Resolving Conflicting Predictions from Multimapping Reads

Stefan Canzar, Khaled Elbassioni, Mitchell Jones, Julián Mestre

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

The first step in the analysis of data produced by ultra-high-throughput next-generation sequencing technology is to map short sequence "reads" to a reference genome, if available. Sequencing errors, repeat regions, and polymorphisms may lead a read to align to multiple locations in the genome reasonably well. While ignoring such multimapping reads, or some of their alignments, will reduce the sensitivity of almost any type of downstream analysis (e.g., detecting structural variants), erroneous mappings will typically yield false positive predictions. Here we propose a framework that aims to identify true predictions among a large set of candidate predictions by selecting for each read a unique mapping that collectively imply conflict-free predictions. We formulate this problem as the maximum facility location problem, for which we propose LP-rounding heuristics. We provide a theoretic guarantee on the quality of the solution and demonstrate the utility of our algorithm in resolving conflicting deletions implied by simulated reads mapping ambiguously to Craig Venter's genome model and Illumina sequencing reads of the well-studied NA12878 individual.

Original languageBritish English
Pages (from-to)203-217
Number of pages15
JournalJournal of Computational Biology
Volume23
Issue number3
DOIs
StatePublished - 1 Mar 2016

Keywords

  • algorithms
  • combinatorial optimization

Fingerprint

Dive into the research topics of 'Resolving Conflicting Predictions from Multimapping Reads'. Together they form a unique fingerprint.

Cite this