ABSTRACT
Next Generation Sequencing (NGS) technologies produce large quantities of short length reads with higher error rates. Erroneous reads that cannot be aligned, are either ignored during de-novo sequencing, or must be suitably corrected. Such reads pose problems for mapping as well, since it is difficult to distinguish errors from true variants. Methods for detection and correction of errors typically rely on frequencies of substrings of the reads. Suffix trees are often utilized for this purpose, since they can be used to index and count the frequencies of substrings of all lengths. Existing suffix-tree based methods detect errors by identifying statistically under-represented branches (suffixes) and fix them. However, they do not refer back to the reads to put the correction in context. Since an error in a single read manifests itself at multiple nodes of a suffix tree, a read-driven approach that relies on its multiple manifestations is expected to perform better. Based on this observation, we develop an algorithm, PLURIBUS, which reconciles corrections suggested by multiple manifestations of an error using a voting scheme. We compare the accuracy of PLURIBUS in detecting and correcting errors against existing error correction techniques using simulated sequencing data. We also assess the impact of error correction on the performance of sequence assembly. Our results show that PLURIBUS corrects errors with improved precision and enables the assembler to generate longer contigs, particularly when the genome is longer, or coverage is lower. PLURIBUS is freely available at http://compbio.case.edu/pluribus/.
- Keith R Bradnam, Joseph N Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, İnanç Birol, Sébastien Boisvert10, Jarrod A Chapman, Guillaume Chapuis, Rayan Chikhi, et al.Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. arXiv preprint arXiv:1301.5406, 2013.Google Scholar
- Weichun Huang, Leping Li, Jason R. Myers, and Gabor T. Marth. Art: a next-generation sequencing read simulator. Bioinformatics, 28(4):593--594, 2012. Google ScholarDigital Library
- Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. Hitec: accurate error correction in high-throughput sequencing data. Bioinformatics, 27(3):295--302, 2011. Google ScholarDigital Library
- David Kelley, Michael Schatz, and Steven Salzberg. Quake: quality-aware detection and correction of sequencing errors. Genome Biology, 11(11):R116, 2010.Google ScholarCross Ref
- Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to dna fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748--9753, 2001.Google ScholarCross Ref
- Matthew Ruffalo, Thomas LaFramboise, and Mehmet Koyutürk. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics, 27(20):2790--2796, October 2011. Google ScholarDigital Library
- Leena Salmela. Correction of sequencing errors in a mixed set of reads. Bioinformatics, 26(10):1284--1290, 2010. Google ScholarDigital Library
- Jan Schröder, Heiko Schröder, Simon J. Puglisi, Ranjan Sinha, and Bertil Schmidt. Shrec: a short-read error correction method. Bioinformatics, 25(17):2157--2163, 2009. Google ScholarDigital Library
- Jay Shendure and Hanlee Ji. Next-generation dna sequencing. Nature biotechnology, 26(10):1135--1145, 2008.Google ScholarCross Ref
- Xiao Yang, Karin S. Dorman, and Srinivas Aluru. Reptile: representative tiling for short read error correction. Bioinformatics, 26(20):2526--2533, October 2010. Google ScholarDigital Library
- Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome research, 18(5):821--829, 2008.Google ScholarCross Ref
Index Terms
- Suffix-Tree Based Error Correction of NGS Reads Using Multiple Manifestations of an Error
Recommendations
Correcting short reads with high error rates for improved sequencing result
In the sequencing process, reads of the sequence are generated, then assembled to form contigs. New technologies can produce reads faster with lower cost and higher coverage. However, these reads are shorter. With errors, short reads make the assembly ...
From NGS assembly challenges to instability of fungal mitochondrial genomes
Graphical abstractMitochondrial genomes can contain repeat landscapes ranging from notable absence of repeats, as in human and fission yeast, to rich and complex repeat systems as in baker's yeast. In this article we characterize exact repetitions of 17-...
Pluribus—Exploring the Limits of Error Correction Using a Suffix Tree
Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction ...
Comments