ABSTRACT
The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.
- Gary Benson . 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research Vol. 27, 2 (1999), 573.Google Scholar
- Roy J Britten and Eric H Davidson . 1971. Repetitive and non-repetitive DNA sequences and a speculation on the origins of evolutionary novelty. Quarterly Review of Biology (1971), 111--138.Google Scholar
- John M Butler et almbox. . 2007. Short tandem repeat typing technologies used in human identity testing. Biotechniques Vol. 43, 4 (2007), 2--5.Google ScholarCross Ref
- Teresa Capriglione . 2000. Repetitive DNA as a tool to study the phylogeny of cold-blooded vertebrates. In Chromosomes Today. Springer, 183--194.Google Scholar
- Rayan Chikhi . 2012. Computational methods for de novo assembly of next-generation genome sequencing data. Ph.D. Dissertation. bibinfoschoolÉcole normale supérieure de Cachan-ENS Cachan.Google Scholar
- Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca . 2012. Fast computation and applications of genome mappability. PloS one Vol. 7, 1 (2012), e30377--e30377.Google ScholarCross Ref
- Priscilla E Dombek, LeeAnn K Johnson, Sara T Zimmerley, and Michael J Sadowsky . 2000. Use of repetitive DNA sequences and the PCR to differentiate Escherichia coli isolates from human and animal sources. Applied and Environmental Microbiology Vol. 66, 6 (2000), 2572--2577.Google ScholarCross Ref
- Robert C Edgar and Eugene W Myers . 2005. PILER: identification and classification of genomic repeats. Bioinformatics Vol. 21, suppl 1 (2005), i152--i158. Google ScholarDigital Library
- Jörg T Epplen . 1992. Diagnostic applications of repetitive DNA sequences. Clinica chimica acta Vol. 209, 3 (1992), S5--S13.Google Scholar
- Richard A Gibbs, George M Weinstock, Michael L Metzker, Donna M Muzny, Erica J Sodergren, Steven Scherer, Graham Scott, David Steffen, Kim C Worley, Paula E Burch, et almbox. . 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature Vol. 428, 6982 (2004), 493--521.Google Scholar
- Syed Haider, Lina Cordeddu, Emma Robinson, Mehregan Movassagh, Lee Siggens, Ana Vujic, Mun-Kit Choy, Martin Goddard, Pietro Lio, and Roger Foo . 2012. The landscape of DNA repeat elements in human heart failure. Genome Biol Vol. 13, 10 (2012), R90.Google ScholarCross Ref
- Amy M Hauth and Deborah A Joseph . 2002. Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics Vol. 18, suppl 1 (2002), S31--S37.Google ScholarCross Ref
- Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S Cenk Sahinalp, et almbox. . 2010. Alu repeat discovery and characterization within human genomes. Genome research (2010).Google Scholar
- Ryan Koehler, Hadar Issac, Nicole Cloonan, and Sean M Grimmond . 2011. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics Vol. 27, 2 (2011), 272--274. Google ScholarDigital Library
- Sybille Kubis, Thomas Schmidt, and John Seymour PAT Heslop-Harrison . 1998. Repetitive DNA elements as a major component of plant genomes. Annals of Botany Vol. 82, suppl 1 (1998), 45--55.Google ScholarCross Ref
- Stefan Kurtz . 2003. The Vmatch large scale sequence analysis software. Ref Type: Computer Program (2003), 4--12.Google Scholar
- Stefan Kurtz, Jomuna V Choudhuri, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye, and Robert Giegerich . 2001. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research Vol. 29, 22 (2001), 4633--4642.Google Scholar
- Stefan Kurtz and Chris Schleiermacher . 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics Vol. 15, 5 (1999), 426--427.Google ScholarCross Ref
- Hayan Lee and Michael C Schatz . 2012. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics Vol. 28, 16 (2012), 2097--2105. Google ScholarDigital Library
- Heng Li . 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google Scholar
- Runsheng Li, Chia-Ling Hsieh, Amanda Young, Zhihong Zhang, Xiaoliang Ren, and Zhongying Zhao . 2015. Illumina synthetic long read sequencing allows recovery of missing sequences even in the “finished” C. elegans genome. Scientific reports Vol. 5 (2015), 10814.Google ScholarCross Ref
- Wentian Li and Jan Freudenberg . 2014 a. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Computational biology and chemistry Vol. 53 (2014), 108--117. Google ScholarDigital Library
- Wentian Li and Jan Freudenberg . 2014 b. Mappability and read length. Frontiers in genetics Vol. 5 (2014).Google Scholar
- Wentian Li, Jan Freudenberg, and Pedro Miramontes . 2014. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC bioinformatics Vol. 15, 1 (2014), 2.Google Scholar
- Guillaume Marccais and Carl Kingsford . 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics Vol. 27, 6 (2011), 764--770. Google ScholarDigital Library
- Karen H Miga, Yulia Newton, Miten Jain, Nicolas Altemose, Huntington F Willard, and W James Kent . 2014. Centromere reference models for human chromosomes X and Y satellite arrays. Genome research Vol. 24, 4 (2014), 697--707.Google Scholar
- Suresh B Mudunuri and Hampapathalu A Nagarajaram . 2007. IMEx: imperfect microsatellite extractor. Bioinformatics Vol. 23, 10 (2007), 1181--1187. Google ScholarDigital Library
- Alkes L Price, Neil C Jones, and Pavel A Pevzner . 2005. De novo identification of repeat families in large genomes. Bioinformatics Vol. 21, suppl_1 (2005), i351--i358. Google ScholarDigital Library
- Astrid M Roy, Marion L Carroll, David H Kass, Son V Nguyen, Abdel-Halim Salem, Mark A Batzer, and Prescott L Deininger . 1999. Recently integrated human Alu repeats: finding needles in the haystack. Genetica Vol. 107, 1--3 (1999), 149--161.Google ScholarCross Ref
- James A Shapiro and Richard von Sternberg . 2005. Why repetitive DNA is essential to genome function. Biological Reviews Vol. 80, 02 (2005), 227--250.Google ScholarCross Ref
- EL Silva, RS Borba, and Patr'ıcia Pasquali Parise-Maltempi . 2012. Chromosome mapping of repetitive sequences in Anostomidae species: implications for genomic and sex chromosome evolution. Molecular cytogenetics Vol. 5, 1 (2012), 1--8.Google Scholar
- Arian FA Smit, Robert Hubley, and Phil Green . 1996. RepeatMasker Open-3.0.Google Scholar
- Karen Usdin . 2008. The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome research Vol. 18, 7 (2008), 1011--1019.Google Scholar
Index Terms
- Measuring the Mappability Spectrum of Reference Genome Assemblies
Recommendations
From NGS assembly challenges to instability of fungal mitochondrial genomes
Graphical abstractMitochondrial genomes can contain repeat landscapes ranging from notable absence of repeats, as in human and fission yeast, to rich and complex repeat systems as in baker's yeast. In this article we characterize exact repetitions of 17-...
The complex task of choosing a de novo assembly
Graphical abstractThe success of a short-read based genome assembly process in faithfully reproducing the sequences of a real genome, or its genes, can be modulated by some or all of three key parameters: read length r, insert size I, and a ...
Comments