skip to main content
10.1145/3233547.3233582acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper
Public Access

Measuring the Mappability Spectrum of Reference Genome Assemblies

Published:15 August 2018Publication History

ABSTRACT

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.

References

  1. Gary Benson . 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research Vol. 27, 2 (1999), 573.Google ScholarGoogle Scholar
  2. Roy J Britten and Eric H Davidson . 1971. Repetitive and non-repetitive DNA sequences and a speculation on the origins of evolutionary novelty. Quarterly Review of Biology (1971), 111--138.Google ScholarGoogle Scholar
  3. John M Butler et almbox. . 2007. Short tandem repeat typing technologies used in human identity testing. Biotechniques Vol. 43, 4 (2007), 2--5.Google ScholarGoogle ScholarCross RefCross Ref
  4. Teresa Capriglione . 2000. Repetitive DNA as a tool to study the phylogeny of cold-blooded vertebrates. In Chromosomes Today. Springer, 183--194.Google ScholarGoogle Scholar
  5. Rayan Chikhi . 2012. Computational methods for de novo assembly of next-generation genome sequencing data. Ph.D. Dissertation. bibinfoschoolÉcole normale supérieure de Cachan-ENS Cachan.Google ScholarGoogle Scholar
  6. Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca . 2012. Fast computation and applications of genome mappability. PloS one Vol. 7, 1 (2012), e30377--e30377.Google ScholarGoogle ScholarCross RefCross Ref
  7. Priscilla E Dombek, LeeAnn K Johnson, Sara T Zimmerley, and Michael J Sadowsky . 2000. Use of repetitive DNA sequences and the PCR to differentiate Escherichia coli isolates from human and animal sources. Applied and Environmental Microbiology Vol. 66, 6 (2000), 2572--2577.Google ScholarGoogle ScholarCross RefCross Ref
  8. Robert C Edgar and Eugene W Myers . 2005. PILER: identification and classification of genomic repeats. Bioinformatics Vol. 21, suppl 1 (2005), i152--i158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jörg T Epplen . 1992. Diagnostic applications of repetitive DNA sequences. Clinica chimica acta Vol. 209, 3 (1992), S5--S13.Google ScholarGoogle Scholar
  10. Richard A Gibbs, George M Weinstock, Michael L Metzker, Donna M Muzny, Erica J Sodergren, Steven Scherer, Graham Scott, David Steffen, Kim C Worley, Paula E Burch, et almbox. . 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature Vol. 428, 6982 (2004), 493--521.Google ScholarGoogle Scholar
  11. Syed Haider, Lina Cordeddu, Emma Robinson, Mehregan Movassagh, Lee Siggens, Ana Vujic, Mun-Kit Choy, Martin Goddard, Pietro Lio, and Roger Foo . 2012. The landscape of DNA repeat elements in human heart failure. Genome Biol Vol. 13, 10 (2012), R90.Google ScholarGoogle ScholarCross RefCross Ref
  12. Amy M Hauth and Deborah A Joseph . 2002. Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics Vol. 18, suppl 1 (2002), S31--S37.Google ScholarGoogle ScholarCross RefCross Ref
  13. Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S Cenk Sahinalp, et almbox. . 2010. Alu repeat discovery and characterization within human genomes. Genome research (2010).Google ScholarGoogle Scholar
  14. Ryan Koehler, Hadar Issac, Nicole Cloonan, and Sean M Grimmond . 2011. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics Vol. 27, 2 (2011), 272--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sybille Kubis, Thomas Schmidt, and John Seymour PAT Heslop-Harrison . 1998. Repetitive DNA elements as a major component of plant genomes. Annals of Botany Vol. 82, suppl 1 (1998), 45--55.Google ScholarGoogle ScholarCross RefCross Ref
  16. Stefan Kurtz . 2003. The Vmatch large scale sequence analysis software. Ref Type: Computer Program (2003), 4--12.Google ScholarGoogle Scholar
  17. Stefan Kurtz, Jomuna V Choudhuri, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye, and Robert Giegerich . 2001. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research Vol. 29, 22 (2001), 4633--4642.Google ScholarGoogle Scholar
  18. Stefan Kurtz and Chris Schleiermacher . 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics Vol. 15, 5 (1999), 426--427.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hayan Lee and Michael C Schatz . 2012. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics Vol. 28, 16 (2012), 2097--2105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Heng Li . 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google ScholarGoogle Scholar
  21. Runsheng Li, Chia-Ling Hsieh, Amanda Young, Zhihong Zhang, Xiaoliang Ren, and Zhongying Zhao . 2015. Illumina synthetic long read sequencing allows recovery of missing sequences even in the “finished” C. elegans genome. Scientific reports Vol. 5 (2015), 10814.Google ScholarGoogle ScholarCross RefCross Ref
  22. Wentian Li and Jan Freudenberg . 2014 a. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Computational biology and chemistry Vol. 53 (2014), 108--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Wentian Li and Jan Freudenberg . 2014 b. Mappability and read length. Frontiers in genetics Vol. 5 (2014).Google ScholarGoogle Scholar
  24. Wentian Li, Jan Freudenberg, and Pedro Miramontes . 2014. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC bioinformatics Vol. 15, 1 (2014), 2.Google ScholarGoogle Scholar
  25. Guillaume Marccais and Carl Kingsford . 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics Vol. 27, 6 (2011), 764--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Karen H Miga, Yulia Newton, Miten Jain, Nicolas Altemose, Huntington F Willard, and W James Kent . 2014. Centromere reference models for human chromosomes X and Y satellite arrays. Genome research Vol. 24, 4 (2014), 697--707.Google ScholarGoogle Scholar
  27. Suresh B Mudunuri and Hampapathalu A Nagarajaram . 2007. IMEx: imperfect microsatellite extractor. Bioinformatics Vol. 23, 10 (2007), 1181--1187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alkes L Price, Neil C Jones, and Pavel A Pevzner . 2005. De novo identification of repeat families in large genomes. Bioinformatics Vol. 21, suppl_1 (2005), i351--i358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Astrid M Roy, Marion L Carroll, David H Kass, Son V Nguyen, Abdel-Halim Salem, Mark A Batzer, and Prescott L Deininger . 1999. Recently integrated human Alu repeats: finding needles in the haystack. Genetica Vol. 107, 1--3 (1999), 149--161.Google ScholarGoogle ScholarCross RefCross Ref
  30. James A Shapiro and Richard von Sternberg . 2005. Why repetitive DNA is essential to genome function. Biological Reviews Vol. 80, 02 (2005), 227--250.Google ScholarGoogle ScholarCross RefCross Ref
  31. EL Silva, RS Borba, and Patr'ıcia Pasquali Parise-Maltempi . 2012. Chromosome mapping of repetitive sequences in Anostomidae species: implications for genomic and sex chromosome evolution. Molecular cytogenetics Vol. 5, 1 (2012), 1--8.Google ScholarGoogle Scholar
  32. Arian FA Smit, Robert Hubley, and Phil Green . 1996. RepeatMasker Open-3.0.Google ScholarGoogle Scholar
  33. Karen Usdin . 2008. The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome research Vol. 18, 7 (2008), 1011--1019.Google ScholarGoogle Scholar

Index Terms

  1. Measuring the Mappability Spectrum of Reference Genome Assemblies

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
        August 2018
        727 pages
        ISBN:9781450357944
        DOI:10.1145/3233547

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 August 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        BCB '18 Paper Acceptance Rate46of148submissions,31%Overall Acceptance Rate254of885submissions,29%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader