short-paper

Public Access

Measuring the Mappability Spectrum of Reference Genome Assemblies

Authors:
Zachary D. Stephens

University of Illinois at Urbana-Champaign, Champaign, IL, USA

University of Illinois at Urbana-Champaign, Champaign, IL, USA
View Profile

,
Ravishankar K. Iyer

University of Illinois at Urbana-Champaign, Champaign, IL, USA

University of Illinois at Urbana-Champaign, Champaign, IL, USA
View Profile

BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsAugust 2018Pages 47–52https://doi.org/10.1145/3233547.3233582

Published:15 August 2018Publication History

BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Pages 47–52

ABSTRACT

The ability to infer actionable information from genomic variation data in a resequencing experiment relies on accurately aligning the sequences to a reference genome. However, this accuracy is inherently limited by the quality of the reference assembly and the repetitive content of the subject's genome. As long read sequencing technologies become more widespread, it is crucial to investigate the expected improvements in alignment accuracy and variant analysis over existing short read methods. The ability to quantify the read length and error rate necessary to uniquely map regions of interest in a sequence allows users to make informed decisions regarding experiment design and provides useful metrics for comparing the magnitude of repetition across different reference assemblies. To this end we have developed NEAT-Repeat, a toolkit for exhaustively identifying the minimum read length required to uniquely map each position of a reference sequence given a specified error rate. Using these tools we computed the -mappability spectrum" for ten reference sequences, including human and a range of plants and animals, quantifying the theoretical improvements in alignment accuracy that would result from sequencing with longer reads or reads with less base-calling errors. Our inclusion of read length and error rate builds upon existing methods for mappability tracks based on uniqueness or aligner-specific mapping scores, and thus enables more comprehensive analysis. We apply our mappability results to whole-genome variant call data, and demonstrate that variants called with low mapping and genotype quality scores are disproportionately found in reference regions that require long reads to be uniquely covered. We propose that our mappability metrics provide a valuable supplement to established variant filtering and annotation pipelines by supplying users with an additional metric related to read mapping quality. NEAT-Repeat can process large and repetitive genomes, such as those of corn and soybean, in a tractable amount of time by leveraging efficient methods for edit distance computation as well as running multiple jobs in parallel. NEAT-Repeat is written in Python 2.7 and C++, and is available at https://github.com/zstephens/neat-repeat.

References

Gary Benson . 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research Vol. 27, 2 (1999), 573.Google Scholar
Roy J Britten and Eric H Davidson . 1971. Repetitive and non-repetitive DNA sequences and a speculation on the origins of evolutionary novelty. Quarterly Review of Biology (1971), 111--138.Google Scholar
John M Butler et almbox. . 2007. Short tandem repeat typing technologies used in human identity testing. Biotechniques Vol. 43, 4 (2007), 2--5.Google ScholarCross Ref
Teresa Capriglione . 2000. Repetitive DNA as a tool to study the phylogeny of cold-blooded vertebrates. In Chromosomes Today. Springer, 183--194.Google Scholar
Rayan Chikhi . 2012. Computational methods for de novo assembly of next-generation genome sequencing data. Ph.D. Dissertation. bibinfoschoolÉcole normale supérieure de Cachan-ENS Cachan.Google Scholar
Thomas Derrien, Jordi Estellé, Santiago Marco Sola, David G Knowles, Emanuele Raineri, Roderic Guigó, and Paolo Ribeca . 2012. Fast computation and applications of genome mappability. PloS one Vol. 7, 1 (2012), e30377--e30377.Google ScholarCross Ref
Priscilla E Dombek, LeeAnn K Johnson, Sara T Zimmerley, and Michael J Sadowsky . 2000. Use of repetitive DNA sequences and the PCR to differentiate Escherichia coli isolates from human and animal sources. Applied and Environmental Microbiology Vol. 66, 6 (2000), 2572--2577.Google ScholarCross Ref
Robert C Edgar and Eugene W Myers . 2005. PILER: identification and classification of genomic repeats. Bioinformatics Vol. 21, suppl 1 (2005), i152--i158. Google ScholarDigital Library
Jörg T Epplen . 1992. Diagnostic applications of repetitive DNA sequences. Clinica chimica acta Vol. 209, 3 (1992), S5--S13.Google Scholar
Richard A Gibbs, George M Weinstock, Michael L Metzker, Donna M Muzny, Erica J Sodergren, Steven Scherer, Graham Scott, David Steffen, Kim C Worley, Paula E Burch, et almbox. . 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature Vol. 428, 6982 (2004), 493--521.Google Scholar
Syed Haider, Lina Cordeddu, Emma Robinson, Mehregan Movassagh, Lee Siggens, Ana Vujic, Mun-Kit Choy, Martin Goddard, Pietro Lio, and Roger Foo . 2012. The landscape of DNA repeat elements in human heart failure. Genome Biol Vol. 13, 10 (2012), R90.Google ScholarCross Ref
Amy M Hauth and Deborah A Joseph . 2002. Beyond tandem repeats: complex pattern structures and distant regions of similarity. Bioinformatics Vol. 18, suppl 1 (2002), S31--S37.Google ScholarCross Ref
Fereydoun Hormozdiari, Can Alkan, Mario Ventura, Iman Hajirasouliha, Maika Malig, Faraz Hach, Deniz Yorukoglu, Phuong Dao, Marzieh Bakhshi, S Cenk Sahinalp, et almbox. . 2010. Alu repeat discovery and characterization within human genomes. Genome research (2010).Google Scholar
Ryan Koehler, Hadar Issac, Nicole Cloonan, and Sean M Grimmond . 2011. The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics Vol. 27, 2 (2011), 272--274. Google ScholarDigital Library
Sybille Kubis, Thomas Schmidt, and John Seymour PAT Heslop-Harrison . 1998. Repetitive DNA elements as a major component of plant genomes. Annals of Botany Vol. 82, suppl 1 (1998), 45--55.Google ScholarCross Ref
Stefan Kurtz . 2003. The Vmatch large scale sequence analysis software. Ref Type: Computer Program (2003), 4--12.Google Scholar
Stefan Kurtz, Jomuna V Choudhuri, Enno Ohlebusch, Chris Schleiermacher, Jens Stoye, and Robert Giegerich . 2001. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic acids research Vol. 29, 22 (2001), 4633--4642.Google Scholar
Stefan Kurtz and Chris Schleiermacher . 1999. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics Vol. 15, 5 (1999), 426--427.Google ScholarCross Ref
Hayan Lee and Michael C Schatz . 2012. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics Vol. 28, 16 (2012), 2097--2105. Google ScholarDigital Library
Heng Li . 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).Google Scholar
Runsheng Li, Chia-Ling Hsieh, Amanda Young, Zhihong Zhang, Xiaoliang Ren, and Zhongying Zhao . 2015. Illumina synthetic long read sequencing allows recovery of missing sequences even in the “finished” C. elegans genome. Scientific reports Vol. 5 (2015), 10814.Google ScholarCross Ref
Wentian Li and Jan Freudenberg . 2014 a. Characterizing regions in the human genome unmappable by next-generation-sequencing at the read length of 1000 bases. Computational biology and chemistry Vol. 53 (2014), 108--117. Google ScholarDigital Library
Wentian Li and Jan Freudenberg . 2014 b. Mappability and read length. Frontiers in genetics Vol. 5 (2014).Google Scholar
Wentian Li, Jan Freudenberg, and Pedro Miramontes . 2014. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC bioinformatics Vol. 15, 1 (2014), 2.Google Scholar
Guillaume Marccais and Carl Kingsford . 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics Vol. 27, 6 (2011), 764--770. Google ScholarDigital Library
Karen H Miga, Yulia Newton, Miten Jain, Nicolas Altemose, Huntington F Willard, and W James Kent . 2014. Centromere reference models for human chromosomes X and Y satellite arrays. Genome research Vol. 24, 4 (2014), 697--707.Google Scholar
Suresh B Mudunuri and Hampapathalu A Nagarajaram . 2007. IMEx: imperfect microsatellite extractor. Bioinformatics Vol. 23, 10 (2007), 1181--1187. Google ScholarDigital Library
Alkes L Price, Neil C Jones, and Pavel A Pevzner . 2005. De novo identification of repeat families in large genomes. Bioinformatics Vol. 21, suppl_1 (2005), i351--i358. Google ScholarDigital Library
Astrid M Roy, Marion L Carroll, David H Kass, Son V Nguyen, Abdel-Halim Salem, Mark A Batzer, and Prescott L Deininger . 1999. Recently integrated human Alu repeats: finding needles in the haystack. Genetica Vol. 107, 1--3 (1999), 149--161.Google ScholarCross Ref
James A Shapiro and Richard von Sternberg . 2005. Why repetitive DNA is essential to genome function. Biological Reviews Vol. 80, 02 (2005), 227--250.Google ScholarCross Ref
EL Silva, RS Borba, and Patr'ıcia Pasquali Parise-Maltempi . 2012. Chromosome mapping of repetitive sequences in Anostomidae species: implications for genomic and sex chromosome evolution. Molecular cytogenetics Vol. 5, 1 (2012), 1--8.Google Scholar
Arian FA Smit, Robert Hubley, and Phil Green . 1996. RepeatMasker Open-3.0.Google Scholar
Karen Usdin . 2008. The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome research Vol. 18, 7 (2008), 1011--1019.Google Scholar

Index Terms

Measuring the Mappability Spectrum of Reference Genome Assemblies
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics
      2. Molecular sequence analysis

Recommendations

From NGS assembly challenges to instability of fungal mitochondrial genomes

Graphical abstractMitochondrial genomes can contain repeat landscapes ranging from notable absence of repeats, as in human and fission yeast, to rich and complex repeat systems as in baker's yeast. In this article we characterize exact repetitions of 17-...
Read More
Integrating genome assemblies with MAIA

Motivation:De novo assembly of a eukaryotic genome with next-generation sequencing data is still a challenging task. Over the past few years several assemblers have been developed, often suitable for one specific type of sequencing data. The number of ...
Read More
The complex task of choosing a de novo assembly

Graphical abstractThe success of a short-read based genome assembly process in faithfully reproducing the sequences of a real genome, or its genes, can be modulated by some or all of three key parameters: read length r, insert size I, and a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
August 2018
727 pages
ISBN:9781450357944
DOI:10.1145/3233547
General Chairs:
Amarda Shehu
George Mason University, USA
,
Cathy Wu
University of Delaware, USA
,
Program Chairs:
Christina Boucher
University of Florida, USA
,
Jing Li
Case Western Reserve University, USA
,
Hongfang Liu
Mayo Clinic, USA
,
Mihai Pop
University of Maryland, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
mappability
repetitive dna
sequence analysis
Qualifiers
- short-paper
Conference

Acceptance Rates
BCB '18 Paper Acceptance Rate46of148submissions,31%Overall Acceptance Rate254of885submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 264
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Measuring the Mappability Spectrum of Reference Genome Assemblies

BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

From NGS assembly challenges to instability of fungal mitochondrial genomes

Integrating genome assemblies with MAIA

The complex task of choosing a de novo assembly

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Measuring the Mappability Spectrum of Reference Genome Assemblies

BCB '18: Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

From NGS assembly challenges to instability of fungal mitochondrial genomes

Integrating genome assemblies with MAIA

The complex task of choosing a de novo assembly

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media