research-article

Efficient relaxed search in hierarchically clustered sequence datasets

Authors:
Kai C. Bader

Technische Universität München, Germany

Technische Universität München, Germany
View Profile

,
Mikhail J. Atallah

Purdue University, IN

Purdue University, IN
View Profile

,
Christian Grothoff

Technische Universität München, Germany

Technische Universität München, Germany
View Profile

ACM Journal of Experimental Algorithmics Volume 17Article No.: 1.4pp 1.1–1.18https://doi.org/10.1145/2133803.2212315

Published:19 July 2012Publication History

ACM Journal of Experimental Algorithmics

Abstract

This article presents a new algorithm for finding oligonucleotide signatures that are specific and sensitive for organisms or groups of organisms in large-scale sequence datasets. We assume that the organisms have been organized in a hierarchy, for example, a phylogenetic tree. The resulting signatures, binding sites for primers and probes, match the maximum possible number of organisms in the target group while having at most k matches outside of the target group.

The key step in the algorithm is the use of the lowest common ancestor (LCA) to search the organism hierarchy; this allows the combinatorial problem in almost linear time (empirically observed) to be solved. The presented algorithm improves performance by several orders of magnitude in terms of both memory consumption and runtime when compared to the best-known previous algorithms while giving identical, exact solutions.

This article gives a formal description of the algorithm, discusses details of our concrete, publicly available implementation, and presents the results from our performance evaluation.

References

Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--86. Google ScholarDigital Library
Amann, R. and Fuchs, B. M. 2008. Single-cell identification in microbial communities by improved fluorescence in situ hybridization techniques. Nat. Rev. Microbiol. 6, 5, 339--348.Google ScholarCross Ref
Amann, R. I., Ludwig, W., and Schleifer, K. H. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 59, 1, 143--169.Google Scholar
Bader, K. C., Eissler, T., Evans, N., GauthierDickey, C., Grothoff, C., Grothoff, K., Keene, J., Meier, H., Ritzdorf, C., and Rutherford, M. J. 2010. Distributed stream processing with DUP. In Network and Parallel Computing, Lecture Notes in Computer Science, vol. 6289, Springer, Berlin, 232--246. Google ScholarDigital Library
Bader, K. C., Grothoff, C., and Meier, H. 2011. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 27, 1546--1554. Google ScholarDigital Library
Bartlett, J. M. S. and Stirling, D. 2003. A short history of the polymerase chain reaction. In PCR Protocols, J. M. Bartlett and D. Stirling, Eds., Methods in Molecular Biology Series, vol. 226, Humana Press, 3--6. 10.1385/1-59259-384-4:3.Google Scholar
Bender, M. and Farach-Colton, M. 2000. The lca problem revisited. In LATIN 2000: Theoretical Informatics, G. Gonnet and A. Viola, Eds., Lecture Notes in Computer Science Series, vol. 1776, Springer Berlin, 88--94. Google ScholarDigital Library
Chung, W. H., Rhee, S. K., Wan, X. F., Bae, J. W., Quan, Z. X., and Park, Y. H. 2005. Design of long oligonucleotide probes for functional gene detection in a microbial community. Bioinformatics 21, 4092--4100. Google ScholarDigital Library
Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. 2009. The Ribosomal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37, D141--145.Google ScholarCross Ref
DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G. L. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Micorbiol. 72, 7, 5069--5072.Google ScholarCross Ref
Doring, A., Weese, D., Rausch, T., and Reinert, K. 2008. Seqan an efficient, generic c++ library for sequence analysis. BMC Bioinformatics 9, 1, 11.Google ScholarCross Ref
Duitama, J., Kumar, D. M., Hemphill, E., Khan, M., Mandoiu, I. I., and Nelson, C. E. 2009. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483--2492.Google ScholarCross Ref
Eissler, T., Hodges, C. P., and Meier, H. 2011. Ptpan—overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics 27, 20, 2797--2805. Google ScholarDigital Library
Feng, S. and Tillier, E. R. M. R. 2007. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics 23, 1195--1202. Google ScholarDigital Library
Ficetola, G. F., Coissac, E., Zundel, S., Riaz, T., Shehzad, W., Bessiere, J., Taberlet, P., and Pompanon, F. 2010. An in silico approach for the evaluation of DNA barcodes. BMC Genomics 11, 434.Google ScholarCross Ref
Fischer, J. and Heun, V. 2006. Theoretical and practical improvements on the rmq-problem, with applications to lca and lce. In Combinatorial Pattern Matching, M. Lewenstein and G. Valiente, Eds., Lecture Notes in Computer Science, vol. 4009, Springer, Berlin, 36--48. Google ScholarDigital Library
Fischer, J. and Heun, V. 2007. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, B. Chen, M. Paterson, and G. Zhang, Eds., Lecture Notes in Computer Science, vol. 4614, Springer, Berlin, 459--470. Google ScholarDigital Library
Gabow, H. N., Bentley, J. L., and Tarjan, R. E. 1984. Scaling and related techniques for geometry problems. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing (STOC'84). ACM, New York, 135--143. Google ScholarDigital Library
Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. 2004. Versatile and open software for comparing large genomes. Genome Biol. 5, R12.Google ScholarCross Ref
Lee, H. P., Sheu, T.-F., and Tang, C. Y. 2010. A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics 11, 132.Google ScholarCross Ref
Loy, A., Maixner, F., Wagner, M., and Horn, M. 2007. probeBase--an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res. 35 (Database issue) D800--804.Google Scholar
Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, et al. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 4, 1363--1371.Google ScholarCross Ref
Olsen, G. 1990. The “Newick's 8:45” tree format. http://evolution.genetics.washington.edu/phylip/newick_doc.html.Google Scholar
Peterlongo, P., Nicolas, J., Lavenier, D., Vorch, R., and Querellou, J. 2009. c-gamma:comparative genome analysis of molecular markers. In Pattern Recognition in Bioinformatics, V. Kadirkamanathan, G. Sanguinetti, M. Girolami, M. Niranjan, and J. Noirel, Eds., Lecture Notes in Computer Science, vol. 5780, Springer, Berlin, 255--269. Google ScholarDigital Library
Phillippy, A. M., Ayanbule, K., Edwards, N. J., and Salzberg, S. L. 2009. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. 37, W229--234.Google ScholarCross Ref
Phillippy, A. M., Mason, J. A., Ayanbule, K., Sommer, D. D., Taviani, E., Huq, A., Colwell, R. R., Knight, I. T., and Salzberg, S. L. 2007. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol. 3, e98.Google ScholarCross Ref
Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. Fasttree 2 approximately maximum-likelihood trees for large alignments. PLoS ONE 5, 3, e9490.Google ScholarCross Ref
Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 21, 7188--7196.Google ScholarCross Ref
Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P., and Coissac, E. 2011. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Res. 39, e145.Google ScholarCross Ref
Stahl, D. A., Flesher, B., Mansfield, H. R., and Montgomery, L. 1988. Use of phylogenetically based hybridization probes for studies of ruminal microbial ecology. Appl. Environ. Microbiol. 54, 1079--1084.Google Scholar
Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688--2690. Google ScholarDigital Library

Index Terms

Efficient relaxed search in hierarchically clustered sequence datasets
1. Applied computing
  1. Life and medical sciences
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets

Motivation: PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of ...
Read More
Circular code motifs near the ribosome decoding center

A maximal C3 self-complementary trinucleotide circular code X is identified in genes of bacteria, eukaryotes, plasmids and viruses (Michel, 2015; Arquès and Michel, 1996). A translation (framing) code based on the circular code was proposed in Michel (...
Read More
Circular code motifs in the ribosome decoding center

Display Omitted Identification of circular code motifs in the ribosome decoding center.The universally conserved nucleotides A1492 and A1493 in circular code motifs.Identification of the conserved nucleotide G530 in nuclear and chloroplast rRNAs.The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Journal of Experimental Algorithmics Volume 17, Issue
2012
427 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/2133803
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2012
- Revised: 1 April 2012
- Accepted: 1 April 2012
- Received: 1 January 2012
Published in jea Volume 17, Issue
Author Tags
Data mining
oligonucleotide
phylogenetic
primer design
probe design
ribosomal RNA
signature
tree
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 257
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient relaxed search in hierarchically clustered sequence datasets

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets

Circular code motifs near the ribosome decoding center

Circular code motifs in the ribosome decoding center

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient relaxed search in hierarchically clustered sequence datasets

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets

Circular code motifs near the ribosome decoding center

Circular code motifs in the ribosome decoding center

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media