skip to main content
research-article

Efficient relaxed search in hierarchically clustered sequence datasets

Published:19 July 2012Publication History
Skip Abstract Section

Abstract

This article presents a new algorithm for finding oligonucleotide signatures that are specific and sensitive for organisms or groups of organisms in large-scale sequence datasets. We assume that the organisms have been organized in a hierarchy, for example, a phylogenetic tree. The resulting signatures, binding sites for primers and probes, match the maximum possible number of organisms in the target group while having at most k matches outside of the target group.

The key step in the algorithm is the use of the lowest common ancestor (LCA) to search the organism hierarchy; this allows the combinatorial problem in almost linear time (empirically observed) to be solved. The presented algorithm improves performance by several orders of magnitude in terms of both memory consumption and runtime when compared to the best-known previous algorithms while giving identical, exact solutions.

This article gives a formal description of the algorithm, discusses details of our concrete, publicly available implementation, and presents the results from our performance evaluation.

References

  1. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amann, R. and Fuchs, B. M. 2008. Single-cell identification in microbial communities by improved fluorescence in situ hybridization techniques. Nat. Rev. Microbiol. 6, 5, 339--348.Google ScholarGoogle ScholarCross RefCross Ref
  3. Amann, R. I., Ludwig, W., and Schleifer, K. H. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 59, 1, 143--169.Google ScholarGoogle Scholar
  4. Bader, K. C., Eissler, T., Evans, N., GauthierDickey, C., Grothoff, C., Grothoff, K., Keene, J., Meier, H., Ritzdorf, C., and Rutherford, M. J. 2010. Distributed stream processing with DUP. In Network and Parallel Computing, Lecture Notes in Computer Science, vol. 6289, Springer, Berlin, 232--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bader, K. C., Grothoff, C., and Meier, H. 2011. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 27, 1546--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bartlett, J. M. S. and Stirling, D. 2003. A short history of the polymerase chain reaction. In PCR Protocols, J. M. Bartlett and D. Stirling, Eds., Methods in Molecular Biology Series, vol. 226, Humana Press, 3--6. 10.1385/1-59259-384-4:3.Google ScholarGoogle Scholar
  7. Bender, M. and Farach-Colton, M. 2000. The lca problem revisited. In LATIN 2000: Theoretical Informatics, G. Gonnet and A. Viola, Eds., Lecture Notes in Computer Science Series, vol. 1776, Springer Berlin, 88--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chung, W. H., Rhee, S. K., Wan, X. F., Bae, J. W., Quan, Z. X., and Park, Y. H. 2005. Design of long oligonucleotide probes for functional gene detection in a microbial community. Bioinformatics 21, 4092--4100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. 2009. The Ribosomal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37, D141--145.Google ScholarGoogle ScholarCross RefCross Ref
  10. DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G. L. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Micorbiol. 72, 7, 5069--5072.Google ScholarGoogle ScholarCross RefCross Ref
  11. Doring, A., Weese, D., Rausch, T., and Reinert, K. 2008. Seqan an efficient, generic c++ library for sequence analysis. BMC Bioinformatics 9, 1, 11.Google ScholarGoogle ScholarCross RefCross Ref
  12. Duitama, J., Kumar, D. M., Hemphill, E., Khan, M., Mandoiu, I. I., and Nelson, C. E. 2009. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483--2492.Google ScholarGoogle ScholarCross RefCross Ref
  13. Eissler, T., Hodges, C. P., and Meier, H. 2011. Ptpan—overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics 27, 20, 2797--2805. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Feng, S. and Tillier, E. R. M. R. 2007. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics 23, 1195--1202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ficetola, G. F., Coissac, E., Zundel, S., Riaz, T., Shehzad, W., Bessiere, J., Taberlet, P., and Pompanon, F. 2010. An in silico approach for the evaluation of DNA barcodes. BMC Genomics 11, 434.Google ScholarGoogle ScholarCross RefCross Ref
  16. Fischer, J. and Heun, V. 2006. Theoretical and practical improvements on the rmq-problem, with applications to lca and lce. In Combinatorial Pattern Matching, M. Lewenstein and G. Valiente, Eds., Lecture Notes in Computer Science, vol. 4009, Springer, Berlin, 36--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fischer, J. and Heun, V. 2007. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, B. Chen, M. Paterson, and G. Zhang, Eds., Lecture Notes in Computer Science, vol. 4614, Springer, Berlin, 459--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gabow, H. N., Bentley, J. L., and Tarjan, R. E. 1984. Scaling and related techniques for geometry problems. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing (STOC'84). ACM, New York, 135--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. 2004. Versatile and open software for comparing large genomes. Genome Biol. 5, R12.Google ScholarGoogle ScholarCross RefCross Ref
  20. Lee, H. P., Sheu, T.-F., and Tang, C. Y. 2010. A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics 11, 132.Google ScholarGoogle ScholarCross RefCross Ref
  21. Loy, A., Maixner, F., Wagner, M., and Horn, M. 2007. probeBase--an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res. 35 (Database issue) D800--804.Google ScholarGoogle Scholar
  22. Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, et al. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 4, 1363--1371.Google ScholarGoogle ScholarCross RefCross Ref
  23. Olsen, G. 1990. The “Newick's 8:45” tree format. http://evolution.genetics.washington.edu/phylip/newick_doc.html.Google ScholarGoogle Scholar
  24. Peterlongo, P., Nicolas, J., Lavenier, D., Vorch, R., and Querellou, J. 2009. c-gamma:comparative genome analysis of molecular markers. In Pattern Recognition in Bioinformatics, V. Kadirkamanathan, G. Sanguinetti, M. Girolami, M. Niranjan, and J. Noirel, Eds., Lecture Notes in Computer Science, vol. 5780, Springer, Berlin, 255--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Phillippy, A. M., Ayanbule, K., Edwards, N. J., and Salzberg, S. L. 2009. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. 37, W229--234.Google ScholarGoogle ScholarCross RefCross Ref
  26. Phillippy, A. M., Mason, J. A., Ayanbule, K., Sommer, D. D., Taviani, E., Huq, A., Colwell, R. R., Knight, I. T., and Salzberg, S. L. 2007. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol. 3, e98.Google ScholarGoogle ScholarCross RefCross Ref
  27. Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. Fasttree 2 approximately maximum-likelihood trees for large alignments. PLoS ONE 5, 3, e9490.Google ScholarGoogle ScholarCross RefCross Ref
  28. Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 21, 7188--7196.Google ScholarGoogle ScholarCross RefCross Ref
  29. Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P., and Coissac, E. 2011. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Res. 39, e145.Google ScholarGoogle ScholarCross RefCross Ref
  30. Stahl, D. A., Flesher, B., Mansfield, H. R., and Montgomery, L. 1988. Use of phylogenetically based hybridization probes for studies of ruminal microbial ecology. Appl. Environ. Microbiol. 54, 1079--1084.Google ScholarGoogle Scholar
  31. Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688--2690. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient relaxed search in hierarchically clustered sequence datasets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Journal of Experimental Algorithmics
        ACM Journal of Experimental Algorithmics  Volume 17, Issue
        2012
        427 pages
        ISSN:1084-6654
        EISSN:1084-6654
        DOI:10.1145/2133803
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2012
        • Revised: 1 April 2012
        • Accepted: 1 April 2012
        • Received: 1 January 2012
        Published in jea Volume 17, Issue

        Qualifiers

        • research-article
        • Research
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)3
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader