Abstract
This article presents a new algorithm for finding oligonucleotide signatures that are specific and sensitive for organisms or groups of organisms in large-scale sequence datasets. We assume that the organisms have been organized in a hierarchy, for example, a phylogenetic tree. The resulting signatures, binding sites for primers and probes, match the maximum possible number of organisms in the target group while having at most k matches outside of the target group.
The key step in the algorithm is the use of the lowest common ancestor (LCA) to search the organism hierarchy; this allows the combinatorial problem in almost linear time (empirically observed) to be solved. The presented algorithm improves performance by several orders of magnitude in terms of both memory consumption and runtime when compared to the best-known previous algorithms while giving identical, exact solutions.
This article gives a formal description of the algorithm, discusses details of our concrete, publicly available implementation, and presents the results from our performance evaluation.
- Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algor. 2, 1, 53--86. Google ScholarDigital Library
- Amann, R. and Fuchs, B. M. 2008. Single-cell identification in microbial communities by improved fluorescence in situ hybridization techniques. Nat. Rev. Microbiol. 6, 5, 339--348.Google ScholarCross Ref
- Amann, R. I., Ludwig, W., and Schleifer, K. H. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev. 59, 1, 143--169.Google Scholar
- Bader, K. C., Eissler, T., Evans, N., GauthierDickey, C., Grothoff, C., Grothoff, K., Keene, J., Meier, H., Ritzdorf, C., and Rutherford, M. J. 2010. Distributed stream processing with DUP. In Network and Parallel Computing, Lecture Notes in Computer Science, vol. 6289, Springer, Berlin, 232--246. Google ScholarDigital Library
- Bader, K. C., Grothoff, C., and Meier, H. 2011. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 27, 1546--1554. Google ScholarDigital Library
- Bartlett, J. M. S. and Stirling, D. 2003. A short history of the polymerase chain reaction. In PCR Protocols, J. M. Bartlett and D. Stirling, Eds., Methods in Molecular Biology Series, vol. 226, Humana Press, 3--6. 10.1385/1-59259-384-4:3.Google Scholar
- Bender, M. and Farach-Colton, M. 2000. The lca problem revisited. In LATIN 2000: Theoretical Informatics, G. Gonnet and A. Viola, Eds., Lecture Notes in Computer Science Series, vol. 1776, Springer Berlin, 88--94. Google ScholarDigital Library
- Chung, W. H., Rhee, S. K., Wan, X. F., Bae, J. W., Quan, Z. X., and Park, Y. H. 2005. Design of long oligonucleotide probes for functional gene detection in a microbial community. Bioinformatics 21, 4092--4100. Google ScholarDigital Library
- Cole, J. R., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R. J., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Marsh, T., Garrity, G. M., and Tiedje, J. M. 2009. The Ribosomal Database Project: Improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37, D141--145.Google ScholarCross Ref
- DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G. L. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Micorbiol. 72, 7, 5069--5072.Google ScholarCross Ref
- Doring, A., Weese, D., Rausch, T., and Reinert, K. 2008. Seqan an efficient, generic c++ library for sequence analysis. BMC Bioinformatics 9, 1, 11.Google ScholarCross Ref
- Duitama, J., Kumar, D. M., Hemphill, E., Khan, M., Mandoiu, I. I., and Nelson, C. E. 2009. PrimerHunter: a primer design tool for PCR-based virus subtype identification. Nucleic Acids Res. 37, 2483--2492.Google ScholarCross Ref
- Eissler, T., Hodges, C. P., and Meier, H. 2011. Ptpan—overcoming memory limitations in oligonucleotide string matching for primer/probe design. Bioinformatics 27, 20, 2797--2805. Google ScholarDigital Library
- Feng, S. and Tillier, E. R. M. R. 2007. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics 23, 1195--1202. Google ScholarDigital Library
- Ficetola, G. F., Coissac, E., Zundel, S., Riaz, T., Shehzad, W., Bessiere, J., Taberlet, P., and Pompanon, F. 2010. An in silico approach for the evaluation of DNA barcodes. BMC Genomics 11, 434.Google ScholarCross Ref
- Fischer, J. and Heun, V. 2006. Theoretical and practical improvements on the rmq-problem, with applications to lca and lce. In Combinatorial Pattern Matching, M. Lewenstein and G. Valiente, Eds., Lecture Notes in Computer Science, vol. 4009, Springer, Berlin, 36--48. Google ScholarDigital Library
- Fischer, J. and Heun, V. 2007. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, B. Chen, M. Paterson, and G. Zhang, Eds., Lecture Notes in Computer Science, vol. 4614, Springer, Berlin, 459--470. Google ScholarDigital Library
- Gabow, H. N., Bentley, J. L., and Tarjan, R. E. 1984. Scaling and related techniques for geometry problems. In Proceedings of the 16th Annual ACM Symposium on Theory of Computing (STOC'84). ACM, New York, 135--143. Google ScholarDigital Library
- Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., and Salzberg, S. L. 2004. Versatile and open software for comparing large genomes. Genome Biol. 5, R12.Google ScholarCross Ref
- Lee, H. P., Sheu, T.-F., and Tang, C. Y. 2010. A parallel and incremental algorithm for efficient unique signature discovery on dna databases. BMC Bioinformatics 11, 132.Google ScholarCross Ref
- Loy, A., Maixner, F., Wagner, M., and Horn, M. 2007. probeBase--an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res. 35 (Database issue) D800--804.Google Scholar
- Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, et al. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32, 4, 1363--1371.Google ScholarCross Ref
- Olsen, G. 1990. The “Newick's 8:45” tree format. http://evolution.genetics.washington.edu/phylip/newick_doc.html.Google Scholar
- Peterlongo, P., Nicolas, J., Lavenier, D., Vorch, R., and Querellou, J. 2009. c-gamma:comparative genome analysis of molecular markers. In Pattern Recognition in Bioinformatics, V. Kadirkamanathan, G. Sanguinetti, M. Girolami, M. Niranjan, and J. Noirel, Eds., Lecture Notes in Computer Science, vol. 5780, Springer, Berlin, 255--269. Google ScholarDigital Library
- Phillippy, A. M., Ayanbule, K., Edwards, N. J., and Salzberg, S. L. 2009. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. 37, W229--234.Google ScholarCross Ref
- Phillippy, A. M., Mason, J. A., Ayanbule, K., Sommer, D. D., Taviani, E., Huq, A., Colwell, R. R., Knight, I. T., and Salzberg, S. L. 2007. Comprehensive DNA signature discovery and validation. PLoS Comput. Biol. 3, e98.Google ScholarCross Ref
- Price, M. N., Dehal, P. S., and Arkin, A. P. 2010. Fasttree 2 approximately maximum-likelihood trees for large alignments. PLoS ONE 5, 3, e9490.Google ScholarCross Ref
- Pruesse, E., Quast, C., Knittel, K., Fuchs, B. M., Ludwig, W., Peplies, J., and Glöckner, F. O. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35, 21, 7188--7196.Google ScholarCross Ref
- Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P., and Coissac, E. 2011. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic Acids Res. 39, e145.Google ScholarCross Ref
- Stahl, D. A., Flesher, B., Mansfield, H. R., and Montgomery, L. 1988. Use of phylogenetically based hybridization probes for studies of ruminal microbial ecology. Appl. Environ. Microbiol. 54, 1079--1084.Google Scholar
- Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 2688--2690. Google ScholarDigital Library
Index Terms
- Efficient relaxed search in hierarchically clustered sequence datasets
Recommendations
Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets
Motivation: PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of ...
Circular code motifs near the ribosome decoding center
A maximal C3 self-complementary trinucleotide circular code X is identified in genes of bacteria, eukaryotes, plasmids and viruses (Michel, 2015; Arquès and Michel, 1996). A translation (framing) code based on the circular code was proposed in Michel (...
Circular code motifs in the ribosome decoding center
Display Omitted Identification of circular code motifs in the ribosome decoding center.The universally conserved nucleotides A1492 and A1493 in circular code motifs.Identification of the conserved nucleotide G530 in nuclear and chloroplast rRNAs.The ...
Comments