ABSTRACT
The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.
- M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Jour. of Discrete Algorithms, 2(1):53--86, 2004. Google ScholarDigital Library
- M. I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal exact string matching based on suffix arrays. In Proc. String Processing and Information Retrieval Symp., pages 31--43. Springer Verlag, Berlin, Germany, 2002. Google ScholarDigital Library
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Jour. of Molecular Biology, 215(3):403--410, Oct. 1990.Google ScholarCross Ref
- A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, NATO ASI Series F12, pages 85--96. Springer Verlag, Berlin, Germany, 1985.Google Scholar
- R. A. Baeza-Yates, E. F. Barbosa, and N. Ziviani. Hierarchies of indices for text searching. Information Systems, 21(6):497--514, 1996. Google ScholarDigital Library
- D. Benson, D. J. Lipman, and J. Ostell. Genbank. Nucleic Acids Research, 21(13):2963--2965, 1993.Google ScholarCross Ref
- D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank. Nucleic Acids Research, 35:D21--D25, Jan. 2007.Google ScholarCross Ref
- M. Cameron, H. E. Williams, and A. Cannane. Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3):116--129, 2004. Google ScholarDigital Library
- C. Cheung, J. X. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17(1):90--105, 2005. Google ScholarDigital Library
- A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32(1):1--35, Jan. 2002.Google ScholarDigital Library
- R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. ACM Jour. of Experimental Algorithmics, 2007. To appear. Google ScholarDigital Library
- Ensembl. Ensembl Genome Browser, 2006. http://www.ensembl.org.Google Scholar
- D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge, UK, 1997. Google ScholarDigital Library
- D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 31(3):271--289, 1995. Google ScholarDigital Library
- T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. Annual Symp. on Combinatorial Pattern Matching, volume 2089 of Lecture Notes in Computer Science, pages 181--192. Springer Verlag, Berlin, Germany, 2001. Google ScholarDigital Library
- S. Kurtz. Reducing the space requirement of suffix trees. Software-Practice and Experience, 29(13):1149--1171, 1999. Google ScholarCross Ref
- U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing, 22(5):935--948, 1993. Google ScholarDigital Library
- M. A. Maniscalco and S. J. Puglisi. Faster lightweight suffix array construction. In J. Ryan and Dafik, editors, Proc. Australasian Workshop on Combinatorial Algorithms, pages 16--29, 2006.Google Scholar
- G. Manzini. Two space saving tricks for linear time LCP computation. In T. Hagerup and J. Katajainen, editors, Proceedings of 9th Scandinavian Workshop on Algorithm Theory (SWAT '04), volume 3111 of Lecture Notes in Computer Science, pages 372--383. Springer Verlag, Berlin, Germany, 2004.Google Scholar
- G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. Algorithmica, 40:33--50, 2004. Google ScholarDigital Library
- E. M. McCreight. A space-economical suffix tree construction algorithm. Jour. of the ACM, 23(2):262--272, 1976. Google ScholarDigital Library
- G. Navarro and V. Mäkinen. Compressed full text indexes. Computing Surveys, 39(1), 2007. Google ScholarDigital Library
- B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. ACM SIGMOD International Conference on Management of Data, pages 833--844, New York, NY, USA, June 2007. ACM Press. Google ScholarDigital Library
- S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms. Computing Surveys, 39(2):1--31, 2007. Google ScholarDigital Library
- J. S. Sim, D. K. Kim, H. Park, and K. Park. Linear-time search in suffix arrays. In M. Miller and K. Park, editors, Proc. Australasian Workshop on Combinatorial Algorithms, pages 139--146, Seoul, Korea, 2003.Google Scholar
- B. Smyth. Computing Patterns in Strings. Addison-Wesley, Reading, Massachusetts, USA, 2003.Google Scholar
- Y. Tian, S. Tata, A. Hankins, and M. Patel. Practical methods for constructing suffix trees. The VLDB Jour., 14(3):281--299, 2005. Google ScholarDigital Library
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library
- P. van Emde-Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80--82, 1977.Google ScholarCross Ref
- P. Weiner. Linear pattern matching algorithms. In Proc. Annual Symposium on Foundations of Computer Science, pages 1--11, Silver Spring, MD, USA, 1973. IEEE Computer Society Press.Google Scholar
Index Terms
- Improving suffix array locality for fast pattern matching on disk
Recommendations
The suffix binary search tree and suffix AVL tree
Suffix trees and suffix arrays are classical data structures that are used to represent the set of suffixes of a given string, and thereby facilitate the efficient solution of various string processing problems--in particular on-line string searching. ...
A Suffix Tree Or Not a Suffix Tree?
Combinatorial AlgorithmsAbstractIn this paper we study the structure of suffix trees. Given an unlabeled tree on n nodes and suffix links of its internal nodes, we ask the question “Is a suffix tree?", i.e., is there a string S whose suffix tree has the same topological ...
The affix array data structure and its applications to RNA secondary structure analysis
Efficient string-processing in large data sets like complete genomes is strongly connected to the suffix tree and similar index data structures. With respect to complex structural string analysis like the search for RNA secondary structure patterns, ...
Comments