skip to main content
10.1145/1376616.1376683acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Improving suffix array locality for fast pattern matching on disk

Published:09 June 2008Publication History

ABSTRACT

The suffix tree (or equivalently, the enhanced suffix array) provides efficient solutions to many problems involving pattern matching and pattern discovery in large strings, such as those arising in computational biology. Here we address the problem of arranging a suffix array on disk so that querying is fast in practice. We show that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered as much as three times faster than the best alternative disk-based suffix array arrangement. Construction of our data structure requires only modest processing time on top of that required to build the suffix tree, and requires negligible extra memory.

References

  1. M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Jour. of Discrete Algorithms, 2(1):53--86, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. I. Abouelhoda, E. Ohlebusch, and S. Kurtz. Optimal exact string matching based on suffix arrays. In Proc. String Processing and Information Retrieval Symp., pages 31--43. Springer Verlag, Berlin, Germany, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Jour. of Molecular Biology, 215(3):403--410, Oct. 1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Apostolico. The myriad virtues of subword trees. In A. Apostolico and Z. Galil, editors, Combinatorial Algorithms on Words, NATO ASI Series F12, pages 85--96. Springer Verlag, Berlin, Germany, 1985.Google ScholarGoogle Scholar
  5. R. A. Baeza-Yates, E. F. Barbosa, and N. Ziviani. Hierarchies of indices for text searching. Information Systems, 21(6):497--514, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Benson, D. J. Lipman, and J. Ostell. Genbank. Nucleic Acids Research, 21(13):2963--2965, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and D. L. Wheeler. Genbank. Nucleic Acids Research, 35:D21--D25, Jan. 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Cameron, H. E. Williams, and A. Cannane. Improved gapped alignment in BLAST. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3):116--129, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Cheung, J. X. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17(1):90--105, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32(1):1--35, Jan. 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. ACM Jour. of Experimental Algorithmics, 2007. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ensembl. Ensembl Genome Browser, 2006. http://www.ensembl.org.Google ScholarGoogle Scholar
  13. D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge, UK, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, 31(3):271--289, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. Annual Symp. on Combinatorial Pattern Matching, volume 2089 of Lecture Notes in Computer Science, pages 181--192. Springer Verlag, Berlin, Germany, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kurtz. Reducing the space requirement of suffix trees. Software-Practice and Experience, 29(13):1149--1171, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  17. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing, 22(5):935--948, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Maniscalco and S. J. Puglisi. Faster lightweight suffix array construction. In J. Ryan and Dafik, editors, Proc. Australasian Workshop on Combinatorial Algorithms, pages 16--29, 2006.Google ScholarGoogle Scholar
  19. G. Manzini. Two space saving tricks for linear time LCP computation. In T. Hagerup and J. Katajainen, editors, Proceedings of 9th Scandinavian Workshop on Algorithm Theory (SWAT '04), volume 3111 of Lecture Notes in Computer Science, pages 372--383. Springer Verlag, Berlin, Germany, 2004.Google ScholarGoogle Scholar
  20. G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. Algorithmica, 40:33--50, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. M. McCreight. A space-economical suffix tree construction algorithm. Jour. of the ACM, 23(2):262--272, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Navarro and V. Mäkinen. Compressed full text indexes. Computing Surveys, 39(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. ACM SIGMOD International Conference on Management of Data, pages 833--844, New York, NY, USA, June 2007. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms. Computing Surveys, 39(2):1--31, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. S. Sim, D. K. Kim, H. Park, and K. Park. Linear-time search in suffix arrays. In M. Miller and K. Park, editors, Proc. Australasian Workshop on Combinatorial Algorithms, pages 139--146, Seoul, Korea, 2003.Google ScholarGoogle Scholar
  26. B. Smyth. Computing Patterns in Strings. Addison-Wesley, Reading, Massachusetts, USA, 2003.Google ScholarGoogle Scholar
  27. Y. Tian, S. Tata, A. Hankins, and M. Patel. Practical methods for constructing suffix trees. The VLDB Jour., 14(3):281--299, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. van Emde-Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80--82, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  30. P. Weiner. Linear pattern matching algorithms. In Proc. Annual Symposium on Foundations of Computer Science, pages 1--11, Silver Spring, MD, USA, 1973. IEEE Computer Society Press.Google ScholarGoogle Scholar

Index Terms

  1. Improving suffix array locality for fast pattern matching on disk

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
              June 2008
              1396 pages
              ISBN:9781605581026
              DOI:10.1145/1376616

              Copyright © 2008 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 June 2008

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate785of4,003submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader