research-article

Free Access

Parallel distributed memory construction of suffix and longest common prefix arrays

Authors:
Patrick Flick

Georgia Institute of Technology, Atlanta, Georgia

Georgia Institute of Technology, Atlanta, Georgia
View Profile

,
Srinivas Aluru

Georgia Institute of Technology, Atlanta, Georgia

Georgia Institute of Technology, Atlanta, Georgia
View Profile

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015Article No.: 16Pages 1–10https://doi.org/10.1145/2807591.2807609

Published:15 November 2015Publication History

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–10

ABSTRACT

Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(T_sort(n, p) · log n) worst-case time where T_sort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.

References

A. Abdelhadi, A. Kandil, and M. Abouelhoda. Cloud-based parallel suffix array construction based on mpi. In Biomedical Engineering (MECBME), 2014 Middle East Conference on, pages 334--337. IEEE, 2014.Google ScholarCross Ref
M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53--86, 2004. Google ScholarDigital Library
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In SODA, volume 97, pages 360--369, 1997. Google ScholarDigital Library
G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine cm-2. In Proceedings of the third annual ACM symposium on Parallel algorithms and architectures, pages 3--16. ACM, 1991. Google ScholarDigital Library
M. Comin and M. Farreras. Efficient parallel construction of suffix trees for genomes larger than main memory. In Proceedings of the 20th European MPI Users' Group Meeting, pages 211--216. ACM, 2013. Google ScholarDigital Library
G. P. Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56--65, 2012.Google ScholarCross Ref
M. Deo and S. Keely. Parallel suffix array and least common prefix for the gpu. In ACM SIGPLAN Notices, volume 48, pages 197--206. ACM, 2013. Google ScholarDigital Library
J. Fischer and V. Heun. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, pages 459--470. Springer, 2007. Google ScholarDigital Library
N. Futamura, S. Aluru, and S. Kurtz. Parallel suffix sorting. pages 76--81, 2001.Google Scholar
A. Ghoting and K. Makarychev. Indexing genomic sequences on the ibm blue gene. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 61. ACM, 2009. Google ScholarDigital Library
A. Ghoting and K. Makarychev. Serial and parallel methods for i/o efficient suffix tree construction. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 827--840. ACM, 2009. Google ScholarDigital Library
R. Hariharan. Optimal parallel suffix tree construction. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 290--299. ACM, 1994. Google ScholarDigital Library
R. Homann, D. Fleer, R. Giegerich, and M. Rehmsmeier. mkesa: enhanced suffix array construction tool. Bioinformatics, 25(8):1084--1085, 2009. Google ScholarDigital Library
J. JaJa and K. W. Ryu. Optimal algorithms on the pipelined hypercube and related networks. Parallel and Distributed Systems, IEEE Transactions on, 4(5):582--591, 1993. Google ScholarDigital Library
J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In Automata, Languages and Programming, pages 943--955. Springer, 2003. Google ScholarCross Ref
R. M. Karp, R. E. Miller, and A. L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the fourth annual ACM symposium on Theory of computing, pages 125--136. ACM, 1972. Google ScholarDigital Library
T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching, pages 181--192. Springer, 2001. Google ScholarCross Ref
P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching, pages 200--210. Springer, 2003. Google ScholarCross Ref
F. Kulla and P. Sanders. Scalable parallel suffix array construction. Parallel Computing, 33(9):605--612, 2007. Google ScholarDigital Library
N. J. Larsson and K. Sadakane. Faster suffix sorting. Theoretical Computer Science, 387(3):258--272, 2007. Google ScholarDigital Library
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948, 1993. Google ScholarDigital Library
E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. Era: efficient serial and parallel suffix tree construction for very long strings. Proceedings of the VLDB Endowment, 5(1):49--60, 2011. Google ScholarDigital Library
G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. Algorithmica, 40(1):33--50, 2004. Google ScholarDigital Library
H. Mohamed and M. Abouelhoda. Parallel suffix sorting based on bucket pointer refinement. In Biomedical Engineering Conference (CIBEC), 2010 5th Cairo International, pages 98--102. IEEE, 2010.Google ScholarCross Ref
Y. Mori. libdivsufsort. https://github.com/y-256/libdivsufsort.Google Scholar
B. Nystedt, N. R. Street, A. Wetterbom, A. Zuccolo, Y.-C. Lin, D. G. Scofield, F. Vezzi, N. Delhomme, S. Giacomello, A. Alexeyenko, et al. The norway spruce genome sequence and conifer genome evolution. Nature, 497(7451):579--584, 2013.Google ScholarCross Ref
V. Osipov. Parallel suffix array construction for shared memory architectures. In String Processing and Information Retrieval, pages 379--384. Springer, 2012. Google ScholarDigital Library
S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys (CSUR), 39(2):4, 2007. Google ScholarDigital Library
K.-B. Schürmann and J. Stoye. An incomplex algorithm for fast suffix array construction. Software: Practice and Experience, 37(3):309--329, 2007. Google ScholarDigital Library
J. Shun. Fast parallel computation of longest common prefixes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 387--398. IEEE Press, 2014. Google ScholarDigital Library
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library

Index Terms

Parallel distributed memory construction of suffix and longest common prefix arrays

Recommendations

Parallel suffix array and least common prefix for the GPU
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
Read More
String inference from longest-common-prefix array
Abstract
The suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the longest common prefixes for lexicographically ...
Read More
Parallel suffix array and least common prefix for the GPU
PPoPP '13

Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 1,114
  Total Downloads
- Downloads (Last 12 months)80
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel distributed memory construction of suffix and longest common prefix arrays

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel suffix array and least common prefix for the GPU

String inference from longest-common-prefix array

Parallel suffix array and least common prefix for the GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Parallel distributed memory construction of suffix and longest common prefix arrays

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel suffix array and least common prefix for the GPU

String inference from longest-common-prefix array

Parallel suffix array and least common prefix for the GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media