ABSTRACT
Suffix arrays and trees are fundamental string data structures of importance to many applications in computational biology. Consequently, their parallel construction is an actively studied problem. To date, algorithms with best practical performance lack efficient worst-case run-time guarantees, and vice versa. In addition, much of the recent work targeted low core count, shared memory parallelization. In this paper, we present parallel algorithms for distributed memory construction of suffix arrays and longest common prefix (LCP) arrays that simultaneously achieve good worst-case run-time bounds and superior practical performance. Our algorithms run in O(Tsort(n, p) · log n) worst-case time where Tsort(n, p) is the run-time of parallel sorting. We present several algorithm engineering techniques that improve performance in practice. We demonstrate the construction of suffix and LCP arrays of the human genome in less than 8 seconds on 1,024 Intel Xeon cores, reaching speedups of over 110X compared to the best sequential suffix array construction implementation divsufsort.
- A. Abdelhadi, A. Kandil, and M. Abouelhoda. Cloud-based parallel suffix array construction based on mpi. In Biomedical Engineering (MECBME), 2014 Middle East Conference on, pages 334--337. IEEE, 2014.Google ScholarCross Ref
- M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53--86, 2004. Google ScholarDigital Library
- J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In SODA, volume 97, pages 360--369, 1997. Google ScholarDigital Library
- G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine cm-2. In Proceedings of the third annual ACM symposium on Parallel algorithms and architectures, pages 3--16. ACM, 1991. Google ScholarDigital Library
- M. Comin and M. Farreras. Efficient parallel construction of suffix trees for genomes larger than main memory. In Proceedings of the 20th European MPI Users' Group Meeting, pages 211--216. ACM, 2013. Google ScholarDigital Library
- G. P. Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56--65, 2012.Google ScholarCross Ref
- M. Deo and S. Keely. Parallel suffix array and least common prefix for the gpu. In ACM SIGPLAN Notices, volume 48, pages 197--206. ACM, 2013. Google ScholarDigital Library
- J. Fischer and V. Heun. A new succinct representation of rmq-information and improvements in the enhanced suffix array. In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, pages 459--470. Springer, 2007. Google ScholarDigital Library
- N. Futamura, S. Aluru, and S. Kurtz. Parallel suffix sorting. pages 76--81, 2001.Google Scholar
- A. Ghoting and K. Makarychev. Indexing genomic sequences on the ibm blue gene. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, page 61. ACM, 2009. Google ScholarDigital Library
- A. Ghoting and K. Makarychev. Serial and parallel methods for i/o efficient suffix tree construction. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 827--840. ACM, 2009. Google ScholarDigital Library
- R. Hariharan. Optimal parallel suffix tree construction. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 290--299. ACM, 1994. Google ScholarDigital Library
- R. Homann, D. Fleer, R. Giegerich, and M. Rehmsmeier. mkesa: enhanced suffix array construction tool. Bioinformatics, 25(8):1084--1085, 2009. Google ScholarDigital Library
- J. JaJa and K. W. Ryu. Optimal algorithms on the pipelined hypercube and related networks. Parallel and Distributed Systems, IEEE Transactions on, 4(5):582--591, 1993. Google ScholarDigital Library
- J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In Automata, Languages and Programming, pages 943--955. Springer, 2003. Google ScholarCross Ref
- R. M. Karp, R. E. Miller, and A. L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the fourth annual ACM symposium on Theory of computing, pages 125--136. ACM, 1972. Google ScholarDigital Library
- T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching, pages 181--192. Springer, 2001. Google ScholarCross Ref
- P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Combinatorial Pattern Matching, pages 200--210. Springer, 2003. Google ScholarCross Ref
- F. Kulla and P. Sanders. Scalable parallel suffix array construction. Parallel Computing, 33(9):605--612, 2007. Google ScholarDigital Library
- N. J. Larsson and K. Sadakane. Faster suffix sorting. Theoretical Computer Science, 387(3):258--272, 2007. Google ScholarDigital Library
- U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935--948, 1993. Google ScholarDigital Library
- E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. Era: efficient serial and parallel suffix tree construction for very long strings. Proceedings of the VLDB Endowment, 5(1):49--60, 2011. Google ScholarDigital Library
- G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. Algorithmica, 40(1):33--50, 2004. Google ScholarDigital Library
- H. Mohamed and M. Abouelhoda. Parallel suffix sorting based on bucket pointer refinement. In Biomedical Engineering Conference (CIBEC), 2010 5th Cairo International, pages 98--102. IEEE, 2010.Google ScholarCross Ref
- Y. Mori. libdivsufsort. https://github.com/y-256/libdivsufsort.Google Scholar
- B. Nystedt, N. R. Street, A. Wetterbom, A. Zuccolo, Y.-C. Lin, D. G. Scofield, F. Vezzi, N. Delhomme, S. Giacomello, A. Alexeyenko, et al. The norway spruce genome sequence and conifer genome evolution. Nature, 497(7451):579--584, 2013.Google ScholarCross Ref
- V. Osipov. Parallel suffix array construction for shared memory architectures. In String Processing and Information Retrieval, pages 379--384. Springer, 2012. Google ScholarDigital Library
- S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys (CSUR), 39(2):4, 2007. Google ScholarDigital Library
- K.-B. Schürmann and J. Stoye. An incomplex algorithm for fast suffix array construction. Software: Practice and Experience, 37(3):309--329, 2007. Google ScholarDigital Library
- J. Shun. Fast parallel computation of longest common prefixes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 387--398. IEEE Press, 2014. Google ScholarDigital Library
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249--260, 1995.Google ScholarDigital Library
Index Terms
- Parallel distributed memory construction of suffix and longest common prefix arrays
Recommendations
Parallel suffix array and least common prefix for the GPU
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programmingSuffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
String inference from longest-common-prefix array
AbstractThe suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the longest common prefixes for lexicographically ...
Parallel suffix array and least common prefix for the GPU
PPoPP '13Suffix Array (SA) is a data structure formed by sorting the suffixes of a string into lexicographic order. SAs have been used in a variety of applications, most notably in pattern matching and Burrows-Wheeler Transform (BWT) based lossless data ...
Comments