skip to main content
10.1145/3243176.3243207acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

DART: distributed adaptive radix tree for efficient affix-based keyword search on HPC systems

Published:01 November 2018Publication History

ABSTRACT

Affix-based search is a fundamental functionality for storage systems. It allows users to find desired datasets, where attributes of a dataset match an affix. While building inverted index to facilitate efficient affix-based keyword search is a common practice for standalone databases and for desktop file systems, building local indexes or adopting indexing techniques used in a standalone data store is insufficient for high-performance computing (HPC) systems due to the massive amount of data and distributed nature of the storage devices within a system. In this paper, we propose Distributed Adaptive Radix Tree (DART), to address the challenge of distributed affix-based keyword search on HPC systems. This trie-based approach is scalable in achieving efficient affix-based search and alleviating imbalanced keyword distribution and excessive requests on keywords at scale. Our evaluation at different scales shows that, comparing with the "full string hashing" use case of the most popular distributed indexing technique - Distributed Hash Table (DHT), DART achieves up to 55× better throughput with prefix search and with suffix search, while achieving comparable throughput with exact and infix searches. Also, comparing to the "initial hashing" use case of DHT, DART maintains a balanced keyword distribution on distributed nodes and alleviates excessive query workload against popular keywords.

References

  1. MG Aartsen, K Abraham, M Ackermann, J Adams, JA Aguilar, M Ahlers, M Ahrens, D Altmann, K Andeen, T Anderson, et al. 2016. Search for sources of High-Energy neutrons with four years of data from the Icetop Detector. The Astrophysical Journal 830, 2(2016), 129.Google ScholarGoogle ScholarCross RefCross Ref
  2. apache.org. 2014. SolrCloud. https://wiki.apache.org/solr/SolrCloud.Google ScholarGoogle Scholar
  3. Baruch Awerbuch and Christian Scheideler. 2003. Peer-to-peer systems for prefix search. In Proceedings of the twenty-second annual symposium on Principles of distributed computing. ACM, 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dirk Bradler, Jussi Kangasharju, and Max Mühlhäuser. 2008. Optimally Efficient Prefix Search and Multicast in Structured P2P Networks. CoRR abs/0808.1207 (2008). http://arxiv.org/abs/0808.1207Google ScholarGoogle Scholar
  5. Ralph Böhme. 2013. libuuid. https://sourceforge.net/projects/libuuid/Google ScholarGoogle Scholar
  6. Hailong Cai and Jun Wang. 2004. Foreseer: a novel, locality-aware peer-to-peer system architecture for keyword searches. In Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware. Springer-Verlag New York, Inc., 38--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chi Chen, Zhi Deng, Richard Tran, Hanmei Tang, Iek-Heng Chu, and Shyue Ping Ong. 2017. Accurate force field for molybdenum by machine learning large materials data. Physical Review Materials 1, 4 (2017), 043603.Google ScholarGoogle ScholarCross RefCross Ref
  8. Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2 (1979), 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Corbet. 2006. Trees I: Radix trees. http://lwn.net/Articles/175432/Google ScholarGoogle Scholar
  10. T Cormen, C Leiserson, R Rivest, and Clifford Stein. 2001. The rabin-karp algorithm. Introduction to Algorithms (2001), 911--916.Google ScholarGoogle Scholar
  11. elastic.co. 2017. Distributed Search Execution. https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html.Google ScholarGoogle Scholar
  12. Brian Everitt and Anders Skrondal. 2002. The Cambridge dictionary of statistics. Vol. 106. Cambridge University Press Cambridge.Google ScholarGoogle Scholar
  13. Apache Software Fundation. 2017. Apache Lucene. https://lucene.apache.org.Google ScholarGoogle Scholar
  14. Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. 1992. New Indices for Text: Pat Trees and Pat Arrays. Information Retrieval: Data Structures & Algorithms 66 (1992), 82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Matthew Harren, Joseph Hellerstein, Ryan Huebsch, Boon Loo, Scott Shenker, and Ion Stoica. 2002. Complex queries in DHT-based peer-to-peer networks. Peer-to-peer systems (2002), 242--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. 2006. Automata theory, languages, and computation. International Edition 24 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yuh-Jzer Joung and Li-Wei Yang. 2006. KISS: A simple prefix search scheme in P2P networks. In Proc. of the WebDB Workshop. 56--61.Google ScholarGoogle Scholar
  18. Yuh-Jzer Joung, Li-Wei Yang, and Chien-Tse Fang. 2007. Keyword search in dhtbased peer-to-peer networks. IEEE Journal on Selected Areas in Communications 25, 1 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Donald Knuth. 1997. 6.3: Digital Searching. The Art of Computer Programming Volume 3: Sorting and Searching (1997), 492.Google ScholarGoogle Scholar
  20. Donald E. Knuth, Jr. James H.Morris, and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput. 6, 2 (1977), 323--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Liu, D. Bard, Q. Koziol, S. Bailey, and Prabhat. 2017. Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss. In 2017 New York Scientific Data Summit (NYSDS). 1--9.Google ScholarGoogle Scholar
  23. Yaning Liu, George Shu Heng Pau, and Stefan Finsterle. 2017. Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data. Computers & Geosciences (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval, Chapter 20.3 Distributing indexes, 415--416. Volume 1 of {26}.Google ScholarGoogle Scholar
  25. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval, Chapter 4.4 Distributed indexing, 68--71. Volume 1 of {26}.Google ScholarGoogle Scholar
  26. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval. Vol. 1. Cambridge university press Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Arun Mannodi-Kanakkithodi, Tran Doan Huan, and Rampi Ramprasad. 2017. Mining materials design rules from data: The example of polymer dielectrics. Chemistry of Materials 29, 21 (2017), 9001--9010.Google ScholarGoogle ScholarCross RefCross Ref
  28. Marius. 2017. English Words. https://github.com/dwyl/english-words.Google ScholarGoogle Scholar
  29. Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Inc.MongoDB. 2018. MongoDB. https://www.mongodb.com/Google ScholarGoogle Scholar
  31. Oracle. 2017. MySQL. https://www.mysql.comGoogle ScholarGoogle Scholar
  32. David Paez-Espino, I Chen, A Min, Krishna Palaniappan, Anna Ratner, Ken Chu, Ernest Szeto, Manoj Pillay, Jinghua Huang, Victor M Markowitz, et al. 2017. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic acids research 45, D1 (2017), D457--D465.Google ScholarGoogle Scholar
  33. PostgreSQL. 2018. PostgreSQL. https://www.postgresql.org/Google ScholarGoogle Scholar
  34. Sriram Ramabhadran, Sylvia Ratnasamy, Joseph M Hellerstein, and Scott Shenker. 2004. Prefix hash tree: An indexing data structure over distributed hash tables. In Proceedings of the 23rd ACM symposium on principles of distributed computing, Vol. 37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kai Ren, Qing Zheng, Swapnil Patil, and Garth Gibson. 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for. IEEE, 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Patrick Reynolds and Amin Vahdat. 2003. Efficient peer-to-peer keyword searching. In Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware. Springer-Verlag New York, Inc., 21--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gerard Salton. 1989. Automatic text processing: The transformation, analysis, and retrieval of. Reading: Addison-Wesley (1989). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. David J Schlegel, M Blanton, D Eisenstein, B Gillespie, J Gunn, P Harding, P McDonald, R Nichol, N Padmanabhan, W Percival, et al. 2007. SDSS-III: The Baryon Oscillation Spectroscopic Survey (BOSS). In Bulletin of the American Astronomical Society, Vol. 39. 966.Google ScholarGoogle Scholar
  39. Shuming Shi, Guangwen Yang, Dingxing Wang, Jin Yu, Shaogang Qu, and Ming Chen. 2004. Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning.. In IPTPS. Springer, 151--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Hyogi Sim, Youngjae Kim, Sudharshan S Vazhkudai, Geoffroy R Vallée, Seung-Hwan Lim, and Ali R Butt. 2017. Taglt: An Integrated Indexing and Search Service for File Systems. (2017).Google ScholarGoogle Scholar
  41. Jerome Soumagne, Dries Kimpe, Judicael Zounmevo, Mohamad Chaarawi, Quincey Koziol, Ahmad Afsahi, and Robert Ross. 2013. Mercury: Enabling remote procedure call for high-performance computing. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  42. sqlite.org. 2017. SQLite. https://sqlite.org.Google ScholarGoogle Scholar
  43. Houjun Tang, Suren Byna, Bin Dong, Jialin Liu, and Quincey Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359--369.Google ScholarGoogle ScholarCross RefCross Ref
  44. Houjun Tang, Suren Byna, François Tessier, Teng Wang, Bin Dong, Jingqing Mu, Quincey Koziol, Jerome Soumagne, Venkatram Vishwanath, Jialin Liu, et al. 2018. Toward Scalable and Asynchronous Object-centric Data Management for HPC. In Proceedings of The 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2018. 113--122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Guido Urdaneta, Guillaume Pierre, and Maarten van Steen. 2009. Wikipedia Workload Analysis for Decentralized Hosting. Elsevier Computer Networks 53, 11 (July 2009), 1830--1845. http://www.globule.org/publi/WWADH_comnet2009.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Brent Welch and John Ousterhout. 1985. Prefix tables: A simple mechanism for locating files in a distributed system. Technical Report. CALIFORNIA UNIV BERKELEY DEPT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wikipedia. 2018. Distributed Hash Table. https://en.wikipedia.org/wiki/Distributed_hash_table. Accessed: 2018-04-15.Google ScholarGoogle Scholar
  48. Dongfang Zhao, Kan Qiao, Zhou Zhou, Tonglin Li, Zhihan Lu, and Xiaohua Xu. 2017. Toward Efficient and Flexible Metadata Indexing of Big Data Systems. IEEE Trans. Big Data 3, 1 (2017), 107--117.Google ScholarGoogle ScholarCross RefCross Ref
  49. Dongfang Zhao, Zhao Zhang, Xiaobing Zhou, Tonglin Li, Ke Wang, Dries Kimpe, Philip Carns, Robert Ross, and Ioan Raicu. 2014. Fusionfs: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 61--70.Google ScholarGoogle ScholarCross RefCross Ref
  50. Qing Zheng, Kai Ren, Garth Gibson, Bradley W. Settlemyer, and Gary Grider. 2015. DeltaFS: Exascale File Systems Scale Better Without Dedicated Servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW '15). ACM, New York, NY, USA, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DART: distributed adaptive radix tree for efficient affix-based keyword search on HPC systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
      November 2018
      494 pages
      ISBN:9781450359863
      DOI:10.1145/3243176

      Copyright © 2018 ACM

      © 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 November 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate121of471submissions,26%

      Upcoming Conference

      PACT '24
      International Conference on Parallel Architectures and Compilation Techniques
      October 14 - 16, 2024
      Southern California , CA , USA

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader