ABSTRACT
Affix-based search is a fundamental functionality for storage systems. It allows users to find desired datasets, where attributes of a dataset match an affix. While building inverted index to facilitate efficient affix-based keyword search is a common practice for standalone databases and for desktop file systems, building local indexes or adopting indexing techniques used in a standalone data store is insufficient for high-performance computing (HPC) systems due to the massive amount of data and distributed nature of the storage devices within a system. In this paper, we propose Distributed Adaptive Radix Tree (DART), to address the challenge of distributed affix-based keyword search on HPC systems. This trie-based approach is scalable in achieving efficient affix-based search and alleviating imbalanced keyword distribution and excessive requests on keywords at scale. Our evaluation at different scales shows that, comparing with the "full string hashing" use case of the most popular distributed indexing technique - Distributed Hash Table (DHT), DART achieves up to 55× better throughput with prefix search and with suffix search, while achieving comparable throughput with exact and infix searches. Also, comparing to the "initial hashing" use case of DHT, DART maintains a balanced keyword distribution on distributed nodes and alleviates excessive query workload against popular keywords.
- MG Aartsen, K Abraham, M Ackermann, J Adams, JA Aguilar, M Ahlers, M Ahrens, D Altmann, K Andeen, T Anderson, et al. 2016. Search for sources of High-Energy neutrons with four years of data from the Icetop Detector. The Astrophysical Journal 830, 2(2016), 129.Google ScholarCross Ref
- apache.org. 2014. SolrCloud. https://wiki.apache.org/solr/SolrCloud.Google Scholar
- Baruch Awerbuch and Christian Scheideler. 2003. Peer-to-peer systems for prefix search. In Proceedings of the twenty-second annual symposium on Principles of distributed computing. ACM, 123--132. Google ScholarDigital Library
- Dirk Bradler, Jussi Kangasharju, and Max Mühlhäuser. 2008. Optimally Efficient Prefix Search and Multicast in Structured P2P Networks. CoRR abs/0808.1207 (2008). http://arxiv.org/abs/0808.1207Google Scholar
- Ralph Böhme. 2013. libuuid. https://sourceforge.net/projects/libuuid/Google Scholar
- Hailong Cai and Jun Wang. 2004. Foreseer: a novel, locality-aware peer-to-peer system architecture for keyword searches. In Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware. Springer-Verlag New York, Inc., 38--58. Google ScholarDigital Library
- Chi Chen, Zhi Deng, Richard Tran, Hanmei Tang, Iek-Heng Chu, and Shyue Ping Ong. 2017. Accurate force field for molybdenum by machine learning large materials data. Physical Review Materials 1, 4 (2017), 043603.Google ScholarCross Ref
- Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2 (1979), 121--137. Google ScholarDigital Library
- J. Corbet. 2006. Trees I: Radix trees. http://lwn.net/Articles/175432/Google Scholar
- T Cormen, C Leiserson, R Rivest, and Clifford Stein. 2001. The rabin-karp algorithm. Introduction to Algorithms (2001), 911--916.Google Scholar
- elastic.co. 2017. Distributed Search Execution. https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html.Google Scholar
- Brian Everitt and Anders Skrondal. 2002. The Cambridge dictionary of statistics. Vol. 106. Cambridge University Press Cambridge.Google Scholar
- Apache Software Fundation. 2017. Apache Lucene. https://lucene.apache.org.Google Scholar
- Gaston H Gonnet, Ricardo A Baeza-Yates, and Tim Snider. 1992. New Indices for Text: Pat Trees and Pat Arrays. Information Retrieval: Data Structures & Algorithms 66 (1992), 82. Google ScholarDigital Library
- Matthew Harren, Joseph Hellerstein, Ryan Huebsch, Boon Loo, Scott Shenker, and Ion Stoica. 2002. Complex queries in DHT-based peer-to-peer networks. Peer-to-peer systems (2002), 242--250. Google ScholarDigital Library
- John E Hopcroft, Rajeev Motwani, and Jeffrey D Ullman. 2006. Automata theory, languages, and computation. International Edition 24 (2006). Google ScholarDigital Library
- Yuh-Jzer Joung and Li-Wei Yang. 2006. KISS: A simple prefix search scheme in P2P networks. In Proc. of the WebDB Workshop. 56--61.Google Scholar
- Yuh-Jzer Joung, Li-Wei Yang, and Chien-Tse Fang. 2007. Keyword search in dhtbased peer-to-peer networks. IEEE Journal on Selected Areas in Communications 25, 1 (2007). Google ScholarDigital Library
- Donald Knuth. 1997. 6.3: Digital Searching. The Art of Computer Programming Volume 3: Sorting and Searching (1997), 492.Google Scholar
- Donald E. Knuth, Jr. James H.Morris, and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput. 6, 2 (1977), 323--350.Google ScholarDigital Library
- Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49. Google ScholarDigital Library
- J. Liu, D. Bard, Q. Koziol, S. Bailey, and Prabhat. 2017. Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss. In 2017 New York Scientific Data Summit (NYSDS). 1--9.Google Scholar
- Yaning Liu, George Shu Heng Pau, and Stefan Finsterle. 2017. Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data. Computers & Geosciences (2017). Google ScholarDigital Library
- Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval, Chapter 20.3 Distributing indexes, 415--416. Volume 1 of {26}.Google Scholar
- Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval, Chapter 4.4 Distributed indexing, 68--71. Volume 1 of {26}.Google Scholar
- Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. 2008. Introduction to information retrieval. Vol. 1. Cambridge university press Cambridge. Google ScholarDigital Library
- Arun Mannodi-Kanakkithodi, Tran Doan Huan, and Rampi Ramprasad. 2017. Mining materials design rules from data: The example of polymer dielectrics. Chemistry of Materials 29, 21 (2017), 9001--9010.Google ScholarCross Ref
- Marius. 2017. English Words. https://github.com/dwyl/english-words.Google Scholar
- Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distributed Systems 12, 10 (2001), 1094--1104. Google ScholarDigital Library
- Inc.MongoDB. 2018. MongoDB. https://www.mongodb.com/Google Scholar
- Oracle. 2017. MySQL. https://www.mysql.comGoogle Scholar
- David Paez-Espino, I Chen, A Min, Krishna Palaniappan, Anna Ratner, Ken Chu, Ernest Szeto, Manoj Pillay, Jinghua Huang, Victor M Markowitz, et al. 2017. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic acids research 45, D1 (2017), D457--D465.Google Scholar
- PostgreSQL. 2018. PostgreSQL. https://www.postgresql.org/Google Scholar
- Sriram Ramabhadran, Sylvia Ratnasamy, Joseph M Hellerstein, and Scott Shenker. 2004. Prefix hash tree: An indexing data structure over distributed hash tables. In Proceedings of the 23rd ACM symposium on principles of distributed computing, Vol. 37. Google ScholarDigital Library
- Kai Ren, Qing Zheng, Swapnil Patil, and Garth Gibson. 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for. IEEE, 237--248. Google ScholarDigital Library
- Patrick Reynolds and Amin Vahdat. 2003. Efficient peer-to-peer keyword searching. In Proceedings of the ACM/IFIP/USENIX 2003 International Conference on Middleware. Springer-Verlag New York, Inc., 21--40. Google ScholarDigital Library
- Gerard Salton. 1989. Automatic text processing: The transformation, analysis, and retrieval of. Reading: Addison-Wesley (1989). Google ScholarDigital Library
- David J Schlegel, M Blanton, D Eisenstein, B Gillespie, J Gunn, P Harding, P McDonald, R Nichol, N Padmanabhan, W Percival, et al. 2007. SDSS-III: The Baryon Oscillation Spectroscopic Survey (BOSS). In Bulletin of the American Astronomical Society, Vol. 39. 966.Google Scholar
- Shuming Shi, Guangwen Yang, Dingxing Wang, Jin Yu, Shaogang Qu, and Ming Chen. 2004. Making Peer-to-Peer Keyword Searching Feasible Using Multi-level Partitioning.. In IPTPS. Springer, 151--161. Google ScholarDigital Library
- Hyogi Sim, Youngjae Kim, Sudharshan S Vazhkudai, Geoffroy R Vallée, Seung-Hwan Lim, and Ali R Butt. 2017. Taglt: An Integrated Indexing and Search Service for File Systems. (2017).Google Scholar
- Jerome Soumagne, Dries Kimpe, Judicael Zounmevo, Mohamad Chaarawi, Quincey Koziol, Ahmad Afsahi, and Robert Ross. 2013. Mercury: Enabling remote procedure call for high-performance computing. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on. IEEE, 1--8.Google ScholarCross Ref
- sqlite.org. 2017. SQLite. https://sqlite.org.Google Scholar
- Houjun Tang, Suren Byna, Bin Dong, Jialin Liu, and Quincey Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359--369.Google ScholarCross Ref
- Houjun Tang, Suren Byna, François Tessier, Teng Wang, Bin Dong, Jingqing Mu, Quincey Koziol, Jerome Soumagne, Venkatram Vishwanath, Jialin Liu, et al. 2018. Toward Scalable and Asynchronous Object-centric Data Management for HPC. In Proceedings of The 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2018. 113--122.Google ScholarDigital Library
- Guido Urdaneta, Guillaume Pierre, and Maarten van Steen. 2009. Wikipedia Workload Analysis for Decentralized Hosting. Elsevier Computer Networks 53, 11 (July 2009), 1830--1845. http://www.globule.org/publi/WWADH_comnet2009.html. Google ScholarDigital Library
- Brent Welch and John Ousterhout. 1985. Prefix tables: A simple mechanism for locating files in a distributed system. Technical Report. CALIFORNIA UNIV BERKELEY DEPT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES. Google ScholarDigital Library
- Wikipedia. 2018. Distributed Hash Table. https://en.wikipedia.org/wiki/Distributed_hash_table. Accessed: 2018-04-15.Google Scholar
- Dongfang Zhao, Kan Qiao, Zhou Zhou, Tonglin Li, Zhihan Lu, and Xiaohua Xu. 2017. Toward Efficient and Flexible Metadata Indexing of Big Data Systems. IEEE Trans. Big Data 3, 1 (2017), 107--117.Google ScholarCross Ref
- Dongfang Zhao, Zhao Zhang, Xiaobing Zhou, Tonglin Li, Ke Wang, Dries Kimpe, Philip Carns, Robert Ross, and Ioan Raicu. 2014. Fusionfs: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 61--70.Google ScholarCross Ref
- Qing Zheng, Kai Ren, Garth Gibson, Bradley W. Settlemyer, and Gary Grider. 2015. DeltaFS: Exascale File Systems Scale Better Without Dedicated Servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW '15). ACM, New York, NY, USA, 1--6. Google ScholarDigital Library
Index Terms
- DART: distributed adaptive radix tree for efficient affix-based keyword search on HPC systems
Recommendations
Scalable Multimodal Search with Distributed Indexing by Sparse Hashing
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia RetrievalMultimedia search systems must deal with an increasingly large and heterogeneous amount of data. Several challenges exist when deploying real-world search engines for such data. Existing literature does not properly tackle the many efficiency issues ...
Ontology-based semantic search for large-scale RDF data
WAIM'13: Proceedings of the 14th international conference on Web-Age Information ManagementIn recent years, the Web of Data has emerged with the release of growing amount of Linked Data. Since traditional Information Retrieval (IR) technologies are no longer suit for the retrieval on Linked Data, it becomes difficult for ordinary users to ...
Distributed image search in camera sensor networks
SenSys '08: Proceedings of the 6th ACM conference on Embedded network sensor systemsRecent advances in sensor networks permit the use of a large number of relatively inexpensive distributed computational nodes with camera sensors linked in a network and possibly linked to one or more central servers. We argue that the full potential of ...
Comments