ABSTRACT
Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.
- Database architects blog: The case for b-tree index structures. http://databasearchitects.blogspot.de/2017/12/ the-case-for-b-tree-index-structures.html.Google Scholar
- Google's sparsehash documentation. https://github.com/sparsehash/ sparsehash/blob/master/src/sparsehash/sparse_hash_map.Google Scholar
- An in-depth look at google's first tensor processing unit (tpu). https://cloud.google.com/blog/big-data/2017/05/ an-in-depth-look-at-googles-first-tensor-processing-unit-tpu.Google Scholar
- Intel Xeon Phi. https://www.intel.com/content/www/us/en/products/ processors/xeon-phi/xeon-phi-processors.html.Google Scholar
- Moore Law is Dead but GPU will get 1000X faster by 2025. https://www.nextbigfuture.com/2017/06/ moore-law-is-dead-but-gpu-will-get-1000x-faster-by-2025.html.Google Scholar
- NVIDIA NVLink High-Speed Interconnect. http://www.nvidia.com/object/ nvlink.html.Google Scholar
- Stanford DAWN cuckoo hashing. https://github.com/stanford-futuredata/ index-baselines.Google Scholar
- Trying to speed up binary search. http://databasearchitects.blogspot.com/ 2015/09/trying-to-speed-up-binary-search.html.Google Scholar
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016. Google ScholarDigital Library
- S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In eCrime, pages 60--69, 2007. Google ScholarDigital Library
- K. Alexiou, D. Kossmann, and P.-A. Larson. Adaptive range filters for cold data: Avoiding trips to siberia. Proc. VLDB Endow., 6(14):1714--1725, Sept. 2013. Google ScholarDigital Library
- M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: scale-independent storage for social computing applications. In CIDR, 2009.Google Scholar
- M. Athanassoulis and A. Ailamaki. BF-tree: Approximate Tree Indexing. In VLDB, pages 1881--1892, 2014. Google ScholarDigital Library
- Performance comparison: linear search vs binary search. https://dirtyhandscoding.wordpress.com/2017/08/25/ performance-comparison-linear-search-vs-binary-search/.Google Scholar
- R. B. Basnet, S. Mukkamala, and A. H. Sung. Detection of phishing attacks: A machine learning approach. Soft Computing Applications in Industry, 226:373--383, 2008.Google ScholarCross Ref
- R. Bayer. Symmetric binary b-trees: Data structure and maintenance algorithms. Acta Inf., 1(4):290--306, Dec. 1972. Google ScholarDigital Library
- R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In SIGFIDET (Now SIGMOD), pages 107--141, 1970. Google ScholarDigital Library
- S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137--2155, 2009. Google ScholarDigital Library
- M. Böhm, B. Schlegel, P. B. Volk, U. Fischer, D. Habich, and W. Lehner. Efficient in-memory indexing with generalized prefix trees. In BTW, pages 227--246, 2011.Google Scholar
- J. Boyar and K. S. Larsen. Efficient rebalancing of chromatic search trees. Journal of Computer and System Sciences, 49(3):667 -- 682, 1994. 30th IEEE Conference on Foundations of Computer Science. Google ScholarDigital Library
- M. Brantner, D. Florescu, D. A. Graf, D. Kossmann, and T. Kraska. Building a database on S3. In SIGMOD, pages 251--264, 2008. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet mathematics, 1(4):485--509, 2004.Google ScholarCross Ref
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In OSDI, pages 205--218, 2006. Google ScholarDigital Library
- K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014.Google ScholarCross Ref
- A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Çetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarDigital Library
- M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auF der Heide, H. Rohnert, and R. E. Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM Journal on Computing, 23(4):738--761, 1994. Google ScholarDigital Library
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011. Google ScholarDigital Library
- A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642--669, 1956.Google ScholarCross Ref
- B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: Practically better than bloom. In CoNEXT, pages 75--88, 2014. Google ScholarDigital Library
- T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861--874, 2006. Google ScholarDigital Library
- E. Fredkin. Trie memory. Commun. ACM, 3(9):490--499, Sept. 1960. Google ScholarDigital Library
- A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska. A-tree: A bounded approximate index structure. CoRR, abs/1801.10207, 2018.Google Scholar
- J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing Relations and Indexes. In ICDE, pages 370--379, 1998. Google ScholarDigital Library
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672--2680, 2014. Google ScholarDigital Library
- G. Graefe. B-tree indexes, interpolation search, and skew. In DaMoN, 2006. Google ScholarDigital Library
- G. Graefe and P. A. Larson. B-tree indexes and CPU caches. In ICDE, pages 349--358, 2001. Google ScholarDigital Library
- A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.Google Scholar
- R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
- R. Grossi and G. Ottaviano. The wavelet trie: Maintaining an indexed sequence of strings in compressed space. In PODS, pages 203--214, 2012. Google ScholarDigital Library
- J. Guo and J. Li. CNN based hashing for image retrieval. CoRR, abs/1509.01354, 2015.Google Scholar
- M. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. The Journal of Machine Learning Research, 17(1):3790--3836, 2016. Google ScholarDigital Library
- J. C. Huang and B. J. Frey. Cumulative distribution networks and the derivative-sum-product algorithm: Models and inference for cumulative distribution functions on graphs. J. Mach. Learn. Res., 12:301--348, Feb. 2011. Google ScholarDigital Library
- K. Kaczmarski. B + -Tree Optimized for GPGPU. 2012.Google Scholar
- C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. Fast: Fast architecture sensitive tree search on modern cpus and gpus. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
- T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. Kiss-tree: Smart latchfree in-memory indexing on modern architectures. In DaMoN, pages 16--23, 2012. Google ScholarDigital Library
- T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, pages 294--303, 1986. Google ScholarDigital Library
- V. Leis. FAST source. http://www-db.in.tum.de/leis/index/fast.cpp.Google Scholar
- V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In ICDE, pages 38--49, 2013. Google ScholarDigital Library
- W. Litwin. Readings in database systems. chapter Linear Hashing: A New Tool for File and Table Addressing., pages 570--581. Morgan Kaufmann Publishers Inc., 1988. Google ScholarDigital Library
- M. Magdon-Ismail and A. F. Atiya. Neural networks for density estimation. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS, pages 522--528. MIT Press, 1999. Google ScholarDigital Library
- D. J. Miller and H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In NIPS, pages 571--577, 1996. Google ScholarDigital Library
- M. Mitzenmacher. Compressed bloom filters. In PODC, pages 144--150, 2001. Google ScholarDigital Library
- M. Mitzenmacher. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884, 2018.Google Scholar
- G. Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
- T. Neumann and G. Weikum. RDF-3X: A RISC-style Engine for RDF. Proc. VLDB Endow., pages 647--659, 2008. Google ScholarDigital Library
- OpenStreetMap database ©OpenStreetMap contributors. https://aws. amazon.com/public-datasets/osm.Google Scholar
- R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-- 144, 2004. Google ScholarDigital Library
- J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD, pages 475--486, 2000. Google ScholarDigital Library
- S. Richter, V. Alvarez, and J. Dittrich. A seven-dimensional analysis of hashing methods and its implications on query processing. Proc. VLDB Endow., 9(3):96--107, Nov. 2015. Google ScholarDigital Library
- D. G. Severance and G. M. Lohman. Differential files: Their application to the maintenance of large data bases. In SIGMOD, pages 43--43, 1976. Google ScholarCross Ref
- A. Shahvarani and H.-A. Jacobsen. A hybrid b+-tree as solution for inmemory indexing on cpu-gpu heterogeneous computing platforms. In SIGMOD, pages 1523--1538, 2016. Google ScholarDigital Library
- N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google Scholar
- M. Stonebraker and L. A. Rowe. The Design of POSTGRES. In SIGMOD, pages 340--355, 1986. Google ScholarDigital Library
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarDigital Library
- F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.Google Scholar
- M. Turcanik and M. Javurek. Hash function generation by neural network. In NTSP, pages 1--5, Oct 2016.Google ScholarCross Ref
- J. Wang, W. Liu, S. Kumar, and S. F. Chang. Learning to hash for indexing big data;a survey. Proceedings of the IEEE, 104(1):34--57, Jan 2016.Google ScholarCross Ref
- J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014.Google Scholar
- J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In MM, pages 133--142, 2013. Google ScholarDigital Library
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.Google Scholar
- S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. In NIPS, pages 2985--2993, 2017.Google Scholar
- Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017.Google Scholar
- J. Yu and M. Sarwat. Two Birds, One Stone: A Fast, Yet Lightweight, Indexing Scheme for Modern Database Systems. In VLDB, pages 385--396, 2016. Google ScholarDigital Library
- E. Zamanian, C. Binnig, T. Kraska, and T. Harris. The end of a myth: Distributed transaction can scale. PVLDB, 10(6):685--696, 2017. Google ScholarDigital Library
- H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, and R. Shen. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In SIGMOD, pages 1567--1581, 2016. Google ScholarDigital Library
Index Terms
- The Case for Learned Index Structures
Recommendations
CDFShop: Exploring and Optimizing Learned Index Structures
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataIndexes are a critical component of data management applications. While tree-like structures (e.g., B-Trees) have been employed to great success, recent work suggests that index structures powered by machine learning models (learned index structures) ...
Data-Driven Learned Metric Index: An Unsupervised Approach
Similarity Search and ApplicationsAbstractMetric indexes are traditionally used for organizing unstructured or complex data to speed up similarity queries. The most widely-used indexes cluster data or divide space using hyper-planes. While searching, the mutual distances between objects ...
ALMSS: Automatic Learned Index Model Selection System
Web and Big DataAbstractIndex is an indispensable part of database. As we enter the era of big data, the traditional index structure is found not to support large-scale data well. Although many index structures such as learned indexes based on machine learning have been ...
Comments