skip to main content
10.1145/3183713.3196909acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

The Case for Learned Index Structures

Published:27 May 2018Publication History

ABSTRACT

Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.

References

  1. Database architects blog: The case for b-tree index structures. http://databasearchitects.blogspot.de/2017/12/ the-case-for-b-tree-index-structures.html.Google ScholarGoogle Scholar
  2. Google's sparsehash documentation. https://github.com/sparsehash/ sparsehash/blob/master/src/sparsehash/sparse_hash_map.Google ScholarGoogle Scholar
  3. An in-depth look at google's first tensor processing unit (tpu). https://cloud.google.com/blog/big-data/2017/05/ an-in-depth-look-at-googles-first-tensor-processing-unit-tpu.Google ScholarGoogle Scholar
  4. Intel Xeon Phi. https://www.intel.com/content/www/us/en/products/ processors/xeon-phi/xeon-phi-processors.html.Google ScholarGoogle Scholar
  5. Moore Law is Dead but GPU will get 1000X faster by 2025. https://www.nextbigfuture.com/2017/06/ moore-law-is-dead-but-gpu-will-get-1000x-faster-by-2025.html.Google ScholarGoogle Scholar
  6. NVIDIA NVLink High-Speed Interconnect. http://www.nvidia.com/object/ nvlink.html.Google ScholarGoogle Scholar
  7. Stanford DAWN cuckoo hashing. https://github.com/stanford-futuredata/ index-baselines.Google ScholarGoogle Scholar
  8. Trying to speed up binary search. http://databasearchitects.blogspot.com/ 2015/09/trying-to-speed-up-binary-search.html.Google ScholarGoogle Scholar
  9. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In eCrime, pages 60--69, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Alexiou, D. Kossmann, and P.-A. Larson. Adaptive range filters for cold data: Avoiding trips to siberia. Proc. VLDB Endow., 6(14):1714--1725, Sept. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: scale-independent storage for social computing applications. In CIDR, 2009.Google ScholarGoogle Scholar
  13. M. Athanassoulis and A. Ailamaki. BF-tree: Approximate Tree Indexing. In VLDB, pages 1881--1892, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Performance comparison: linear search vs binary search. https://dirtyhandscoding.wordpress.com/2017/08/25/ performance-comparison-linear-search-vs-binary-search/.Google ScholarGoogle Scholar
  15. R. B. Basnet, S. Mukkamala, and A. H. Sung. Detection of phishing attacks: A machine learning approach. Soft Computing Applications in Industry, 226:373--383, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  16. R. Bayer. Symmetric binary b-trees: Data structure and maintenance algorithms. Acta Inf., 1(4):290--306, Dec. 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In SIGFIDET (Now SIGMOD), pages 107--141, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137--2155, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Böhm, B. Schlegel, P. B. Volk, U. Fischer, D. Habich, and W. Lehner. Efficient in-memory indexing with generalized prefix trees. In BTW, pages 227--246, 2011.Google ScholarGoogle Scholar
  20. J. Boyar and K. S. Larsen. Efficient rebalancing of chromatic search trees. Journal of Computer and System Sciences, 49(3):667 -- 682, 1994. 30th IEEE Conference on Foundations of Computer Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Brantner, D. Florescu, D. A. Graf, D. Kossmann, and T. Kraska. Building a database on S3. In SIGMOD, pages 251--264, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet mathematics, 1(4):485--509, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In OSDI, pages 205--218, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Çetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auF der Heide, H. Rohnert, and R. E. Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM Journal on Computing, 23(4):738--761, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642--669, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  29. B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: Practically better than bloom. In CoNEXT, pages 75--88, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861--874, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Fredkin. Trie memory. Commun. ACM, 3(9):490--499, Sept. 1960. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska. A-tree: A bounded approximate index structure. CoRR, abs/1801.10207, 2018.Google ScholarGoogle Scholar
  33. J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing Relations and Indexes. In ICDE, pages 370--379, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672--2680, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Graefe. B-tree indexes, interpolation search, and skew. In DaMoN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Graefe and P. A. Larson. B-tree indexes and CPU caches. In ICDE, pages 349--358, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.Google ScholarGoogle Scholar
  38. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Grossi and G. Ottaviano. The wavelet trie: Maintaining an indexed sequence of strings in compressed space. In PODS, pages 203--214, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Guo and J. Li. CNN based hashing for image retrieval. CoRR, abs/1509.01354, 2015.Google ScholarGoogle Scholar
  41. M. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. The Journal of Machine Learning Research, 17(1):3790--3836, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. C. Huang and B. J. Frey. Cumulative distribution networks and the derivative-sum-product algorithm: Models and inference for cumulative distribution functions on graphs. J. Mach. Learn. Res., 12:301--348, Feb. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. Kaczmarski. B + -Tree Optimized for GPGPU. 2012.Google ScholarGoogle Scholar
  44. C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. Fast: Fast architecture sensitive tree search on modern cpus and gpus. In SIGMOD, pages 339--350, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. Kiss-tree: Smart latchfree in-memory indexing on modern architectures. In DaMoN, pages 16--23, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, pages 294--303, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. V. Leis. FAST source. http://www-db.in.tum.de/leis/index/fast.cpp.Google ScholarGoogle Scholar
  48. V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In ICDE, pages 38--49, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. W. Litwin. Readings in database systems. chapter Linear Hashing: A New Tool for File and Table Addressing., pages 570--581. Morgan Kaufmann Publishers Inc., 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Magdon-Ismail and A. F. Atiya. Neural networks for density estimation. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS, pages 522--528. MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. J. Miller and H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In NIPS, pages 571--577, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Mitzenmacher. Compressed bloom filters. In PODC, pages 144--150, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. Mitzenmacher. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884, 2018.Google ScholarGoogle Scholar
  54. G. Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB, pages 476--487, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Neumann and G. Weikum. RDF-3X: A RISC-style Engine for RDF. Proc. VLDB Endow., pages 647--659, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. OpenStreetMap database ©OpenStreetMap contributors. https://aws. amazon.com/public-datasets/osm.Google ScholarGoogle Scholar
  57. R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-- 144, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD, pages 475--486, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. S. Richter, V. Alvarez, and J. Dittrich. A seven-dimensional analysis of hashing methods and its implications on query processing. Proc. VLDB Endow., 9(3):96--107, Nov. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. G. Severance and G. M. Lohman. Differential files: Their application to the maintenance of large data bases. In SIGMOD, pages 43--43, 1976. Google ScholarGoogle ScholarCross RefCross Ref
  61. A. Shahvarani and H.-A. Jacobsen. A hybrid b+-tree as solution for inmemory indexing on cpu-gpu heterogeneous computing platforms. In SIGMOD, pages 1523--1538, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google ScholarGoogle Scholar
  63. M. Stonebraker and L. A. Rowe. The Design of POSTGRES. In SIGMOD, pages 340--355, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.Google ScholarGoogle Scholar
  66. M. Turcanik and M. Javurek. Hash function generation by neural network. In NTSP, pages 1--5, Oct 2016.Google ScholarGoogle ScholarCross RefCross Ref
  67. J. Wang, W. Liu, S. Kumar, and S. F. Chang. Learning to hash for indexing big data;a survey. Proceedings of the IEEE, 104(1):34--57, Jan 2016.Google ScholarGoogle ScholarCross RefCross Ref
  68. J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014.Google ScholarGoogle Scholar
  69. J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In MM, pages 133--142, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.Google ScholarGoogle Scholar
  71. S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. In NIPS, pages 2985--2993, 2017.Google ScholarGoogle Scholar
  72. Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017.Google ScholarGoogle Scholar
  73. J. Yu and M. Sarwat. Two Birds, One Stone: A Fast, Yet Lightweight, Indexing Scheme for Modern Database Systems. In VLDB, pages 385--396, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. E. Zamanian, C. Binnig, T. Kraska, and T. Harris. The end of a myth: Distributed transaction can scale. PVLDB, 10(6):685--696, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, and R. Shen. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In SIGMOD, pages 1567--1581, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Case for Learned Index Structures

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader