research-article

Open Access

The Case for Learned Index Structures

Authors:
Tim Kraska

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Alex Beutel

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

,
Ed H. Chi

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

,
Jeffrey Dean

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

,
Neoklis Polyzotis

Google, Inc., Mountain View, CA, USA

Google, Inc., Mountain View, CA, USA
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 489–504https://doi.org/10.1145/3183713.3196909

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 489–504

ABSTRACT

Indexes are models: a \btree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term \em learned indexes. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show that our learned indexes can have significant advantages over traditional indexes. More importantly, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.

References

Database architects blog: The case for b-tree index structures. http://databasearchitects.blogspot.de/2017/12/ the-case-for-b-tree-index-structures.html.Google Scholar
Google's sparsehash documentation. https://github.com/sparsehash/ sparsehash/blob/master/src/sparsehash/sparse_hash_map.Google Scholar
An in-depth look at google's first tensor processing unit (tpu). https://cloud.google.com/blog/big-data/2017/05/ an-in-depth-look-at-googles-first-tensor-processing-unit-tpu.Google Scholar
Intel Xeon Phi. https://www.intel.com/content/www/us/en/products/ processors/xeon-phi/xeon-phi-processors.html.Google Scholar
Moore Law is Dead but GPU will get 1000X faster by 2025. https://www.nextbigfuture.com/2017/06/ moore-law-is-dead-but-gpu-will-get-1000x-faster-by-2025.html.Google Scholar
NVIDIA NVLink High-Speed Interconnect. http://www.nvidia.com/object/ nvlink.html.Google Scholar
Stanford DAWN cuckoo hashing. https://github.com/stanford-futuredata/ index-baselines.Google Scholar
Trying to speed up binary search. http://databasearchitects.blogspot.com/ 2015/09/trying-to-speed-up-binary-search.html.Google Scholar
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265--283, 2016. Google ScholarDigital Library
S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In eCrime, pages 60--69, 2007. Google ScholarDigital Library
K. Alexiou, D. Kossmann, and P.-A. Larson. Adaptive range filters for cold data: Avoiding trips to siberia. Proc. VLDB Endow., 6(14):1714--1725, Sept. 2013. Google ScholarDigital Library
M. Armbrust, A. Fox, D. A. Patterson, N. Lanham, B. Trushkowsky, J. Trutna, and H. Oh. SCADS: scale-independent storage for social computing applications. In CIDR, 2009.Google Scholar
M. Athanassoulis and A. Ailamaki. BF-tree: Approximate Tree Indexing. In VLDB, pages 1881--1892, 2014. Google ScholarDigital Library
Performance comparison: linear search vs binary search. https://dirtyhandscoding.wordpress.com/2017/08/25/ performance-comparison-linear-search-vs-binary-search/.Google Scholar
R. B. Basnet, S. Mukkamala, and A. H. Sung. Detection of phishing attacks: A machine learning approach. Soft Computing Applications in Industry, 226:373--383, 2008.Google ScholarCross Ref
R. Bayer. Symmetric binary b-trees: Data structure and maintenance algorithms. Acta Inf., 1(4):290--306, Dec. 1972. Google ScholarDigital Library
R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. In SIGFIDET (Now SIGMOD), pages 107--141, 1970. Google ScholarDigital Library
S. Bickel, M. Brückner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(Sep):2137--2155, 2009. Google ScholarDigital Library
M. Böhm, B. Schlegel, P. B. Volk, U. Fischer, D. Habich, and W. Lehner. Efficient in-memory indexing with generalized prefix trees. In BTW, pages 227--246, 2011.Google Scholar
J. Boyar and K. S. Larsen. Efficient rebalancing of chromatic search trees. Journal of Computer and System Sciences, 49(3):667 -- 682, 1994. 30th IEEE Conference on Foundations of Computer Science. Google ScholarDigital Library
M. Brantner, D. Florescu, D. A. Graf, D. Kossmann, and T. Kraska. Building a database on S3. In SIGMOD, pages 251--264, 2008. Google ScholarDigital Library
A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet mathematics, 1(4):485--509, 2004.Google ScholarCross Ref
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data (awarded best paper!). In OSDI, pages 205--218, 2006. Google ScholarDigital Library
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pages 1724--1734, 2014.Google ScholarCross Ref
A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Çetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarDigital Library
M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auF der Heide, H. Rohnert, and R. E. Tarjan. Dynamic perfect hashing: Upper and lower bounds. SIAM Journal on Computing, 23(4):738--761, 1994. Google ScholarDigital Library
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121--2159, 2011. Google ScholarDigital Library
A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642--669, 1956.Google ScholarCross Ref
B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher. Cuckoo filter: Practically better than bloom. In CoNEXT, pages 75--88, 2014. Google ScholarDigital Library
T. Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861--874, 2006. Google ScholarDigital Library
E. Fredkin. Trie memory. Commun. ACM, 3(9):490--499, Sept. 1960. Google ScholarDigital Library
A. Galakatos, M. Markovitch, C. Binnig, R. Fonseca, and T. Kraska. A-tree: A bounded approximate index structure. CoRR, abs/1801.10207, 2018.Google Scholar
J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing Relations and Indexes. In ICDE, pages 370--379, 1998. Google ScholarDigital Library
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672--2680, 2014. Google ScholarDigital Library
G. Graefe. B-tree indexes, interpolation search, and skew. In DaMoN, 2006. Google ScholarDigital Library
G. Graefe and P. A. Larson. B-tree indexes and CPU caches. In ICDE, pages 349--358, 2001. Google ScholarDigital Library
A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.Google Scholar
R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
R. Grossi and G. Ottaviano. The wavelet trie: Maintaining an indexed sequence of strings in compressed space. In PODS, pages 203--214, 2012. Google ScholarDigital Library
J. Guo and J. Li. CNN based hashing for image retrieval. CoRR, abs/1509.01354, 2015.Google Scholar
M. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczydlowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables. The Journal of Machine Learning Research, 17(1):3790--3836, 2016. Google ScholarDigital Library
J. C. Huang and B. J. Frey. Cumulative distribution networks and the derivative-sum-product algorithm: Models and inference for cumulative distribution functions on graphs. J. Mach. Learn. Res., 12:301--348, Feb. 2011. Google ScholarDigital Library
K. Kaczmarski. B + -Tree Optimized for GPGPU. 2012.Google Scholar
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, and P. Dubey. Fast: Fast architecture sensitive tree search on modern cpus and gpus. In SIGMOD, pages 339--350, 2010. Google ScholarDigital Library
T. Kissinger, B. Schlegel, D. Habich, and W. Lehner. Kiss-tree: Smart latchfree in-memory indexing on modern architectures. In DaMoN, pages 16--23, 2012. Google ScholarDigital Library
T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, pages 294--303, 1986. Google ScholarDigital Library
V. Leis. FAST source. http://www-db.in.tum.de/leis/index/fast.cpp.Google Scholar
V. Leis, A. Kemper, and T. Neumann. The adaptive radix tree: Artful indexing for main-memory databases. In ICDE, pages 38--49, 2013. Google ScholarDigital Library
W. Litwin. Readings in database systems. chapter Linear Hashing: A New Tool for File and Table Addressing., pages 570--581. Morgan Kaufmann Publishers Inc., 1988. Google ScholarDigital Library
M. Magdon-Ismail and A. F. Atiya. Neural networks for density estimation. In M. J. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS, pages 522--528. MIT Press, 1999. Google ScholarDigital Library
D. J. Miller and H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In NIPS, pages 571--577, 1996. Google ScholarDigital Library
M. Mitzenmacher. Compressed bloom filters. In PODC, pages 144--150, 2001. Google ScholarDigital Library
M. Mitzenmacher. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884, 2018.Google Scholar
G. Moerkotte. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
T. Neumann and G. Weikum. RDF-3X: A RISC-style Engine for RDF. Proc. VLDB Endow., pages 647--659, 2008. Google ScholarDigital Library
OpenStreetMap database ©OpenStreetMap contributors. https://aws. amazon.com/public-datasets/osm.Google Scholar
R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122-- 144, 2004. Google ScholarDigital Library
J. Rao and K. A. Ross. Making b+- trees cache conscious in main memory. In SIGMOD, pages 475--486, 2000. Google ScholarDigital Library
S. Richter, V. Alvarez, and J. Dittrich. A seven-dimensional analysis of hashing methods and its implications on query processing. Proc. VLDB Endow., 9(3):96--107, Nov. 2015. Google ScholarDigital Library
D. G. Severance and G. M. Lohman. Differential files: Their application to the maintenance of large data bases. In SIGMOD, pages 43--43, 1976. Google ScholarCross Ref
A. Shahvarani and H.-A. Jacobsen. A hybrid b+-tree as solution for inmemory indexing on cpu-gpu heterogeneous computing platforms. In SIGMOD, pages 1523--1538, 2016. Google ScholarDigital Library
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixtureof-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google Scholar
M. Stonebraker and L. A. Rowe. The Design of POSTGRES. In SIGMOD, pages 340--355, 1986. Google ScholarDigital Library
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104--3112, 2014. Google ScholarDigital Library
F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.Google Scholar
M. Turcanik and M. Javurek. Hash function generation by neural network. In NTSP, pages 1--5, Oct 2016.Google ScholarCross Ref
J. Wang, W. Liu, S. Kumar, and S. F. Chang. Learning to hash for indexing big data;a survey. Proceedings of the IEEE, 104(1):34--57, Jan 2016.Google ScholarCross Ref
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014.Google Scholar
J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing for approximate nearest neighbor search. In MM, pages 133--142, 2013. Google ScholarDigital Library
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.Google Scholar
S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks and partial monotonic functions. In NIPS, pages 2985--2993, 2017.Google Scholar
Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. CoRR, abs/1709.05011, 2017.Google Scholar
J. Yu and M. Sarwat. Two Birds, One Stone: A Fast, Yet Lightweight, Indexing Scheme for Modern Database Systems. In VLDB, pages 385--396, 2016. Google ScholarDigital Library
E. Zamanian, C. Binnig, T. Kraska, and T. Harris. The end of a myth: Distributed transaction can scale. PVLDB, 10(6):685--696, 2017. Google ScholarDigital Library
H. Zhang, D. G. Andersen, A. Pavlo, M. Kaminsky, L. Ma, and R. Shen. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In SIGMOD, pages 1567--1581, 2016. Google ScholarDigital Library

Index Terms

The Case for Learned Index Structures
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Data management systems
    1. Data structures

Recommendations

CDFShop: Exploring and Optimizing Learned Index Structures
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Indexes are a critical component of data management applications. While tree-like structures (e.g., B-Trees) have been employed to great success, recent work suggests that index structures powered by machine learning models (learned index structures) ...
Read More
Data-Driven Learned Metric Index: An Unsupervised Approach
Similarity Search and Applications
Abstract
Metric indexes are traditionally used for organizing unstructured or complex data to speed up similarity queries. The most widely-used indexes cluster data or divide space using hyper-planes. While searching, the mutual distances between objects ...
Read More
ALMSS: Automatic Learned Index Model Selection System
Web and Big Data
Abstract
Index is an indispensable part of database. As we enter the era of big data, the traditional index structure is found not to support large-scale data well. Although many index structures such as learned indexes based on machine learning have been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
b-tree
bloom-filter
cdf
hash-map
index structures
learned data structures
learned index
learned index structure
linear regression
mixture of experts
neural net
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 438
  Total Citations
  View Citations
- 18,441
  Total Downloads
- Downloads (Last 12 months)3,649
- Downloads (Last 6 weeks)601
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Case for Learned Index Structures

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

CDFShop: Exploring and Optimizing Learned Index Structures

Data-Driven Learned Metric Index: An Unsupervised Approach

ALMSS: Automatic Learned Index Model Selection System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The Case for Learned Index Structures

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

CDFShop: Exploring and Optimizing Learned Index Structures

Data-Driven Learned Metric Index: An Unsupervised Approach

ALMSS: Automatic Learned Index Model Selection System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media