skip to main content
10.1145/3132747.3132765acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

Published:14 October 2017Publication History

ABSTRACT

Key-value stores such as LevelDB and RocksDB offer excellent write throughput, but suffer high write amplification. The write amplification problem is due to the Log-Structured Merge Trees data structure that underlies these key-value stores. To remedy this problem, this paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level. We build PebblesDB, a high-performance key-value store, by modifying HyperLevelDB to use the FLSM data structure. We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3x compared to RocksDB, while increasing write throughput by 6.7x. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.

Skip Supplemental Material Section

Supplemental Material

pebblesdb.mp4

mp4

2.2 GB

References

  1. Ittai Abraham, James Aspnes, and Jian Yuan. 2005. Skip B-trees. In Proceedings of the 9th International Conference on Principles of Distributed Systems (OPODIS 2005). 366--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alok Aggarwal, Jeffrey Vitter, et al. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In Proceedings of the 2008 USENIX Annual Technical Conference. 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jung-Sang Ahn, Chiyoung Seo, Ravi Mayuram, Rahim Yaseen, Jin-Soo Kim, and Seungryoul Maeng. 2016. ForestDB: A Fast Key-Value Storage System for Variable-Length String Keys. IEEE Trans. Comput. 65, 3 (2016), 902--915. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Reed Allman. 2014. Rock Solid Queues @ Iron.io. https://www.youtube.com/watch?v=HTjt6oj-RL4. (2014).Google ScholarGoogle Scholar
  6. David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array Of Wimpy Nodes. In Poceedings of the ACM SIGOPS 22nd Symposium On Operating Systems Principles (SOSP 09). ACM, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Apache. 2017. Search Results Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org. (2017).Google ScholarGoogle Scholar
  8. Austin Appleby. 2016. SMHasher test suite for MurmurHash family of hash functions. https://github.com/aappleby/smhasher. (2016).Google ScholarGoogle Scholar
  9. Anirudh Badam, KyoungSoo Park, Vivek S. Pai, and Larry L Peterson. 2009. HashCache: Cache Storage for the Next Billion. In Proceedings of the 6th USENIX Symposium on Network Systems Design and Implementation (NSDI 09). 123--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA, 363--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-Optimized Skip Lists. In Proceedings of the 36th ACM Symposium on Principles of Database Systems (PODS '17). ACM, New York, NY, USA, 69--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2 (1979), 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alexander Conway, Ainesh Bakshi, Yizheng Jiao, William Jannen, Yang Zhan, Jun Yuan, Michael A. Bender, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File Systems Fated for Senescence? Nonsense, Says Science!. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 17). 45--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC 10). ACM, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-Based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: a distributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 Conference. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Facebook. 2017. FIFO compaction style. https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style. (2017).Google ScholarGoogle Scholar
  20. Facebook. 2017. RocksDB | A persistent key-value store. http://rocksdb.org. (2017).Google ScholarGoogle Scholar
  21. Facebook. 2017. RocksDB Users. https://github.com/facebook/rocksdb/blob/master/USERS.md. (2017).Google ScholarGoogle Scholar
  22. Facebook. 2017. Universal Compaction. https://github.com/facebook/rocksdb/wiki/Universal-Compaction. (2017).Google ScholarGoogle Scholar
  23. Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling Concurrent Log-structured Data Stores. In Proceedings of the Tenth European Conference on Computer Systems (Eurosys 15). ACM, 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Daniel Golovin. 2010. The B-Skip-List: A Simpler Uniquely Represented Alternative to B-Trees. CoRR abs/1005.0662 (2010).Google ScholarGoogle Scholar
  25. Google. 2017. LevelDB. https://github.com/google/leveldb. (2017).Google ScholarGoogle Scholar
  26. Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing Flash Memory: Anomalies, Observations, and Applications. In Proceedings of 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). IEEE, 24--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. James Hamilton. 2009. The Cost of Latency. http://perspectives.mvdirona.com/2009/10/the-cost-of-latency/. (2009).Google ScholarGoogle Scholar
  28. HyperDex. 2016. HyperDex Benchmark Setup. http://hyperdex.org/performance/setup/. (2016).Google ScholarGoogle Scholar
  29. HyperDex. 2017. HyperLevelDB Performance Benchmarks. http://hyperdex.org/performance/leveldb/. (2017).Google ScholarGoogle Scholar
  30. Cockroach Labs. 2017. CockroachDB. https://github.com/cockroachdb/cockroach. (2017).Google ScholarGoogle Scholar
  31. Dgraph labs. 2017. Dgraph: Graph database for production environment. https://dgraph.io. (2017).Google ScholarGoogle Scholar
  32. FAL Labs. 2011. Kyoto Cabinet: a straightforward implementation of DBM. http://fallabs.com/kyotocabinet/. (2011).Google ScholarGoogle Scholar
  33. LevelDB. 2016. LevelDB db_bench benchmark. https://github.com/google/leveldb/blob/master/db/db_bench.cc. (2016).Google ScholarGoogle Scholar
  34. Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP 11). ACM, 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. LinkedIn. 2016. FollowFeed: LinkedIn's Feed Made Faster and Smarter. http://bit.ly/2onMQwN. (2016).Google ScholarGoogle Scholar
  36. Percona LLC. 2017. Percona TokuDB. https://www.percona.com/software/mysql-database/percona-tokudb. (2017).Google ScholarGoogle Scholar
  37. Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating Keys from Values in SSD-conscious Storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). 133--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, and Raju Rangaswami. 2015. NVMKV: a Scalable, Lightweight, FTL-aware Key-Value Store. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 207--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Neal Mielke, Todd Marquart, Ning Wu, Jeff Kessenich, Hanmant Belgal, Eric Schares, Falgun Trivedi, Evan Goodness, and Leland R. Nevill. 2008. Bit Error Rate in NAND Flash Memories. In Proceedings of the IEEE International Reliability Physics Symposium, (IRPS 08). IEEE, 9--19.Google ScholarGoogle Scholar
  40. MongoDB. 2017. MongoDB. https://www.mongodb.com. (2017).Google ScholarGoogle Scholar
  41. Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. 2009. Migrating server storage to SSDs: analysis of tradeoffs. In Proceedings of the 4th ACM European conference on Computer Systems (Eurosys 09). ACM, 145--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Suman Nath and Aman Kansal. 2007. FlashDB: Dynamic Self-tuning Database for NAND flash. In Proceedings of the 6th International Conference on Information Processing in Sensor Networks. ACM, 410--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Netflix. 2016. Application Data Caching using SSDs. http://techblog.netflix.com/2016/05/application-data-caching-using-ssds.html. (May 2016).Google ScholarGoogle Scholar
  44. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Oracle. 2017. Oracle Berkeley DB. http//:www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html. (2017).Google ScholarGoogle Scholar
  46. Pinterest. 2016. Open-sourcing Rocksplicator, a real-time RocksDB data replicator. http://bit.ly/2pv5nZZ. (2016).Google ScholarGoogle Scholar
  47. William Pugh. 1989. Skip lists: A probabilistic alternative to balanced trees. Algorithms and Data Structures (1989), 437--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. William Pugh. 1990. A Skip List Cookbook. Technical Report CS-TR-2286.1. University of Maryland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Parthasarathy Ranganathan. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer 44, 1 (2011), 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Apache Samza. 2017. State Management. http://samza.apache.org/learn/documentation/0.8/container/state-management.html. (2017).Google ScholarGoogle Scholar
  51. Russell Sears and Raghu Ramakrishnan. 2012. bLSM:a General Purpose Log Structured Merge Tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Pradeep J. Shetty, Richard P. Spillane, Ravikant R. Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building Workload-Independent Storage with VT-trees. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST 13). 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. RocksDB Issue Tracker. 2014. Strategies to Reduce Write Amplification 19. https://github.com/facebook/rocksdb/issues/19. (2014).Google ScholarGoogle Scholar
  54. Uber. 2016. Cherami: Uber Engineering's Durable and Scalable Queue in Go. https://eng.uber.com/cherami/. (2016).Google ScholarGoogle Scholar
  55. Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using Vector Interfaces To Deliver Millions Of IOPS From A Networked Key-Value Storage Server. In Proceedings of the Third ACM Symposium on Cloud Computing (SOCC 12). ACM, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An Efficient Design And Implementation Of LSM-Tree Based Key-Value Store On Open-Channel SSD. In Proceedings of the Ninth European Conference on Computer Systems (Eurosys 14). ACM, 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-Tree-Based Ultra-Large Key-Value Store for Small Data Items. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and Walid A. Najjar. 2005. MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST '05). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
        October 2017
        677 pages
        ISBN:9781450350853
        DOI:10.1145/3132747

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 October 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate131of716submissions,18%

        Upcoming Conference

        SOSP '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader