research-article

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

Authors:
Pandian Raju

University of Texas at Austin

University of Texas at Austin
View Profile

,
Rohan Kadekodi

University of Texas at Austin

University of Texas at Austin
View Profile

,
Vijay Chidambaram

University of Texas at Austin and VMware Research

University of Texas at Austin and VMware Research
View Profile

,
Ittai Abraham

VMware Research

VMware Research
View Profile

SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesOctober 2017Pages 497–514https://doi.org/10.1145/3132747.3132765

Published:14 October 2017Publication History

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

Pages 497–514

ABSTRACT

Key-value stores such as LevelDB and RocksDB offer excellent write throughput, but suffer high write amplification. The write amplification problem is due to the Log-Structured Merge Trees data structure that underlies these key-value stores. To remedy this problem, this paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level. We build PebblesDB, a high-performance key-value store, by modifying HyperLevelDB to use the FLSM data structure. We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3x compared to RocksDB, while increasing write throughput by 6.7x. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.

Supplemental Material

pebblesdb.mp4

mp4

2.2 GB

Download

References

Ittai Abraham, James Aspnes, and Jian Yuan. 2005. Skip B-trees. In Proceedings of the 9th International Conference on Principles of Distributed Systems (OPODIS 2005). 366--380. Google ScholarDigital Library
Alok Aggarwal, Jeffrey Vitter, et al. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127. Google ScholarDigital Library
Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In Proceedings of the 2008 USENIX Annual Technical Conference. 57--70. Google ScholarDigital Library
Jung-Sang Ahn, Chiyoung Seo, Ravi Mayuram, Rahim Yaseen, Jin-Soo Kim, and Seungryoul Maeng. 2016. ForestDB: A Fast Key-Value Storage System for Variable-Length String Keys. IEEE Trans. Comput. 65, 3 (2016), 902--915. Google ScholarDigital Library
Reed Allman. 2014. Rock Solid Queues @ Iron.io. https://www.youtube.com/watch?v=HTjt6oj-RL4. (2014).Google Scholar
David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array Of Wimpy Nodes. In Poceedings of the ACM SIGOPS 22nd Symposium On Operating Systems Principles (SOSP 09). ACM, 1--14. Google ScholarDigital Library
Apache. 2017. Search Results Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org. (2017).Google Scholar
Austin Appleby. 2016. SMHasher test suite for MurmurHash family of hash functions. https://github.com/aappleby/smhasher. (2016).Google Scholar
Anirudh Badam, KyoungSoo Park, Vivek S. Pai, and Larry L Peterson. 2009. HashCache: Cache Storage for the Next Billion. In Proceedings of the 6th USENIX Symposium on Network Systems Design and Implementation (NSDI 09). 123--136. Google ScholarDigital Library
Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA, 363--375. Google ScholarDigital Library
Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 81--92. Google ScholarDigital Library
Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-Optimized Skip Lists. In Proceedings of the 36th ACM Symposium on Principles of Database Systems (PODS '17). ACM, New York, NY, USA, 69--78. Google ScholarDigital Library
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2 (1979), 121--137. Google ScholarDigital Library
Alexander Conway, Ainesh Bakshi, Yizheng Jiao, William Jannen, Yang Zhan, Jun Yuan, Michael A. Bender, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File Systems Fated for Senescence? Nonsense, Says Science!. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 17). 45--58. Google ScholarDigital Library
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC 10). ACM, 143--154. Google ScholarDigital Library
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-Based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 25--36. Google ScholarDigital Library
Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: a distributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 Conference. 25--36. Google ScholarDigital Library
Facebook. 2017. FIFO compaction style. https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style. (2017).Google Scholar
Facebook. 2017. RocksDB | A persistent key-value store. http://rocksdb.org. (2017).Google Scholar
Facebook. 2017. RocksDB Users. https://github.com/facebook/rocksdb/blob/master/USERS.md. (2017).Google Scholar
Facebook. 2017. Universal Compaction. https://github.com/facebook/rocksdb/wiki/Universal-Compaction. (2017).Google Scholar
Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling Concurrent Log-structured Data Stores. In Proceedings of the Tenth European Conference on Computer Systems (Eurosys 15). ACM, 32. Google ScholarDigital Library
Daniel Golovin. 2010. The B-Skip-List: A Simpler Uniquely Represented Alternative to B-Trees. CoRR abs/1005.0662 (2010).Google Scholar
Google. 2017. LevelDB. https://github.com/google/leveldb. (2017).Google Scholar
Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing Flash Memory: Anomalies, Observations, and Applications. In Proceedings of 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). IEEE, 24--33. Google ScholarDigital Library
James Hamilton. 2009. The Cost of Latency. http://perspectives.mvdirona.com/2009/10/the-cost-of-latency/. (2009).Google Scholar
HyperDex. 2016. HyperDex Benchmark Setup. http://hyperdex.org/performance/setup/. (2016).Google Scholar
HyperDex. 2017. HyperLevelDB Performance Benchmarks. http://hyperdex.org/performance/leveldb/. (2017).Google Scholar
Cockroach Labs. 2017. CockroachDB. https://github.com/cockroachdb/cockroach. (2017).Google Scholar
Dgraph labs. 2017. Dgraph: Graph database for production environment. https://dgraph.io. (2017).Google Scholar
FAL Labs. 2011. Kyoto Cabinet: a straightforward implementation of DBM. http://fallabs.com/kyotocabinet/. (2011).Google Scholar
LevelDB. 2016. LevelDB db_bench benchmark. https://github.com/google/leveldb/blob/master/db/db_bench.cc. (2016).Google Scholar
Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP 11). ACM, 1--13. Google ScholarDigital Library
LinkedIn. 2016. FollowFeed: LinkedIn's Feed Made Faster and Smarter. http://bit.ly/2onMQwN. (2016).Google Scholar
Percona LLC. 2017. Percona TokuDB. https://www.percona.com/software/mysql-database/percona-tokudb. (2017).Google Scholar
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating Keys from Values in SSD-conscious Storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). 133--148. Google ScholarDigital Library
Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, and Raju Rangaswami. 2015. NVMKV: a Scalable, Lightweight, FTL-aware Key-Value Store. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 207--219. Google ScholarDigital Library
Neal Mielke, Todd Marquart, Ning Wu, Jeff Kessenich, Hanmant Belgal, Eric Schares, Falgun Trivedi, Evan Goodness, and Leland R. Nevill. 2008. Bit Error Rate in NAND Flash Memories. In Proceedings of the IEEE International Reliability Physics Symposium, (IRPS 08). IEEE, 9--19.Google Scholar
MongoDB. 2017. MongoDB. https://www.mongodb.com. (2017).Google Scholar
Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. 2009. Migrating server storage to SSDs: analysis of tradeoffs. In Proceedings of the 4th ACM European conference on Computer Systems (Eurosys 09). ACM, 145--158. Google ScholarDigital Library
Suman Nath and Aman Kansal. 2007. FlashDB: Dynamic Self-tuning Database for NAND flash. In Proceedings of the 6th International Conference on Information Processing in Sensor Networks. ACM, 410--419. Google ScholarDigital Library
Netflix. 2016. Application Data Caching using SSDs. http://techblog.netflix.com/2016/05/application-data-caching-using-ssds.html. (May 2016).Google Scholar
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarDigital Library
Oracle. 2017. Oracle Berkeley DB. http//:www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html. (2017).Google Scholar
Pinterest. 2016. Open-sourcing Rocksplicator, a real-time RocksDB data replicator. http://bit.ly/2pv5nZZ. (2016).Google Scholar
William Pugh. 1989. Skip lists: A probabilistic alternative to balanced trees. Algorithms and Data Structures (1989), 437--449. Google ScholarDigital Library
William Pugh. 1990. A Skip List Cookbook. Technical Report CS-TR-2286.1. University of Maryland. Google ScholarDigital Library
Parthasarathy Ranganathan. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer 44, 1 (2011), 39--48. Google ScholarDigital Library
Apache Samza. 2017. State Management. http://samza.apache.org/learn/documentation/0.8/container/state-management.html. (2017).Google Scholar
Russell Sears and Raghu Ramakrishnan. 2012. bLSM:a General Purpose Log Structured Merge Tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google ScholarDigital Library
Pradeep J. Shetty, Richard P. Spillane, Ravikant R. Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building Workload-Independent Storage with VT-trees. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST 13). 17--30. Google ScholarDigital Library
RocksDB Issue Tracker. 2014. Strategies to Reduce Write Amplification 19. https://github.com/facebook/rocksdb/issues/19. (2014).Google Scholar
Uber. 2016. Cherami: Uber Engineering's Durable and Scalable Queue in Go. https://eng.uber.com/cherami/. (2016).Google Scholar
Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using Vector Interfaces To Deliver Millions Of IOPS From A Networked Key-Value Storage Server. In Proceedings of the Third ACM Symposium on Cloud Computing (SOCC 12). ACM, 8. Google ScholarDigital Library
Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An Efficient Design And Implementation Of LSM-Tree Based Key-Value Store On Open-Channel SSD. In Proceedings of the Ninth European Conference on Computer Systems (Eurosys 14). ACM, 16. Google ScholarDigital Library
Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-Tree-Based Ultra-Large Key-Value Store for Small Data Items. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). 71--82. Google ScholarDigital Library
Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and Walid A. Najjar. 2005. MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST '05). Google ScholarDigital Library

Index Terms

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Record and block layout
    2. Database management system engines
      1. Parallel and distributed DBMSs
        Key-value stores

Recommendations

HiLSM: an LSM-based key-value store for hybrid NVM-SSD storage systems
CF '20: Proceedings of the 17th ACM International Conference on Computing Frontiers

In order to ensure data durability and crash consistency, the LSM-tree based key-value stores suffer from high WAL synchronization overhead. Fortunately, the advent of NVM offers an opportunity to address this issue. However, NVM is currently too ...
Read More
An Efficient Memory-Mapped Key-Value Store for Flash Storage
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Today, due to power limitations it is important to reduce CPU overheads for data ...
Read More
BoLT: Barrier-optimized LSM-Tree
Middleware '20: Proceedings of the 21st International Middleware Conference

Key-value stores such as LevelDB and RocksDB are widely used in various systems due to their high write performance. However, the background compaction operations inherent to the key-value stores are often to blame for write amplification and write ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
ISBN:9781450350853
DOI:10.1145/3132747

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
key-value stores
log-structured merge trees
write-optimized data structures
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate131of716submissions,18%
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 207
  Total Citations
  View Citations
- 3,207
  Total Downloads
- Downloads (Last 12 months)290
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

HiLSM: an LSM-based key-value store for hybrid NVM-SSD storage systems

An Efficient Memory-Mapped Key-Value Store for Flash Storage

BoLT: Barrier-optimized LSM-Tree

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

HiLSM: an LSM-based key-value store for hybrid NVM-SSD storage systems

An Efficient Memory-Mapped Key-Value Store for Flash Storage

BoLT: Barrier-optimized LSM-Tree

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media