ABSTRACT
Key-value stores such as LevelDB and RocksDB offer excellent write throughput, but suffer high write amplification. The write amplification problem is due to the Log-Structured Merge Trees data structure that underlies these key-value stores. To remedy this problem, this paper presents a novel data structure that is inspired by Skip Lists, termed Fragmented Log-Structured Merge Trees (FLSM). FLSM introduces the notion of guards to organize logs, and avoids rewriting data in the same level. We build PebblesDB, a high-performance key-value store, by modifying HyperLevelDB to use the FLSM data structure. We evaluate PebblesDB using micro-benchmarks and show that for write-intensive workloads, PebblesDB reduces write amplification by 2.4-3x compared to RocksDB, while increasing write throughput by 6.7x. We modify two widely-used NoSQL stores, MongoDB and HyperDex, to use PebblesDB as their underlying storage engine. Evaluating these applications using the YCSB benchmark shows that throughput is increased by 18-105% when using PebblesDB (compared to their default storage engines) while write IO is decreased by 35-55%.
Supplemental Material
- Ittai Abraham, James Aspnes, and Jian Yuan. 2005. Skip B-trees. In Proceedings of the 9th International Conference on Principles of Distributed Systems (OPODIS 2005). 366--380. Google ScholarDigital Library
- Alok Aggarwal, Jeffrey Vitter, et al. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 9 (1988), 1116--1127. Google ScholarDigital Library
- Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark S. Manasse, and Rina Panigrahy. 2008. Design Tradeoffs for SSD Performance. In Proceedings of the 2008 USENIX Annual Technical Conference. 57--70. Google ScholarDigital Library
- Jung-Sang Ahn, Chiyoung Seo, Ravi Mayuram, Rahim Yaseen, Jin-Soo Kim, and Seungryoul Maeng. 2016. ForestDB: A Fast Key-Value Storage System for Variable-Length String Keys. IEEE Trans. Comput. 65, 3 (2016), 902--915. Google ScholarDigital Library
- Reed Allman. 2014. Rock Solid Queues @ Iron.io. https://www.youtube.com/watch?v=HTjt6oj-RL4. (2014).Google Scholar
- David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A Fast Array Of Wimpy Nodes. In Poceedings of the ACM SIGOPS 22nd Symposium On Operating Systems Principles (SOSP 09). ACM, 1--14. Google ScholarDigital Library
- Apache. 2017. Search Results Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org. (2017).Google Scholar
- Austin Appleby. 2016. SMHasher test suite for MurmurHash family of hash functions. https://github.com/aappleby/smhasher. (2016).Google Scholar
- Anirudh Badam, KyoungSoo Park, Vivek S. Pai, and Larry L Peterson. 2009. HashCache: Cache Storage for the Next Billion. In Proceedings of the 6th USENIX Symposium on Network Systems Design and Implementation (NSDI 09). 123--136. Google ScholarDigital Library
- Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17). Santa Clara, CA, 363--375. Google ScholarDigital Library
- Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-Oblivious Streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 81--92. Google ScholarDigital Library
- Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-Optimized Skip Lists. In Proceedings of the 36th ACM Symposium on Principles of Database Systems (PODS '17). ACM, New York, NY, USA, 69--78. Google ScholarDigital Library
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Douglas Comer. 1979. Ubiquitous B-tree. ACM Computing Surveys (CSUR) 11, 2 (1979), 121--137. Google ScholarDigital Library
- Alexander Conway, Ainesh Bakshi, Yizheng Jiao, William Jannen, Yang Zhan, Jun Yuan, Michael A. Bender, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File Systems Fated for Senescence? Nonsense, Says Science!. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 17). 45--58. Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC 10). ACM, 143--154. Google ScholarDigital Library
- Biplob Debnath, Sudipta Sengupta, and Jin Li. 2011. SkimpyStash: RAM Space Skimpy Key-Value Store on Flash-Based Storage. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, 25--36. Google ScholarDigital Library
- Robert Escriva, Bernard Wong, and Emin Gün Sirer. 2012. HyperDex: a distributed, searchable key-value store. In Proceedings of the ACM SIGCOMM 2012 Conference. 25--36. Google ScholarDigital Library
- Facebook. 2017. FIFO compaction style. https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style. (2017).Google Scholar
- Facebook. 2017. RocksDB | A persistent key-value store. http://rocksdb.org. (2017).Google Scholar
- Facebook. 2017. RocksDB Users. https://github.com/facebook/rocksdb/blob/master/USERS.md. (2017).Google Scholar
- Facebook. 2017. Universal Compaction. https://github.com/facebook/rocksdb/wiki/Universal-Compaction. (2017).Google Scholar
- Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling Concurrent Log-structured Data Stores. In Proceedings of the Tenth European Conference on Computer Systems (Eurosys 15). ACM, 32. Google ScholarDigital Library
- Daniel Golovin. 2010. The B-Skip-List: A Simpler Uniquely Represented Alternative to B-Trees. CoRR abs/1005.0662 (2010).Google Scholar
- Google. 2017. LevelDB. https://github.com/google/leveldb. (2017).Google Scholar
- Laura M. Grupp, Adrian M. Caulfield, Joel Coburn, Steven Swanson, Eitan Yaakobi, Paul H. Siegel, and Jack K. Wolf. 2009. Characterizing Flash Memory: Anomalies, Observations, and Applications. In Proceedings of 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42). IEEE, 24--33. Google ScholarDigital Library
- James Hamilton. 2009. The Cost of Latency. http://perspectives.mvdirona.com/2009/10/the-cost-of-latency/. (2009).Google Scholar
- HyperDex. 2016. HyperDex Benchmark Setup. http://hyperdex.org/performance/setup/. (2016).Google Scholar
- HyperDex. 2017. HyperLevelDB Performance Benchmarks. http://hyperdex.org/performance/leveldb/. (2017).Google Scholar
- Cockroach Labs. 2017. CockroachDB. https://github.com/cockroachdb/cockroach. (2017).Google Scholar
- Dgraph labs. 2017. Dgraph: Graph database for production environment. https://dgraph.io. (2017).Google Scholar
- FAL Labs. 2011. Kyoto Cabinet: a straightforward implementation of DBM. http://fallabs.com/kyotocabinet/. (2011).Google Scholar
- LevelDB. 2016. LevelDB db_bench benchmark. https://github.com/google/leveldb/blob/master/db/db_bench.cc. (2016).Google Scholar
- Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP 11). ACM, 1--13. Google ScholarDigital Library
- LinkedIn. 2016. FollowFeed: LinkedIn's Feed Made Faster and Smarter. http://bit.ly/2onMQwN. (2016).Google Scholar
- Percona LLC. 2017. Percona TokuDB. https://www.percona.com/software/mysql-database/percona-tokudb. (2017).Google Scholar
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating Keys from Values in SSD-conscious Storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). 133--148. Google ScholarDigital Library
- Leonardo Marmol, Swaminathan Sundararaman, Nisha Talagala, and Raju Rangaswami. 2015. NVMKV: a Scalable, Lightweight, FTL-aware Key-Value Store. In 2015 USENIX Annual Technical Conference (USENIX ATC 15). 207--219. Google ScholarDigital Library
- Neal Mielke, Todd Marquart, Ning Wu, Jeff Kessenich, Hanmant Belgal, Eric Schares, Falgun Trivedi, Evan Goodness, and Leland R. Nevill. 2008. Bit Error Rate in NAND Flash Memories. In Proceedings of the IEEE International Reliability Physics Symposium, (IRPS 08). IEEE, 9--19.Google Scholar
- MongoDB. 2017. MongoDB. https://www.mongodb.com. (2017).Google Scholar
- Dushyanth Narayanan, Eno Thereska, Austin Donnelly, Sameh Elnikety, and Antony Rowstron. 2009. Migrating server storage to SSDs: analysis of tradeoffs. In Proceedings of the 4th ACM European conference on Computer Systems (Eurosys 09). ACM, 145--158. Google ScholarDigital Library
- Suman Nath and Aman Kansal. 2007. FlashDB: Dynamic Self-tuning Database for NAND flash. In Proceedings of the 6th International Conference on Information Processing in Sensor Networks. ACM, 410--419. Google ScholarDigital Library
- Netflix. 2016. Application Data Caching using SSDs. http://techblog.netflix.com/2016/05/application-data-caching-using-ssds.html. (May 2016).Google Scholar
- Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarDigital Library
- Oracle. 2017. Oracle Berkeley DB. http//:www.oracle.com/technetwork/database/database-technologies/berkeleydb/overview/index.html. (2017).Google Scholar
- Pinterest. 2016. Open-sourcing Rocksplicator, a real-time RocksDB data replicator. http://bit.ly/2pv5nZZ. (2016).Google Scholar
- William Pugh. 1989. Skip lists: A probabilistic alternative to balanced trees. Algorithms and Data Structures (1989), 437--449. Google ScholarDigital Library
- William Pugh. 1990. A Skip List Cookbook. Technical Report CS-TR-2286.1. University of Maryland. Google ScholarDigital Library
- Parthasarathy Ranganathan. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer 44, 1 (2011), 39--48. Google ScholarDigital Library
- Apache Samza. 2017. State Management. http://samza.apache.org/learn/documentation/0.8/container/state-management.html. (2017).Google Scholar
- Russell Sears and Raghu Ramakrishnan. 2012. bLSM:a General Purpose Log Structured Merge Tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google ScholarDigital Library
- Pradeep J. Shetty, Richard P. Spillane, Ravikant R. Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building Workload-Independent Storage with VT-trees. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST 13). 17--30. Google ScholarDigital Library
- RocksDB Issue Tracker. 2014. Strategies to Reduce Write Amplification 19. https://github.com/facebook/rocksdb/issues/19. (2014).Google Scholar
- Uber. 2016. Cherami: Uber Engineering's Durable and Scalable Queue in Go. https://eng.uber.com/cherami/. (2016).Google Scholar
- Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using Vector Interfaces To Deliver Millions Of IOPS From A Networked Key-Value Storage Server. In Proceedings of the Third ACM Symposium on Cloud Computing (SOCC 12). ACM, 8. Google ScholarDigital Library
- Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An Efficient Design And Implementation Of LSM-Tree Based Key-Value Store On Open-Channel SSD. In Proceedings of the Ninth European Conference on Computer Systems (Eurosys 14). ACM, 16. Google ScholarDigital Library
- Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-Tree-Based Ultra-Large Key-Value Store for Small Data Items. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). 71--82. Google ScholarDigital Library
- Demetrios Zeinalipour-Yazti, Song Lin, Vana Kalogeraki, Dimitrios Gunopulos, and Walid A. Najjar. 2005. MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST '05). Google ScholarDigital Library
Index Terms
- PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
Recommendations
HiLSM: an LSM-based key-value store for hybrid NVM-SSD storage systems
CF '20: Proceedings of the 17th ACM International Conference on Computing FrontiersIn order to ensure data durability and crash consistency, the LSM-tree based key-value stores suffer from high WAL synchronization overhead. Fortunately, the advent of NVM offers an opportunity to address this issue. However, NVM is currently too ...
An Efficient Memory-Mapped Key-Value Store for Flash Storage
SoCC '18: Proceedings of the ACM Symposium on Cloud ComputingPersistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Today, due to power limitations it is important to reduce CPU overheads for data ...
BoLT: Barrier-optimized LSM-Tree
Middleware '20: Proceedings of the 21st International Middleware ConferenceKey-value stores such as LevelDB and RocksDB are widely used in various systems due to their high write performance. However, the background compaction operations inherent to the key-value stores are often to blame for write amplification and write ...
Comments