ABSTRACT
Popular SSD-based key-value stores consume a large amount of DRAM in order to provide high-performance database operations. However, DRAM can be expensive for data center providers, especially given recent global supply shortages that have resulted in increasing DRAM costs. In this work, we design a key-value store, MyNVM, which leverages an NVM block device to reduce DRAM usage, and to reduce the total cost of ownership, while providing comparable latency and queries-per-second (QPS) as MyRocks on a server with a much larger amount of DRAM. Replacing DRAM with NVM introduces several challenges. In particular, NVM has limited read bandwidth, and it wears out quickly under a high write bandwidth.
We design novel solutions to these challenges, including using small block sizes with a partitioned index, aligning blocks post-compression to reduce read bandwidth, utilizing dictionary compression, implementing an admission control policy for which objects get cached in NVM to control its durability, as well as replacing interrupts with a hybrid polling mechanism. We implemented MyNVM and measured its performance in Facebook's production environment. Our implementation reduces the size of the DRAM cache from 96 GB to 16 GB, and incurs a negligible impact on latency and queries-per-second compared to MyRocks. Finally, to the best of our knowledge, this is the first study on the usage of NVM devices in a commercial data center environment.
- Dram prices continue to climb. https://epsnews.com/2017/08/18/dram-prices-continue-climb/.Google Scholar
- Flexible I/O tester. https://github.com/axboe/fio.Google Scholar
- Intel Optane DC p4800x specifications. https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/optane-dc-p4800x-series.html.Google Scholar
- Introducing the Samsung PM1725a NVMe SSD. http://www.samsung.com/semiconductor/insights/tech-leadership/brochure-samsung-pm1725a-nvme-ssd/.Google Scholar
- RocksDB wiki. github.com/facebook/rocksdb/wiki//.Google Scholar
- T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. LinkBench: A database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1185--1196, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- K. A. Bailey, P. Hornyack, L. Ceze, S. D. Gribble, and H. M. Levy. Exploring storage class memory with key value stores. In Proceedings of the 1st Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads, INFLOW '13, pages 4:1--4:8, NewYork, NY, USA, 2013. ACM. Google ScholarDigital Library
- D. S. Berger, R. K. Sitaraman, and M. Harchol-Balter. AdaptSize: Orchestrating the hot object memory cache in a content delivery network. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 483--498, Boston, MA, 2017. USENIX Association. Google ScholarDigital Library
- N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y.J. Song, and V. Venkataramani. TAO: Facebook's Distributed Data Store for the Social Graph. In Presented as part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 49--60, San Jose, CA, 2013. Google ScholarDigital Library
- J. Chen, Q. Wei, C. Chen, and L. Wu. FSMAC: A file system metadata accelerator with non-volatile memory. In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pages 1--11. IEEE, 2013.Google ScholarCross Ref
- S. Chen, P. B. Gibbons, and S. Nath. Rethinking database algorithms for phase change memory. In CIDR, pages 21--31. www.cidrdb.org, 2011.Google Scholar
- Y. COLLET and C. TURNER. Smaller and faster data compression with zstandard, 2016, 2016.Google Scholar
- J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better i/o through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 133--146, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- B. Debnath, A. Haghdoost, A. Kadav, M. G. Khatib, and C. Ungureanu. Revisiting hash table design for phase change memory. In Proceedings of the 3rd Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads, INFLOW '15, pages 1:1--1:9, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, and J. Jackson. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, pages 15:1--15:15, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- A. Eisenman, A. Cidon, E. Pergament, O. Haimovich, R. Stutsman, M. Alizadeh, and S. Katti. Flashield: a key-value cache that minimizes writes to flash. CoRR, abs/1702.02588, 2017.Google Scholar
- D. Exchange. DRAM supply to remain tight with its annual bit growth for 2018 forecast at just 19.6www.dramexchange.com.Google Scholar
- W. Hu, G. Li, J. Ni, D. Sun, and K.-L. Tan. B-tree: A predictive B-tree for reducing writes on phase change memory. IEEE Transactions on Knowledge and Data Engineering, 26(10):2368--2381, 2014.Google ScholarCross Ref
- U. Kang, H.-s. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. S. Choi. Co-architecting controllers and dram to enhance dram process scaling. In The memory forum, pages 1--4, 2014.Google Scholar
- W.-H. Kim, J. Kim, W. Baek, B. Nam, and Y. Won. NVWAL: Exploiting NVRAM in write-ahead logging. SIGPLAN Not., 51(4):385--398, Mar. 2016. Google ScholarDigital Library
- E. Lee, S. Yoo, J.-E. Jang, and H. Bahn. Shortcut-JFS: A write efficient journaling file system for phase change memory. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1--6. IEEE, 2012.Google ScholarCross Ref
- S.-H. Lee. Technology scaling challenges and opportunities of memory devices. In Electron Devices Meeting (IEDM), 2016 IEEE International, pages 1--1. IEEE, 2016.Google ScholarCross Ref
- Y. Matsunobu. Myrocks: A space and write-optimized MySQL database. code. facebook.com/posts/190251048047090/.Google Scholar
- R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling Memcache at Facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 385--398, Lombard, IL, 2013. Google ScholarDigital Library
- I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner. Fptree: A hybrid SCM-DRAM persistent and concurrent B-Tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 371--386, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- W. Shin, Q. Chen, M. Oh, H. Eom, and H. Y. Yeom. OS i/o path optimizations for flash solid-state drives. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 483--488, Philadelphia, PA, 2014. USENIX Association. Google ScholarDigital Library
- S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies, FAST'11, pages 5--5, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- F. Xia, D. Jiang, J. Xiong, and N. Sun. HiKV: A hybrid index key-value store for DRAM-NVM memory systems. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 349--362, Santa Clara, CA, 2017. USENIX Association. Google ScholarDigital Library
- J. Xu and S. Swanson. NOVA: A log-structured file system for hybrid volatile/nonvolatile main memories. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 323--338, Santa Clara, CA, 2016. USENIX Association. Google ScholarDigital Library
- J. Yang, D. B. Minturn, and F. Hady. When poll is better than interrupt. In Proceedings of the 10th USENIX Conference on File and Storage Technologies, FAST'12, pages 3--3, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
- J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and B. He. NV-Tree: Reducing consistency cost for NVM-based single level systems. In 13th USENIX Conference on File and Storage Technologies (FAST 15), pages 167--181, Santa Clara, CA, 2015. USENIX Association. Google ScholarDigital Library
- P. Zuo and Y. Hua. A write-friendly hashing scheme for non-volatile memory systems. In Proceedings of the 33st Symposium on Mass Storage Systems and Technologies, MSST, volume 17, pages 1--10, 2017.Google Scholar
Recommendations
NVM duet: unified working memory and persistent store architecture
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsEmerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising ...
NVM duet: unified working memory and persistent store architecture
ASPLOS '14Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising ...
Comments