skip to main content
research-article
Open Access

The Case for Custom Storage Backends in Distributed Storage Systems

Published:18 May 2020Publication History
Skip Abstract Section

Abstract

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today, because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph’s experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow.

Ceph addressed these issues with BlueStore, a new backend designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backward-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.

References

  1. Abutalib Aghayev and Peter Desnoyers. 2015. Skylight—A window on shingled disk operation. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 135--149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abutalib Aghayev, Theodore Ts’o, Garth Gibson, and Peter Desnoyers. 2017. Evolving Ext4 for shingled disks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 105--120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abutalib Aghayev, Sage Weil, Greg Ganger, and George Amvrosiadis. 2019. Reconciling LSM-Trees with Modern Hard Drives Using BlueFS. Technical Report CMU-PDL-19-102. CMU Parallel Data Laboratory.Google ScholarGoogle Scholar
  4. Amazon.com, Inc. 2019. Amazon Elastic Block Store. Retrieved from https://aws.amazon.com/ebs/.Google ScholarGoogle Scholar
  5. Amazon.com, Inc. 2019. Amazon S3. Retrieved from https://aws.amazon.com/s3/.Google ScholarGoogle Scholar
  6. Jens Axboe. 2009. Queue sysfs Files. Retrieved from https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt.Google ScholarGoogle Scholar
  7. Jens Axboe. 2016. Flexible I/O Tester. Retrieved from git://git.kernel.dk/fio.git.Google ScholarGoogle Scholar
  8. Jens Axboe. 2016. Throttled Background Buffered Writeback. Retrieved from https://lwn.net/Articles/698815/.Google ScholarGoogle Scholar
  9. Matias Bjørling. 2019. From open-channel SSDs to zoned namespaces. In Proceedings of the Linux Storage and Filesystems Conference (Vault 19). USENIX Association.Google ScholarGoogle Scholar
  10. Matias Bjørling. 2019. New NVMe Specification Defines Zoned Namespaces (ZNS) as Go-To Industry Technology. Retrieved from https://nvmexpress.org/new-nvmetm-specification-defines-zoned-namespaces-zns-as-go-to-industry-technology/.Google ScholarGoogle Scholar
  11. Matias Bjørling, Javier Gonzalez, and Philippe Bonnet. 2017. LightNVM: The Linux open-channel SSD subsystem. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 359--374.Google ScholarGoogle Scholar
  12. Artem Blagodarenko. 2016. Scaling LDISKFS for the Future. Retrieved from https://www.youtube.com/watch?v=ubbZGpxV6zk.Google ScholarGoogle Scholar
  13. Artem Blagodarenko. 2017. Scaling LDISKFS for the Future. Again. Retrieved from https://www.youtube.com/watch?v=HLfEd0_Dq0U.Google ScholarGoogle Scholar
  14. Frederick P. Brooks Jr. 1986. No Silver Bullet—Essence and Accident in Software Engineering. https://dl.acm.org/doi/10.1109/MC.1987.1663532Google ScholarGoogle Scholar
  15. Btrfs. 2019. Btrfs Changelog. Retrieved from https://btrfs.wiki.kernel.org/index.php/Changelog.Google ScholarGoogle Scholar
  16. David C. 2018. [ceph-users] Luminous | PG Split Causing Slow Requests. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024984.html.Google ScholarGoogle Scholar
  17. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008). DOI:https://doi.org/10.1145/1365815.1365816Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Luoqing Chao and Thunder Zhang. 2015. Implement Object Storage with SMR Based Key-Value Store. Retrieved from https://www.snia.org/sites/default/files/SDC15_presentations/smr/QingchaoLuo_Implement_Object_Storage_SMR_Key-Value_Store.pdf.Google ScholarGoogle Scholar
  19. Dave Chinner. 2010. XFS Delayed Logging Design. Retrieved from https://www.kernel.org/doc/Documentation/filesystems/xfs-delayed-logging-design.txt.Google ScholarGoogle Scholar
  20. Dave Chinner. 2015. SMR Layout Optimization for XFS. Retrieved from http://xfs.org/images/f/f6/Xfs-smr-structure-0.2.pdf.Google ScholarGoogle Scholar
  21. Dave Chinner. 2019. Re: Pagecache Locking (Was: bcachefs Status Update) Merged). Retrieved from https://lkml.org/lkml/2019/6/13/1794.Google ScholarGoogle Scholar
  22. Alibaba Clouder. 2018. Alibaba Deploys Alibaba Open Channel SSD for Next Generation Data Centers. Retrieved from https://www.alibabacloud.com/blog/alibaba-deploys-alibaba-open-channel-ssd-for-next-generation-data-centers_593802.Google ScholarGoogle Scholar
  23. William Cohen. 2016. How to Avoid Wasting Megabytes of Memory a Few Bytes at a Time. Retrieved from https://developers.redhat.com/blog/2016/06/01/how-to-avoid-wasting-megabytes-of-memory-a-few-bytes-at-a-time/.Google ScholarGoogle Scholar
  24. Jonathan Corbet. 2009. Supporting Transactions in Btrfs. Retrieved from https://lwn.net/Articles/361457/.Google ScholarGoogle Scholar
  25. Jonathan Corbet. 2011. No-I/O Dirty Throttling. Retrieved from https://lwn.net/Articles/456904/.Google ScholarGoogle Scholar
  26. Jonathan Corbet. 2018. PostgreSQL’s fsync() Surprise. Retrieved from https://lwn.net/Articles/752063/.Google ScholarGoogle Scholar
  27. Western Digital. 2019. Zoned Storage. Retrieved from http://zonedstorage.io.Google ScholarGoogle Scholar
  28. Anton Dmitriev. 2017. [ceph-users] All OSD Fails after Few Requests to RGW. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-May/017950.html.Google ScholarGoogle Scholar
  29. Siying Dong. 2018. Direct I/O Close() Shouldn’t Rewrite the Last Page. Retrieved from https://github.com/facebook/rocksdb/pull/4771.Google ScholarGoogle Scholar
  30. Jake Edge. 2015. Filesystem Support for SMR Devices. Retrieved from https://lwn.net/Articles/637035/.Google ScholarGoogle Scholar
  31. Jake Edge. 2015. The OrangeFS Distributed Filesystem. Retrieved from https://lwn.net/Articles/643165/.Google ScholarGoogle Scholar
  32. Jake Edge. 2015. XFS: There and Back ... and There Again? Retrieved from https://lwn.net/Articles/638546/.Google ScholarGoogle Scholar
  33. D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. 1995. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 251--266. DOI:https://doi.org/10.1145/224056.224076Google ScholarGoogle Scholar
  34. Facebook, Inc. 2018. A RocksDB Storage Engine with MySQL. Retrieved from http://myrocks.io/.Google ScholarGoogle Scholar
  35. Andrew Fikes. 2010. Storage Architecture and Challenges. Retrieved from https://cloud.google.com/files/storage_architecture_and_challenges.pdf.Google ScholarGoogle Scholar
  36. Mary Jo Foley. 2018. Microsoft readies new cloud SSD storage spec for the Open Compute Project. Retrieved from https://www.zdnet.com/article/microsoft-readies-new-cloud-ssd-storage-spec-for-the-open-compute-project/.Google ScholarGoogle Scholar
  37. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 29--43. DOI:https://doi.org/10.1145/945445.945450Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). USENIX Association, Santa Clara, CA, 263--276.Google ScholarGoogle Scholar
  39. Christoph Hellwig. 2009. XFS: The big storage file system for Linux. USENIX ;login 34, 5 (2009).Google ScholarGoogle Scholar
  40. J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and M. West. 1987. Scale and performance in a distributed file system. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 1--2. DOI:https://doi.org/10.1145/41457.37500Google ScholarGoogle Scholar
  41. Joel Hruska. 2019. Western Digital to Demo Dual Actuator HDD, Will Use SMR to Hit 18TB Capacity. Retrieved from https://www.extremetech.com/computing/287319-western-digital-to-demo-dual-actuator-hdd-will-use-smr-to-hit-18tb-capacity.Google ScholarGoogle Scholar
  42. Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. 2018. TxFS: Leveraging file-system crash consistency to provide ACID transactions. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). USENIX Association, 879--891.Google ScholarGoogle Scholar
  43. Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in Windows Azure storage. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX, 15--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Felix Hupfeld, Toni Cortes, Björn Kolbeck, Jan Stender, Erich Focht, Matthias Hess, Jesus Malo, Jonathan Marti, and Eugenio Cesario. 2008. The XtreemFS architecture—A case for object-based file systems in grids. Concurr. Comput.: Pract. Exper. 20, 17 (Dec. 2008), 2049--2060. DOI:https://doi.org/10.1002/cpe.v20:17Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Facebook Inc. 2019. RocksDB Direct IO. Retrieved from https://github.com/facebook/rocksdb/wiki/Direct-IO.Google ScholarGoogle Scholar
  46. Facebook Inc. 2019. RocksDB Merge Operator. Retrieved from https://github.com/facebook/rocksdb/wiki/Merge-Operator.Google ScholarGoogle Scholar
  47. Facebook Inc. 2019. RocksDB Synchronous Writes. Retrieved from https://github.com/facebook/rocksdb/wiki/Basic-Operations#synchronous-writes.Google ScholarGoogle Scholar
  48. Silicon Graphics Inc. 2006. XFS Allocation Groups. Retrieved from http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure/tmp/en-US/html/Allocation_Groups.html.Google ScholarGoogle Scholar
  49. INCITS T10 Technical Committee. 2014. Information Technology—Zoned Block Commands (ZBC). Draft Standard T10/BSR INCITS 536. American National Standards Institute, Inc. Retrieved from http://www.t10.org/drafts.htm.Google ScholarGoogle Scholar
  50. William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. Trans. Stor. 11, 4, Article 18 (Nov. 2015), 29 pages. DOI:https://doi.org/10.1145/2798729Google ScholarGoogle Scholar
  51. Sooman Jeong, Kisung Lee, Seongjin Lee, Seoungbum Son, and Youjip Won. 2013. I/O stack optimization for smartphones. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). USENIX, 309--320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Theodore Johnson and Dennis Shasha. 1994. 2Q: A low overhead high performance buffer management replacement algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Morgan Kaufmann, San Francisco, CA, 439--450. http://dl.acm.org/citation.cfm?id=645920.672996Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. 1997. Application performance and flexibility on exokernel systems. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP’97). ACM, New York, NY, 52--65. DOI:https://doi.org/10.1145/268998.266644Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. John Kennedy and Michael Satran. 2018. About Transactional NTFS. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/fileio/about-transactional-ntfs.Google ScholarGoogle Scholar
  55. John Kennedy and Michael Satran. 2018. Alternatives to using Transactional NTFS. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/fileio/deprecation-of-txf.Google ScholarGoogle Scholar
  56. Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 183--189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35--40. DOI:https://doi.org/10.1145/1773912.1773922Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Butler Lampson and Howard E. Sturgis. 1979. Crash recovery in a distributed data storage system. (1979). https://www.microsoft.com/en-us/research/publication/crash-recovery-in-a-distributed-data-storage-system/.Google ScholarGoogle Scholar
  59. Adam Leventhal. 2016. APFS in Detail: Overview. Retrieved from http://dtrace.org/blogs/ahl/2016/06/19/apfs-part1/.Google ScholarGoogle Scholar
  60. Peter Macko, Xiongzi Ge, John Haskins Jr., James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith. 2017. SMORE: A cold data object store for SMR drives (extended version). CoRR abs/1705.09701 (2017). http://arxiv.org/abs/1705.09701Google ScholarGoogle Scholar
  61. Magic Pocket 8 Hardware Engineering Teams. 2018. Extending Magic Pocket Innovation with the First Petabyte Scale SMR Drive Deployment. Retrieved from https://blogs.dropbox.com/tech/2018/06/extending-magic-pocket-innovation-with-the-first-petabyte-scale-smr-drive-deployment/.Google ScholarGoogle Scholar
  62. Magic Pocket 8 Hardware Engineering Teams. 2019. SMR: What We Learned in Our First Year. Retrieved from https://blogs.dropbox.com/tech/2019/07/smr-what-we-learned-in-our-first-year/.Google ScholarGoogle Scholar
  63. Adam Manzanares, Noah Watkins, Cyril Guyot, Damien LeMoal, Carlos Maltzahn, and Zvonimr Bandic. 2016. ZEA, a data management approach for SMR. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Lars Marowsky-Brée. 2018. Ceph User Survey 2018 Results. Retrieved from https://ceph.com/ceph-blog/ceph-user-survey-2018-results/.Google ScholarGoogle Scholar
  65. Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3 (1984), 181--197.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Chris Mellor. 2019. Toshiba Embraces Shingling for Next-gen MAMR HDDs. Retrieved from https://blocksandfiles.com/2019/03/11/toshiba-mamr-statements-have-shingling-absence/.Google ScholarGoogle Scholar
  67. Changwoo Min, Woon-Hak Kang, Taesoo Kim, Sang-Won Lee, and Young Ik Eom. 2015. Lightweight application-level crash consistency on transactional flash storage. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC’15). USENIX Association, 221--234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Sumedh N. 2013. Coding for Performance: Data alignment and structures. Retrieved from https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures.Google ScholarGoogle Scholar
  69. Michael A. Olson. 1993. The design and implementation of the inversion file system. In USENIX Winter. USENIX Association, Berkeley, CA. https://www.usenix.org/conference/usenix-winter-1993-conference/presentation/design-and-implementation-inversion-file-system.Google ScholarGoogle Scholar
  70. Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC’99). USENIX Association, Berkeley, CA, 43--43.Google ScholarGoogle Scholar
  71. Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (Jun. 1996), 351--385. DOI:https://doi.org/10.1007/s002360050048Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. OpenStack Foundation. 2017. 2017 Annual Report. Retrieved from https://www.openstack.org/assets/reports/OpenStack-AnnualReport2017.pdf.Google ScholarGoogle Scholar
  73. Adrian Palmer. 2015. SMRFFS-EXT4—SMR Friendly File System. Retrieved from https://github.com/Seagate/SMR_FS-EXT4.Google ScholarGoogle Scholar
  74. Swapnil Patil and Garth Gibson. 2011. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, Berkeley, CA, 13--13. http://dl.acm.org/citation.cfm?id=1960475.1960488Google ScholarGoogle Scholar
  75. Juan Piernas, Toni Cortes, and José M. García. 2002. DualFS: A new journaling file system without meta-data duplication. In Proceedings of the 16th International Conference on Supercomputing (ICS’02). Association for Computing Machinery, New York, NY, 137--146. DOI:https://doi.org/10.1145/514191.514213Google ScholarGoogle Scholar
  76. Poornima G and Rajesh Joseph. 2016. Metadata Performance Bottlenecks in Gluster. Retrieved from https://www.slideshare.net/GlusterCommunity/performance-bottlenecks-for-metadata-workload-in-gluster-with-poornima-gurusiddaiah-rajesh-joseph.Google ScholarGoogle Scholar
  77. Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel. 2009. Operating system transactions. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 161--176. DOI:https://doi.org/10.1145/1629575.1629591Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Lee Prewitt. 2019. SMR and ZNS—Two Sides of the Same Coin. Retrieved from https://www.youtube.com/watch?v=jBxzO6YyMxU.Google ScholarGoogle Scholar
  79. Red Hat Inc. 2019. GlusterFS Architecture. Retrieved from https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/.Google ScholarGoogle Scholar
  80. Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). USENIX, 145--156.Google ScholarGoogle Scholar
  81. Mendel Rosenblum and John K. Ousterhout. 1991. The design and implementation of a log-structured file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91). ACM, New York, NY, 1--15. DOI:https://doi.org/10.1145/121132.121137Google ScholarGoogle Scholar
  82. Frank Schmuck and Jim Wylie. 1991. Experience with transactions in QuickSilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91). ACM, New York, NY, 239--253. DOI:https://doi.org/10.1145/121132.121171Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Thomas J. E. Schwarz, Qin Xin, Ethan L. Miller, Darrell D. E. Long, Andy Hospodor, and Spencer Ng. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS’04). IEEE Computer Society, 409--418. http://dl.acm.org/citation.cfm?id=1032659.1034226Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Seastar. 2019. Shared-nothing Design. Retrieved from http://seastar.io/shared-nothing/.Google ScholarGoogle Scholar
  85. Margo I. Seltzer. 1993. Transaction support in a log-structured file system. In Proceedings of the 9th International Conference on Data Engineering. IEEE Computer Society, 503--510.Google ScholarGoogle ScholarCross RefCross Ref
  86. Kai Shen, Stan Park, and Men Zhu. 2014. Journaling of journal is (almost) free. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). USENIX, 287--293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Anton Shilov. 2017. Seagate Ships 35th Millionth SMR HDD, Confirms HAMR-Based Drives in Late 2018. Retrieved from https://www.anandtech.com/show/11315/seagate-ships-35th-millionth-smr-hdd-confirms-hamrbased-hard-drives-in-late-2018.Google ScholarGoogle Scholar
  88. A. Shilov. 2019. Western Digital: Over Half of Data Center HDDs Will Use SMR by 2023. Retrieved from https://www.anandtech.com/show/14099/western-digital-over-half-of-dc-hdds-will-use-smr-by-2023.Google ScholarGoogle Scholar
  89. Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST’10). IEEE Computer Society, 1--10. DOI:https://doi.org/10.1109/MSST.2010.5496972Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Chris Siebenmann. 2011. About the Order That readdir() Returns Entries In. Retrieved from https://utcc.utoronto.ca/ cks/space/blog/unix/ReaddirOrder.Google ScholarGoogle Scholar
  91. Chris Siebenmann. 2013. ZFS Transaction Groups and the ZFS Intent Log. Retrieved from https://utcc.utoronto.ca/ cks/space/blog/solaris/ZFSTXGsAndZILs.Google ScholarGoogle Scholar
  92. Richard P. Spillane, Sachin Gaikwad, Manjunath Chinni, Erez Zadok, and Charles P. Wright. 2009. Enabling transactional file access via lightweight kernel extensions. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, 29--42.Google ScholarGoogle Scholar
  93. Stas Starikevich. 2016. [ceph-users] RadosGW performance degradation on the 18 millions objects stored. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012983.html.Google ScholarGoogle Scholar
  94. Jan Stender, Björn Kolbeck, Mikael Högqvist, and Felix Hupfeld. 2010. BabuDB: Fast and efficient file system metadata storage. In Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI’10). IEEE Computer Society, 51--58. DOI:https://doi.org/10.1109/SNAPI.2010.14Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Michael Stonebraker. 1981. Operating system support for database management. Commun. ACM 24, 7 (Jul. 1981), 412--418. DOI:https://doi.org/10.1145/358699.358703Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Michael Stonebraker and Lawrence A. Rowe. 1986. The design of POSTGRES. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data (SIGMOD’86). ACM, New York, NY, 340--355. DOI:https://doi.org/10.1145/16894.16888Google ScholarGoogle Scholar
  97. ZAR team. 2019. “Write hole” phenomenon. Retrieved from http://www.raid-recovery-guide.com/raid5-write-hole.aspx.Google ScholarGoogle Scholar
  98. ThinkParQ. 2018. An introduction to BeeGFS. Retrieved from https://www.beegfs.io/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf.Google ScholarGoogle Scholar
  99. Stephen C. Tweedie. 1998. Journaling the Linux ext2fs Filesystem. In Proceedings of the 4th Annual Linux Expo.Google ScholarGoogle Scholar
  100. Sage Weil. 2009. Re: [RFC] Big Fat Transaction ioctl. Retrieved from https://lwn.net/Articles/361472/.Google ScholarGoogle Scholar
  101. Sage Weil. 2009. [RFC] Big Fat Transaction ioctl. Retrieved from https://lwn.net/Articles/361439/.Google ScholarGoogle Scholar
  102. Sage Weil. 2011. [PATCH v3] Introduce sys_syncfs to Sync a Single File System. Retrieved from https://lwn.net/Articles/433384/.Google ScholarGoogle Scholar
  103. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). USENIX Association, Berkeley, CA, 307--320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC’06). Association for Computing Machinery, New York, NY, 122--es. DOI:https://doi.org/10.1145/1188455.1188582Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Sage A. Weil, Andrew W. Leung, Scott A. Brandt, and Carlos Maltzahn. 2007. RADOS: A scalable, reliable storage service for petabyte-scale storage clusters. In Proceedings of the 2Nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing’07 (PDSW’07). ACM, New York, NY, 35--44. DOI:https://doi.org/10.1145/1374596.1374606Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 2, 17 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Western Digital Inc. 2018. ZBC device manipulation library. Retrieved from https://github.com/hgst/libzbc.Google ScholarGoogle Scholar
  108. Lustre Wiki. 2017. Introduction to Lustre Architecture. Retrieved from http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf.Google ScholarGoogle Scholar
  109. Wikipedia. 2018. Btrfs History. Retrieved from https://en.wikipedia.org/wiki/Btrfs#History.Google ScholarGoogle Scholar
  110. Wikipedia. 2018. XFS History. Retrieved from https://en.wikipedia.org/wiki/XFS#History.Google ScholarGoogle Scholar
  111. Wikipedia. 2019. Cache flushing. Retrieved from https://en.wikipedia.org/wiki/Disk_buffer#Cache_flushing.Google ScholarGoogle Scholar
  112. Charles P. Wright, Richard Spillane, Gopalan Sivathanu, and Erez Zadok. 2007. Extending ACID semantics to the file system. ACM Trans. Stor. 3, 2 (Jun. 2007), 4--es. DOI:https://doi.org/10.1145/1242520.1242521Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Fengguang Wu. 2012. I/O-less Dirty Throttling. Retrieved from https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_wu.pdf.Google ScholarGoogle Scholar
  114. Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 15--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Ting Yao, Jiguang Wan, Ping Huang, Yiwen Zhang, Zhiwen Liu, Changsheng Xie, and Xubin He. 2019. GearDB: A GC-free key-value store on HM-SMR drives with gear compaction. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). USENIX Association, 159--171.Google ScholarGoogle Scholar
  116. Lawrence Ying and Theodore Ts’o. 2017. Dynamic Hybrid-SMR: An OCP proposal to improve data center disk drives. Retrieved from https://www.blog.google/products/google-cloud/dynamic-hybrid-smr-ocp-proposal-improve-data-center-disk-drives/.Google ScholarGoogle Scholar
  117. Zhihui Zhang and Kanad Ghose. 2007. hFS: A hybrid file system prototype for improving small file and metadata performance. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys’07). ACM, New York, NY, 175--187. DOI:https://doi.org/10.1145/1272996.1273016Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, Garth A. Gibson, Bradley W. Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Article 3, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291660Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Alexey Zhuravlev. 2016. ZFS: Metadata Performance. Retrieved from https://www.eofs.eu/_media/events/lad16/02_zfs_md_performance_improvements_zhuravlev.pdf.Google ScholarGoogle Scholar

Index Terms

  1. The Case for Custom Storage Backends in Distributed Storage Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 16, Issue 2
          SOSP 2019 Special Section and Regular Papers
          May 2020
          194 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3399155
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 May 2020
          • Online AM: 7 May 2020
          • Accepted: 1 March 2020
          • Received: 1 January 2020
          Published in tos Volume 16, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format