Abstract
For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today, because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph’s experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow.
Ceph addressed these issues with BlueStore, a new backend designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backward-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.
- Abutalib Aghayev and Peter Desnoyers. 2015. Skylight—A window on shingled disk operation. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 135--149.Google ScholarDigital Library
- Abutalib Aghayev, Theodore Ts’o, Garth Gibson, and Peter Desnoyers. 2017. Evolving Ext4 for shingled disks. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 105--120.Google ScholarDigital Library
- Abutalib Aghayev, Sage Weil, Greg Ganger, and George Amvrosiadis. 2019. Reconciling LSM-Trees with Modern Hard Drives Using BlueFS. Technical Report CMU-PDL-19-102. CMU Parallel Data Laboratory.Google Scholar
- Amazon.com, Inc. 2019. Amazon Elastic Block Store. Retrieved from https://aws.amazon.com/ebs/.Google Scholar
- Amazon.com, Inc. 2019. Amazon S3. Retrieved from https://aws.amazon.com/s3/.Google Scholar
- Jens Axboe. 2009. Queue sysfs Files. Retrieved from https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt.Google Scholar
- Jens Axboe. 2016. Flexible I/O Tester. Retrieved from git://git.kernel.dk/fio.git.Google Scholar
- Jens Axboe. 2016. Throttled Background Buffered Writeback. Retrieved from https://lwn.net/Articles/698815/.Google Scholar
- Matias Bjørling. 2019. From open-channel SSDs to zoned namespaces. In Proceedings of the Linux Storage and Filesystems Conference (Vault 19). USENIX Association.Google Scholar
- Matias Bjørling. 2019. New NVMe Specification Defines Zoned Namespaces (ZNS) as Go-To Industry Technology. Retrieved from https://nvmexpress.org/new-nvmetm-specification-defines-zoned-namespaces-zns-as-go-to-industry-technology/.Google Scholar
- Matias Bjørling, Javier Gonzalez, and Philippe Bonnet. 2017. LightNVM: The Linux open-channel SSD subsystem. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 359--374.Google Scholar
- Artem Blagodarenko. 2016. Scaling LDISKFS for the Future. Retrieved from https://www.youtube.com/watch?v=ubbZGpxV6zk.Google Scholar
- Artem Blagodarenko. 2017. Scaling LDISKFS for the Future. Again. Retrieved from https://www.youtube.com/watch?v=HLfEd0_Dq0U.Google Scholar
- Frederick P. Brooks Jr. 1986. No Silver Bullet—Essence and Accident in Software Engineering. https://dl.acm.org/doi/10.1109/MC.1987.1663532Google Scholar
- Btrfs. 2019. Btrfs Changelog. Retrieved from https://btrfs.wiki.kernel.org/index.php/Changelog.Google Scholar
- David C. 2018. [ceph-users] Luminous | PG Split Causing Slow Requests. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024984.html.Google Scholar
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008). DOI:https://doi.org/10.1145/1365815.1365816Google ScholarDigital Library
- Luoqing Chao and Thunder Zhang. 2015. Implement Object Storage with SMR Based Key-Value Store. Retrieved from https://www.snia.org/sites/default/files/SDC15_presentations/smr/QingchaoLuo_Implement_Object_Storage_SMR_Key-Value_Store.pdf.Google Scholar
- Dave Chinner. 2010. XFS Delayed Logging Design. Retrieved from https://www.kernel.org/doc/Documentation/filesystems/xfs-delayed-logging-design.txt.Google Scholar
- Dave Chinner. 2015. SMR Layout Optimization for XFS. Retrieved from http://xfs.org/images/f/f6/Xfs-smr-structure-0.2.pdf.Google Scholar
- Dave Chinner. 2019. Re: Pagecache Locking (Was: bcachefs Status Update) Merged). Retrieved from https://lkml.org/lkml/2019/6/13/1794.Google Scholar
- Alibaba Clouder. 2018. Alibaba Deploys Alibaba Open Channel SSD for Next Generation Data Centers. Retrieved from https://www.alibabacloud.com/blog/alibaba-deploys-alibaba-open-channel-ssd-for-next-generation-data-centers_593802.Google Scholar
- William Cohen. 2016. How to Avoid Wasting Megabytes of Memory a Few Bytes at a Time. Retrieved from https://developers.redhat.com/blog/2016/06/01/how-to-avoid-wasting-megabytes-of-memory-a-few-bytes-at-a-time/.Google Scholar
- Jonathan Corbet. 2009. Supporting Transactions in Btrfs. Retrieved from https://lwn.net/Articles/361457/.Google Scholar
- Jonathan Corbet. 2011. No-I/O Dirty Throttling. Retrieved from https://lwn.net/Articles/456904/.Google Scholar
- Jonathan Corbet. 2018. PostgreSQL’s fsync() Surprise. Retrieved from https://lwn.net/Articles/752063/.Google Scholar
- Western Digital. 2019. Zoned Storage. Retrieved from http://zonedstorage.io.Google Scholar
- Anton Dmitriev. 2017. [ceph-users] All OSD Fails after Few Requests to RGW. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-May/017950.html.Google Scholar
- Siying Dong. 2018. Direct I/O Close() Shouldn’t Rewrite the Last Page. Retrieved from https://github.com/facebook/rocksdb/pull/4771.Google Scholar
- Jake Edge. 2015. Filesystem Support for SMR Devices. Retrieved from https://lwn.net/Articles/637035/.Google Scholar
- Jake Edge. 2015. The OrangeFS Distributed Filesystem. Retrieved from https://lwn.net/Articles/643165/.Google Scholar
- Jake Edge. 2015. XFS: There and Back ... and There Again? Retrieved from https://lwn.net/Articles/638546/.Google Scholar
- D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. 1995. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 251--266. DOI:https://doi.org/10.1145/224056.224076Google Scholar
- Facebook, Inc. 2018. A RocksDB Storage Engine with MySQL. Retrieved from http://myrocks.io/.Google Scholar
- Andrew Fikes. 2010. Storage Architecture and Challenges. Retrieved from https://cloud.google.com/files/storage_architecture_and_challenges.pdf.Google Scholar
- Mary Jo Foley. 2018. Microsoft readies new cloud SSD storage spec for the Open Compute Project. Retrieved from https://www.zdnet.com/article/microsoft-readies-new-cloud-ssd-storage-spec-for-the-open-compute-project/.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). ACM, New York, NY, 29--43. DOI:https://doi.org/10.1145/945445.945450Google ScholarDigital Library
- Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien, and Haryadi S. Gunawi. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). USENIX Association, Santa Clara, CA, 263--276.Google Scholar
- Christoph Hellwig. 2009. XFS: The big storage file system for Linux. USENIX ;login 34, 5 (2009).Google Scholar
- J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and M. West. 1987. Scale and performance in a distributed file system. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 1--2. DOI:https://doi.org/10.1145/41457.37500Google Scholar
- Joel Hruska. 2019. Western Digital to Demo Dual Actuator HDD, Will Use SMR to Hit 18TB Capacity. Retrieved from https://www.extremetech.com/computing/287319-western-digital-to-demo-dual-actuator-hdd-will-use-smr-to-hit-18tb-capacity.Google Scholar
- Yige Hu, Zhiting Zhu, Ian Neal, Youngjin Kwon, Tianyu Cheng, Vijay Chidambaram, and Emmett Witchel. 2018. TxFS: Leveraging file-system crash consistency to provide ACID transactions. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). USENIX Association, 879--891.Google Scholar
- Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in Windows Azure storage. In Proceedings of the 2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX, 15--26.Google ScholarDigital Library
- Felix Hupfeld, Toni Cortes, Björn Kolbeck, Jan Stender, Erich Focht, Matthias Hess, Jesus Malo, Jonathan Marti, and Eugenio Cesario. 2008. The XtreemFS architecture—A case for object-based file systems in grids. Concurr. Comput.: Pract. Exper. 20, 17 (Dec. 2008), 2049--2060. DOI:https://doi.org/10.1002/cpe.v20:17Google ScholarDigital Library
- Facebook Inc. 2019. RocksDB Direct IO. Retrieved from https://github.com/facebook/rocksdb/wiki/Direct-IO.Google Scholar
- Facebook Inc. 2019. RocksDB Merge Operator. Retrieved from https://github.com/facebook/rocksdb/wiki/Merge-Operator.Google Scholar
- Facebook Inc. 2019. RocksDB Synchronous Writes. Retrieved from https://github.com/facebook/rocksdb/wiki/Basic-Operations#synchronous-writes.Google Scholar
- Silicon Graphics Inc. 2006. XFS Allocation Groups. Retrieved from http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure/tmp/en-US/html/Allocation_Groups.html.Google Scholar
- INCITS T10 Technical Committee. 2014. Information Technology—Zoned Block Commands (ZBC). Draft Standard T10/BSR INCITS 536. American National Standards Institute, Inc. Retrieved from http://www.t10.org/drafts.htm.Google Scholar
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. Trans. Stor. 11, 4, Article 18 (Nov. 2015), 29 pages. DOI:https://doi.org/10.1145/2798729Google Scholar
- Sooman Jeong, Kisung Lee, Seongjin Lee, Seoungbum Son, and Youjip Won. 2013. I/O stack optimization for smartphones. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). USENIX, 309--320.Google ScholarDigital Library
- Theodore Johnson and Dennis Shasha. 1994. 2Q: A low overhead high performance buffer management replacement algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94). Morgan Kaufmann, San Francisco, CA, 439--450. http://dl.acm.org/citation.cfm?id=645920.672996Google ScholarDigital Library
- M. Frans Kaashoek, Dawson R. Engler, Gregory R. Ganger, Hector M. Briceño, Russell Hunt, David Mazières, Thomas Pinckney, Robert Grimm, John Jannotti, and Kenneth Mackenzie. 1997. Application performance and flexibility on exokernel systems. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP’97). ACM, New York, NY, 52--65. DOI:https://doi.org/10.1145/268998.266644Google ScholarDigital Library
- John Kennedy and Michael Satran. 2018. About Transactional NTFS. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/fileio/about-transactional-ntfs.Google Scholar
- John Kennedy and Michael Satran. 2018. Alternatives to using Transactional NTFS. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/fileio/deprecation-of-txf.Google Scholar
- Jaeho Kim, Donghee Lee, and Sam H. Noh. 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 183--189.Google ScholarDigital Library
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 2 (Apr. 2010), 35--40. DOI:https://doi.org/10.1145/1773912.1773922Google ScholarDigital Library
- Butler Lampson and Howard E. Sturgis. 1979. Crash recovery in a distributed data storage system. (1979). https://www.microsoft.com/en-us/research/publication/crash-recovery-in-a-distributed-data-storage-system/.Google Scholar
- Adam Leventhal. 2016. APFS in Detail: Overview. Retrieved from http://dtrace.org/blogs/ahl/2016/06/19/apfs-part1/.Google Scholar
- Peter Macko, Xiongzi Ge, John Haskins Jr., James Kelley, David Slik, Keith A. Smith, and Maxim G. Smith. 2017. SMORE: A cold data object store for SMR drives (extended version). CoRR abs/1705.09701 (2017). http://arxiv.org/abs/1705.09701Google Scholar
- Magic Pocket 8 Hardware Engineering Teams. 2018. Extending Magic Pocket Innovation with the First Petabyte Scale SMR Drive Deployment. Retrieved from https://blogs.dropbox.com/tech/2018/06/extending-magic-pocket-innovation-with-the-first-petabyte-scale-smr-drive-deployment/.Google Scholar
- Magic Pocket 8 Hardware Engineering Teams. 2019. SMR: What We Learned in Our First Year. Retrieved from https://blogs.dropbox.com/tech/2019/07/smr-what-we-learned-in-our-first-year/.Google Scholar
- Adam Manzanares, Noah Watkins, Cyril Guyot, Damien LeMoal, Carlos Maltzahn, and Zvonimr Bandic. 2016. ZEA, a data management approach for SMR. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). USENIX Association.Google ScholarDigital Library
- Lars Marowsky-Brée. 2018. Ceph User Survey 2018 Results. Retrieved from https://ceph.com/ceph-blog/ceph-user-survey-2018-results/.Google Scholar
- Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM Trans. Comput. Syst. 2, 3 (1984), 181--197.Google ScholarDigital Library
- Chris Mellor. 2019. Toshiba Embraces Shingling for Next-gen MAMR HDDs. Retrieved from https://blocksandfiles.com/2019/03/11/toshiba-mamr-statements-have-shingling-absence/.Google Scholar
- Changwoo Min, Woon-Hak Kang, Taesoo Kim, Sang-Won Lee, and Young Ik Eom. 2015. Lightweight application-level crash consistency on transactional flash storage. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC’15). USENIX Association, 221--234.Google ScholarDigital Library
- Sumedh N. 2013. Coding for Performance: Data alignment and structures. Retrieved from https://software.intel.com/en-us/articles/coding-for-performance-data-alignment-and-structures.Google Scholar
- Michael A. Olson. 1993. The design and implementation of the inversion file system. In USENIX Winter. USENIX Association, Berkeley, CA. https://www.usenix.org/conference/usenix-winter-1993-conference/presentation/design-and-implementation-inversion-file-system.Google Scholar
- Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC’99). USENIX Association, Berkeley, CA, 43--43.Google Scholar
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (Jun. 1996), 351--385. DOI:https://doi.org/10.1007/s002360050048Google ScholarDigital Library
- OpenStack Foundation. 2017. 2017 Annual Report. Retrieved from https://www.openstack.org/assets/reports/OpenStack-AnnualReport2017.pdf.Google Scholar
- Adrian Palmer. 2015. SMRFFS-EXT4—SMR Friendly File System. Retrieved from https://github.com/Seagate/SMR_FS-EXT4.Google Scholar
- Swapnil Patil and Garth Gibson. 2011. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, Berkeley, CA, 13--13. http://dl.acm.org/citation.cfm?id=1960475.1960488Google Scholar
- Juan Piernas, Toni Cortes, and José M. García. 2002. DualFS: A new journaling file system without meta-data duplication. In Proceedings of the 16th International Conference on Supercomputing (ICS’02). Association for Computing Machinery, New York, NY, 137--146. DOI:https://doi.org/10.1145/514191.514213Google Scholar
- Poornima G and Rajesh Joseph. 2016. Metadata Performance Bottlenecks in Gluster. Retrieved from https://www.slideshare.net/GlusterCommunity/performance-bottlenecks-for-metadata-workload-in-gluster-with-poornima-gurusiddaiah-rajesh-joseph.Google Scholar
- Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel. 2009. Operating system transactions. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 161--176. DOI:https://doi.org/10.1145/1629575.1629591Google ScholarDigital Library
- Lee Prewitt. 2019. SMR and ZNS—Two Sides of the Same Coin. Retrieved from https://www.youtube.com/watch?v=jBxzO6YyMxU.Google Scholar
- Red Hat Inc. 2019. GlusterFS Architecture. Retrieved from https://docs.gluster.org/en/latest/Quick-Start-Guide/Architecture/.Google Scholar
- Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC’13). USENIX, 145--156.Google Scholar
- Mendel Rosenblum and John K. Ousterhout. 1991. The design and implementation of a log-structured file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91). ACM, New York, NY, 1--15. DOI:https://doi.org/10.1145/121132.121137Google Scholar
- Frank Schmuck and Jim Wylie. 1991. Experience with transactions in QuickSilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP’91). ACM, New York, NY, 239--253. DOI:https://doi.org/10.1145/121132.121171Google ScholarDigital Library
- Thomas J. E. Schwarz, Qin Xin, Ethan L. Miller, Darrell D. E. Long, Andy Hospodor, and Spencer Ng. 2004. Disk scrubbing in large archival storage systems. In Proceedings of the IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS’04). IEEE Computer Society, 409--418. http://dl.acm.org/citation.cfm?id=1032659.1034226Google ScholarDigital Library
- Seastar. 2019. Shared-nothing Design. Retrieved from http://seastar.io/shared-nothing/.Google Scholar
- Margo I. Seltzer. 1993. Transaction support in a log-structured file system. In Proceedings of the 9th International Conference on Data Engineering. IEEE Computer Society, 503--510.Google ScholarCross Ref
- Kai Shen, Stan Park, and Men Zhu. 2014. Journaling of journal is (almost) free. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). USENIX, 287--293.Google ScholarDigital Library
- Anton Shilov. 2017. Seagate Ships 35th Millionth SMR HDD, Confirms HAMR-Based Drives in Late 2018. Retrieved from https://www.anandtech.com/show/11315/seagate-ships-35th-millionth-smr-hdd-confirms-hamrbased-hard-drives-in-late-2018.Google Scholar
- A. Shilov. 2019. Western Digital: Over Half of Data Center HDDs Will Use SMR by 2023. Retrieved from https://www.anandtech.com/show/14099/western-digital-over-half-of-dc-hdds-will-use-smr-by-2023.Google Scholar
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST’10). IEEE Computer Society, 1--10. DOI:https://doi.org/10.1109/MSST.2010.5496972Google ScholarDigital Library
- Chris Siebenmann. 2011. About the Order That readdir() Returns Entries In. Retrieved from https://utcc.utoronto.ca/ cks/space/blog/unix/ReaddirOrder.Google Scholar
- Chris Siebenmann. 2013. ZFS Transaction Groups and the ZFS Intent Log. Retrieved from https://utcc.utoronto.ca/ cks/space/blog/solaris/ZFSTXGsAndZILs.Google Scholar
- Richard P. Spillane, Sachin Gaikwad, Manjunath Chinni, Erez Zadok, and Charles P. Wright. 2009. Enabling transactional file access via lightweight kernel extensions. In Proccedings of the 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, 29--42.Google Scholar
- Stas Starikevich. 2016. [ceph-users] RadosGW performance degradation on the 18 millions objects stored. Retrieved from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012983.html.Google Scholar
- Jan Stender, Björn Kolbeck, Mikael Högqvist, and Felix Hupfeld. 2010. BabuDB: Fast and efficient file system metadata storage. In Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI’10). IEEE Computer Society, 51--58. DOI:https://doi.org/10.1109/SNAPI.2010.14Google ScholarDigital Library
- Michael Stonebraker. 1981. Operating system support for database management. Commun. ACM 24, 7 (Jul. 1981), 412--418. DOI:https://doi.org/10.1145/358699.358703Google ScholarDigital Library
- Michael Stonebraker and Lawrence A. Rowe. 1986. The design of POSTGRES. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data (SIGMOD’86). ACM, New York, NY, 340--355. DOI:https://doi.org/10.1145/16894.16888Google Scholar
- ZAR team. 2019. “Write hole” phenomenon. Retrieved from http://www.raid-recovery-guide.com/raid5-write-hole.aspx.Google Scholar
- ThinkParQ. 2018. An introduction to BeeGFS. Retrieved from https://www.beegfs.io/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf.Google Scholar
- Stephen C. Tweedie. 1998. Journaling the Linux ext2fs Filesystem. In Proceedings of the 4th Annual Linux Expo.Google Scholar
- Sage Weil. 2009. Re: [RFC] Big Fat Transaction ioctl. Retrieved from https://lwn.net/Articles/361472/.Google Scholar
- Sage Weil. 2009. [RFC] Big Fat Transaction ioctl. Retrieved from https://lwn.net/Articles/361439/.Google Scholar
- Sage Weil. 2011. [PATCH v3] Introduce sys_syncfs to Sync a Single File System. Retrieved from https://lwn.net/Articles/433384/.Google Scholar
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). USENIX Association, Berkeley, CA, 307--320.Google ScholarDigital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC’06). Association for Computing Machinery, New York, NY, 122--es. DOI:https://doi.org/10.1145/1188455.1188582Google ScholarDigital Library
- Sage A. Weil, Andrew W. Leung, Scott A. Brandt, and Carlos Maltzahn. 2007. RADOS: A scalable, reliable storage service for petabyte-scale storage clusters. In Proceedings of the 2Nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing’07 (PDSW’07). ACM, New York, NY, 35--44. DOI:https://doi.org/10.1145/1374596.1374606Google ScholarDigital Library
- Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. 2008. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, Berkeley, CA, Article 2, 17 pages.Google ScholarDigital Library
- Western Digital Inc. 2018. ZBC device manipulation library. Retrieved from https://github.com/hgst/libzbc.Google Scholar
- Lustre Wiki. 2017. Introduction to Lustre Architecture. Retrieved from http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf.Google Scholar
- Wikipedia. 2018. Btrfs History. Retrieved from https://en.wikipedia.org/wiki/Btrfs#History.Google Scholar
- Wikipedia. 2018. XFS History. Retrieved from https://en.wikipedia.org/wiki/XFS#History.Google Scholar
- Wikipedia. 2019. Cache flushing. Retrieved from https://en.wikipedia.org/wiki/Disk_buffer#Cache_flushing.Google Scholar
- Charles P. Wright, Richard Spillane, Gopalan Sivathanu, and Erez Zadok. 2007. Extending ACID semantics to the file system. ACM Trans. Stor. 3, 2 (Jun. 2007), 4--es. DOI:https://doi.org/10.1145/1242520.1242521Google ScholarDigital Library
- Fengguang Wu. 2012. I/O-less Dirty Throttling. Retrieved from https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_wu.pdf.Google Scholar
- Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, 15--28.Google ScholarDigital Library
- Ting Yao, Jiguang Wan, Ping Huang, Yiwen Zhang, Zhiwen Liu, Changsheng Xie, and Xubin He. 2019. GearDB: A GC-free key-value store on HM-SMR drives with gear compaction. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). USENIX Association, 159--171.Google Scholar
- Lawrence Ying and Theodore Ts’o. 2017. Dynamic Hybrid-SMR: An OCP proposal to improve data center disk drives. Retrieved from https://www.blog.google/products/google-cloud/dynamic-hybrid-smr-ocp-proposal-improve-data-center-disk-drives/.Google Scholar
- Zhihui Zhang and Kanad Ghose. 2007. hFS: A hybrid file system prototype for improving small file and metadata performance. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys’07). ACM, New York, NY, 175--187. DOI:https://doi.org/10.1145/1272996.1273016Google ScholarDigital Library
- Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, Garth A. Gibson, Bradley W. Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Article 3, 15 pages. http://dl.acm.org/citation.cfm?id=3291656.3291660Google ScholarDigital Library
- Alexey Zhuravlev. 2016. ZFS: Metadata Performance. Retrieved from https://www.eofs.eu/_media/events/lad16/02_zfs_md_performance_improvements_zhuravlev.pdf.Google Scholar
Index Terms
- The Case for Custom Storage Backends in Distributed Storage Systems
Recommendations
File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesFor a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the ...
Evolution Towards Distributed Storage in a Nutshell
HPCC '14: Proceedings of the 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS)Distributed storage systems have greatly evolveddue to cloud computing upsurge in the past several years. Thedistributed file systems inherit many components fromcentralized ones and use them in a distributed manner. There aretwo ways to grow the ...
Comments