ABSTRACT
We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.
- A. Adya, W. J. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, Dec. 2002. USENIX. Google ScholarDigital Library
- P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering, pages 562--570. IEEE Computer Society Press, 1976. Google ScholarDigital Library
- A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor, N. Rinetzky, O. Rodeh, J. Satran, A. Tavory, and L. Yerushalmi. Towards an object store. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 165--176, Apr. 2003. Google ScholarDigital Library
- P. J. Braam. The Lustre storage architecture. http://www.lustre.org/documentation.html, Cluster File Systems, Inc., Aug. 2004.Google Scholar
- L.-F. Cabrera and D. D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405--436, 1991.Google Scholar
- P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225--264, 1996. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), Bolton Landing, NY, Oct. 2003. ACM. Google ScholarDigital Library
- G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 92--103, San Jose, CA, Oct. 1998. Google ScholarDigital Library
- D. Hildebrand and P. Honeyman. Exporting storage systems in a scalable manner with pNFS. Technical Report CITI-05-1, CITI, University of Michigan, Feb. 2005.Google ScholarDigital Library
- D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In ACM Symposium on Theory of Computing, pages 654--663, May 1997. Google ScholarDigital Library
- J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cambridge, MA, Nov. 2000. ACM. Google ScholarDigital Library
- R. Latham, N. Miller, R. Ross, and P. Carns. A next-generation parallel file system for Linux clusters. Linux-World, pages 56--59, Jan. 2004.Google Scholar
- A. Leung and E. L. Miller. Scalable security for large, high performance storage systems. In Proceedings of the 2006 ACM Workshop on Storage Security and Survivability. ACM, Oct. 2006. Google ScholarDigital Library
- B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M. Williams. Replication in the Harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91), pages 226--238. ACM, 1991. Google ScholarDigital Library
- C. R. Lumb, G. R. Ganger, and R. Golding. D-SPTF: Decentralized request distribution in brick-based storage systems. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--47, Boston, MA, 2004. Google ScholarDigital Library
- J. Menon, D. A. Pease, R. Rees, L. Duyanovich, and B. Hillsberg. IBM Storage Tank---a heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250--267, 2003. Google ScholarDigital Library
- N. Nieuwejaar and D. Kotz. The Galley parallel file system. In Proceedings of 10th ACM International Conference on Supercomputing, pages 374--381, Philadelphia, PA, 1996. ACM Press. Google ScholarDigital Library
- N. Nieuwejaar, D. Kotz, A. Purakayastha, C. S. Ellis, and M. Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075--1089, Oct. 1996. Google ScholarDigital Library
- C. A. Olson and E. L. Miller. Secure capabilities for a petabyte-scale object-based distributed file system. In Proceedings of the 2005 ACM Workshop on Storage Security and Survivability, Fairfax, VA, Nov. 2005. Google ScholarDigital Library
- B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz. NFS version 3: Design and implementation. In Proceedings of the Summer 1994 USENIX Technical Conference, pages 137--151, 1994.Google Scholar
- O. Rodeh and A. Teperman. zFS---a scalable distributed file system using object disks. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 207--218, Apr. 2003. Google ScholarDigital Library
- D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41--54, San Diego, CA, June 2000. USENIX Association. Google ScholarDigital Library
- Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence. FAB: Building distributed enterprise disk arrays from commodity components. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48--58, 2004. Google ScholarDigital Library
- F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 231--244. USENIX, Jan. 2002. Google ScholarDigital Library
- M. Szeredi. File System in User Space. http://fuse.sourceforge.net, 2006.Google Scholar
- H. Tang, A. Gulbeden, J. Zhou, W. Strathearn, T. Yang, and L. Chu. A self-organizing storage cluster for parallel data-intensive applications. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04), Pittsburgh, PA, Nov. 2004. Google ScholarDigital Library
- F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller, D. D. E. Long, and T. T. McLarty. File system workload analysis for large scale scientific computing applications. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 139--152, College Park, MD, Apr. 2004.Google Scholar
- S. A. Weil. Scalable archival data and metadata management in object-based file systems. Technical Report SSRC-04-01, University of California, Santa Cruz, May 2004.Google Scholar
- S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, Nov. 2006. ACM. Google ScholarDigital Library
- S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04). ACM, Nov. 2004. Google ScholarDigital Library
- B. Welch. POSIX IO extensions for HPC. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST), Dec. 2005.Google Scholar
- B. Welch and G. Gibson. Managing scalability in object storage systems for HPC Linux clusters. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 433--445, Apr. 2004.Google Scholar
- B. S. White, M. Walker, M. Humphrey, and A. S. Grimshaw. LegionFS: A secure and scalable file system supporting cross-domain high-performance applications. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (SC '01), Denver, CO, 2001. Google ScholarDigital Library
- J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 96--108, Copper Mountain, CO, 1995. ACM Press. Google ScholarDigital Library
- T. M. Wong, R. A. Golding, J. S. Glider, E. Borowsky, R. A. Becker-Szendy, C. Fleiner, D. R. Kenchammana-Hosekote, and O. A. Zaki. Kybos: self-management for distributed brick-base storage. Research Report RJ 10356, IBM Almaden Research Center, Aug. 2005.Google Scholar
- J. C. Wu and S. A. Brandt. The design and implementation of AQuA: an adaptive quality of service aware object-based storage device. In Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 209--218, College Park, MD, May 2006.Google Scholar
- Q. Xin, E. L. Miller, and T. J. E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 172--181, Honolulu, HI, June 2004. Google ScholarDigital Library
- Ceph: a scalable, high-performance distributed file system
Recommendations
Ceph: a scalable, high-performance distributed file system
OSDI '06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution ...
Using ceph's BlueStore as object storage in HPC storage framework
CHEOPS '21: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage SystemsIn times of ever-increasing data sizes, data management and insightful analysis are amidst the most severe challenges of high-performance computing. While high-level libraries such as NetCDF, HDF5, and ADIOS2, as well as the associated self-describing ...
Comments