skip to main content
10.5555/1298455.1298485acmconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Ceph: a scalable, high-performance distributed file system

Published:06 November 2006Publication History

ABSTRACT

We have developed Ceph, a distributed file system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object file system. A dynamic distributed metadata cluster provides extremely efficient metadata management and seamlessly adapts to a wide range of general purpose and scientific computing file system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.

References

  1. A. Adya, W. J. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Boston, MA, Dec. 2002. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. A. Alsberg and J. D. Day. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering, pages 562--570. IEEE Computer Society Press, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Azagury, V. Dreizin, M. Factor, E. Henis, D. Naor, N. Rinetzky, O. Rodeh, J. Satran, A. Tavory, and L. Yerushalmi. Towards an object store. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 165--176, Apr. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. J. Braam. The Lustre storage architecture. http://www.lustre.org/documentation.html, Cluster File Systems, Inc., Aug. 2004.Google ScholarGoogle Scholar
  5. L.-F. Cabrera and D. D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405--436, 1991.Google ScholarGoogle Scholar
  6. P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225--264, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP '03), Bolton Landing, NY, Oct. 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 92--103, San Jose, CA, Oct. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Hildebrand and P. Honeyman. Exporting storage systems in a scalable manner with pNFS. Technical Report CITI-05-1, CITI, University of Michigan, Feb. 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Karger, E. Lehman, T. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In ACM Symposium on Theory of Computing, pages 654--663, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cambridge, MA, Nov. 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Latham, N. Miller, R. Ross, and P. Carns. A next-generation parallel file system for Linux clusters. Linux-World, pages 56--59, Jan. 2004.Google ScholarGoogle Scholar
  13. A. Leung and E. L. Miller. Scalable security for large, high performance storage systems. In Proceedings of the 2006 ACM Workshop on Storage Security and Survivability. ACM, Oct. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M. Williams. Replication in the Harp file system. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP '91), pages 226--238. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. R. Lumb, G. R. Ganger, and R. Golding. D-SPTF: Decentralized request distribution in brick-based storage systems. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 37--47, Boston, MA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Menon, D. A. Pease, R. Rees, L. Duyanovich, and B. Hillsberg. IBM Storage Tank---a heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250--267, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Nieuwejaar and D. Kotz. The Galley parallel file system. In Proceedings of 10th ACM International Conference on Supercomputing, pages 374--381, Philadelphia, PA, 1996. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Nieuwejaar, D. Kotz, A. Purakayastha, C. S. Ellis, and M. Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075--1089, Oct. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. A. Olson and E. L. Miller. Secure capabilities for a petabyte-scale object-based distributed file system. In Proceedings of the 2005 ACM Workshop on Storage Security and Survivability, Fairfax, VA, Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz. NFS version 3: Design and implementation. In Proceedings of the Summer 1994 USENIX Technical Conference, pages 137--151, 1994.Google ScholarGoogle Scholar
  21. O. Rodeh and A. Teperman. zFS---a scalable distributed file system using object disks. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 207--218, Apr. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41--54, San Diego, CA, June 2000. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Saito, S. Frølund, A. Veitch, A. Merchant, and S. Spence. FAB: Building distributed enterprise disk arrays from commodity components. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 48--58, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 231--244. USENIX, Jan. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Szeredi. File System in User Space. http://fuse.sourceforge.net, 2006.Google ScholarGoogle Scholar
  26. H. Tang, A. Gulbeden, J. Zhou, W. Strathearn, T. Yang, and L. Chu. A self-organizing storage cluster for parallel data-intensive applications. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04), Pittsburgh, PA, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Wang, Q. Xin, B. Hong, S. A. Brandt, E. L. Miller, D. D. E. Long, and T. T. McLarty. File system workload analysis for large scale scientific computing applications. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 139--152, College Park, MD, Apr. 2004.Google ScholarGoogle Scholar
  28. S. A. Weil. Scalable archival data and metadata management in object-based file systems. Technical Report SSRC-04-01, University of California, Santa Cruz, May 2004.Google ScholarGoogle Scholar
  29. S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06), Tampa, FL, Nov. 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC '04). ACM, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Welch. POSIX IO extensions for HPC. In Proceedings of the 4th USENIX Conference on File and Storage Technologies (FAST), Dec. 2005.Google ScholarGoogle Scholar
  32. B. Welch and G. Gibson. Managing scalability in object storage systems for HPC Linux clusters. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 433--445, Apr. 2004.Google ScholarGoogle Scholar
  33. B. S. White, M. Walker, M. Humphrey, and A. S. Grimshaw. LegionFS: A secure and scalable file system supporting cross-domain high-performance applications. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing (SC '01), Denver, CO, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 96--108, Copper Mountain, CO, 1995. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. M. Wong, R. A. Golding, J. S. Glider, E. Borowsky, R. A. Becker-Szendy, C. Fleiner, D. R. Kenchammana-Hosekote, and O. A. Zaki. Kybos: self-management for distributed brick-base storage. Research Report RJ 10356, IBM Almaden Research Center, Aug. 2005.Google ScholarGoogle Scholar
  36. J. C. Wu and S. A. Brandt. The design and implementation of AQuA: an adaptive quality of service aware object-based storage device. In Proceedings of the 23rd IEEE / 14th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 209--218, College Park, MD, May 2006.Google ScholarGoogle Scholar
  37. Q. Xin, E. L. Miller, and T. J. E. Schwarz. Evaluation of distributed recovery in large-scale storage systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 172--181, Honolulu, HI, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Ceph: a scalable, high-performance distributed file system

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  OSDI '06: Proceedings of the 7th symposium on Operating systems design and implementation
                  November 2006
                  407 pages
                  ISBN:1931971471

                  Publisher

                  USENIX Association

                  United States

                  Publication History

                  • Published: 6 November 2006

                  Check for updates

                  Qualifiers

                  • Article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader