skip to main content
10.1145/2741948.2741964acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open Access

Large-scale cluster management at Google with Borg

Published:17 April 2015Publication History

ABSTRACT

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.

We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

Skip Supplemental Material Section

Supplemental Material

a18-sidebyside.mp4

mp4

1 GB

References

  1. O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In Proc. IEEE Int'l Conf. on Cloud Computing Technology and Science (CloudCom), pages 272--277, Singapore, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adaptive Computing Enterprises Inc., Provo, UT. Maui Scheduler Administrator's Guide, 3.2 edition, 2011.Google ScholarGoogle Scholar
  3. T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 734--746, Riva del Garda, Italy, Aug. 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans. Parallel Distrib. Syst., 11(7): 760--768, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache Aurora. http://aurora.incubator.apache.org/, 2014.Google ScholarGoogle Scholar
  6. Aurora Configuration Tutorial. https://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/, 2014.Google ScholarGoogle Scholar
  7. AWS. Amazon Web Services VM Instances. http://aws.amazon.com/ec2/instance-types/, 2014.Google ScholarGoogle Scholar
  8. J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc. Conference on Innovative Data Systems Research (CIDR), pages 223--234, Asilomar, CA, USA, Jan. 2011.Google ScholarGoogle Scholar
  9. M. Baker and J. Ousterhout. Availability in the Sprite distributed file system. Operating Systems Review, 25(2): 95--98, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan Claypool Publishers, 2nd edition, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: the Google cluster architecture. In IEEE Micro, pages 22--28, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. Bokharouss. GCL Viewer: a study in improving the understanding of GCL programs. Technical report, Eindhoven Univ. of Technology, 2008. MS thesis.Google ScholarGoogle Scholar
  13. E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and coordinated scheduling for cloud-scale computing. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), Oct. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), pages 335--350, Seattle, WA, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. cAdvisor. https://github.com/google/cadvisor, 2014.Google ScholarGoogle Scholar
  16. CFS per-entity load patches. http://lwn.net/Articles/531853, 2013.Google ScholarGoogle Scholar
  17. cgroups. http://en.wikipedia.org/wiki/Cgroups, 2014.Google ScholarGoogle Scholar
  18. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363--375, Toronto, Ontario, Canada, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. ACM Trans. on Computer Systems, 26(2): 4:1--4:26, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Chen, S. Alspaugh, and R. H. Katz. Design insights for MapReduce from diverse production workloads. Technical Report UCB/EECS--2012--17, UC Berkeley, Jan. 2012.Google ScholarGoogle Scholar
  21. C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: if you're late don't blame us! In Proc. ACM Symp. on Cloud Computing (SoCC), pages 2:1--2:14, Seattle, WA, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2): 74--80, Feb. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, Salt Lake City, UT, USA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus Grid workloads. In International Conference on Cluster Computing (IEEE CLUSTER), pages 230--238, Beijing, China, Sept. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Di, D. Kondo, and C. Franck. Characterizing cloud applications on a Google data center. In Proc. Int'l Conf. on Parallel Processing (ICPP), Lyon, France, Oct. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Docker Project. https://www.docker.io/, 2014.Google ScholarGoogle Scholar
  29. D. Dolev, D. G. Feitelson, J. Y. Halpern, R. Kupferman, and N. Linial. No justified complaints: on fair sharing of multiple resources. In Proc. Innovations in Theoretical Computer Science (ITCS), pages 68--75, Cambridge, MA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ElasticSearch. http://www.elasticsearch.org, 2014.Google ScholarGoogle Scholar
  31. D. G. Feitelson. Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2014.Google ScholarGoogle Scholar
  32. Fluentd. http://www.fluentd.org/, 2014.Google ScholarGoogle Scholar
  33. GCE. Google Compute Engine. http://cloud.google.com/products/compute-engine/, 2014.Google ScholarGoogle Scholar
  34. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 29--43, Bolton Landing, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant Resource Fairness: fair allocation of multiple resource types. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), pages 323--326, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proc. European Conf. on Computer Systems (EuroSys), pages 365--378, Prague, Czech Republic, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Gmach, J. Rolia, and L. Cherkasova. Selling T-shirts and time shares in the cloud. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 539--546, Ottawa, Canada, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Google App Engine. http://cloud.google.com/AppEngine, 2014.Google ScholarGoogle Scholar
  39. Google Container Engine (GKE). https://cloud.google.com/container-engine/, 2015.Google ScholarGoogle Scholar
  40. R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In Proc. ACM SIGCOMM, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Apache Hadoop Project. http://hadoop.apache.org/, 2009.Google ScholarGoogle Scholar
  42. Hadoop MapReduce Next Generation -- Capacity Scheduler. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html, 2013.Google ScholarGoogle Scholar
  43. J. Hamilton. On designing and deploying internet-scale services. In Proc. Large Installation System Administration Conf. (LISA), pages 231--242, Dallas, TX, USA, Nov. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. P. Helland. Cosmos: big data and big challenges. http://research.microsoft.com/en-us/events/fs2011/helland\_cosmos\_big\_data\_and\_big\_challenges.pdf, 2011.Google ScholarGoogle Scholar
  45. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. IBM Platform Computing. http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/clustermanager/index.html.Google ScholarGoogle Scholar
  47. S. Iqbal, R. Gupta, and Y.-C. Fang. Planning considerations for job scheduling in HPC clusters. Dell Power Solutions, Feb. 2005.Google ScholarGoogle Scholar
  48. M. Isaard. Autopilot: Automatic data center management. ACM SIGOPS Operating Systems Review, 41(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proc. ACM Symp. on Operating Systems Principles (SOSP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. B. Jackson, Q. Snell, and M. J. Clement. Core algorithms of the Maui scheduler. In Proc. Int'l Workshop on Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer-Verlag, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. Measuring interference between live datacenter applications. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production MapReduce cluster. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 94--103, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Kubernetes. http://kubernetes.io, Aug. 2014.Google ScholarGoogle Scholar
  54. Kernel Based Virtual Machine. http://www.linux-kvm.org.Google ScholarGoogle Scholar
  55. L. Lamport. The part-time parliament. ACM Trans. on Computer Systems, 16(2): 133--169, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Leverich and C. Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proc. European Conf. on Computer Systems (EuroSys), page 4, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Z. Liu and S. Cho. Characterizing machines and workloads on a Google cluster. In Proc. Int'l Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS), Pittsburgh, PA, USA, Sept. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Google LMCTFY project (let me contain that for you). http://github.com/google/lmctfy, 2014.Google ScholarGoogle Scholar
  59. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Conference, pages 135--146, Indianapolis, IA, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. Int'l Symp. on Microarchitecture (Micro), Porto Alegre, Brazil, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 330--339, Singapore, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. P. Menage. Linux control groups. http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt, 2007--2014.Google ScholarGoogle Scholar
  63. A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37: 34--41, Mar. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. A. Narayanan. Tupperware: containerized deployment at Facebook. http://www.slideshare.net/dotCloud/tupperware-containerized-deployment-at-facebook, June 2014.Google ScholarGoogle Scholar
  65. K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 69--84, Farminton, PA, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond Dominant Resource Fairness: extensions, limitations, and indivisibilities. In Proc. Electronic Commerce, pages 808--825, Valencia, Spain, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Protocol buffers. https://developers.google.com/protocol-buffers/, and https://github.com/google/protobuf/., 2014.Google ScholarGoogle Scholar
  68. C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and synthesizing task placement constraints in Google compute clusters. In Proc. ACM Symp. on Cloud Computing (SoCC), pages 3:1--3:14, Cascais, Portugal, Oct. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems schedulers: are we doing the right thing? IEEE Trans. on Parallel and Distributed Systems, 20(7): 983--996, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. A. Singh, M. Korupolu, and D. Mohapatra. Server-storage virtualization: integration and load balancing in data centers. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 53:1--53:12, Austin, TX, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Apache Spark Project. http://spark.apache.org/, 2014.Google ScholarGoogle Scholar
  74. A. Tumanov, J. Cipar, M. A. Kozuch, and G. R. Ganger. Alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. P. Turner, B. Rao, and N. Rao. CPU bandwidth control for CFS. In Proc. Linux Symposium, pages 245--254, July 2010.Google ScholarGoogle Scholar
  76. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. ACM Symp. on Cloud Computing (SoCC), Santa Clara, CA, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. VMware VCloud Suite. http://www.vmware.com/products/vcloud-suite/.Google ScholarGoogle Scholar
  78. A. Verma, M. Korupolu, and J. Wilkes. Evaluating job packing in warehouse-scale computing. In IEEE Cluster, pages 48--56, Madrid, Spain, Sept. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  79. W. Whitt. Open and closed models for networks of queues. AT&T Bell Labs Technical Journal, 63(9), Nov. 1984.Google ScholarGoogle Scholar
  80. J. Wilkes. More Google cluster data. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html, Nov. 2011.Google ScholarGoogle Scholar
  81. Y. Zhai, X. Zhang, S. Eranian, L. Tang, and J. Mars. HaPPy: Hyperthread-aware power profiling dynamically. In Proc. USENIX Annual Technical Conf. (USENIX ATC), pages 211--217, Philadelphia, PA, USA, June 2014. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Q. Zhang, J. Hellerstein, and R. Boutaba. Characterizing task usage shapes in Google's compute clusters. In Proc. Int'l Workshop on Large-Scale Distributed Systems and Middleware (LADIS), 2011.Google ScholarGoogle Scholar
  83. X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 1393--1404. VLDB Endowment Inc., Sept. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale cluster management at Google with Borg

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems
              April 2015
              503 pages
              ISBN:9781450332385
              DOI:10.1145/2741948

              Copyright © 2015 Owner/Author

              Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 17 April 2015

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate241of1,308submissions,18%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader