ABSTRACT
Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.
It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.
We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.
Supplemental Material
- O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In Proc. IEEE Int'l Conf. on Cloud Computing Technology and Science (CloudCom), pages 272--277, Singapore, Dec. 2014. Google ScholarDigital Library
- Adaptive Computing Enterprises Inc., Provo, UT. Maui Scheduler Administrator's Guide, 3.2 edition, 2011.Google Scholar
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 734--746, Riva del Garda, Italy, Aug. 2013.Google ScholarDigital Library
- Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans. Parallel Distrib. Syst., 11(7): 760--768, July 2000. Google ScholarDigital Library
- Apache Aurora. http://aurora.incubator.apache.org/, 2014.Google Scholar
- Aurora Configuration Tutorial. https://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/, 2014.Google Scholar
- AWS. Amazon Web Services VM Instances. http://aws.amazon.com/ec2/instance-types/, 2014.Google Scholar
- J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc. Conference on Innovative Data Systems Research (CIDR), pages 223--234, Asilomar, CA, USA, Jan. 2011.Google Scholar
- M. Baker and J. Ousterhout. Availability in the Sprite distributed file system. Operating Systems Review, 25(2): 95--98, Apr. 1991. Google ScholarDigital Library
- L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan Claypool Publishers, 2nd edition, 2013. Google ScholarDigital Library
- L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: the Google cluster architecture. In IEEE Micro, pages 22--28, 2003. Google ScholarDigital Library
- I. Bokharouss. GCL Viewer: a study in improving the understanding of GCL programs. Technical report, Eindhoven Univ. of Technology, 2008. MS thesis.Google Scholar
- E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and coordinated scheduling for cloud-scale computing. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), Oct. 2014. Google ScholarDigital Library
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), pages 335--350, Seattle, WA, USA, 2006. Google ScholarDigital Library
- cAdvisor. https://github.com/google/cadvisor, 2014.Google Scholar
- CFS per-entity load patches. http://lwn.net/Articles/531853, 2013.Google Scholar
- cgroups. http://en.wikipedia.org/wiki/Cgroups, 2014.Google Scholar
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363--375, Toronto, Ontario, Canada, 2010. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. ACM Trans. on Computer Systems, 26(2): 4:1--4:26, June 2008. Google ScholarDigital Library
- Y. Chen, S. Alspaugh, and R. H. Katz. Design insights for MapReduce from diverse production workloads. Technical Report UCB/EECS--2012--17, UC Berkeley, Jan. 2012.Google Scholar
- C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: if you're late don't blame us! In Proc. ACM Symp. on Cloud Computing (SoCC), pages 2:1--2:14, Seattle, WA, USA, 2014. Google ScholarDigital Library
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2): 74--80, Feb. 2012. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, 2008. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 201. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, Salt Lake City, UT, USA, 2014. Google ScholarDigital Library
- S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus Grid workloads. In International Conference on Cluster Computing (IEEE CLUSTER), pages 230--238, Beijing, China, Sept. 2012. Google ScholarDigital Library
- S. Di, D. Kondo, and C. Franck. Characterizing cloud applications on a Google data center. In Proc. Int'l Conf. on Parallel Processing (ICPP), Lyon, France, Oct. 2013. Google ScholarDigital Library
- Docker Project. https://www.docker.io/, 2014.Google Scholar
- D. Dolev, D. G. Feitelson, J. Y. Halpern, R. Kupferman, and N. Linial. No justified complaints: on fair sharing of multiple resources. In Proc. Innovations in Theoretical Computer Science (ITCS), pages 68--75, Cambridge, MA, USA, 2012. Google ScholarDigital Library
- ElasticSearch. http://www.elasticsearch.org, 2014.Google Scholar
- D. G. Feitelson. Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2014.Google Scholar
- Fluentd. http://www.fluentd.org/, 2014.Google Scholar
- GCE. Google Compute Engine. http://cloud.google.com/products/compute-engine/, 2014.Google Scholar
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 29--43, Bolton Landing, NY, USA, 2003. ACM. Google ScholarDigital Library
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant Resource Fairness: fair allocation of multiple resource types. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), pages 323--326, 2011. Google ScholarDigital Library
- A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proc. European Conf. on Computer Systems (EuroSys), pages 365--378, Prague, Czech Republic, 2013. Google ScholarDigital Library
- D. Gmach, J. Rolia, and L. Cherkasova. Selling T-shirts and time shares in the cloud. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 539--546, Ottawa, Canada, 2012. Google ScholarDigital Library
- Google App Engine. http://cloud.google.com/AppEngine, 2014.Google Scholar
- Google Container Engine (GKE). https://cloud.google.com/container-engine/, 2015.Google Scholar
- R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In Proc. ACM SIGCOMM, Aug. 2014. Google ScholarDigital Library
- Apache Hadoop Project. http://hadoop.apache.org/, 2009.Google Scholar
- Hadoop MapReduce Next Generation -- Capacity Scheduler. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html, 2013.Google Scholar
- J. Hamilton. On designing and deploying internet-scale services. In Proc. Large Installation System Administration Conf. (LISA), pages 231--242, Dallas, TX, USA, Nov. 2007. Google ScholarDigital Library
- P. Helland. Cosmos: big data and big challenges. http://research.microsoft.com/en-us/events/fs2011/helland\_cosmos\_big\_data\_and\_big\_challenges.pdf, 2011.Google Scholar
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
- IBM Platform Computing. http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/clustermanager/index.html.Google Scholar
- S. Iqbal, R. Gupta, and Y.-C. Fang. Planning considerations for job scheduling in HPC clusters. Dell Power Solutions, Feb. 2005.Google Scholar
- M. Isaard. Autopilot: Automatic data center management. ACM SIGOPS Operating Systems Review, 41(2), 2007. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proc. ACM Symp. on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
- D. B. Jackson, Q. Snell, and M. J. Clement. Core algorithms of the Maui scheduler. In Proc. Int'l Workshop on Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer-Verlag, 2001. Google ScholarDigital Library
- M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. Measuring interference between live datacenter applications. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 2012. Google ScholarDigital Library
- S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production MapReduce cluster. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 94--103, 2010. Google ScholarDigital Library
- Kubernetes. http://kubernetes.io, Aug. 2014.Google Scholar
- Kernel Based Virtual Machine. http://www.linux-kvm.org.Google Scholar
- L. Lamport. The part-time parliament. ACM Trans. on Computer Systems, 16(2): 133--169, May 1998. Google ScholarDigital Library
- J. Leverich and C. Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proc. European Conf. on Computer Systems (EuroSys), page 4, 2014. Google ScholarDigital Library
- Z. Liu and S. Cho. Characterizing machines and workloads on a Google cluster. In Proc. Int'l Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS), Pittsburgh, PA, USA, Sept. 2012. Google ScholarDigital Library
- Google LMCTFY project (let me contain that for you). http://github.com/google/lmctfy, 2014.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Conference, pages 135--146, Indianapolis, IA, USA, 2010. Google ScholarDigital Library
- J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. Int'l Symp. on Microarchitecture (Micro), Porto Alegre, Brazil, 2011. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 330--339, Singapore, Sept. 2010. Google ScholarDigital Library
- P. Menage. Linux control groups. http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt, 2007--2014.Google Scholar
- A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37: 34--41, Mar. 2010. Google ScholarDigital Library
- A. Narayanan. Tupperware: containerized deployment at Facebook. http://www.slideshare.net/dotCloud/tupperware-containerized-deployment-at-facebook, June 2014.Google Scholar
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 69--84, Farminton, PA, USA, 2013. Google ScholarDigital Library
- D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond Dominant Resource Fairness: extensions, limitations, and indivisibilities. In Proc. Electronic Commerce, pages 808--825, Valencia, Spain, 2012. Google ScholarDigital Library
- Protocol buffers. https://developers.google.com/protocol-buffers/, and https://github.com/google/protobuf/., 2014.Google Scholar
- C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarDigital Library
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarDigital Library
- B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and synthesizing task placement constraints in Google compute clusters. In Proc. ACM Symp. on Cloud Computing (SoCC), pages 3:1--3:14, Cascais, Portugal, Oct. 2011. Google ScholarDigital Library
- E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems schedulers: are we doing the right thing? IEEE Trans. on Parallel and Distributed Systems, 20(7): 983--996, July 2009. Google ScholarDigital Library
- A. Singh, M. Korupolu, and D. Mohapatra. Server-storage virtualization: integration and load balancing in data centers. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 53:1--53:12, Austin, TX, USA, 2008. Google ScholarDigital Library
- Apache Spark Project. http://spark.apache.org/, 2014.Google Scholar
- A. Tumanov, J. Cipar, M. A. Kozuch, and G. R. Ganger. Alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarDigital Library
- P. Turner, B. Rao, and N. Rao. CPU bandwidth control for CFS. In Proc. Linux Symposium, pages 245--254, July 2010.Google Scholar
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. ACM Symp. on Cloud Computing (SoCC), Santa Clara, CA, USA, 2013. Google ScholarDigital Library
- VMware VCloud Suite. http://www.vmware.com/products/vcloud-suite/.Google Scholar
- A. Verma, M. Korupolu, and J. Wilkes. Evaluating job packing in warehouse-scale computing. In IEEE Cluster, pages 48--56, Madrid, Spain, Sept. 2014.Google ScholarCross Ref
- W. Whitt. Open and closed models for networks of queues. AT&T Bell Labs Technical Journal, 63(9), Nov. 1984.Google Scholar
- J. Wilkes. More Google cluster data. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html, Nov. 2011.Google Scholar
- Y. Zhai, X. Zhang, S. Eranian, L. Tang, and J. Mars. HaPPy: Hyperthread-aware power profiling dynamically. In Proc. USENIX Annual Technical Conf. (USENIX ATC), pages 211--217, Philadelphia, PA, USA, June 2014. USENIX Association. Google ScholarDigital Library
- Q. Zhang, J. Hellerstein, and R. Boutaba. Characterizing task usage shapes in Google's compute clusters. In Proc. Int'l Workshop on Large-Scale Distributed Systems and Middleware (LADIS), 2011.Google Scholar
- X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarDigital Library
- Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 1393--1404. VLDB Endowment Inc., Sept. 2014. Google ScholarDigital Library
Index Terms
- Large-scale cluster management at Google with Borg
Recommendations
Virtual cluster management with Xen
Euro-Par'07: Proceedings of the 2007 conference on Parallel processingRecently, virtualization of hardware resources to run multiple instances of independent virtual machines over physical hosts has gained popularity due to an industry-wide focus on the need to reduce the cost of operation of an enterprise computing ...
Borg: the next generation
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer SystemsThis paper analyzes a newly-published trace that covers 8 different Borg [35] clusters for the month of May 2019. The trace enables researchers to explore how scheduling works in large-scale production compute clusters. We highlight how Borg has evolved ...
Comments