research-article

Open Access

Large-scale cluster management at Google with Borg

Authors:
Abhishek Verma

Google Inc.

Google Inc.
View Profile

,
Luis Pedrosa

Google Inc.

Google Inc.
View Profile

,
Madhukar Korupolu

Google Inc.

Google Inc.
View Profile

,
David Oppenheimer

Google Inc.

Google Inc.
View Profile

,
Eric Tune

Google Inc.

Google Inc.
View Profile

,
John Wilkes

Google Inc.

Google Inc.
View Profile

EuroSys '15: Proceedings of the Tenth European Conference on Computer SystemsApril 2015Article No.: 18Pages 1–17https://doi.org/10.1145/2741948.2741964

Published:17 April 2015Publication History

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

Pages 1–17

ABSTRACT

Google's Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.

We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

Supplemental Material

a18-sidebyside.mp4

mp4

1 GB

Download

References

O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In Proc. IEEE Int'l Conf. on Cloud Computing Technology and Science (CloudCom), pages 272--277, Singapore, Dec. 2014. Google ScholarDigital Library
Adaptive Computing Enterprises Inc., Provo, UT. Maui Scheduler Administrator's Guide, 3.2 edition, 2011.Google Scholar
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 734--746, Riva del Garda, Italy, Aug. 2013.Google ScholarDigital Library
Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans. Parallel Distrib. Syst., 11(7): 760--768, July 2000. Google ScholarDigital Library
Apache Aurora. http://aurora.incubator.apache.org/, 2014.Google Scholar
Aurora Configuration Tutorial. https://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/, 2014.Google Scholar
AWS. Amazon Web Services VM Instances. http://aws.amazon.com/ec2/instance-types/, 2014.Google Scholar
J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc. Conference on Innovative Data Systems Research (CIDR), pages 223--234, Asilomar, CA, USA, Jan. 2011.Google Scholar
M. Baker and J. Ousterhout. Availability in the Sprite distributed file system. Operating Systems Review, 25(2): 95--98, Apr. 1991. Google ScholarDigital Library
L. A. Barroso, J. Clidaras, and U. Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan Claypool Publishers, 2nd edition, 2013. Google ScholarDigital Library
L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: the Google cluster architecture. In IEEE Micro, pages 22--28, 2003. Google ScholarDigital Library
I. Bokharouss. GCL Viewer: a study in improving the understanding of GCL programs. Technical report, Eindhoven Univ. of Technology, 2008. MS thesis.Google Scholar
E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and coordinated scheduling for cloud-scale computing. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), Oct. 2014. Google ScholarDigital Library
M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), pages 335--350, Seattle, WA, USA, 2006. Google ScholarDigital Library
cAdvisor. https://github.com/google/cadvisor, 2014.Google Scholar
CFS per-entity load patches. http://lwn.net/Articles/531853, 2013.Google Scholar
cgroups. http://en.wikipedia.org/wiki/Cgroups, 2014.Google Scholar
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363--375, Toronto, Ontario, Canada, 2010. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. ACM Trans. on Computer Systems, 26(2): 4:1--4:26, June 2008. Google ScholarDigital Library
Y. Chen, S. Alspaugh, and R. H. Katz. Design insights for MapReduce from diverse production workloads. Technical Report UCB/EECS--2012--17, UC Berkeley, Jan. 2012.Google Scholar
C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: if you're late don't blame us! In Proc. ACM Symp. on Cloud Computing (SoCC), pages 2:1--2:14, Seattle, WA, USA, 2014. Google ScholarDigital Library
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2): 74--80, Feb. 2012. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, 2008. Google ScholarDigital Library
C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 201. Google ScholarDigital Library
C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. In Proc. Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127--144, Salt Lake City, UT, USA, 2014. Google ScholarDigital Library
S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus Grid workloads. In International Conference on Cluster Computing (IEEE CLUSTER), pages 230--238, Beijing, China, Sept. 2012. Google ScholarDigital Library
S. Di, D. Kondo, and C. Franck. Characterizing cloud applications on a Google data center. In Proc. Int'l Conf. on Parallel Processing (ICPP), Lyon, France, Oct. 2013. Google ScholarDigital Library
Docker Project. https://www.docker.io/, 2014.Google Scholar
D. Dolev, D. G. Feitelson, J. Y. Halpern, R. Kupferman, and N. Linial. No justified complaints: on fair sharing of multiple resources. In Proc. Innovations in Theoretical Computer Science (ITCS), pages 68--75, Cambridge, MA, USA, 2012. Google ScholarDigital Library
ElasticSearch. http://www.elasticsearch.org, 2014.Google Scholar
D. G. Feitelson. Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2014.Google Scholar
Fluentd. http://www.fluentd.org/, 2014.Google Scholar
GCE. Google Compute Engine. http://cloud.google.com/products/compute-engine/, 2014.Google Scholar
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 29--43, Bolton Landing, NY, USA, 2003. ACM. Google ScholarDigital Library
A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant Resource Fairness: fair allocation of multiple resource types. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), pages 323--326, 2011. Google ScholarDigital Library
A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proc. European Conf. on Computer Systems (EuroSys), pages 365--378, Prague, Czech Republic, 2013. Google ScholarDigital Library
D. Gmach, J. Rolia, and L. Cherkasova. Selling T-shirts and time shares in the cloud. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 539--546, Ottawa, Canada, 2012. Google ScholarDigital Library
Google App Engine. http://cloud.google.com/AppEngine, 2014.Google Scholar
Google Container Engine (GKE). https://cloud.google.com/container-engine/, 2015.Google Scholar
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In Proc. ACM SIGCOMM, Aug. 2014. Google ScholarDigital Library
Apache Hadoop Project. http://hadoop.apache.org/, 2009.Google Scholar
Hadoop MapReduce Next Generation -- Capacity Scheduler. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html, 2013.Google Scholar
J. Hamilton. On designing and deploying internet-scale services. In Proc. Large Installation System Administration Conf. (LISA), pages 231--242, Dallas, TX, USA, Nov. 2007. Google ScholarDigital Library
P. Helland. Cosmos: big data and big challenges. http://research.microsoft.com/en-us/events/fs2011/helland\_cosmos\_big\_data\_and\_big\_challenges.pdf, 2011.Google Scholar
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2011. Google ScholarDigital Library
IBM Platform Computing. http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/clustermanager/index.html.Google Scholar
S. Iqbal, R. Gupta, and Y.-C. Fang. Planning considerations for job scheduling in HPC clusters. Dell Power Solutions, Feb. 2005.Google Scholar
M. Isaard. Autopilot: Automatic data center management. ACM SIGOPS Operating Systems Review, 41(2), 2007. Google ScholarDigital Library
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proc. ACM Symp. on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
D. B. Jackson, Q. Snell, and M. J. Clement. Core algorithms of the Maui scheduler. In Proc. Int'l Workshop on Job Scheduling Strategies for Parallel Processing, pages 87--102. Springer-Verlag, 2001. Google ScholarDigital Library
M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. Measuring interference between live datacenter applications. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 2012. Google ScholarDigital Library
S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production MapReduce cluster. In Proc. IEEE/ACM Int'l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 94--103, 2010. Google ScholarDigital Library
Kubernetes. http://kubernetes.io, Aug. 2014.Google Scholar
Kernel Based Virtual Machine. http://www.linux-kvm.org.Google Scholar
L. Lamport. The part-time parliament. ACM Trans. on Computer Systems, 16(2): 133--169, May 1998. Google ScholarDigital Library
J. Leverich and C. Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proc. European Conf. on Computer Systems (EuroSys), page 4, 2014. Google ScholarDigital Library
Z. Liu and S. Cho. Characterizing machines and workloads on a Google cluster. In Proc. Int'l Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS), Pittsburgh, PA, USA, Sept. 2012. Google ScholarDigital Library
Google LMCTFY project (let me contain that for you). http://github.com/google/lmctfy, 2014.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Conference, pages 135--146, Indianapolis, IA, USA, 2010. Google ScholarDigital Library
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. Int'l Symp. on Microarchitecture (Micro), Porto Alegre, Brazil, 2011. Google ScholarDigital Library
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 330--339, Singapore, Sept. 2010. Google ScholarDigital Library
P. Menage. Linux control groups. http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt, 2007--2014.Google Scholar
A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37: 34--41, Mar. 2010. Google ScholarDigital Library
A. Narayanan. Tupperware: containerized deployment at Facebook. http://www.slideshare.net/dotCloud/tupperware-containerized-deployment-at-facebook, June 2014.Google Scholar
K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 69--84, Farminton, PA, USA, 2013. Google ScholarDigital Library
D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond Dominant Resource Fairness: extensions, limitations, and indivisibilities. In Proc. Electronic Commerce, pages 808--825, Valencia, Spain, 2012. Google ScholarDigital Library
Protocol buffers. https://developers.google.com/protocol-buffers/, and https://github.com/google/protobuf/., 2014.Google Scholar
C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarDigital Library
M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarDigital Library
B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and synthesizing task placement constraints in Google compute clusters. In Proc. ACM Symp. on Cloud Computing (SoCC), pages 3:1--3:14, Cascais, Portugal, Oct. 2011. Google ScholarDigital Library
E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems schedulers: are we doing the right thing? IEEE Trans. on Parallel and Distributed Systems, 20(7): 983--996, July 2009. Google ScholarDigital Library
A. Singh, M. Korupolu, and D. Mohapatra. Server-storage virtualization: integration and load balancing in data centers. In Proc. Int'l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 53:1--53:12, Austin, TX, USA, 2008. Google ScholarDigital Library
Apache Spark Project. http://spark.apache.org/, 2014.Google Scholar
A. Tumanov, J. Cipar, M. A. Kozuch, and G. R. Ganger. Alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012. Google ScholarDigital Library
P. Turner, B. Rao, and N. Rao. CPU bandwidth control for CFS. In Proc. Linux Symposium, pages 245--254, July 2010.Google Scholar
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. ACM Symp. on Cloud Computing (SoCC), Santa Clara, CA, USA, 2013. Google ScholarDigital Library
VMware VCloud Suite. http://www.vmware.com/products/vcloud-suite/.Google Scholar
A. Verma, M. Korupolu, and J. Wilkes. Evaluating job packing in warehouse-scale computing. In IEEE Cluster, pages 48--56, Madrid, Spain, Sept. 2014.Google ScholarCross Ref
W. Whitt. Open and closed models for networks of queues. AT&T Bell Labs Technical Journal, 63(9), Nov. 1984.Google Scholar
J. Wilkes. More Google cluster data. http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html, Nov. 2011.Google Scholar
Y. Zhai, X. Zhang, S. Eranian, L. Tang, and J. Mars. HaPPy: Hyperthread-aware power profiling dynamically. In Proc. USENIX Annual Technical Conf. (USENIX ATC), pages 211--217, Philadelphia, PA, USA, June 2014. USENIX Association. Google ScholarDigital Library
Q. Zhang, J. Hellerstein, and R. Boutaba. Characterizing task usage shapes in Google's compute clusters. In Proc. Int'l Workshop on Large-Scale Distributed Systems and Middleware (LADIS), 2011.Google Scholar
X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI²: CPU performance isolation for shared compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013. Google ScholarDigital Library
Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pages 1393--1404. VLDB Endowment Inc., Sept. 2014. Google ScholarDigital Library

Index Terms

Large-scale cluster management at Google with Borg

Recommendations

Virtual cluster management with Xen
Euro-Par'07: Proceedings of the 2007 conference on Parallel processing

Recently, virtualization of hardware resources to run multiple instances of independent virtual machines over physical hosts has gained popularity due to an industry-wide focus on the need to reduce the cost of operation of an enterprise computing ...
Read More
Flexible and efficient resource management in a virtual cluster environment
Read More
Borg: the next generation
EuroSys '20: Proceedings of the Fifteenth European Conference on Computer Systems

This paper analyzes a newly-published trace that covers 8 different Borg [35] clusters for the month of May 2019. The trace enables researchers to explore how scheduling works in large-scale production compute clusters. We highlight how Borg has evolved ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems
April 2015
503 pages
ISBN:9781450332385
DOI:10.1145/2741948
General Chair:
Laurent Réveillère
LaBRI, University of Bordeaux, France
,
Program Chairs:
Tim Harris
Oracle Labs, UK
,
Maurice Herlihy
Brown University
Copyright © 2015 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 April 2015
Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 750
  Total Citations
  View Citations
- 22,244
  Total Downloads
- Downloads (Last 12 months)3,195
- Downloads (Last 6 weeks)380
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Large-scale cluster management at Google with Borg

EuroSys '15: Proceedings of the Tenth European Conference on Computer Systems

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Virtual cluster management with Xen

Flexible and efficient resource management in a virtual cluster environment

Borg: the next generation