Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure

Authors:
Ramesh Govindan

Google, USC

Google, USC
View Profile

,
Ina Minei

Google

Google
View Profile

,
Mahesh Kallahalla

Google

Google
View Profile

,
Bikash Koley

Google

Google
View Profile

,
Amin Vahdat

Google

Google
View Profile

SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM ConferenceAugust 2016Pages 58–72https://doi.org/10.1145/2934872.2934891

Published:22 August 2016Publication History

SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference

Pages 58–72

ABSTRACT

Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.

Supplemental Material

p58.mp4

mp4

218.5 MB

Download

References

Daniel Abadi. "Consistency Tradeoffs in Modern Distributed Database Design: CAP is Only Part of the Story". In: IEEE Computer (2012). Google ScholarDigital Library
Richard Alimi, Ye Wang, and Yang Richard Yang. "Shadow configuration as a network management primitive". In: Proc. ACM SIGCOMM. 2008. Google ScholarDigital Library
C. Ashton. What is the Real Cost of Network Downtime? http://www.lightreading.com/data-center/data-center-infrastructure/whats-the-real-cost-ofnetwork-downtime/a/d-id/710595. 2014.Google Scholar
B. Schneier. Security in the Cloud. https://www.schneier.com/blog/archives/2006/02/security_in_the.html. 2006.Google Scholar
Betsy Beyer and Niall Richard Murphy. "Site Reliability Engineering: How Google Runs its Production Clusters". In: O'Reilly, 2016. Chap. 1.Google Scholar
Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. "Mapping the Expansion of Google's Serving Infrastructure". In: Proc. of the ACM Internet Measurement Conference (IMC '13). 2013. Google ScholarDigital Library
Carlson, J. M. and Doyle, John. "Highly Optimized Tolerance: Robustness and Design in Complex Systems". In: Phys. Rev. Lett. 84 (11 2000), pp. 2529-2532.Google Scholar
Cisco Visual Networking Index: The Zettabyte Era-Trends and Analysis. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI_Hyperconnectivity_WP.html. 2014.Google Scholar
Jeff Dean. Designs, Lessons and Advice from Building Large Distributed Systems. Keynote at LADIS 2009.Google Scholar
E. Dubrova. "Fault-Tolerant Design". In: Springer, 2013. Chap. 2. Google ScholarDigital Library
Tobias Flach et al. "Reducing Web Latency: the Virtue of Gentle Aggression". In: Proc. ACM SIGCOMM. 2013. Google ScholarDigital Library
Aaron Gember-Jacobson, Wenfei Wu, Xiujun Li, Aditya Akella, and Ratul Mahajan. "Management Plane Analytics". In: Proceedings of ACM IMC. IMC '15. Tokyo, Japan: ACM, 2015, pp. 395-408. ISBN:978-1-4503-3848-6. Google ScholarDigital Library
P. Gill, N. Jain, and N. Nagappan. "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications". In: Proc. ACM SIGCOMM. 2011. Google ScholarDigital Library
Chuanxiong Guo et al. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 139-152. ISSN:0146-4833. Google ScholarDigital Library
R. Hinden. Virtual Router Redundancy Protocol. Internet Engineering Task Force, RFC 3768. 2004. Google ScholarDigital Library
Internet hiccups today? You're not alone. Here's why. http://www.zdnet.com/article/internet-hiccups-todayyoure-not-alone-heres-why/.Google Scholar
Y. Israelevtsky and A. Tseitlin. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflixsimian-army.html. 2011.Google Scholar
Sushant Jain et al. "B4: Experience with a Globally-deployed Software Defined WAN". In: Proceedings of the ACM SIGCOMM 2013. SIGCOMM '13. Hong Kong, China: ACM, 2013, pp. 3-14. ISBN:978-1-4503-2056-6. Google ScholarDigital Library
Juniper Networks MX 2020. http://www.juniper.net/elqNow/elqRedir.htm?ref=http://www.juniper.net/assets/us/en/local/pdf/datasheets/1000417-en.pdf.Google Scholar
K. Krishnan. "Weathering the Unexpected". In: ACM Queue (2012). Google ScholarDigital Library
Alok Kumar et al. "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing". In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM '15. London, United Kingdom: ACM, 2015, pp. 1-14. ISBN: 978-1-4503-3542-3. Google ScholarDigital Library
Craig Labovitz, Abha Ahuja, and Farnam Jahanian. "Experimental Study of Internet Stability and Wide-Area Network Failures". In: Proc. International Symposium on Fault-Tolerant Computing. 1999. Google ScholarDigital Library
G. Linden. Make Data Useful. http://sites.google.com/site/glinden/Home/StanfordDataMining.2006-11-28.ppt. 2006.Google Scholar
M. Canini and D. Venzano and P. Perešíni and D. Kostic and J. Rexford. "A NICE Way to Test Open Flow Applications". In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). San Jose, CA: USENIX, 2012, pp. 127-140. ISBN:978-931971-92-8. Google ScholarDigital Library
M. Kuzniar and P. Peresini and M. Canini and D. Venzano and D. Kostic. "A SOFT Way for Openflow Switch Interoperability Testing". In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies. CoNEXT'12. Nice, France: ACM, 2012, pp. 265-276. ISBN:978-1-4503-1775-7. Google ScholarDigital Library
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. "Characterization of Failures in an Operational IP Backbone Network". In: IEEE/ACM Transactions on Networking (2008). Google ScholarDigital Library
I. Minei and J. Lucek. MPLS-Enabled Applications: Emerging Developments and New Technologies. 3rd. Wiley Inc., 2015. Google ScholarDigital Library
Andrew Montalenti. Kafkapocalypse: A Post-Mortem on our Service Outage. Parse.ly Tech Blog post. 2015.Google Scholar
N. Feamster and H. Balakrishnan. "Detecting BGP Configuration Faults with Static Analysis". In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. USENIX Association. 2005, pp. 43-56. Google ScholarDigital Library
P. Bailis and K. Kingsbury. "An Informal Survey of Real-World Communications Failures". In: Communications of the ACM (2014).Google Scholar
R. Mahajan and D. Wetherall and T. Anderson. "Understanding BGP Misconfiguration". In: Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. SIGCOMM '02. Pittsburgh, Pennsylvania, USA: ACM, 2002, pp. 3-16. ISBN: 1-58113-570-X. Google ScholarDigital Library
John Rushby. "Critical System Properties: Survey and Taxonomy". In: Reliability Engineering and System Safety 43.2 (1994), pp. 189-219.Google ScholarCross Ref
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "A Case Study of OSPF Behavior in a Large Enterprise Network". In: Proc. ACM Internet Measurement Workshop. 2002. Google ScholarDigital Library
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "California Fault Lines: Understanding the Causes and Impact of Network Failures". In: Proc. ACM SIGCOMM. 2010. Google ScholarDigital Library
Arjun Singh et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 183-197. ISSN: 0146-4833. Google ScholarDigital Library
Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648/. Amazon Web Services. 2011.Google Scholar
D. Turner, K. Levchenko, J. C. Mogul, S. Savage, and A. C. Snoeren. On Failure in Managed Enterprise Networks. Tech. rep. HPL-2012-101. HP Labs, 2012.Google Scholar
Amin Vahdat et al. "Scalability and Accuracy in a Large-scale Network Emulator". In: SIGOPS Oper. Syst. Rev. 36.SI (Dec. 2002), pp. 271-284. ISSN:0163-5980. Google ScholarDigital Library
D. Watson, F. Jahanian, and C. Labovitz. "Experiences With Monitoring OSPF on a Regional Service Provider Network". In: Proc. IEEE ICDCS. 2003. Google ScholarDigital Library

Index Terms

Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure
1. Networks
  1. Network algorithms
    1. Control path algorithms
  2. Network properties
    1. Network manageability
    2. Network reliability

Recommendations

Evolve: tool support for architecture evolution
ICSE '11: Proceedings of the 33rd International Conference on Software Engineering

Incremental change is intrinsic to both the initial development and subsequent evolution of large complex software systems. Evolve is a graphical design tool that captures this incremental change in the definition of software architecture. It supports a ...
Read More
Enhancing es-hyperneat to evolve more complex regular neural networks
GECCO '11: Proceedings of the 13th annual conference on Genetic and evolutionary computation

The recently-introduced evolvable-substrate HyperNEAT algorithm (ES-HyperNEAT) demonstrated that the placement and density of hidden nodes in an artificial neural network can be determined based on implicit information in an infinite-resolution pattern ...
Read More
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference
August 2016
645 pages
ISBN:9781450341936
DOI:10.1145/2934872
General Chairs:
Marinho Barcellos
UFRGS
,
Jon Crowcroft
University of Cambridge
,
Program Chairs:
Amin Vahdat
Google
,
Sachin Katti
Stanford University
Copyright © 2016 Owner/Author
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Availability; Control Plane; Management Plane
Qualifiers
- research-article
Conference

Acceptance Rates
SIGCOMM '16 Paper Acceptance Rate39of231submissions,17%Overall Acceptance Rate554of3,547submissions,16%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 155
  Total Citations
  View Citations
- 14,200
  Total Downloads
- Downloads (Last 12 months)621
- Downloads (Last 6 weeks)83
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure

SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Evolve: tool support for architecture evolution

Enhancing es-hyperneat to evolve more complex regular neural networks

A fully informed model-based checkpointing protocol for preventing useless checkpoints