ABSTRACT
Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.
Supplemental Material
- Daniel Abadi. "Consistency Tradeoffs in Modern Distributed Database Design: CAP is Only Part of the Story". In: IEEE Computer (2012). Google ScholarDigital Library
- Richard Alimi, Ye Wang, and Yang Richard Yang. "Shadow configuration as a network management primitive". In: Proc. ACM SIGCOMM. 2008. Google ScholarDigital Library
- C. Ashton. What is the Real Cost of Network Downtime? http://www.lightreading.com/data-center/data-center-infrastructure/whats-the-real-cost-ofnetwork-downtime/a/d-id/710595. 2014.Google Scholar
- B. Schneier. Security in the Cloud. https://www.schneier.com/blog/archives/2006/02/security_in_the.html. 2006.Google Scholar
- Betsy Beyer and Niall Richard Murphy. "Site Reliability Engineering: How Google Runs its Production Clusters". In: O'Reilly, 2016. Chap. 1.Google Scholar
- Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. "Mapping the Expansion of Google's Serving Infrastructure". In: Proc. of the ACM Internet Measurement Conference (IMC '13). 2013. Google ScholarDigital Library
- Carlson, J. M. and Doyle, John. "Highly Optimized Tolerance: Robustness and Design in Complex Systems". In: Phys. Rev. Lett. 84 (11 2000), pp. 2529-2532.Google Scholar
- Cisco Visual Networking Index: The Zettabyte Era-Trends and Analysis. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI_Hyperconnectivity_WP.html. 2014.Google Scholar
- Jeff Dean. Designs, Lessons and Advice from Building Large Distributed Systems. Keynote at LADIS 2009.Google Scholar
- E. Dubrova. "Fault-Tolerant Design". In: Springer, 2013. Chap. 2. Google ScholarDigital Library
- Tobias Flach et al. "Reducing Web Latency: the Virtue of Gentle Aggression". In: Proc. ACM SIGCOMM. 2013. Google ScholarDigital Library
- Aaron Gember-Jacobson, Wenfei Wu, Xiujun Li, Aditya Akella, and Ratul Mahajan. "Management Plane Analytics". In: Proceedings of ACM IMC. IMC '15. Tokyo, Japan: ACM, 2015, pp. 395-408. ISBN:978-1-4503-3848-6. Google ScholarDigital Library
- P. Gill, N. Jain, and N. Nagappan. "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications". In: Proc. ACM SIGCOMM. 2011. Google ScholarDigital Library
- Chuanxiong Guo et al. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 139-152. ISSN:0146-4833. Google ScholarDigital Library
- R. Hinden. Virtual Router Redundancy Protocol. Internet Engineering Task Force, RFC 3768. 2004. Google ScholarDigital Library
- Internet hiccups today? You're not alone. Here's why. http://www.zdnet.com/article/internet-hiccups-todayyoure-not-alone-heres-why/.Google Scholar
- Y. Israelevtsky and A. Tseitlin. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflixsimian-army.html. 2011.Google Scholar
- Sushant Jain et al. "B4: Experience with a Globally-deployed Software Defined WAN". In: Proceedings of the ACM SIGCOMM 2013. SIGCOMM '13. Hong Kong, China: ACM, 2013, pp. 3-14. ISBN:978-1-4503-2056-6. Google ScholarDigital Library
- Juniper Networks MX 2020. http://www.juniper.net/elqNow/elqRedir.htm?ref=http://www.juniper.net/assets/us/en/local/pdf/datasheets/1000417-en.pdf.Google Scholar
- K. Krishnan. "Weathering the Unexpected". In: ACM Queue (2012). Google ScholarDigital Library
- Alok Kumar et al. "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing". In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM '15. London, United Kingdom: ACM, 2015, pp. 1-14. ISBN: 978-1-4503-3542-3. Google ScholarDigital Library
- Craig Labovitz, Abha Ahuja, and Farnam Jahanian. "Experimental Study of Internet Stability and Wide-Area Network Failures". In: Proc. International Symposium on Fault-Tolerant Computing. 1999. Google ScholarDigital Library
- G. Linden. Make Data Useful. http://sites.google.com/site/glinden/Home/StanfordDataMining.2006-11-28.ppt. 2006.Google Scholar
- M. Canini and D. Venzano and P. Perešíni and D. Kostic and J. Rexford. "A NICE Way to Test Open Flow Applications". In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). San Jose, CA: USENIX, 2012, pp. 127-140. ISBN:978-931971-92-8. Google ScholarDigital Library
- M. Kuzniar and P. Peresini and M. Canini and D. Venzano and D. Kostic. "A SOFT Way for Openflow Switch Interoperability Testing". In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies. CoNEXT'12. Nice, France: ACM, 2012, pp. 265-276. ISBN:978-1-4503-1775-7. Google ScholarDigital Library
- A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. "Characterization of Failures in an Operational IP Backbone Network". In: IEEE/ACM Transactions on Networking (2008). Google ScholarDigital Library
- I. Minei and J. Lucek. MPLS-Enabled Applications: Emerging Developments and New Technologies. 3rd. Wiley Inc., 2015. Google ScholarDigital Library
- Andrew Montalenti. Kafkapocalypse: A Post-Mortem on our Service Outage. Parse.ly Tech Blog post. 2015.Google Scholar
- N. Feamster and H. Balakrishnan. "Detecting BGP Configuration Faults with Static Analysis". In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. USENIX Association. 2005, pp. 43-56. Google ScholarDigital Library
- P. Bailis and K. Kingsbury. "An Informal Survey of Real-World Communications Failures". In: Communications of the ACM (2014).Google Scholar
- R. Mahajan and D. Wetherall and T. Anderson. "Understanding BGP Misconfiguration". In: Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. SIGCOMM '02. Pittsburgh, Pennsylvania, USA: ACM, 2002, pp. 3-16. ISBN: 1-58113-570-X. Google ScholarDigital Library
- John Rushby. "Critical System Properties: Survey and Taxonomy". In: Reliability Engineering and System Safety 43.2 (1994), pp. 189-219.Google ScholarCross Ref
- A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "A Case Study of OSPF Behavior in a Large Enterprise Network". In: Proc. ACM Internet Measurement Workshop. 2002. Google ScholarDigital Library
- A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "California Fault Lines: Understanding the Causes and Impact of Network Failures". In: Proc. ACM SIGCOMM. 2010. Google ScholarDigital Library
- Arjun Singh et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 183-197. ISSN: 0146-4833. Google ScholarDigital Library
- Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648/. Amazon Web Services. 2011.Google Scholar
- D. Turner, K. Levchenko, J. C. Mogul, S. Savage, and A. C. Snoeren. On Failure in Managed Enterprise Networks. Tech. rep. HPL-2012-101. HP Labs, 2012.Google Scholar
- Amin Vahdat et al. "Scalability and Accuracy in a Large-scale Network Emulator". In: SIGOPS Oper. Syst. Rev. 36.SI (Dec. 2002), pp. 271-284. ISSN:0163-5980. Google ScholarDigital Library
- D. Watson, F. Jahanian, and C. Labovitz. "Experiences With Monitoring OSPF on a Regional Service Provider Network". In: Proc. IEEE ICDCS. 2003. Google ScholarDigital Library
Index Terms
- Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure
Recommendations
Evolve: tool support for architecture evolution
ICSE '11: Proceedings of the 33rd International Conference on Software EngineeringIncremental change is intrinsic to both the initial development and subsequent evolution of large complex software systems. Evolve is a graphical design tool that captures this incremental change in the definition of software architecture. It supports a ...
Enhancing es-hyperneat to evolve more complex regular neural networks
GECCO '11: Proceedings of the 13th annual conference on Genetic and evolutionary computationThe recently-introduced evolvable-substrate HyperNEAT algorithm (ES-HyperNEAT) demonstrated that the placement and density of hidden nodes in an artificial neural network can be determined based on implicit information in an infinite-resolution pattern ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Comments