skip to main content
10.1145/2934872.2934891acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Open Access

Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure

Published:22 August 2016Publication History

ABSTRACT

Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.

Skip Supplemental Material Section

Supplemental Material

p58.mp4

mp4

218.5 MB

References

  1. Daniel Abadi. "Consistency Tradeoffs in Modern Distributed Database Design: CAP is Only Part of the Story". In: IEEE Computer (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Richard Alimi, Ye Wang, and Yang Richard Yang. "Shadow configuration as a network management primitive". In: Proc. ACM SIGCOMM. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Ashton. What is the Real Cost of Network Downtime? http://www.lightreading.com/data-center/data-center-infrastructure/whats-the-real-cost-ofnetwork-downtime/a/d-id/710595. 2014.Google ScholarGoogle Scholar
  4. B. Schneier. Security in the Cloud. https://www.schneier.com/blog/archives/2006/02/security_in_the.html. 2006.Google ScholarGoogle Scholar
  5. Betsy Beyer and Niall Richard Murphy. "Site Reliability Engineering: How Google Runs its Production Clusters". In: O'Reilly, 2016. Chap. 1.Google ScholarGoogle Scholar
  6. Matt Calder, Xun Fan, Zi Hu, Ethan Katz-Bassett, John Heidemann, and Ramesh Govindan. "Mapping the Expansion of Google's Serving Infrastructure". In: Proc. of the ACM Internet Measurement Conference (IMC '13). 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Carlson, J. M. and Doyle, John. "Highly Optimized Tolerance: Robustness and Design in Complex Systems". In: Phys. Rev. Lett. 84 (11 2000), pp. 2529-2532.Google ScholarGoogle Scholar
  8. Cisco Visual Networking Index: The Zettabyte Era-Trends and Analysis. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/VNI_Hyperconnectivity_WP.html. 2014.Google ScholarGoogle Scholar
  9. Jeff Dean. Designs, Lessons and Advice from Building Large Distributed Systems. Keynote at LADIS 2009.Google ScholarGoogle Scholar
  10. E. Dubrova. "Fault-Tolerant Design". In: Springer, 2013. Chap. 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tobias Flach et al. "Reducing Web Latency: the Virtue of Gentle Aggression". In: Proc. ACM SIGCOMM. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Aaron Gember-Jacobson, Wenfei Wu, Xiujun Li, Aditya Akella, and Ratul Mahajan. "Management Plane Analytics". In: Proceedings of ACM IMC. IMC '15. Tokyo, Japan: ACM, 2015, pp. 395-408. ISBN:978-1-4503-3848-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Gill, N. Jain, and N. Nagappan. "Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications". In: Proc. ACM SIGCOMM. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chuanxiong Guo et al. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 139-152. ISSN:0146-4833. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Hinden. Virtual Router Redundancy Protocol. Internet Engineering Task Force, RFC 3768. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Internet hiccups today? You're not alone. Here's why. http://www.zdnet.com/article/internet-hiccups-todayyoure-not-alone-heres-why/.Google ScholarGoogle Scholar
  17. Y. Israelevtsky and A. Tseitlin. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflixsimian-army.html. 2011.Google ScholarGoogle Scholar
  18. Sushant Jain et al. "B4: Experience with a Globally-deployed Software Defined WAN". In: Proceedings of the ACM SIGCOMM 2013. SIGCOMM '13. Hong Kong, China: ACM, 2013, pp. 3-14. ISBN:978-1-4503-2056-6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Juniper Networks MX 2020. http://www.juniper.net/elqNow/elqRedir.htm?ref=http://www.juniper.net/assets/us/en/local/pdf/datasheets/1000417-en.pdf.Google ScholarGoogle Scholar
  20. K. Krishnan. "Weathering the Unexpected". In: ACM Queue (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alok Kumar et al. "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing". In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM '15. London, United Kingdom: ACM, 2015, pp. 1-14. ISBN: 978-1-4503-3542-3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Craig Labovitz, Abha Ahuja, and Farnam Jahanian. "Experimental Study of Internet Stability and Wide-Area Network Failures". In: Proc. International Symposium on Fault-Tolerant Computing. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Linden. Make Data Useful. http://sites.google.com/site/glinden/Home/StanfordDataMining.2006-11-28.ppt. 2006.Google ScholarGoogle Scholar
  24. M. Canini and D. Venzano and P. Perešíni and D. Kostic and J. Rexford. "A NICE Way to Test Open Flow Applications". In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). San Jose, CA: USENIX, 2012, pp. 127-140. ISBN:978-931971-92-8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Kuzniar and P. Peresini and M. Canini and D. Venzano and D. Kostic. "A SOFT Way for Openflow Switch Interoperability Testing". In: Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies. CoNEXT'12. Nice, France: ACM, 2012, pp. 265-276. ISBN:978-1-4503-1775-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. "Characterization of Failures in an Operational IP Backbone Network". In: IEEE/ACM Transactions on Networking (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. Minei and J. Lucek. MPLS-Enabled Applications: Emerging Developments and New Technologies. 3rd. Wiley Inc., 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew Montalenti. Kafkapocalypse: A Post-Mortem on our Service Outage. Parse.ly Tech Blog post. 2015.Google ScholarGoogle Scholar
  29. N. Feamster and H. Balakrishnan. "Detecting BGP Configuration Faults with Static Analysis". In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation. USENIX Association. 2005, pp. 43-56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Bailis and K. Kingsbury. "An Informal Survey of Real-World Communications Failures". In: Communications of the ACM (2014).Google ScholarGoogle Scholar
  31. R. Mahajan and D. Wetherall and T. Anderson. "Understanding BGP Misconfiguration". In: Proceedings of the 2002 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. SIGCOMM '02. Pittsburgh, Pennsylvania, USA: ACM, 2002, pp. 3-16. ISBN: 1-58113-570-X. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. John Rushby. "Critical System Properties: Survey and Taxonomy". In: Reliability Engineering and System Safety 43.2 (1994), pp. 189-219.Google ScholarGoogle ScholarCross RefCross Ref
  33. A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "A Case Study of OSPF Behavior in a Large Enterprise Network". In: Proc. ACM Internet Measurement Workshop. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. "California Fault Lines: Understanding the Causes and Impact of Network Failures". In: Proc. ACM SIGCOMM. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Arjun Singh et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network". In: SIGCOMM Comput. Commun. Rev. 45.5 (Aug. 2015), pp. 183-197. ISSN: 0146-4833. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648/. Amazon Web Services. 2011.Google ScholarGoogle Scholar
  37. D. Turner, K. Levchenko, J. C. Mogul, S. Savage, and A. C. Snoeren. On Failure in Managed Enterprise Networks. Tech. rep. HPL-2012-101. HP Labs, 2012.Google ScholarGoogle Scholar
  38. Amin Vahdat et al. "Scalability and Accuracy in a Large-scale Network Emulator". In: SIGOPS Oper. Syst. Rev. 36.SI (Dec. 2002), pp. 271-284. ISSN:0163-5980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Watson, F. Jahanian, and C. Labovitz. "Experiences With Monitoring OSPF on a Regional Service Provider Network". In: Proc. IEEE ICDCS. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference
          August 2016
          645 pages
          ISBN:9781450341936
          DOI:10.1145/2934872

          Copyright © 2016 Owner/Author

          This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 August 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGCOMM '16 Paper Acceptance Rate39of231submissions,17%Overall Acceptance Rate554of3,547submissions,16%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader