skip to main content
10.1145/2987550.2987555acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Automating Failure Testing Research at Internet Scale

Published:05 October 2016Publication History

ABSTRACT

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort.

In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.

References

  1. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.Google ScholarGoogle Scholar
  2. Chaos Community Day. http://chaos.community, 2015.Google ScholarGoogle Scholar
  3. Nemesis: Disruptive Testing. https://www.scribd.com/document/318375955/Yahoo-Nemesis, 2015.Google ScholarGoogle Scholar
  4. The OpenTracing Project. http://opentracing.io/, 2016.Google ScholarGoogle Scholar
  5. P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.Google ScholarGoogle Scholar
  6. P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Alvaro, J. Rosen, and J. M. Hellerstein. Lineage-driven fault injection. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Aniszczyk. Distributed Systems Tracing with Zipkin. https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin, June 2012.Google ScholarGoogle Scholar
  9. D. Barth. Inject failure to make your systems more reliable. http://devops.com/2014/06/03/inject-failure/, June 2014.Google ScholarGoogle Scholar
  10. A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal. Chaos engineering. IEEE Software, 33(3):35--41, May 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Oct. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.Google ScholarGoogle Scholar
  16. What is Falcor? https://netflix.github.io/falcor/starter/what-is-falcor.html, 2015.Google ScholarGoogle Scholar
  17. D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. FIT: Failure Injection Testing. http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html, 2014.Google ScholarGoogle Scholar
  19. B. Fitzpatrick. Distributed Caching with Memcached. Linux J., 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. S. Gunawi, T. Do, J. M. Hellerstein, I. Stoica, D. Borthakur, and J. Robbins. Failure as a service (FaaS): A cloud service for large-scale, online failure drills. Technical report, EECS Department, University of California, Berkeley, 2011.Google ScholarGoogle Scholar
  21. H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.Google ScholarGoogle Scholar
  26. A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev., April 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Introduction to the Fault Analysis Service. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/, 2016.Google ScholarGoogle Scholar
  29. M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Newcombe, T. Rath, F. Zhang, B. Munteanu, M. Brooker, and M. Deardeuff. Use of Formal Methods at Amazon Web Services. Technical report, 2014.Google ScholarGoogle Scholar
  32. E. Reinhold. Rewriting Uber Engineering. https://eng.uber.com/building-tincup/, April 2016.Google ScholarGoogle Scholar
  33. S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.Google ScholarGoogle Scholar
  34. B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google ScholarGoogle Scholar
  35. A Deep Dive into Simoorg: Our Open Source Failure Induction Framework, 2016.Google ScholarGoogle Scholar
  36. G. Tsoumakas and I. Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1--13, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  37. Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Yu, P. Manolios, and L. Lamport. Model checking tla+specifications. CHARME '99. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automating Failure Testing Research at Internet Scale

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
        October 2016
        534 pages
        ISBN:9781450345255
        DOI:10.1145/2987550

        Copyright © 2016 Owner/Author

        This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader