ABSTRACT
Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort.
In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.
- The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.Google Scholar
- Chaos Community Day. http://chaos.community, 2015.Google Scholar
- Nemesis: Disruptive Testing. https://www.scribd.com/document/318375955/Yahoo-Nemesis, 2015.Google Scholar
- The OpenTracing Project. http://opentracing.io/, 2016.Google Scholar
- P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.Google Scholar
- P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10. Google ScholarDigital Library
- P. Alvaro, J. Rosen, and J. M. Hellerstein. Lineage-driven fault injection. In SIGMOD, 2015. Google ScholarDigital Library
- C. Aniszczyk. Distributed Systems Tracing with Zipkin. https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin, June 2012.Google Scholar
- D. Barth. Inject failure to make your systems more reliable. http://devops.com/2014/06/03/inject-failure/, June 2014.Google Scholar
- A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal. Chaos engineering. IEEE Software, 33(3):35--41, May 2016. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009. Google ScholarDigital Library
- M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Oct. 2014. Google ScholarDigital Library
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000. Google ScholarDigital Library
- S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.Google Scholar
- What is Falcor? https://netflix.github.io/falcor/starter/what-is-falcor.html, 2015.Google Scholar
- D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
- FIT: Failure Injection Testing. http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html, 2014.Google Scholar
- B. Fitzpatrick. Distributed Caching with Memcached. Linux J., 2004. Google ScholarDigital Library
- H. S. Gunawi, T. Do, J. M. Hellerstein, I. Stoica, D. Borthakur, and J. Robbins. Failure as a service (FaaS): A cloud service for large-scale, online failure drills. Technical report, EECS Department, University of California, Berkeley, 2011.Google Scholar
- H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11. Google ScholarDigital Library
- G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. Google ScholarDigital Library
- G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995. Google ScholarDigital Library
- C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07. Google ScholarDigital Library
- S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.Google Scholar
- A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev., April 2010. Google ScholarDigital Library
- A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12. Google ScholarDigital Library
- Introduction to the Fault Analysis Service. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/, 2016.Google Scholar
- M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002. Google ScholarDigital Library
- M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08. Google ScholarDigital Library
- C. Newcombe, T. Rath, F. Zhang, B. Munteanu, M. Brooker, and M. Deardeuff. Use of Formal Methods at Amazon Web Services. Technical report, 2014.Google Scholar
- E. Reinhold. Rewriting Uber Engineering. https://eng.uber.com/building-tincup/, April 2016.Google Scholar
- S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.Google Scholar
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- A Deep Dive into Simoorg: Our Open Source Failure Induction Framework, 2016.Google Scholar
- G. Tsoumakas and I. Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1--13, 2007.Google ScholarCross Ref
- Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13. Google ScholarDigital Library
- J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09. Google ScholarDigital Library
- Y. Yu, P. Manolios, and L. Lamport. Model checking tla+specifications. CHARME '99. Google ScholarDigital Library
Index Terms
- Automating Failure Testing Research at Internet Scale
Recommendations
Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults
ISSRE '96: Proceedings of the The Seventh International Symposium on Software Reliability EngineeringFault injection allows a detailed study of complex interactions between faults and fault handling mechanisms. It can be a useful complement to analytical modeling and formal verification techniques in the testing of fault tolerant systems. However, work ...
Study of the Effects of SEU-Induced Faults on a Pipeline Protected Microprocessor
This paper presents a detailed analysis of the behavior of a novel, fault-tolerant, 32-bit embedded CPU when compared to a default (non fault-tolerant) implementation of the same processor, during a fault injection campaign of single and double faults. ...
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor
In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Comments