research-article

Open Access

Automating Failure Testing Research at Internet Scale

Authors:
Peter Alvaro

UC Santa Cruz

UC Santa Cruz
View Profile

,
Kolton Andrus

Gremlin, Inc., Formerly Netflix

Gremlin, Inc., Formerly Netflix
View Profile

,
Chris Sanden

Netflix, Inc.

Netflix, Inc.
View Profile

,
Casey Rosenthal

Netflix, Inc.

Netflix, Inc.
View Profile

,
Ali Basiri

Netflix, Inc.

Netflix, Inc.
View Profile

,
Lorin Hochstein

Netflix, Inc.

Netflix, Inc.
View Profile

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingOctober 2016Pages 17–28https://doi.org/10.1145/2987550.2987555

Published:05 October 2016Publication History

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 17–28

ABSTRACT

Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatorial space of failure scenarios is too large to explore exhaustively. Existing failure testing approaches either randomly explore the space of potential failures randomly or exploit the "hunches" of domain experts to guide the search. Random strategies waste resources testing "uninteresting" faults, while programmer-guided approaches are only as good as human intuition and only scale with human effort.

In this paper, we describe how we adapted and implemented a research prototype called lineage-driven fault injection (LDFI) to automate failure testing at Netflix. Along the way, we describe the challenges that arose adapting the LDFI model to the complex and dynamic realities of the Netflix architecture. We show how we implemented the adapted algorithm as a service atop the existing tracing and fault injection infrastructure, and present early results.

References

The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.Google Scholar
Chaos Community Day. http://chaos.community, 2015.Google Scholar
Nemesis: Disruptive Testing. https://www.scribd.com/document/318375955/Yahoo-Nemesis, 2015.Google Scholar
The OpenTracing Project. http://opentracing.io/, 2016.Google Scholar
P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.Google Scholar
P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10. Google ScholarDigital Library
P. Alvaro, J. Rosen, and J. M. Hellerstein. Lineage-driven fault injection. In SIGMOD, 2015. Google ScholarDigital Library
C. Aniszczyk. Distributed Systems Tracing with Zipkin. https://blog.twitter.com/2012/distributed-systems-tracing-with-zipkin, June 2012.Google Scholar
D. Barth. Inject failure to make your systems more reliable. http://devops.com/2014/06/03/inject-failure/, June 2014.Google Scholar
A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal. Chaos engineering. IEEE Software, 33(3):35--41, May 2016. Google ScholarDigital Library
P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01. Google ScholarDigital Library
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009. Google ScholarDigital Library
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Oct. 2014. Google ScholarDigital Library
Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000. Google ScholarDigital Library
S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.Google Scholar
What is Falcor? https://netflix.github.io/falcor/starter/what-is-falcor.html, 2015.Google Scholar
D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
FIT: Failure Injection Testing. http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html, 2014.Google Scholar
B. Fitzpatrick. Distributed Caching with Memcached. Linux J., 2004. Google ScholarDigital Library
H. S. Gunawi, T. Do, J. M. Hellerstein, I. Stoica, D. Borthakur, and J. Robbins. Failure as a service (FaaS): A cloud service for large-scale, online failure drills. Technical report, EECS Department, University of California, Berkeley, 2011.Google Scholar
H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11. Google ScholarDigital Library
G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. Google ScholarDigital Library
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995. Google ScholarDigital Library
C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07. Google ScholarDigital Library
S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.Google Scholar
A. Lakshman and P. Malik. Cassandra: A Decentralized Structured Storage System. SIGOPS Oper. Syst. Rev., April 2010. Google ScholarDigital Library
A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12. Google ScholarDigital Library
Introduction to the Fault Analysis Service. https://azure.microsoft.com/en-us/documentation/articles/service-fabric-testability-overview/, 2016.Google Scholar
M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002. Google ScholarDigital Library
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08. Google ScholarDigital Library
C. Newcombe, T. Rath, F. Zhang, B. Munteanu, M. Brooker, and M. Deardeuff. Use of Formal Methods at Amazon Web Services. Technical report, 2014.Google Scholar
E. Reinhold. Rewriting Uber Engineering. https://eng.uber.com/building-tincup/, April 2016.Google Scholar
S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.Google Scholar
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
A Deep Dive into Simoorg: Our Open Source Failure Induction Framework, 2016.Google Scholar
G. Tsoumakas and I. Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1--13, 2007.Google ScholarCross Ref
Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13. Google ScholarDigital Library
J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09. Google ScholarDigital Library
Y. Yu, P. Manolios, and L. Lamport. Model checking tla+specifications. CHARME '99. Google ScholarDigital Library

Index Terms

Automating Failure Testing Research at Internet Scale
1. General and reference
  1. Cross-computing tools and techniques
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software reliability

Recommendations

Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults
ISSRE '96: Proceedings of the The Seventh International Symposium on Software Reliability Engineering

Fault injection allows a detailed study of complex interactions between faults and fault handling mechanisms. It can be a useful complement to analytical modeling and formal verification techniques in the testing of fault tolerant systems. However, work ...
Read More
Study of the Effects of SEU-Induced Faults on a Pipeline Protected Microprocessor

This paper presents a detailed analysis of the behavior of a novel, fault-tolerant, 32-bit embedded CPU when compared to a default (non fault-tolerant) implementation of the same processor, during a fault injection campaign of single and double faults. ...
Read More
Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor

In this paper, we propose a new approach to implement a reliable softcore processor on SRAM-based FPGAs, which can mitigate radiation-induced temporary faults (single-event upsets (SEUs)) at moderate cost. A new Enhanced Lockstep scheme built using a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
October 2016
534 pages
ISBN:9781450345255
DOI:10.1145/2987550
Editors:
Marcos K. Aguilera,
Brian Cooper,
Yanlei Diao
Copyright © 2016 Owner/Author
This work is licensed under a Creative Commons Attribution-NoDerivs International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Fault tolerance
data lineage
fault injection
verification
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
SoCC '16 Paper Acceptance Rate38of151submissions,25%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 2,164
  Total Downloads
- Downloads (Last 12 months)232
- Downloads (Last 6 weeks)42
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automating Failure Testing Research at Internet Scale

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults

Study of the Effects of SEU-Induced Faults on a Pipeline Protected Microprocessor

Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automating Failure Testing Research at Internet Scale

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults

Study of the Effects of SEU-Induced Faults on a Pipeline Protected Microprocessor

Low-Overhead Fault-Tolerance Technique for a Dynamically Reconfigurable Softcore Processor

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media