Abstract
In the early 2000s, Amazon created GameDay, a program designed to increase resilience by purposely injecting major failures into critical systems semi-regularly to discover flaws and subtle dependencies. Basically, a GameDay exercise tests a company’s systems, software, and people in the course of preparing for a response to a disastrous event. Widespread acceptance of the GameDay concept has taken a few years, but many companies now see its value and have started to adopt their own versions. This discussion considers some of those experiences.
Recommendations
Resilience engineering: learning to embrace failure
A discussion with Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli.
Learning from failure in systems engineering: A panel discussion
This paper summarizes the discussion of the Learning from Failure in Systems Engineering panel that was held in Huntsville, AL on November 8, 2010. The panel objective was to discuss how systems engineers respond to and learn from failure and identify ...
Comments