skip to main content
research-article
Free Access

The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability

Published:08 June 2013Publication History
Skip Abstract Section

Abstract

Failure is inevitable. Disks fail. Software bugs lie dormant waiting for just the right conditions to bite. People make mistakes. Data centers are built on farms of unreliable commodity hardware. If you’re running in a cloud environment, then many of these factors are outside of your control. To compound the problem, failure is not predictable and doesn’t occur with uniform probability and frequency. The lack of a uniform frequency increases uncertainty and risk in the system. In the face of such inevitable and unpredictable failure, how can you build a reliable service that provides the high level of availability your users can depend on?

References

  1. Robbins, J., Krishnan, K., Allspaw, J., Limoncelli, T. 2012. Resilience engineering: learning to embrace failure. Communications of the ACM55(11): 40-47; http://dx.doi.org/10.1145/2366316.2366331. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bennett, C. 2012. Edda - Learn the stories of your cloud deployments. The Netflix Tech Blog; http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html.Google ScholarGoogle Scholar
  3. Bennett, C., Tseitlin, A. 2012. Chaos Monkey released into the wild. The Netflix Tech Blog; http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.Google ScholarGoogle Scholar
  4. Chandra, T. D., Griesemer, R., Redstone, J. 2007. Paxos made live: an engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing: 398-407; http://labs.google.com/papers/paxos_made_live.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Izrailevsky, Y., Tseitlin, A. 2011. The Netflix simian army. The Netflix Tech Blog; http://techblog.netflix.com/2011/07/netflix-simian-army.html.Google ScholarGoogle Scholar
  6. Sondow, J. 2012. Asgard: Web-based cloud management and deployment. The Netflix Tech Blog; http://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html.Google ScholarGoogle Scholar
  7. Strigini, L. 2009. Fault tolerance and resilience: meanings, measures and assessment. London, U.K.: Centre for Software Reliability, City University London; http://www.csr.city.ac.uk/projects/amber/resilienceFTmeasurementv06.pdf.Google ScholarGoogle Scholar
  8. Taleb, N. 2012. Antifragile: Things That Gain from Disorder. Random House.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Queue
    Queue  Volume 11, Issue 6
    Mobile Web Development
    June 2013
    34 pages
    ISSN:1542-7730
    EISSN:1542-7749
    DOI:10.1145/2493944
    Issue’s Table of Contents

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 June 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Popular
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format