skip to main content
10.1145/2723372.2723711acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Lineage-driven Fault Injection

Published:27 May 2015Publication History

ABSTRACT

In large-scale data management systems, failure is practically a certainty. Fault-tolerant protocols and components are notoriously difficult to implement and debug. Worse still, choosing existing fault-tolerance mechanisms and integrating them correctly into complex systems remains an art form, and programmers have few tools to assist them.

We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. We present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using a order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.

References

  1. The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.Google ScholarGoogle Scholar
  2. Kafka 0.8.0 Documentation. https://kafka.apache.org/08/documentation.html, 2013.Google ScholarGoogle Scholar
  3. S. Abiteboul, E. Antoine, and J. Stoyanovich. The Webdamlog System Managing Distributed Knowledge on the Web. CoRR, abs/1304.4187, 2013.Google ScholarGoogle Scholar
  4. P. A. Alsberg and J. D. Day. A Principle for Resilient Sharing of Distributed Resources. ICSE '76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.Google ScholarGoogle Scholar
  6. P. Alvaro, A. Hutchinson, N. Conway, W. R. Marczak, and J. M. Hellerstein. BloomUnit: Declarative Testing for Distributed Programs. DBTest '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. J. Ameloot, F. Neven, and J. Van den Bussche. Relational Transducers for Declarative Networking. PODS'12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Armstrong. Programming Erlang: Software for a Concurrent World. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Highly Available Transactions: virtues and limitations. VLDB'14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Bailis and K. Kingsbury. The Network is Reliable. Commun. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR'11.Google ScholarGoogle Scholar
  13. T. Ball, V. Levin, and S. K. Rajamani. A Decade of Software Model Checking with SLAM. Commun. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Ben-Or. Another Advantage of Free Choice (Extended Abstract): Completely Asynchronous Agreement Protocols. PODC '83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Blodget. Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers? Data. http://www.businessinsider.com/amazon-lost-data-2011-4, April 2011.Google ScholarGoogle Scholar
  17. P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. OSDI'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. J. ACM, July 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a Distributed Storage System for Structured Data. OSDI'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Claessen and J. Hughes. QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP '00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith. Counterexample-guided Abstraction Refinement for Symbolic Model Checking. J. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's Globally-distributed Database. OSDI'12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.Google ScholarGoogle Scholar
  27. J. Dean. Designs, Lessons and Advice from Building Large Distributed Systems. http://www.cs.cornell.edu/projects/ladis2009/talks/deankeynoteladis2009.pdf, 2009. Ladis'09 Keynote.Google ScholarGoogle Scholar
  28. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. SOSP'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility o Distributed Consensus with One Faulty Process. J. ACM, April 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Garcia-Molina. Elections in a distributed computing system. IEEE Trans. Comput., January 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. SIGCOMM '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Gray. Notes on data base operating systems. In Operating Systems, An Advanced Course, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Gray. Why do computers stop and what can be done about it?, 1985.Google ScholarGoogle Scholar
  35. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. PODS '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Han and S. Ratnasamy. Large-scale Computation Not at the Cost of Expressiveness. HotOS'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Herschel, M. A. Hernández, and W.-C. Tan. Artemis: A System for Analyzing Missing Answers.Google ScholarGoogle Scholar
  39. G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-answers to Queries over Extracted Data. VLDB'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Interlandi, L. Tanca, and S. Bergamaschi. Datalog in time and space, synchronously. CEUR'13.Google ScholarGoogle Scholar
  43. F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. DSN '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. SIGMOD '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. P. Katta, J. Rexford, and D. Walker. Logic programming for software-defined networks. XLDI'12.Google ScholarGoogle Scholar
  47. G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. marc Loingtier, and J. Irwin. Aspect-oriented Programming. ECOOP'97.Google ScholarGoogle Scholar
  48. C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. K. Kingsbury. Call me maybe: Kafka. http://aphyr.com/posts/293-call-me-maybe-kafka, 2013.Google ScholarGoogle Scholar
  50. S. Köhler, B. Ludäscher, and Y. Smaragdakis. Declarative datalog debugging for mere mortals. In Datalog in Academia and Industry, LNCS. Springer Berlin Heidelberg, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  52. L. Kuper and R. R. Newton. LVars: Lattice-based Data Structures for Deterministic Parallelism. FHPC'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, Jul 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. L. Lamport. The Part-time Parliament. ACM Transactions on Computer Systems, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don'T Settle for Eventual: Scalable Causal Consistency for Wide-area Storage with COPS. SOSP '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. SOSP '05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. O. Malik. When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data. http://gigaom.com/2009/10/10/when-cloud-fails-t-mobile-microsoft-losesidekick-customer-data/, Oct 2009.Google ScholarGoogle Scholar
  58. W. R. Marczak, P. Alvaro, N. Conway, J. M. Hellerstein, and D. Maier. Confluence Analysis for Distributed Programs: A Model-Theoretic Approach. Datalog'12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. P. D. Marinescu and G. Candea. LFI: A practical and general library-level fault injector. In DSN. IEEE, 2009.Google ScholarGoogle Scholar
  60. A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Mullender, editor. Distributed Systems. Addison-Wesley, second edition, 1993.Google ScholarGoogle Scholar
  62. K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. ATEC '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. T. Nelson, M. Scheer, A. Ferguson, and S. Krishnamurthi. Tierless Programming and Reasoning for Software-Defined Networks. NSDI'14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional Programs That Explain Their Work. ICFP '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Flexible Update Propagation for Weakly Consistent Replication. SIGOPS Oper. Syst. Rev., Dec 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. N. Raychaudhuri. Scala in Action. Manning Publications Co., 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.Google ScholarGoogle Scholar
  70. M. C. Rinard and P. C. Diniz. Commutativity Analysis: a New Analysis Technique for Parallelizing Compilers. ACM Trans. Program. Lang. Syst., Nov 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. F. B. Schneider. Implementing Fault-tolerant Services Using the State Machine Approach: a Tutorial. ACM Comput. Surv., 22(4), Dec. 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. K. Sen and G. Agha. Automated Systematic Testing of Open Distributed Programs. In L. Baresi and R. Heckel, editors, Fundamental Approaches to Software Engineering, volume 3922 of LNCS. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. M. A. Shah, J. M. Hellerstein, and E. Brewer. Highly Available, Fault-tolerant, Parallel Dataflows. SIGMOD'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Research report, INRIA, 2011.Google ScholarGoogle Scholar
  75. D. Skeen. Nonblocking Commit Protocols. SIGMOD '81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. M. Stonebraker. Concurrency control and consistency of multiple copies of data in distributed ingres. IEEE Trans. Softw. Eng., May 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. T. Stoppard. Arcadia: a play in two acts. Samuel French, Inc., 1993.Google ScholarGoogle Scholar
  78. D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and B. B. Welch. Session Guarantees for Weakly Consistent Replicated Data. PDIS '94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. R. H. Thomas. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. ACM Trans. Database Syst., June 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. SIGMOD '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Co., 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. W. Vogels. Eventually Consistent. Commun. ACM, January 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. NSDI'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Y. Yu, P. Manolios, and L. Lamport. Model checking tla+ specifications. CHARME '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized Streams: an Efficient and Fault-tolerant Model for Stream Processing on Large Clusters. HotCloud'12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. C. Zamfir and G. Candea. Execution Synthesis: A Technique for Automated Software Debugging. EuroSys '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying an maintenance of network provenance at internet-scale. SIGMOD '10. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lineage-driven Fault Injection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 Owner/Author

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader