ABSTRACT
In large-scale data management systems, failure is practically a certainty. Fault-tolerant protocols and components are notoriously difficult to implement and debug. Worse still, choosing existing fault-tolerance mechanisms and integrating them correctly into complex systems remains an art form, and programmers have few tools to assist them.
We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. We present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using a order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.
- The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.Google Scholar
- Kafka 0.8.0 Documentation. https://kafka.apache.org/08/documentation.html, 2013.Google Scholar
- S. Abiteboul, E. Antoine, and J. Stoyanovich. The Webdamlog System Managing Distributed Knowledge on the Web. CoRR, abs/1304.4187, 2013.Google Scholar
- P. A. Alsberg and J. D. Day. A Principle for Resilient Sharing of Distributed Resources. ICSE '76. Google ScholarDigital Library
- P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.Google Scholar
- P. Alvaro, A. Hutchinson, N. Conway, W. R. Marczak, and J. M. Hellerstein. BloomUnit: Declarative Testing for Distributed Programs. DBTest '12. Google ScholarDigital Library
- P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10. Google ScholarDigital Library
- T. J. Ameloot, F. Neven, and J. Van den Bussche. Relational Transducers for Declarative Networking. PODS'12. Google ScholarDigital Library
- J. Armstrong. Programming Erlang: Software for a Concurrent World. 2007. Google ScholarDigital Library
- P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Highly Available Transactions: virtues and limitations. VLDB'14. Google ScholarDigital Library
- P. Bailis and K. Kingsbury. The Network is Reliable. Commun. ACM, 2014. Google ScholarDigital Library
- J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR'11.Google Scholar
- T. Ball, V. Levin, and S. K. Rajamani. A Decade of Software Model Checking with SLAM. Commun. ACM, 2011. Google ScholarDigital Library
- M. Ben-Or. Another Advantage of Free Choice (Extended Abstract): Completely Asynchronous Agreement Protocols. PODC '83. Google ScholarDigital Library
- P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. Google ScholarDigital Library
- H. Blodget. Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers? Data. http://www.businessinsider.com/amazon-lost-data-2011-4, April 2011.Google Scholar
- P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01. Google ScholarDigital Library
- C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. OSDI'08. Google ScholarDigital Library
- T. D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. J. ACM, July 1996. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a Distributed Storage System for Structured Data. OSDI'06. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009. Google ScholarDigital Library
- K. Claessen and J. Hughes. QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP '00. Google ScholarDigital Library
- E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith. Counterexample-guided Abstraction Refinement for Symbolic Model Checking. J. ACM, 2003. Google ScholarDigital Library
- J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's Globally-distributed Database. OSDI'12. Google ScholarDigital Library
- Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000. Google ScholarDigital Library
- S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.Google Scholar
- J. Dean. Designs, Lessons and Advice from Building Large Distributed Systems. http://www.cs.cornell.edu/projects/ladis2009/talks/deankeynoteladis2009.pdf, 2009. Ladis'09 Keynote.Google Scholar
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. SOSP'07. Google ScholarDigital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility o Distributed Consensus with One Faulty Process. J. ACM, April 1985. Google ScholarDigital Library
- D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
- H. Garcia-Molina. Elections in a distributed computing system. IEEE Trans. Comput., January 1982. Google ScholarDigital Library
- P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. SIGCOMM '11. Google ScholarDigital Library
- J. Gray. Notes on data base operating systems. In Operating Systems, An Advanced Course, 1978. Google ScholarDigital Library
- J. Gray. Why do computers stop and what can be done about it?, 1985.Google Scholar
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. PODS '07. Google ScholarDigital Library
- H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11. Google ScholarDigital Library
- S. Han and S. Ratnasamy. Large-scale Computation Not at the Cost of Expressiveness. HotOS'13. Google ScholarDigital Library
- M. Herschel, M. A. Hernández, and W.-C. Tan. Artemis: A System for Analyzing Missing Answers.Google Scholar
- G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003. Google ScholarDigital Library
- J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-answers to Queries over Extracted Data. VLDB'08. Google ScholarDigital Library
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC'10. Google ScholarDigital Library
- M. Interlandi, L. Tanca, and S. Bergamaschi. Datalog in time and space, synchronously. CEUR'13.Google Scholar
- F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. DSN '11. Google ScholarDigital Library
- G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995. Google ScholarDigital Library
- G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. SIGMOD '10. Google ScholarDigital Library
- N. P. Katta, J. Rexford, and D. Walker. Logic programming for software-defined networks. XLDI'12.Google Scholar
- G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. marc Loingtier, and J. Irwin. Aspect-oriented Programming. ECOOP'97.Google Scholar
- C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07. Google ScholarDigital Library
- K. Kingsbury. Call me maybe: Kafka. http://aphyr.com/posts/293-call-me-maybe-kafka, 2013.Google Scholar
- S. Köhler, B. Ludäscher, and Y. Smaragdakis. Declarative datalog debugging for mere mortals. In Datalog in Academia and Industry, LNCS. Springer Berlin Heidelberg, 2012. Google ScholarDigital Library
- S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.Google ScholarCross Ref
- L. Kuper and R. R. Newton. LVars: Lattice-based Data Structures for Deterministic Parallelism. FHPC'13. Google ScholarDigital Library
- L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, Jul 1978. Google ScholarDigital Library
- L. Lamport. The Part-time Parliament. ACM Transactions on Computer Systems, May 1998. Google ScholarDigital Library
- W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don'T Settle for Eventual: Scalable Causal Consistency for Wide-area Storage with COPS. SOSP '11. Google ScholarDigital Library
- B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. SOSP '05. Google ScholarDigital Library
- O. Malik. When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data. http://gigaom.com/2009/10/10/when-cloud-fails-t-mobile-microsoft-losesidekick-customer-data/, Oct 2009.Google Scholar
- W. R. Marczak, P. Alvaro, N. Conway, J. M. Hellerstein, and D. Maier. Confluence Analysis for Distributed Programs: A Model-Theoretic Approach. Datalog'12. Google ScholarDigital Library
- P. D. Marinescu and G. Candea. LFI: A practical and general library-level fault injector. In DSN. IEEE, 2009.Google Scholar
- A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12. Google ScholarDigital Library
- S. Mullender, editor. Distributed Systems. Addison-Wesley, second edition, 1993.Google Scholar
- K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. ATEC '06. Google ScholarDigital Library
- M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002. Google ScholarDigital Library
- M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08. Google ScholarDigital Library
- T. Nelson, M. Scheer, A. Ferguson, and S. Krishnamurthi. Tierless Programming and Reasoning for Software-Defined Networks. NSDI'14. Google ScholarDigital Library
- R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional Programs That Explain Their Work. ICFP '12. Google ScholarDigital Library
- K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Flexible Update Propagation for Weakly Consistent Replication. SIGOPS Oper. Syst. Rev., Dec 1997. Google ScholarDigital Library
- N. Raychaudhuri. Scala in Action. Manning Publications Co., 2013. Google ScholarDigital Library
- S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.Google Scholar
- M. C. Rinard and P. C. Diniz. Commutativity Analysis: a New Analysis Technique for Parallelizing Compilers. ACM Trans. Program. Lang. Syst., Nov 1997. Google ScholarDigital Library
- F. B. Schneider. Implementing Fault-tolerant Services Using the State Machine Approach: a Tutorial. ACM Comput. Surv., 22(4), Dec. 1990. Google ScholarDigital Library
- K. Sen and G. Agha. Automated Systematic Testing of Open Distributed Programs. In L. Baresi and R. Heckel, editors, Fundamental Approaches to Software Engineering, volume 3922 of LNCS. 2006. Google ScholarDigital Library
- M. A. Shah, J. M. Hellerstein, and E. Brewer. Highly Available, Fault-tolerant, Parallel Dataflows. SIGMOD'04. Google ScholarDigital Library
- M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Research report, INRIA, 2011.Google Scholar
- D. Skeen. Nonblocking Commit Protocols. SIGMOD '81. Google ScholarDigital Library
- M. Stonebraker. Concurrency control and consistency of multiple copies of data in distributed ingres. IEEE Trans. Softw. Eng., May 1979. Google ScholarDigital Library
- T. Stoppard. Arcadia: a play in two acts. Samuel French, Inc., 1993.Google Scholar
- D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and B. B. Welch. Session Guarantees for Weakly Consistent Replicated Data. PDIS '94. Google ScholarDigital Library
- R. H. Thomas. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. ACM Trans. Database Syst., June 1979. Google ScholarDigital Library
- A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. SIGMOD '12. Google ScholarDigital Library
- J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Co., 1990. Google ScholarDigital Library
- W. Vogels. Eventually Consistent. Commun. ACM, January 2009. Google ScholarDigital Library
- Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13. Google ScholarDigital Library
- M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. NSDI'09. Google ScholarDigital Library
- J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09. Google ScholarDigital Library
- Y. Yu, P. Manolios, and L. Lamport. Model checking tla+ specifications. CHARME '99. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized Streams: an Efficient and Fault-tolerant Model for Stream Processing on Large Clusters. HotCloud'12. Google ScholarDigital Library
- C. Zamfir and G. Candea. Execution Synthesis: A Technique for Automated Software Debugging. EuroSys '10. Google ScholarDigital Library
- W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying an maintenance of network provenance at internet-scale. SIGMOD '10. Google ScholarDigital Library
Index Terms
- Lineage-driven Fault Injection
Recommendations
Fault Injection
Fault-injection involves the deliberate insertion of faults or errors into a computer system in order to determine its response. It has proven to be an effective method for measuring the parameters of analytical dependability models, validating existing ...
iBiR: Bug-report-driven Fault Injection
Much research on software engineering relies on experimental studies based on fault injection. Fault injection, however, is not often relevant to emulate real-world software faults since it “blindly” injects large numbers of faults. It remains indeed ...
Fault Injection into VHDL Models: Experimental Validation of a Fault Tolerant Microcomputer System
EDCC-3: Proceedings of the Third European Dependable Computing Conference on Dependable ComputingThis work presents a campaign of fault injection to validate the dependability of a fault tolerant microcomputer system. The system is duplex with cold stand-by sparing, parity detection and a watchdog timer. The faults have been injected on a chip-...
Comments