skip to main content
survey

Systems Approaches to Tackling Configuration Errors: A Survey

Published:21 July 2015Publication History
Skip Abstract Section

Abstract

In recent years, configuration errors (i.e., misconfigurations) have become one of the dominant causes of system failures, resulting in many severe service outages and downtime. Unfortunately, it is notoriously difficult for system users (e.g., administrators and operators) to prevent, detect, and troubleshoot configuration errors due to the complexity of the configurations as well as the systems under configuration. As a result, the cost of resolving configuration errors is often tremendous from the aspects of both compensating the service disruptions and diagnosing, recovering from the failures. The prevalence, severity, and cost have made configuration errors one of the most thorny system problems that desire to be addressed.

This survey article provides a holistic and structured overview of the systems approaches that tackle configuration errors. To understand the problem fundamentally, we first discuss the characteristics of configuration errors and the challenges of tackling such errors. Then, we discuss the state-of-the-art systems approaches that address different types of configuration errors in different scenarios. Our primary goal is to equip the stakeholder with a better understanding of configuration errors and the potential solutions for resolving configuration errors in the spectrum of system development and management. To inspire follow-up research, we further discuss the open problems with regard to system configuration. To the best of our knowledge, this is the first survey on the topic of tackling configuration errors.

References

  1. Bhavish Aggarwal, Ranjita Bhagwan, Tathagata Das, Siddharth Eswaran, Venkata N. Padmanabhan, and Geoffrey M. Voelker. 2009. NetPrints: Diagnosing home network misconfigurations using shared knowledge. In Proceedings of the 6th USENIX Symposium on Networked System Design and Implementation (NSDI’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. The International Journal on Very Large Data Bases (VLDBJ) 23, 6 (December 2014), 939--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: Running circles around storage administration. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paul Anderson. 1994. Towards a high-level machine configuration system. In Proceedings of the 8th USENIX System Administration Conference (LISA-VIII). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Paul Anderson and James Cheney. 2012. Toward provenance-based security for configuration languages. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Paul Anderson and Edmund Smith. 2005. Configuration tools: Working together. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Marc Andreessen. 2011. Why software is eating the world. The Wall Street Journal (August 2011).Google ScholarGoogle Scholar
  8. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2009. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28. University of California Berkeley.Google ScholarGoogle Scholar
  9. Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mona Attariyan and Jason Flinn. 2008. Using causality to diagnose configuration bugs. In Proceedings of 2008 USENIX Annual Technical Conference (USENIX ATC’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mona Attariyan and Jason Flinn. 2011. Automating configuration troubleshooting with ConfAid. USENIX ;login: 36, 1 (February 2011), 27--36.Google ScholarGoogle Scholar
  13. Rob Barrett, Eser Kandogan, Paul P. Maglio, Eben Haber, Leila A. Takayama, and Madhu Prabaker. 2004. Field studies of computer system administrators: Analysis of system management tools and practices. In Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work (CSCW’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Luiz André Barroso and Urs Hölzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lujo Bauer, Scott Garriss, and Michael K. Reiter. 2011. Detecting and resolving policy misconfigurations in access-control systems. ACM Transactions on Information and System Security (TISSEC) 14, 1 (May 2011), 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Theophilus Benson, Aditya Akella, and David Maltz. 2009. Unraveling the complexity of network management. In Proceedings of the 6th USENIX Symposium on Networked System Design and Implementation (NSDI’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Theophilus Benson, Aditya Akella, and Aman Shaikh. 2011. Demystifying configuration challenges and trade-offs in network-based ISP services. In Proceedings of 2011 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM 53, 2 (February 2010), 66--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ricardo Bianchini, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen, and Fabio Oliveira. 2005. Human-aware computer system design. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS X). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Charles Border and Kyrre Begnum. 2014. Educating system administrators. USENIX ;login: 39, 5 (October 2014), 36--39.Google ScholarGoogle Scholar
  21. Jon Brodkin. 2012. Why Gmail Went Down: Google Misconfigured Load Balancing Servers. Retrieved from http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-google-misconfigured-ch-romes-sync-server/.Google ScholarGoogle Scholar
  22. Aaron Brown. 2001. Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems. Technical Report UCB//CSD-01-1132. University of California, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aaron B. Brown and David A. Patterson. 2001. To err is human. In Proceedings of the 1st Workshop on Evaluating and Architecting System Dependability (EASY’01).Google ScholarGoogle Scholar
  24. Aaron B. Brown and David A. Patterson. 2003. Undo for Operators: Building an undoable e-mail store. In Proceedings of the 2003 USENIX Annual Technical Conference (USENIX ATC’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mark Burgess. 1995. A site configuration engine. USENIX Computing Systems 8, 3 (1995), 309--337.Google ScholarGoogle Scholar
  26. Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jeffery S. Chase, Darrell C. Anderson, Prachi N. Thakar, Amin M. Vahdat, and Ronald P. Doyle. 2001. Managing energy and server resources in hosting centers. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking database system architecture: Towards a self-tuning risc-style database system. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kai Chen, Chuanxiong Guo, Haitao Wu, Jing Yuan, Zhenqian Feng, Yan Chen, Songwu Lu, and Wenfei Wu. 2010. Generic and automatic address configuration for data center networks. In Proceedings of 2010 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ken A. L. Coar. 1998. Apache Server For Dummies. IDG Books Worldwide Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Computing Research Association. 2003. Grand Research Challenges in Information Systems. Technical Report. Retrieved from http://archive.cra.org/reports/gc.systems.pdf.Google ScholarGoogle Scholar
  32. Thomas Delaet, Wouter Joosen, and Bart Vanbrabant. 2010. A survey of system conguration tools. In Proceedings of the 24th Large Installation System Administration Conference (LISA’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Narayan Desai, Rick Bradshaw, Scott Matott, Sandra Bittner, Susan Coghlan, Remy Evard, Cory Lueninghoener, Ti Leggett, John-Paul Navarro, Gene Rackow, Craig Stacey, and Tisha Stacey. 2005. A case study in configuration management tool deployment. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. John DeTreville. 2005. Making system configuration more declarative. In Proceedings of the USENIX 10th Workshop on Hot Topics in Operating Systems (HotOS-X). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Eelco Dolstra and Armijn Hemel. 2007. Purely functional system configuration management. In Proceedings of the USENIX 11th Workshop on Hot Topics in Operating Systems (HotOS-XI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning database configuration parameters with ituned. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB’09).Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kevin Elphinstone and Gernot Heiser. 2013. From L3 to seL4: What have we learnt in 20 years of L4 microkernels?. In Proceedings of the 24th Symposium on Operating System Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. William Enck, Patrick McDaniel, Subhabrata Sen, Panagiotis Sebos, Sylke Spoerel, Albert Greenberg, Sanjay Rao, and William Aiello. 2007. Configuration management at massive scale: System design and experience. In Proceedings of 2007 USENIX Annual Technical Conference (USENIX ATC’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. William Enck, Thomas Moyer, Patrick McDaniel, Subhabrata Sen, Panagiotis Sebos, Sylke Spoerel, Albert Greenberg, Yu-Wei Eric Sung, Sanjay Rao, and William Aiello. 2009. Configuration management at massive scale: System design and experience. IEEE Journal on Selected Areas in Communications (JSAC) 27, 3 (April 2009), 323--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Nick Feamster and Hari Balakrishnan. 2005. Detecting BGP configuration faults with static analysis. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation (NSDI’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Patrick Goldsack, Julio Guijarro, Steve Loughran, Alistair Coles, Andrew Farrell, Antonio Lain, Paul Murray, and Peter Toft. 2009. The smartfrog configuration management framework. SIGOPS Operating System Review 43 (January 2009), 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jim Gray. 1985. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (June 1985).Google ScholarGoogle Scholar
  43. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Eben M. Haber and John Bailey. 2007. Design guidelines for system administration tools developed through ethnographic field study. In Proceedings of the 2007 ACM Conference on Human Interfaces to the Management of Information Technology (CHIMIT’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. John A. Hewson, Paul Anderson, and Andrew D. Gordon. 2012. A declarative approach to automated configuration. In Proceedings of the 26th Large Installation System Administration Conference (LISA’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dennis G. Hrebec and Michael Stiber. 2001. A survey of system administrator mental models and situation awareness. In Proceedings of the 2001 ACM SIGCPR Conference on Computer Personnel Research (SIGCPR’01). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Peng Huang, William J. Bolosky, Abhishek Sigh, and Yuanyuan Zhou. 2015. KungfuValley: A systematic configuration validation framework for cloud services. In Proceedings of the 10th ACM European Conference in Computer Systems (EuroSys’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Arnaud Hubaux, Yingfei Xiong, and Krzysztof Czarnecki. 2012. A user survey of configuration challenges in linux and ecos. In Proceedings of 6th International Workshop on Variability Modeling of Software-intensive Systems (VaMoS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Fabian Hueske. 2015. Juggling with Bits and Bytes. Retrieved from http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html.Google ScholarGoogle Scholar
  51. Weihang Jiang, Chongfeng Hu, Shankar Pasupathy, Arkady Kanevsky, Zhenmin Li, and Yuanyuan Zhou. 2009. Understanding customer problem troubleshooting from storage system logs. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Lorenzo Keller, Prasang Upadhyaya, and George Candea. 2008. ConfErr: A tool for assessing resilience to human configuration errors. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’08).Google ScholarGoogle ScholarCross RefCross Ref
  53. Stuart Kendrick. 2012. What takes us down? USENIX ;login: 37, 5 (October 2012), 37--45.Google ScholarGoogle Scholar
  54. Jeffrey O. Kephart and David M. Chess. 2003. The vision of autonomic computing. IEEE Computer 36, 1 (January 2003), 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Emre Kiciman and Yi-Min Wang. 2004. Discovering correctness constraints for self-management of system configuration. In Proceedings of the 1st International Conference on Autonomic Computing (ICAC’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 2011. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43, 3 (April 2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Eddie Kohler, Benjie Chen, M. Frans Kaashoek, Robert Morris, and Massimiliano Poletto. 2000. Programming Language Techniques for Modular Router Configurations. Technical Report MIT-LCS-TR-812. MIT Laboratory for Computer Science.Google ScholarGoogle Scholar
  58. Nate Kushman and Dina Katabi. 2010. Enabling configuration-independent automation by non-expert users. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Craig Labovitz, Abha Ahuja, and Farnam Jahanian. 1999. Experimental study of internet stability and backbone failures. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (FTCS’99). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jean-Claude Laprie. 1995. Dependable computing: Concepts, limits, challenges. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS’95). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. William LeFebvre and David Snyder. 2004. Auto-configuration by file construction: Configuration management with newfig. In Proceedings of the 18th USENIX Large Installation System Administration Conference (LISA’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Lim Yan Liang. 2013. Linkedin.com inaccessible on Thursday because of server misconfiguration. Retrieved from http://www.straitstimes.com/breaking-news/singapore/story/linkedincom-inaccessible-thursday-because-server-misconfiguration-2013.Google ScholarGoogle Scholar
  63. Cory Lueninghoener. 2011. Getting started with configuration management. USENIX ;login: 36, 2 (April 2011), 12--17.Google ScholarGoogle Scholar
  64. Ratul Mahajan, David Wetherall, and Tom Anderson. 2002. Understanding BGP misconfiguration. In Proceedings of 2002 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Justin Mason. 2011. Against The Use of Programming Languages in Configuration Files. Retrieved from http://taint.org/2011/02/18/001527a.html. (2011).Google ScholarGoogle Scholar
  66. Roy A. Maxion and Robert W. Reeder. 2005. Improving user-interface dependability through mitigation of human error. International Journal of Human-Computer Studies 63, 1-2 (July 2005), 25--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Paul McNamara. 2009. Missing dot drops Sweden off the Internet. Retrieved from http://www.computerworld.com/article/2529287/networking/opinion--missing-dot-drops-sweden-off-the-internet.html.Google ScholarGoogle Scholar
  68. James Mickens, Martin Szummer, and Dushyanth Narayanan. 2007. Snitch: Interactive decision trees for troubleshooting misconfigurations. In Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SYSML’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Barton P. Miller, David Koski, Cjin Pheow Lee, Vivekananda Maganty, Ravi Murthy, Ajitkumar Natarajan, and Jeff Steidl. 1995. Fuzz Revisited: A Re-Examination of the Reliability of UNIX Utilities and Services. Technical Report No. 1268. University of Wisconsin-Madison, Computer Sciences Department.Google ScholarGoogle Scholar
  70. Rich Miller. 2012. Microsoft: Misconfigured Network Device Caused Azure Outage. Retrieved from http://www.datacenterknowledge.com/archives/2012/07/28/microsoft-misconfigured-network-device-caused-azure-outage/.Google ScholarGoogle Scholar
  71. Rolf Molich and Jakob Nielsen. 1990. Improving a human-computer dialogue. Communications of the ACM 33, 3 (March 1990), 338--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Sarah Nadi, Thorsten Berger, Christian Kästner, and Krzysztof Czarnecki. 2014. Mining configuration constraints: Static analyses and empirical results. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Kiran Nagaraja, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2004. Understanding and dealing with operator mistakes in internet services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Sanjai Narain. 2005. Network configuration management via model finding. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Sanjai Narain, Gary Levin, Vikram Kaul, and Sharad Malik. 2008. Declarative infrastructure configuration synthesis and debugging. Journal of Network and System Management 16, 3 (October 2008), 235--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the ACM CHI 90 Human Factors in Computing Systems Conference (CHI’90). Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Donald A. Norman. 1983a. Design principles for human-computer interfaces. In Proceedings of the ACM CHI 83 Human Factors in Computing Systems Conference (CHI’83). Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Donald A. Norman. 1983b. Design rules based on analyses of human error. Communications of the ACM 26, 4 (April 1983), 254--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Fábio Oliveira, Kiran Nagaraja, Rekha Bachwani, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2006. Understanding and validating database system administration. In Proceedings of 2006 USENIX Annual Technical Conference (USENIX ATC’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Fábio Oliveira, Andrew Tjang, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2010. Barricade: Defending systems against operator mistakes. In Proceedings of the 5th ACM European Conference in Computer Systems (EuroSys’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. OpenLDAP mailing list. 2004. Re: BINDDN in ldap.conf. Retrieved from http://www.openldap.org/lists/openldap-software/200407/msg00648.html. (2004).Google ScholarGoogle Scholar
  82. David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do internet services fail, and what can be done about it?. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Takayuki Osogami and Toshinari Itoko. 2006. Finding probably better system configurations quickly. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Noam Palatin, Arie Leizarowitz, Assaf Schuster, and Ran Wolff. 2006. Mining for misconfigured machines in grid systems. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez, Armando Fox, Emre Kiciman, Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah Treuhaft. 2002. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical Report No. UCB//CSD-02-1175. University of California, Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Charles Perrow. 1984. Normal Accidents: Living with High-Risk Technologies. Basic Books.Google ScholarGoogle Scholar
  87. Ariel Rabkin and Randy Katz. 2011a. Precomputing possible configuration error diagnosis. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Ariel Rabkin and Randy Katz. 2011b. Static extraction of program configuration options. In Proceedings of the 33rd International Conference on Software Engineering (ICSE’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Ariel Rabkin and Randy Katz. 2013. How Hadoop clusters break. IEEE Software Magazine 30, 4 (July 2013), 88--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Vinod Ramachandran, Manish Gupta, Manish Sethi, and Soudip Roy Chowdhury. 2009. Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications. In Proceedings of the 6th International Conference on Autonomic Computing and Communications (ICAC’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. James Reason. 1990. Human Error. Cambridge University Press.Google ScholarGoogle Scholar
  92. Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. William F. Slater. 2002. The Internet Outage and Attacks of October 2002. Retrieved from http://www.isoc-chicago.org/internetoutage.pdf.Google ScholarGoogle Scholar
  94. Yee Jiun Song, Flavio Junqueira, and Benjamin Reed. 2009. BFT for the skeptics. In Proceedings of the Workshop on Theory and Practice of Byzantine Fault Tolerance (BFTW3).Google ScholarGoogle Scholar
  95. StackOverflow. 2015. 2015 Developer Survey. Retrieved from http://stackoverflow.com/research/developer-survey-2015#profile-education.Google ScholarGoogle Scholar
  96. Ya-Yunn Su, Mona Attariyan, and Jason Flinn. 2007. AutoBash: Improving configuration management with operating system causality analysis. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Ya-Yunn Su and Jason Flinn. 2009. Automatically generating predicates and solutions for configuration troubleshooting. In Proceedings of 2009 USENIX Annual Technical Conference (USENIX ATC’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Keir Thomas. 2011. Thanks, Amazon: The Cloud Crash Reveals Your Importance. Retrieved from http://www.pcworld.com/article/226033/thanks_amazon_for_making_possible_much_of_the_internet.html.Google ScholarGoogle Scholar
  99. Bart Vanbrabant, Joris Peeraer, and Wouter Joosen. 2011. Fine-grained access-control for the puppet configuration language. In Proceedings of the 25th Large Installation System Administration Conference (LISA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Nicole F. Velasquez, Suzanne Weisband, and Alexandra Durcikova. 2008. Designing tools for system administrators: An empirical test of the integrated user satisfaction model. In Proceedings of the 22nd Large Installation System Administration Conference (LISA’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Chad Verbowski, Emre Kiciman, Arunvijay Kumar, Brad Daniels, Shan Lu, Juhan Lee, Yi-Min Wang, and Roussi Roussev. 2006a. Flight data recorder: Monitoring persistent-state interactions to improve systems management. In Proceedings of the 7th USENIX Conference on Operating Systems Design and Implementation (OSDI’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Chad Verbowski, Juhan Lee, Xiaogang Liu, Roussi Roussev, and Yi-Min Wang. 2006b. LiveOps: Systems management as a service. In Proceedings of the 20th Large Installation System Administration Conference (LISA’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Lev Walkin. 2012. Comment on “Why Config?” Retrieved from http://robey.lag.net//2012/03/26/why-config.html.Google ScholarGoogle Scholar
  104. Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang. 2004. Automatic misconfiguration troubleshooting with peerpressure. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, and Zheng Zhang. 2003. STRIDER: A black-box, state-based approach to change and configuration management and support. In Proceedings of the 17th Large Installation Systems Administration Conference (LISA’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.Google ScholarGoogle Scholar
  107. Andrew Whitaker, Richard S. Cox, and Steven D. Gribble. 2004. Configuration debugging as search: Finding the needle in the haystack. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Wikipedia. 2014. System administrator. Retrieved from http://en.wikipedia.org/wiki/System_administrator.Google ScholarGoogle Scholar
  109. John Wilkes. 2011. Ωmega: Cluster management at Google. Retrieved from https://www.youtube.com/watch?feature=player_detailpage&v==0ZFMlO98Jkct=1674s.Google ScholarGoogle Scholar
  110. Bowei Xi, Zhen Liu, Mukund Raghavachari, Cathy H. Xia, and Li Zhang. 2004. A smart hill-climbing algorithm for application server configuration. In Proceedings of the 13th International World Wide Web Conference (WWW’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Yingfei Xiong, Arnaud Hubaux, Steven She, and Krzysztof Czarnecki. 2012. Generating range fixes for software configuration. In Proceedings of the 34th International Conference on Software Engineering (ICSE’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Tianyin Xu, Long Jin, Xuepeng Fan, Yuanyuan Zhou, Shankar Pasupathy, and Rukma Talwadker. 2015. Hey, you have given me too many knobs!—Understanding and dealing with over-designed configuration in system software. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’15).Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. 2013. Do not blame users for misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Tao Ye and Shivkumar Kalyanaraman. 2003. A recursive random search algorithm for large-scale network parameter configuration. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, and Wei-Ying Ma. 2006. Automated known problem diagnosis with event traces. In Proceedings of the 1st ACM European Conference in Computer Systems (EuroSys’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Ding Yuan, Yinglian Xie, Rina Panigrahy, Junfeng Yang, Chad Verbowski, and Arunvijay Kumar. 2011a. Context-based online configuration error detection. In Proceedings of 2011 USENIX Annual Technical Conference (USENIX ATC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2011b. Improving software diagnosability via log enhancement. In Proceedings of the 16th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS-XVI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Andreas Zeller. 2009. Why Programs Fail: A Guide to Systematic Debugging (2nd ed.). Morgan Kaufmann Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Gong Zhang and Ling Liu. 2011. Why do migrations fail and what can we do about it? In Proceedings of the 25th USENIX Large Installation System Administration Conference (LISA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Jiaqi Zhang, Lakshmi Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. EnCore: Exploiting system environment and correlation information for misconfiguration detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  123. Sai Zhang and Michael D. Ernst. 2013. Automated diagnosis of software configuration errors. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Sai Zhang and Michael D. Ernst. 2014. Which configuration option should I change? In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Wei Zheng, Ricardo Bianchini, and Thu D. Nguyen. 2007. Automatic configuration of internet services. In Proceedings of the 2nd ACM European Conference in Computer Systems (EuroSys’07). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Systems Approaches to Tackling Configuration Errors: A Survey

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Computing Surveys
                  ACM Computing Surveys  Volume 47, Issue 4
                  July 2015
                  573 pages
                  ISSN:0360-0300
                  EISSN:1557-7341
                  DOI:10.1145/2775083
                  • Editor:
                  • Sartaj Sahni
                  Issue’s Table of Contents

                  Copyright © 2015 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 21 July 2015
                  • Accepted: 1 June 2015
                  • Revised: 1 February 2015
                  • Received: 1 August 2014
                  Published in csur Volume 47, Issue 4

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • survey
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader