Abstract
In recent years, configuration errors (i.e., misconfigurations) have become one of the dominant causes of system failures, resulting in many severe service outages and downtime. Unfortunately, it is notoriously difficult for system users (e.g., administrators and operators) to prevent, detect, and troubleshoot configuration errors due to the complexity of the configurations as well as the systems under configuration. As a result, the cost of resolving configuration errors is often tremendous from the aspects of both compensating the service disruptions and diagnosing, recovering from the failures. The prevalence, severity, and cost have made configuration errors one of the most thorny system problems that desire to be addressed.
This survey article provides a holistic and structured overview of the systems approaches that tackle configuration errors. To understand the problem fundamentally, we first discuss the characteristics of configuration errors and the challenges of tackling such errors. Then, we discuss the state-of-the-art systems approaches that address different types of configuration errors in different scenarios. Our primary goal is to equip the stakeholder with a better understanding of configuration errors and the potential solutions for resolving configuration errors in the spectrum of system development and management. To inspire follow-up research, we further discuss the open problems with regard to system configuration. To the best of our knowledge, this is the first survey on the topic of tackling configuration errors.
- Bhavish Aggarwal, Ranjita Bhagwan, Tathagata Das, Siddharth Eswaran, Venkata N. Padmanabhan, and Geoffrey M. Voelker. 2009. NetPrints: Diagnosing home network misconfigurations using shared knowledge. In Proceedings of the 6th USENIX Symposium on Networked System Design and Implementation (NSDI’09). Google ScholarDigital Library
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. The International Journal on Very Large Data Bases (VLDBJ) 23, 6 (December 2014), 939--964. Google ScholarDigital Library
- Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: Running circles around storage administration. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02). Google ScholarDigital Library
- Paul Anderson. 1994. Towards a high-level machine configuration system. In Proceedings of the 8th USENIX System Administration Conference (LISA-VIII). Google ScholarDigital Library
- Paul Anderson and James Cheney. 2012. Toward provenance-based security for configuration languages. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google ScholarDigital Library
- Paul Anderson and Edmund Smith. 2005. Configuration tools: Working together. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarDigital Library
- Marc Andreessen. 2011. Why software is eating the world. The Wall Street Journal (August 2011).Google Scholar
- Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2009. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report UCB/EECS-2009-28. University of California Berkeley.Google Scholar
- Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). Google ScholarDigital Library
- Mona Attariyan and Jason Flinn. 2008. Using causality to diagnose configuration bugs. In Proceedings of 2008 USENIX Annual Technical Conference (USENIX ATC’08). Google ScholarDigital Library
- Mona Attariyan and Jason Flinn. 2010. Automating configuration troubleshooting with dynamic information flow analysis. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). Google ScholarDigital Library
- Mona Attariyan and Jason Flinn. 2011. Automating configuration troubleshooting with ConfAid. USENIX ;login: 36, 1 (February 2011), 27--36.Google Scholar
- Rob Barrett, Eser Kandogan, Paul P. Maglio, Eben Haber, Leila A. Takayama, and Madhu Prabaker. 2004. Field studies of computer system administrators: Analysis of system management tools and practices. In Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work (CSCW’04). Google ScholarDigital Library
- Luiz André Barroso and Urs Hölzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers. Google ScholarDigital Library
- Lujo Bauer, Scott Garriss, and Michael K. Reiter. 2011. Detecting and resolving policy misconfigurations in access-control systems. ACM Transactions on Information and System Security (TISSEC) 14, 1 (May 2011), 1--28. Google ScholarDigital Library
- Theophilus Benson, Aditya Akella, and David Maltz. 2009. Unraveling the complexity of network management. In Proceedings of the 6th USENIX Symposium on Networked System Design and Implementation (NSDI’09). Google ScholarDigital Library
- Theophilus Benson, Aditya Akella, and Aman Shaikh. 2011. Demystifying configuration challenges and trade-offs in network-based ISP services. In Proceedings of 2011 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’11). Google ScholarDigital Library
- Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM 53, 2 (February 2010), 66--75. Google ScholarDigital Library
- Ricardo Bianchini, Richard P. Martin, Kiran Nagaraja, Thu D. Nguyen, and Fabio Oliveira. 2005. Human-aware computer system design. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS X). Google ScholarDigital Library
- Charles Border and Kyrre Begnum. 2014. Educating system administrators. USENIX ;login: 39, 5 (October 2014), 36--39.Google Scholar
- Jon Brodkin. 2012. Why Gmail Went Down: Google Misconfigured Load Balancing Servers. Retrieved from http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-google-misconfigured-ch-romes-sync-server/.Google Scholar
- Aaron Brown. 2001. Towards Availability and Maintainability Benchmarks: A Case Study of Software RAID Systems. Technical Report UCB//CSD-01-1132. University of California, Berkeley. Google ScholarDigital Library
- Aaron B. Brown and David A. Patterson. 2001. To err is human. In Proceedings of the 1st Workshop on Evaluating and Architecting System Dependability (EASY’01).Google Scholar
- Aaron B. Brown and David A. Patterson. 2003. Undo for Operators: Building an undoable e-mail store. In Proceedings of the 2003 USENIX Annual Technical Conference (USENIX ATC’03). Google ScholarDigital Library
- Mark Burgess. 1995. A site configuration engine. USENIX Computing Systems 8, 3 (1995), 309--337.Google Scholar
- Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). Google ScholarDigital Library
- Jeffery S. Chase, Darrell C. Anderson, Prachi N. Thakar, Amin M. Vahdat, and Ronald P. Doyle. 2001. Managing energy and server resources in hosting centers. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01). Google ScholarDigital Library
- Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking database system architecture: Towards a self-tuning risc-style database system. In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB’00). Google ScholarDigital Library
- Kai Chen, Chuanxiong Guo, Haitao Wu, Jing Yuan, Zhenqian Feng, Yan Chen, Songwu Lu, and Wenfei Wu. 2010. Generic and automatic address configuration for data center networks. In Proceedings of 2010 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’10). Google ScholarDigital Library
- Ken A. L. Coar. 1998. Apache Server For Dummies. IDG Books Worldwide Inc. Google ScholarDigital Library
- Computing Research Association. 2003. Grand Research Challenges in Information Systems. Technical Report. Retrieved from http://archive.cra.org/reports/gc.systems.pdf.Google Scholar
- Thomas Delaet, Wouter Joosen, and Bart Vanbrabant. 2010. A survey of system conguration tools. In Proceedings of the 24th Large Installation System Administration Conference (LISA’10). Google ScholarDigital Library
- Narayan Desai, Rick Bradshaw, Scott Matott, Sandra Bittner, Susan Coghlan, Remy Evard, Cory Lueninghoener, Ti Leggett, John-Paul Navarro, Gene Rackow, Craig Stacey, and Tisha Stacey. 2005. A case study in configuration management tool deployment. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarDigital Library
- John DeTreville. 2005. Making system configuration more declarative. In Proceedings of the USENIX 10th Workshop on Hot Topics in Operating Systems (HotOS-X). Google ScholarDigital Library
- Eelco Dolstra and Armijn Hemel. 2007. Purely functional system configuration management. In Proceedings of the USENIX 11th Workshop on Hot Topics in Operating Systems (HotOS-XI). Google ScholarDigital Library
- Songyun Duan, Vamsidhar Thummala, and Shivnath Babu. 2009. Tuning database configuration parameters with ituned. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB’09).Google ScholarDigital Library
- Kevin Elphinstone and Gernot Heiser. 2013. From L3 to seL4: What have we learnt in 20 years of L4 microkernels?. In Proceedings of the 24th Symposium on Operating System Principles (SOSP’13). Google ScholarDigital Library
- William Enck, Patrick McDaniel, Subhabrata Sen, Panagiotis Sebos, Sylke Spoerel, Albert Greenberg, Sanjay Rao, and William Aiello. 2007. Configuration management at massive scale: System design and experience. In Proceedings of 2007 USENIX Annual Technical Conference (USENIX ATC’07). Google ScholarDigital Library
- William Enck, Thomas Moyer, Patrick McDaniel, Subhabrata Sen, Panagiotis Sebos, Sylke Spoerel, Albert Greenberg, Yu-Wei Eric Sung, Sanjay Rao, and William Aiello. 2009. Configuration management at massive scale: System design and experience. IEEE Journal on Selected Areas in Communications (JSAC) 27, 3 (April 2009), 323--335. Google ScholarDigital Library
- Nick Feamster and Hari Balakrishnan. 2005. Detecting BGP configuration faults with static analysis. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation (NSDI’05). Google ScholarDigital Library
- Patrick Goldsack, Julio Guijarro, Steve Loughran, Alistair Coles, Andrew Farrell, Antonio Lain, Paul Murray, and Peter Toft. 2009. The smartfrog configuration management framework. SIGOPS Operating System Review 43 (January 2009), 16--25. Google ScholarDigital Library
- Jim Gray. 1985. Why do computers stop and what can be done about it? Tandem Technical Report 85.7 (June 1985).Google Scholar
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC’14). Google ScholarDigital Library
- Eben M. Haber and John Bailey. 2007. Design guidelines for system administration tools developed through ethnographic field study. In Proceedings of the 2007 ACM Conference on Human Interfaces to the Management of Information Technology (CHIMIT’07). Google ScholarDigital Library
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’11).Google ScholarDigital Library
- John A. Hewson, Paul Anderson, and Andrew D. Gordon. 2012. A declarative approach to automated configuration. In Proceedings of the 26th Large Installation System Administration Conference (LISA’12). Google ScholarDigital Library
- Dennis G. Hrebec and Michael Stiber. 2001. A survey of system administrator mental models and situation awareness. In Proceedings of the 2001 ACM SIGCPR Conference on Computer Personnel Research (SIGCPR’01). Google ScholarDigital Library
- Peng Huang, William J. Bolosky, Abhishek Sigh, and Yuanyuan Zhou. 2015. KungfuValley: A systematic configuration validation framework for cloud services. In Proceedings of the 10th ACM European Conference in Computer Systems (EuroSys’15). Google ScholarDigital Library
- Arnaud Hubaux, Yingfei Xiong, and Krzysztof Czarnecki. 2012. A user survey of configuration challenges in linux and ecos. In Proceedings of 6th International Workshop on Variability Modeling of Software-intensive Systems (VaMoS’12). Google ScholarDigital Library
- Fabian Hueske. 2015. Juggling with Bits and Bytes. Retrieved from http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html.Google Scholar
- Weihang Jiang, Chongfeng Hu, Shankar Pasupathy, Arkady Kanevsky, Zhenmin Li, and Yuanyuan Zhou. 2009. Understanding customer problem troubleshooting from storage system logs. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09). Google ScholarDigital Library
- Lorenzo Keller, Prasang Upadhyaya, and George Candea. 2008. ConfErr: A tool for assessing resilience to human configuration errors. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’08).Google ScholarCross Ref
- Stuart Kendrick. 2012. What takes us down? USENIX ;login: 37, 5 (October 2012), 37--45.Google Scholar
- Jeffrey O. Kephart and David M. Chess. 2003. The vision of autonomic computing. IEEE Computer 36, 1 (January 2003), 41--50. Google ScholarDigital Library
- Emre Kiciman and Yi-Min Wang. 2004. Discovering correctness constraints for self-management of system configuration. In Proceedings of the 1st International Conference on Autonomic Computing (ICAC’04). Google ScholarDigital Library
- Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 2011. The state of the art in end-user software engineering. ACM Computing Surveys (CSUR) 43, 3 (April 2011). Google ScholarDigital Library
- Eddie Kohler, Benjie Chen, M. Frans Kaashoek, Robert Morris, and Massimiliano Poletto. 2000. Programming Language Techniques for Modular Router Configurations. Technical Report MIT-LCS-TR-812. MIT Laboratory for Computer Science.Google Scholar
- Nate Kushman and Dina Katabi. 2010. Enabling configuration-independent automation by non-expert users. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). Google ScholarDigital Library
- Craig Labovitz, Abha Ahuja, and Farnam Jahanian. 1999. Experimental study of internet stability and backbone failures. In Proceedings of the 29th Annual International Symposium on Fault-Tolerant Computing (FTCS’99). Google ScholarDigital Library
- Jean-Claude Laprie. 1995. Dependable computing: Concepts, limits, challenges. In Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS’95). Google ScholarDigital Library
- William LeFebvre and David Snyder. 2004. Auto-configuration by file construction: Configuration management with newfig. In Proceedings of the 18th USENIX Large Installation System Administration Conference (LISA’04). Google ScholarDigital Library
- Lim Yan Liang. 2013. Linkedin.com inaccessible on Thursday because of server misconfiguration. Retrieved from http://www.straitstimes.com/breaking-news/singapore/story/linkedincom-inaccessible-thursday-because-server-misconfiguration-2013.Google Scholar
- Cory Lueninghoener. 2011. Getting started with configuration management. USENIX ;login: 36, 2 (April 2011), 12--17.Google Scholar
- Ratul Mahajan, David Wetherall, and Tom Anderson. 2002. Understanding BGP misconfiguration. In Proceedings of 2002 Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’02). Google ScholarDigital Library
- Justin Mason. 2011. Against The Use of Programming Languages in Configuration Files. Retrieved from http://taint.org/2011/02/18/001527a.html. (2011).Google Scholar
- Roy A. Maxion and Robert W. Reeder. 2005. Improving user-interface dependability through mitigation of human error. International Journal of Human-Computer Studies 63, 1-2 (July 2005), 25--50. Google ScholarDigital Library
- Paul McNamara. 2009. Missing dot drops Sweden off the Internet. Retrieved from http://www.computerworld.com/article/2529287/networking/opinion--missing-dot-drops-sweden-off-the-internet.html.Google Scholar
- James Mickens, Martin Szummer, and Dushyanth Narayanan. 2007. Snitch: Interactive decision trees for troubleshooting misconfigurations. In Proceedings of the 2nd USENIX Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SYSML’07). Google ScholarDigital Library
- Barton P. Miller, David Koski, Cjin Pheow Lee, Vivekananda Maganty, Ravi Murthy, Ajitkumar Natarajan, and Jeff Steidl. 1995. Fuzz Revisited: A Re-Examination of the Reliability of UNIX Utilities and Services. Technical Report No. 1268. University of Wisconsin-Madison, Computer Sciences Department.Google Scholar
- Rich Miller. 2012. Microsoft: Misconfigured Network Device Caused Azure Outage. Retrieved from http://www.datacenterknowledge.com/archives/2012/07/28/microsoft-misconfigured-network-device-caused-azure-outage/.Google Scholar
- Rolf Molich and Jakob Nielsen. 1990. Improving a human-computer dialogue. Communications of the ACM 33, 3 (March 1990), 338--348. Google ScholarDigital Library
- Sarah Nadi, Thorsten Berger, Christian Kästner, and Krzysztof Czarnecki. 2014. Mining configuration constraints: Static analyses and empirical results. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). Google ScholarDigital Library
- Kiran Nagaraja, Fábio Oliveira, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2004. Understanding and dealing with operator mistakes in internet services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarDigital Library
- Sanjai Narain. 2005. Network configuration management via model finding. In Proceedings of the 19th Large Installation System Administration Conference (LISA’05). Google ScholarDigital Library
- Sanjai Narain, Gary Levin, Vikram Kaul, and Sharad Malik. 2008. Declarative infrastructure configuration synthesis and debugging. Journal of Network and System Management 16, 3 (October 2008), 235--258. Google ScholarDigital Library
- Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the ACM CHI 90 Human Factors in Computing Systems Conference (CHI’90). Google ScholarDigital Library
- Donald A. Norman. 1983a. Design principles for human-computer interfaces. In Proceedings of the ACM CHI 83 Human Factors in Computing Systems Conference (CHI’83). Google ScholarDigital Library
- Donald A. Norman. 1983b. Design rules based on analyses of human error. Communications of the ACM 26, 4 (April 1983), 254--258. Google ScholarDigital Library
- Fábio Oliveira, Kiran Nagaraja, Rekha Bachwani, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2006. Understanding and validating database system administration. In Proceedings of 2006 USENIX Annual Technical Conference (USENIX ATC’06). Google ScholarDigital Library
- Fábio Oliveira, Andrew Tjang, Ricardo Bianchini, Richard P. Martin, and Thu D. Nguyen. 2010. Barricade: Defending systems against operator mistakes. In Proceedings of the 5th ACM European Conference in Computer Systems (EuroSys’10). Google ScholarDigital Library
- OpenLDAP mailing list. 2004. Re: BINDDN in ldap.conf. Retrieved from http://www.openldap.org/lists/openldap-software/200407/msg00648.html. (2004).Google Scholar
- David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do internet services fail, and what can be done about it?. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS’03). Google ScholarDigital Library
- Takayuki Osogami and Toshinari Itoko. 2006. Finding probably better system configurations quickly. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS/Performance’06). Google ScholarDigital Library
- Noam Palatin, Arie Leizarowitz, Assaf Schuster, and Ran Wolff. 2006. Mining for misconfigured machines in grid systems. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). Google ScholarDigital Library
- David Patterson, Aaron Brown, Pete Broadwell, George Candea, Mike Chen, James Cutler, Patricia Enriquez, Armando Fox, Emre Kiciman, Matthew Merzbacher, David Oppenheimer, Naveen Sastry, William Tetzlaff, Jonathan Traupman, and Noah Treuhaft. 2002. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Technical Report No. UCB//CSD-02-1175. University of California, Berkeley. Google ScholarDigital Library
- Charles Perrow. 1984. Normal Accidents: Living with High-Risk Technologies. Basic Books.Google Scholar
- Ariel Rabkin and Randy Katz. 2011a. Precomputing possible configuration error diagnosis. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE’11). Google ScholarDigital Library
- Ariel Rabkin and Randy Katz. 2011b. Static extraction of program configuration options. In Proceedings of the 33rd International Conference on Software Engineering (ICSE’11). Google ScholarDigital Library
- Ariel Rabkin and Randy Katz. 2013. How Hadoop clusters break. IEEE Software Magazine 30, 4 (July 2013), 88--94. Google ScholarDigital Library
- Vinod Ramachandran, Manish Gupta, Manish Sethi, and Soudip Roy Chowdhury. 2009. Determining configuration parameter dependencies via analysis of configuration data from multi-tiered enterprise applications. In Proceedings of the 6th International Conference on Autonomic Computing and Communications (ICAC’09). Google ScholarDigital Library
- James Reason. 1990. Human Error. Cambridge University Press.Google Scholar
- Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). Google ScholarDigital Library
- William F. Slater. 2002. The Internet Outage and Attacks of October 2002. Retrieved from http://www.isoc-chicago.org/internetoutage.pdf.Google Scholar
- Yee Jiun Song, Flavio Junqueira, and Benjamin Reed. 2009. BFT for the skeptics. In Proceedings of the Workshop on Theory and Practice of Byzantine Fault Tolerance (BFTW3).Google Scholar
- StackOverflow. 2015. 2015 Developer Survey. Retrieved from http://stackoverflow.com/research/developer-survey-2015#profile-education.Google Scholar
- Ya-Yunn Su, Mona Attariyan, and Jason Flinn. 2007. AutoBash: Improving configuration management with operating system causality analysis. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). Google ScholarDigital Library
- Ya-Yunn Su and Jason Flinn. 2009. Automatically generating predicates and solutions for configuration troubleshooting. In Proceedings of 2009 USENIX Annual Technical Conference (USENIX ATC’09). Google ScholarDigital Library
- Keir Thomas. 2011. Thanks, Amazon: The Cloud Crash Reveals Your Importance. Retrieved from http://www.pcworld.com/article/226033/thanks_amazon_for_making_possible_much_of_the_internet.html.Google Scholar
- Bart Vanbrabant, Joris Peeraer, and Wouter Joosen. 2011. Fine-grained access-control for the puppet configuration language. In Proceedings of the 25th Large Installation System Administration Conference (LISA’11). Google ScholarDigital Library
- Nicole F. Velasquez, Suzanne Weisband, and Alexandra Durcikova. 2008. Designing tools for system administrators: An empirical test of the integrated user satisfaction model. In Proceedings of the 22nd Large Installation System Administration Conference (LISA’08). Google ScholarDigital Library
- Chad Verbowski, Emre Kiciman, Arunvijay Kumar, Brad Daniels, Shan Lu, Juhan Lee, Yi-Min Wang, and Roussi Roussev. 2006a. Flight data recorder: Monitoring persistent-state interactions to improve systems management. In Proceedings of the 7th USENIX Conference on Operating Systems Design and Implementation (OSDI’06). Google ScholarDigital Library
- Chad Verbowski, Juhan Lee, Xiaogang Liu, Roussi Roussev, and Yi-Min Wang. 2006b. LiveOps: Systems management as a service. In Proceedings of the 20th Large Installation System Administration Conference (LISA’06). Google ScholarDigital Library
- Lev Walkin. 2012. Comment on “Why Config?” Retrieved from http://robey.lag.net//2012/03/26/why-config.html.Google Scholar
- Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang. 2004. Automatic misconfiguration troubleshooting with peerpressure. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarDigital Library
- Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, and Zheng Zhang. 2003. STRIDER: A black-box, state-based approach to change and configuration management and support. In Proceedings of the 17th Large Installation Systems Administration Conference (LISA’03). Google ScholarDigital Library
- Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.Google Scholar
- Andrew Whitaker, Richard S. Cox, and Steven D. Gribble. 2004. Configuration debugging as search: Finding the needle in the haystack. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI’04). Google ScholarDigital Library
- Wikipedia. 2014. System administrator. Retrieved from http://en.wikipedia.org/wiki/System_administrator.Google Scholar
- John Wilkes. 2011. Ωmega: Cluster management at Google. Retrieved from https://www.youtube.com/watch?feature=player_detailpage&v==0ZFMlO98Jkct=1674s.Google Scholar
- Bowei Xi, Zhen Liu, Mukund Raghavachari, Cathy H. Xia, and Li Zhang. 2004. A smart hill-climbing algorithm for application server configuration. In Proceedings of the 13th International World Wide Web Conference (WWW’04). Google ScholarDigital Library
- Yingfei Xiong, Arnaud Hubaux, Steven She, and Krzysztof Czarnecki. 2012. Generating range fixes for software configuration. In Proceedings of the 34th International Conference on Software Engineering (ICSE’12). Google ScholarDigital Library
- Tianyin Xu, Long Jin, Xuepeng Fan, Yuanyuan Zhou, Shankar Pasupathy, and Rukma Talwadker. 2015. Hey, you have given me too many knobs!—Understanding and dealing with over-designed configuration in system software. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’15).Google ScholarDigital Library
- Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. 2013. Do not blame users for misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP’13). Google ScholarDigital Library
- Tao Ye and Shivkumar Kalyanaraman. 2003. A recursive random search algorithm for large-scale network parameter configuration. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS’03). Google ScholarDigital Library
- Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Google ScholarDigital Library
- Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, and Wei-Ying Ma. 2006. Automated known problem diagnosis with event traces. In Proceedings of the 1st ACM European Conference in Computer Systems (EuroSys’06). Google ScholarDigital Library
- Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). Google ScholarDigital Library
- Ding Yuan, Yinglian Xie, Rina Panigrahy, Junfeng Yang, Chad Verbowski, and Arunvijay Kumar. 2011a. Context-based online configuration error detection. In Proceedings of 2011 USENIX Annual Technical Conference (USENIX ATC'11). Google ScholarDigital Library
- Ding Yuan, Jing Zheng, Soyeon Park, Yuanyuan Zhou, and Stefan Savage. 2011b. Improving software diagnosability via log enhancement. In Proceedings of the 16th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS-XVI). Google ScholarDigital Library
- Andreas Zeller. 2009. Why Programs Fail: A Guide to Systematic Debugging (2nd ed.). Morgan Kaufmann Publishers. Google ScholarDigital Library
- Gong Zhang and Ling Liu. 2011. Why do migrations fail and what can we do about it? In Proceedings of the 25th USENIX Large Installation System Administration Conference (LISA’11). Google ScholarDigital Library
- Jiaqi Zhang, Lakshmi Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. EnCore: Exploiting system environment and correlation information for misconfiguration detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS’14). Google ScholarDigital Library
- Sai Zhang and Michael D. Ernst. 2013. Automated diagnosis of software configuration errors. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). Google ScholarDigital Library
- Sai Zhang and Michael D. Ernst. 2014. Which configuration option should I change? In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). Google ScholarDigital Library
- Wei Zheng, Ricardo Bianchini, and Thu D. Nguyen. 2007. Automatic configuration of internet services. In Proceedings of the 2nd ACM European Conference in Computer Systems (EuroSys’07). Google ScholarDigital Library
Index Terms
- Systems Approaches to Tackling Configuration Errors: A Survey
Recommendations
Do not blame users for misconfigurations
SOSP '13: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems PrinciplesSimilar to software bugs, configuration errors are also one of the major causes of today's system failures. Many configuration issues manifest themselves in ways similar to software bugs such as crashes, hangs, silent failures. It leaves users clueless ...
The SmartFrog configuration management framework
SmartFrog is a framework for creating configuration-driven systems. It has been designed with the express purpose of making the design, deployment and management of distributed component-based systems simpler and more robust. Over the last decade it has ...
ECFuzz: Effective Configuration Fuzzing for Large-Scale Systems
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software EngineeringA large-scale system contains a huge configuration space because of its large number of configuration parameters. This leads to a combination explosion among configuration parameters when exploring the configuration space. Existing configuration testing ...
Comments