- Bernstein, P., Hadzilacos, V., Goodman, N. Distributed recovery. Concurrency Control and Recovery in Database Systems, Chapter 7. Addison-Wesley, 1986; http://research.microsoft.com/en-us/people/philbe/chapter7.pdf.Google Scholar
- Corbett, J. C. et al. Spanner: Google's globally distributed database. In Proceedings of the 10th Usenix Symposium on Operating Systems Design and Implementation, 2012; https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett. Google ScholarDigital Library
- Garduno, E., Kavulya, S. P., Tan, J., Gandhi, R., Narasimhan, P. Theia: Visual signatures for problem diagnosis in large Hadoop clusters. In Proceedings of the 26th International Conference on Large Installation System Administration, 2012, 33--42; https://users.ece.cmu.edu/~spertet/papers/hadoopvis-lisa12-cameraready-v3.pdf. Google ScholarDigital Library
- Geels, D., Altekar, G., Maniatis, P., Roscoe, T., Stoica, I. Friday: Global comprehension for distributed replay. In Proceedings of the 4th Usenix Conference on Networked Systems Design and Implementation, (2007); https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/geels/geels.pdf. Google ScholarDigital Library
- Hawblitzel, C., Howell, J., Kapritsos, M., Lorch, J. R., Parno, B., Roberts, M. L., Setty, S., Zill, B. IronFleet: Proving practical distributed systems correct. In Proceedings of the 25th Symposium on Operating Systems Principles; 2015; http://sigops.org/sosp/sosp15/current/2015-Monterey/250-hawblitzel-online.pdf. Google ScholarDigital Library
- Isaacs, K.E. et al. Combing the communication hairball: Visualizing parallel execution traces using logical time. IEEE Transactions on Visualization and Computer Graphics 20, 12 (Dec 2014), 2349--2358.Google Scholar
- Killian, C., Anderson, J. W., Jhala, R., Vahdat, A. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the 4th Usenix Conference on Networked Systems Design and Implementation, (2007); https://www.usenix.org/legacy/event/nsdi07/tech/killian/killian.pdf. Google ScholarDigital Library
- Liu, X., Guo, Z., Wang, X., Chen, F., Lian, X., Tang, J., Wu, M., Kaashoek, M. F., Zhang, Z. D3S: Debugging deployed distributed systems. In Proceedings of the 5th Usenix Symposium on Networked Systems Design and Implementation, 2008; 423--437; http://static.usenix.org/event/nsdi08/tech/full_papers/liu_xuezheng/liu_xuezheng.pdf. Google ScholarDigital Library
- Mace, J., Roelke, R., Fonseca, R. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, (2015); 378--393; http://sigops.org/sosp/sosp15/current/2015-Monterey/122-mace-online.pdf. Google ScholarDigital Library
- Mattern, F. Virtual time and global states of distributed systems. In Proceedings of the International Workshop on Parallel and Distributed Algorithms, 1989; http://homes.cs.washington.edu/~arvind/cs425/doc/mattern89virtual.pdfGoogle Scholar
- Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M. How Amazon Web Services uses formal methods. Commun. ACM 58, 4 (2015), 66--73; http://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext. Google ScholarDigital Library
- Project Voldemort; http://www.project-voldemort.com/voldemort/.Google Scholar
- Sambasivan, R.R., Fonseca, R., Shafer, I., Ganger, G. So, you want to trace your distributed system? Key design insights from years of practical experience. Parallel Data Laboratory, Carnegie Mellon University, 2014; http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf.Google Scholar
- Scott, C. et al. Minimize faulty executions of distributed systems. In Proceedings of the 13th Usenix Symposium on Networked Design and Implementation (Santa Clara, CA, Mar. 16--18, 2016) 291--309. Google ScholarDigital Library
- Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. Dapper, a large-scale distributed systems tracing infrastructure. Research at Google, 2010; http://research.google.com/pubs/pub36356.html.Google Scholar
- Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., Anderson, T. Verdi: A framework for implementing and formally verifying distributed systems. In Proceedings of the 36th SIGPLAN Conference on Programming Language Design and Implementation, 2015, 357--368; https://homes.cs.washington.edu/~ztatlock/pubs/verdi-wilcox-pldi15.pdf. Google ScholarDigital Library
- Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M. Experience mining Google's production console logs. In Proceedings of the Workshop on Managing Systems via Log Analysis and Machine Learning Techniques, 2010; http://iiis.tsinghua.edu.cn/~weixu/files/slaml10.pdf. Google ScholarDigital Library
- Yang, J., et al. MoDist: Transparent model checking of unmodified distributed systems. In Proceedings of the 6th Usenix Symposium on Networked Systems Design and Implementation, 2009, 213--228; https://www.usenix.org/legacy/event/nsdi09/tech/full_papers/yang/yang_html/. Google ScholarDigital Library
Index Terms
- Debugging distributed systems
Recommendations
Live Debugging of Distributed Systems
CC '09: Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009Debugging distributed systems is challenging. Although incremental debugging during development finds some bugs, developers are rarely able to fully test their systems under realistic operating conditions prior to deployment. While deploying a system ...
Replay debugging: leveraging record and replay for program debugging
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecutureHardware-assisted Record and Deterministic Replay (RnR) of programs has been proposed as a primitive for debugging hard-to-repeat software bugs. However, simply providing support for repeatedly stumbling on the same bug does not help diagnose it. For ...
A Framework for Distributed Debugging
The authors provide a general picture of current research in distributed debugging. Rather than an exhaustive survey of the area, they present a view of the issues and solutions based on a proposed framework for distributed debugging systems. They ...
Comments