ABSTRACT
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.
- A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002. Google ScholarDigital Library
- D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.Google Scholar
- M.Y. Chen and et al. Path-based failure and evolution management. In Proc. NSDI'04, pages 23--23, San Francisco, California, 2004. USENIX. Google ScholarDigital Library
- M.H. DeGroot and M.J. Schervish. Probability and Statistics. Addison-Wesley, 3rd edition, 2002.Google Scholar
- R. Dunia and S.J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proc. ACC, 1997.Google Scholar
- R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006. Google ScholarDigital Library
- K. Fisher, D. Walker, K.Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proceedings of ACM POPL'08, pages 421--434, 2008. Google ScholarDigital Library
- R. Fonseca and et al. Xtrace: A pervasive network tracing framework. In In Proc. NSDI, 2007. Google ScholarDigital Library
- C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.Google Scholar
- S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with Swatch. In Proc. USENIX LISA '93, pages 145--152, 1993. Google ScholarDigital Library
- E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, 2004. Google ScholarDigital Library
- J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data. IBM Sys. Jour, 41(3), 2002. Google ScholarDigital Library
- J.E. Jackson and G.S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.Google ScholarCross Ref
- W. Jiang and et al. Understanding customer problem troubleshooting from storage system logs. In Proceedings of USENIX FAST'09, 2009. Google ScholarDigital Library
- I. Jolliffe. Principal Component Analysis. Springer, 2002.Google Scholar
- A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In Proc. ACM SIGCOMM, 2004. Google ScholarDigital Library
- C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proc. DSN, June 2008.Google Scholar
- S. Ma and J.L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proc. IEEE ICDE, Washington, DC, 2001. Google ScholarDigital Library
- A.A. Makanju, A.N. Zincir-Heywood, and E.E. Milios. Clustering event logs using iterative partitioning. In Proceedings of KDD'09, 2009. Google ScholarDigital Library
- C. Manning, P. Ragahavan, and et al. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In Proc. ACM KDD, New York, NY, 2006. Google ScholarDigital Library
- A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proc. IEEE DSN, Washington, DC, 2007. Google ScholarDigital Library
- K. Papineni. Why inverse document frequency? In Proc. NAACL '01:, pages 1--8, Morristown, NJ, 2001. Asso. for Comp. Linguistics. Google ScholarDigital Library
- J.E. Prewett. Analyzing cluster log files using logsurfer. In Proc. Annual Conf. on Linux Clusters, 2003.Google Scholar
- T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Proc. ACM MSR '06, pages 65--71, 2006. Google ScholarDigital Library
- G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell, Ithaca, NY, USA, 1987.Google Scholar
- J. Stearley. Towards informatic analysis of syslogs. In Proc. IEEE CLUSTER, Washington, DC, 2004. Google ScholarDigital Library
- Sun. Project darkstar. www.projectdarkstar.com, 2008.Google Scholar
- Sun. Solaris Dynamic Tracing Guide, 2008.Google Scholar
- J. Tan and et al. SALSA: Analyzing logs as StAte machines. In Proc. of WASL '08, 2008. Google ScholarDigital Library
- L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In Proc. ACM SOSP '07, New York, NY, 2007. ACM. Google ScholarDigital Library
- R. Vaarandi. A data clustering algorithm for mining patterns from event logs. Proc. IPOM, 2003.Google ScholarCross Ref
- R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In INTELLCOMM, volume 3283, pages 293--308. Springer, 2004.Google ScholarCross Ref
- I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. Google ScholarDigital Library
- K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proc. ACM Google ScholarDigital Library
Index Terms
- Detecting large-scale system problems by mining console logs
Recommendations
Online System Problem Detection by Mining Patterns of Console Logs
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data MiningWe describe a novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection ...
Detecting large-scale system problems by mining console logs
ICML'10: Proceedings of the 27th International Conference on International Conference on Machine LearningSurprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general ...
Mining console logs for large-scale system problem detection
SysML'08: Proceedings of the Third conference on Tackling computer systems problems with machine learning techniquesThe console logs generated by an application contain messages that the application developers believed would be useful in debugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited in a ...
Comments