research-article

Detecting large-scale system problems by mining console logs

Authors:
Wei Xu

University of California at Berkeley, Berkeley, CA, USA

University of California at Berkeley, Berkeley, CA, USA
View Profile

,
Ling Huang

Intel Labs Berkeley, Berkeley, CA, USA

Intel Labs Berkeley, Berkeley, CA, USA
View Profile

,
Armando Fox

University of California at Berkeley, Berkeley, CA, USA

University of California at Berkeley, Berkeley, CA, USA
View Profile

,
David Patterson

University of California at Berkeley, Berkeley, CA, USA

University of California at Berkeley, Berkeley, CA, USA
View Profile

,
Michael I. Jordan

University of California at Berkeley, Berkeley, CA, USA

University of California at Berkeley, Berkeley, CA, USA
View Profile

SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principlesOctober 2009Pages 117–132https://doi.org/10.1145/1629575.1629587

Published:11 October 2009Publication History

SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles

Pages 117–132

ABSTRACT

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.

References

A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002. Google ScholarDigital Library
D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.Google Scholar
M.Y. Chen and et al. Path-based failure and evolution management. In Proc. NSDI'04, pages 23--23, San Francisco, California, 2004. USENIX. Google ScholarDigital Library
M.H. DeGroot and M.J. Schervish. Probability and Statistics. Addison-Wesley, 3rd edition, 2002.Google Scholar
R. Dunia and S.J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proc. ACC, 1997.Google Scholar
R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006. Google ScholarDigital Library
K. Fisher, D. Walker, K.Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proceedings of ACM POPL'08, pages 421--434, 2008. Google ScholarDigital Library
R. Fonseca and et al. Xtrace: A pervasive network tracing framework. In In Proc. NSDI, 2007. Google ScholarDigital Library
C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.Google Scholar
S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with Swatch. In Proc. USENIX LISA '93, pages 145--152, 1993. Google ScholarDigital Library
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, 2004. Google ScholarDigital Library
J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data. IBM Sys. Jour, 41(3), 2002. Google ScholarDigital Library
J.E. Jackson and G.S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.Google ScholarCross Ref
W. Jiang and et al. Understanding customer problem troubleshooting from storage system logs. In Proceedings of USENIX FAST'09, 2009. Google ScholarDigital Library
I. Jolliffe. Principal Component Analysis. Springer, 2002.Google Scholar
A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In Proc. ACM SIGCOMM, 2004. Google ScholarDigital Library
C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proc. DSN, June 2008.Google Scholar
S. Ma and J.L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proc. IEEE ICDE, Washington, DC, 2001. Google ScholarDigital Library
A.A. Makanju, A.N. Zincir-Heywood, and E.E. Milios. Clustering event logs using iterative partitioning. In Proceedings of KDD'09, 2009. Google ScholarDigital Library
C. Manning, P. Ragahavan, and et al. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In Proc. ACM KDD, New York, NY, 2006. Google ScholarDigital Library
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proc. IEEE DSN, Washington, DC, 2007. Google ScholarDigital Library
K. Papineni. Why inverse document frequency? In Proc. NAACL '01:, pages 1--8, Morristown, NJ, 2001. Asso. for Comp. Linguistics. Google ScholarDigital Library
J.E. Prewett. Analyzing cluster log files using logsurfer. In Proc. Annual Conf. on Linux Clusters, 2003.Google Scholar
T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Proc. ACM MSR '06, pages 65--71, 2006. Google ScholarDigital Library
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell, Ithaca, NY, USA, 1987.Google Scholar
J. Stearley. Towards informatic analysis of syslogs. In Proc. IEEE CLUSTER, Washington, DC, 2004. Google ScholarDigital Library
Sun. Project darkstar. www.projectdarkstar.com, 2008.Google Scholar
Sun. Solaris Dynamic Tracing Guide, 2008.Google Scholar
J. Tan and et al. SALSA: Analyzing logs as StAte machines. In Proc. of WASL '08, 2008. Google ScholarDigital Library
L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In Proc. ACM SOSP '07, New York, NY, 2007. ACM. Google ScholarDigital Library
R. Vaarandi. A data clustering algorithm for mining patterns from event logs. Proc. IPOM, 2003.Google ScholarCross Ref
R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In INTELLCOMM, volume 3283, pages 293--308. Springer, 2004.Google ScholarCross Ref
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. Google ScholarDigital Library
K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proc. ACM Google ScholarDigital Library

Index Terms

Detecting large-scale system problems by mining console logs
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

Online System Problem Detection by Mining Patterns of Console Logs
ICDM '09: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining

We describe a novel application of using data mining and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection ...
Read More
Detecting large-scale system problems by mining console logs
ICML'10: Proceedings of the 27th International Conference on International Conference on Machine Learning

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general ...
Read More
Mining console logs for large-scale system problem detection
SysML'08: Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques

The console logs generated by an application contain messages that the application developers believed would be useful in debugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited in a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
October 2009
346 pages
ISBN:9781605587523
DOI:10.1145/1629575
General Chair:
Jeanna Neefe Matthews
Clarkson University
,
Program Chair:
Thomas Anderson
University of Washington
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 October 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
console log analysis
monitoring
pca
problem detection
source code analysis
statistical learning
tracing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate131of716submissions,18%
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 791
  Total Citations
  View Citations
- 4,825
  Total Downloads
- Downloads (Last 12 months)449
- Downloads (Last 6 weeks)70
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting large-scale system problems by mining console logs

SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles

ABSTRACT

References

Cited By

Index Terms

Recommendations

Online System Problem Detection by Mining Patterns of Console Logs

Detecting large-scale system problems by mining console logs

Mining console logs for large-scale system problem detection