research-article

A survey of online failure prediction methods

Authors:
Felix Salfner

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Maren Lenk

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Miroslaw Malek

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 42 Issue 3Article No.: 10pp 1–42https://doi.org/10.1145/1670679.1670680

Published:29 March 2010Publication History

ACM Computing Surveys

Abstract

With the ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.

Supplemental Material

Available for Download

pdf

a10-salfner-apndx.pdf (248.5 KB)

Online appendix to a survey of online failure prediction methods on article 10.

References

Abraham, A. and Grosan, C. 2005. Genetic programming approach for fault modeling of electronic hardware. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). Edinburgh, U.K., Vol. 2, 1563--1569.Google Scholar
Aitchison, J. and Dunsmore, I. R. 1975. Statistical Prediction Analysis. Cambridge University Press, Cambridge, U.K.Google Scholar
Altman, D. G. 1991. Practical Statistics for Medical Research. CRC Press, Boca Raton, FL. Google ScholarDigital Library
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 3, 403--410.Google ScholarCross Ref
Andrzejak, A. and Silva, L. 2007. Deterministic models of software aging and optimal rejuvenation schedules. In Proceedings of the 10th IEEE/IFIP International Symposium on Integrated Network Management (IM). 159--168.Google Scholar
Avizienis, A. and Laprie, J.-C. 1986. Dependable computing: From concepts to design diversity. Proc. IEEE 74, 5 (May), 629--638.Google ScholarCross Ref
Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1, 1, 11--33. Google ScholarDigital Library
Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., and van Steen, M., Eds. 2005. Self-Star Properties in Complex Information Systems. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
Bai, C. G., Hu, Q. P., Xie, M., and Ng, S. H. 2005. Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74, 3 (Feb.), 275--282. Google ScholarDigital Library
Basseville, M. and Nikiforov, I. 1993. Detection of Abrupt Changes: Theory and Application. Prentice Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
Berenji, H., Ametha, J., and Vengerov, D. 2003. Inductive learning for fault diagnosis. In Proceedings of the IEEE 12th International Conference on Fuzzy Systems (FUZZ). Vol. 1.Google Scholar
Blischke, W. R. and Murthy, D. N. P. 2000. Reliability: Modeling, Prediction, and Optimization. Probability and Statistics Series. John Wiley and Sons, New York, NY.Google ScholarCross Ref
Bodik, P., Friedman, G., Biewald, L., Levine, H., Candea, G., Patel, K., Tolle, G., Hui, J., Fox, A., Jordan, M. I., and Patterson, D. 2005. Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization. In Proceedings of the IEEE International Conference on Autonomic Computing (ICAC). IEEE Computer Society Press, Los Alamitos, CA, 89--100. Google ScholarDigital Library
Brocklehurst, S. and Littlewood, B. 1996. Techniques for prediction analysis and recalibration. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 4, 119--166. Google ScholarDigital Library
Brown, A. and Patterson, D. 2001. Embracing failure: A case for recovery-oriented computing (ROC). In Proceedings of the High Performance Transaction Processing Symposium.Google Scholar
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. 2004. Microreboot—a technique for cheap recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 31--44. Google ScholarDigital Library
Candea, G., Kiciman, E., Kawamoto, S., and Fox, A. 2006. Autonomous recovery in componentized Internet applications. Cluster Comput. 9, 2, 175--190. Google ScholarDigital Library
Candea, G., Kiciman, E., Zhang, S., Keyani, P., and Fox, A. 2003. Jagr: An autonomous self-recovering application server. In Proceedings of the 5th International Workshop on Active Middleware Services (Seattle, WA).Google Scholar
Cassidy, K. J., Gross, K. C., and Malekpour, A. 2002. Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers. In Proceedings of the Conference on Dependable Systems and Networks (DSN). 478--482. Google ScholarDigital Library
Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., and Zeggert, W. 2001. Proactive management of software aging. IBM J. Res. Develop. 45, 2 (Mar.), 311--332. Google ScholarDigital Library
Cavafy, C. P. 1992. But the wise perceive things about to happen. In Collected Poems, G. Savidis, Ed. Princeton University Press, Princeton, NJ.Google Scholar
Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., and Brewer, E. 2004. Path-based failure and evolution management. In Proceedings of the USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI, San Francisco, CA). Google ScholarDigital Library
Chen, M., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, IPDS track (DSN). IEEE Computer Society Press, Los Alamitos, CA, 595--604. Google ScholarDigital Library
Chen, M.-S., Park, J. S., and Yu, P. S. 1998. Efficient data mining for path traversal patterns. IEEE Trans. Knowl. Data Eng. 10, 2, 209--221. Google ScholarDigital Library
Cheng, F., Wu, S., Tsai, P., Chung, Y., and Yang, H. 2005. Application cluster service scheme for near-zero-downtime services. In Proceedings of the IEEE International Conference on Robotics and Automation. 4062--4067.Google Scholar
Cleveland, W. et al. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 368, 829--836.Google ScholarCross Ref
Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning. 115--123.Google ScholarDigital Library
Coleman, D. and Thompson, C. 2005. Model based automation and management for the adaptive enterprise. In Proceedings of the 12th Annual Workshop of the HP OpenView University Association. 171--184.Google Scholar
Crowell, J., Shereshevsky, M., and Cukic, B. 2002. Using fractal analysis to model software aging. Tech. rep. West Virginia University, Lane Department of CSEE, Morgantown, WV. May.Google Scholar
Csenki, A. 1990. Bayes predictive analysis of a fundamental software reliability model. IEEE Trans. Reliab. 39, 2 (Jun.), 177--183.Google ScholarCross Ref
Daidone, A., Di Giandomenico, F., Bondavalli, A., and Chiaradonna, S. 2006. Hidden Markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS, Leeds, U.K.). Google ScholarDigital Library
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
Denson, W. 1998. The history of reliability prediction. IEEE Trans. Reliab. 47, 3 (Sep.), 321--328.Google ScholarCross Ref
Discenzo, F., Unsworth, P., Loparo, K., and Marcy, H. 1999. Self-diagnosing intelligent motors: a key enabler for nextgeneration manufacturing systems. In Proceedings of the IEEE Colloquium on Intelligent and Self-Validating Sensors.Google Scholar
Domeniconi, C., Perng, C.-S., Vilalta, R., and Ma, S. 2002. A classification approach for prediction of target events in temporal sequences. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'02), T. Elomaa, H. Mannila, and H. Toivonen, Eds. Lecture Notes in Artificial Intelligence, vol. 2431. Springer-Verlag, Heidelberg, Germany, 125--137. Google ScholarDigital Library
Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.Google Scholar
Elbaum, S., Kanduri, S., and Amschler, A. 2003. Anomalies as precursors of field failures. In Proceedings of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE). 108--118. Google ScholarDigital Library
Farr, W. 1996. Software reliability modeling survey. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 3, 71--117. Google ScholarDigital Library
Fawcett, T. 2004. ROC graphs: Notes and practical considerations for researchers. Mach. Learn. 31, 1--38.Google Scholar
Flach, P. A. 2004. The many faces of ROC analysis in machine learning. Tutorial at the International Conference on Machine Learning (ICML'04). http://www.cs.bris.ac.uk/flach/ICML04tutorial/.Google Scholar
Fu, S. and Xu, C.-Z. 2007. Quantifying temporal and spatial fault event correlation for proactive failure management. In Proceedings of the IEEE Symposium on Reliable and Distributed Systems (SRDS). Google ScholarDigital Library
Garg, S., van Moorsel, A., Vaidyanathan, K., and Trivedi, K. S. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering (ISSRE). Google ScholarDigital Library
Gross, K. C., Bhardwaj, V., and Bickford, R. 2002. Proactive detection of software aging mechanisms in performance critical computers. In SEW '02: Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02). IEEE Computer Society Press, Los Alamitos, CA. Google ScholarDigital Library
Grottke, M., Matias, R., and Trivedi, K. S. 2008. The fundamentals of software aging. In Proceedings of the IEEE Workshop on Software Aging and Rejuvenation (Seattle, WA).Google Scholar
Grottke, M. and Trivedi, K. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Comput. 40, 107--109. Google ScholarDigital Library
Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Special Issue on Variable and Feature Selection. Google ScholarDigital Library
Hamerly, G. and Elkan, C. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA, 202--209. Google ScholarDigital Library
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, Berlin, Germany.Google Scholar
Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., and Toivonen, H. 1996. Tasa: Telecommunication alarm sequence analyzer, or: How to enjoy faults in your network. In Proceedings of the IEEE Network Operations and Management Symposium (Kyoto, Japan). Vol. 2., 520--529.Google Scholar
Hellerstein, J. L., Zhang, F., and Shahabuddin, P. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IEEE International Symposium on Integrated Network Management. 309--322.Google Scholar
Ho, D. W. C., Zhang, P. A., and Xu, J. 2001. Fuzzy wavelet networks for function learning. IEEE Trans. Fuzzy Syst. 9, 1, 200--211. Google ScholarDigital Library
Hoffmann, G. A. 2004. Adaptive transfer functions in radial basis function (RBF) networks. In Proceedings of the 4th International Conference on Computational Science (ICCS 2004), M. Bubak, G. D. van Albada, P. M. A. Sloot, et al., Eds. Lecture Notes in Computer Science, vol. 3037. Springer-Verlag, Berlin, Germany, 682--686.Google ScholarCross Ref
Hoffmann, G. A. 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag, Herzogexrath, Germany.Google Scholar
Hoffmann, G. A. and Malek, M. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS), Leeds, U.K. Google ScholarDigital Library
Hoffmann, G. A., Salfner, F., and Malek, M. 2004. Advanced failure prediction in complex software systems. Res. rep. 172, Department of Computer Science, Humboldt University, Berlin, Germany. www.rok.informatik.hu-berlin.de/Members/salfner.Google Scholar
Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2006. A best practice guide to resource forecasting for the Apache Webserver. In IEEE Proceedings of the 12th International Symposium Pacific Rim Dependable Computing (PRDC). University of California, Riverside, Riverside, CA. Google ScholarDigital Library
Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2007. A best practice guide to resource forecasting for computing systems. IEEE Trans. Reliab. 56, 4 (Dec.), 615--628.Google ScholarCross Ref
Horn, P. 2001. Autonomic computing: IBM's perspective on the state of information technology. Tech. rep. IBM, Yorktown Heights, NY.Google Scholar
Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Educat. Psych. 24, 417--441.Google ScholarCross Ref
Hughes, G., Murray, J., Kreutz-Delgado, K., and Elkan, C. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (Sep.), 350--357.Google ScholarCross Ref
IEC: International Technical Comission, Ed. 2002. Dependability and Quality of Service, 2nd ed. IEC, Geneva, Switzerland, Chapter 191.Google Scholar
Iyer, R. K., Young, L. T., and Sridhar, V. 1986. Recognition of error symptoms in large systems. In Proceedings of the 1986 ACM Fall Joint Computer Conference. IEEE Computer Society Press, Los Alamitos, CA, 797--806. Google ScholarDigital Library
Jelinski, Z. and Moranda, P. 1972. Software reliability research. In Statistical Computer Performance Evaluation, W. Freiberger, Ed. Academic Press, New York, NY.Google Scholar
Kapadia, N. H., Fortes, J. A. B., and Brodley, C. E. 1999. Predictive application-performance modeling in a computational gridenvironment. In Procedings of the 8th International IEEE Symposium on High Performance Distributed Computing. 47--54. Google ScholarDigital Library
Kiciman, E. and Fox, A. 2005. Detecting application-level failures in component-based Internet services. IEEE Trans. Neural Netw. 16, 5 (Sep.), 1027--1041. Google ScholarDigital Library
Korbicz, J., Kościelny, J. M., Kowalczuk, Z., and Cholewa, W., Eds. 2004. Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
Lal, R. and Choi, G. 1998. Error and failure analysis of a unix server. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 232--239. Google ScholarDigital Library
Laprie, J.-C. and Kanoun, K. 1996. Software reliability and system reliability. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 2, 27--69. Google ScholarDigital Library
Leangsuksun, C., Liu, T., Rao, T., Scott, S., and Libby, R. 2004. A failure predictive and policy-based high availability strategy for Linux high performance computing cluster. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution. 18--20.Google Scholar
Levy, D. and Chillarege, R. 2003. Early warning of failures through alarm analysis—a case study in telecom voice mail systems. In ISSRE '03: Proceedings of the 14th International Symposium on Software Reliability Engineering. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarDigital Library
Li, L., Vaidyanathan, K., and Trivedi, K. S. 2002. An approach for estimation of software aging in a Web server. In Proceedings of the Intenational Symposium on Empirical Software Engineering (ISESE, Nara, Japan). Google ScholarDigital Library
Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., and Sahoo, R. 2006. Bluegene/l failure analysis and prediction models. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN). 425--434. Google ScholarDigital Library
Lin, T.-T. Y. and Siewiorek, D. P. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliab. 39, 4 (Oct.), 419--432.Google ScholarCross Ref
Lunze, J. 2003. Automatisierungstechnik, 1st ed. Oldenbourg, Munich, Germany.Google Scholar
Lyu, M. R., Ed. 1996. Handbook of Software Reliability Engineering. McGraw-Hill, New York, NY. Google ScholarDigital Library
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarDigital Library
Melliar-Smith, P. M. and Randell, B. 1977. Software reliability: The role of programmed exception handling. SIGPLAN Not. 12, 3, 95--100.Google ScholarDigital Library
Meng, H., Di Hou, Y., and Chen, Y. 2007. A rough wavelet network model with genetic algorithm and its application to aging forecasting of application server. In Procedings of the IEEE International Conference on Machine Learning and Cybernetics. Vol. 5.Google Scholar
Mundie, C., de Vries, P., Haynes, P., and Corwine, M. 2002. Trustworthy computing. Tech. rep., Microsoft Corp., Redmond, WA. Oct.Google Scholar
Murray, J., Hughes, G., and Kreutz-Delgado, K. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of ICANN/ICONIP.Google Scholar
Musa, J. D., Iannino, A., and Okumoto, K. 1987. Software Reliability: Measurement, Prediction, Application. McGraw-Hill, New York, NY. Google ScholarDigital Library
Nassar, F. A. and Andrews, D. M. 1985. A methodology for analysis of failure prediction data. In Proceedings of the IEEE Real-Time Systems Symposium. 160--166.Google Scholar
Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48, 3, 443--53.Google ScholarCross Ref
Neville, S. W. 1998. Approaches for early fault detection in large scale engineering plants. Ph.D. dissertation, University of Victoria, Victoria, B.C., Canada.Google Scholar
Ning, M. H., Yong, Q., Di, H., Ying, C., and Zhong, Z. J. 2006. Software aging prediction model based on fuzzy wavelet network with adaptive genetic algorithm. In Proceedings of the IEEE 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). IEEE Computer Society Press, Los Alamitos, CA, 659--666. Google ScholarDigital Library
Parnas, D. L. 1994. Software aging. In Proceedings of the 16th IEEE International Conference on Software Engineering (ICSE). IEEE Computer Society Press, Los Alamitos, CA, 279--287. Google ScholarDigital Library
Patterson, D. A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Tech. rep. UCB//CSD-02-1175. Computer Science Department, University of California, Berkeley, Berkeley, CA. March. Google ScholarDigital Library
Pawlak, Z., Wong, S. K. M., and Ziarko, W. 1988. Rough sets: Probabilistic versus deterministic approach. Internat. J. Man-Mach. Stud. 29, 81--95. Google ScholarDigital Library
Pettitt, A. 1977. Testing the normality of several independent samples using the anderson-darling statistic. Appl. Statist. 26, 2, 156--161.Google ScholarCross Ref
Pfefferman, J. and Cernuschi-Frias, B. 2002. A nonparametric nonstationary procedure for failure prediction. IEEE Trans. Reliab. 51, 4 (Dec.), 434--442.Google ScholarCross Ref
Pizza, M., Strigini, L., Bondavalli, A., and Di Giandomenico, F. 1998. Optimal discrimination between transient and permanent faults. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 214--223. Google ScholarDigital Library
Quinlan, J. 1990. Learning logical definitions from relations. Mach. Learn. 5, 3, 239--266. Google ScholarDigital Library
Quinlan, J. 1993. C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb.), 257--286.Google ScholarDigital Library
Rovnyak, S., Kretsinger, S., Thorp, J., and Brown, D. 1994. Decision trees for real-time transient stability prediction. IEEE Trans. Power Syst. 9, 3, 1417--1426.Google ScholarCross Ref
Sahner, R. A., Trivedi, K. S., and Puliafito, A. 1996. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package (The Red Book). Kluwer Academic Publishers, Dordrecht, The Netherlands. Google ScholarDigital Library
Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., and Sivasubramaniam, A. 2003. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 426--435. Google ScholarDigital Library
Salfner, F. 2006. Modeling event-driven time series with generalized hidden semi-Markov models. Tech. rep. 208. Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany. http://edoc.hu-berlin.de/docviews/abstract.php?id=27653.Google Scholar
Salfner, F. 2008. Event-based Failure Prediction: An Extended Hidden Markov Model Approach. dissertation.de—Verlag im Internet GmbH, Berlin, Germany. http://www.rok.informatik.hu-berlin.de/Members/salfner/publications/salfner08event-based.pdf.Google Scholar
Salfner, F., Hoffmann, G. A., and Malek, M. 2005. Prediction-based software availability enhancement. In Self-Star Properties in Complex Information Systems, O. Babaoglu, M. Jelasity, A. Montresor, C. Fetzer, S. Leonardi, van Moorsel A., and M. van Steen, Eds. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
Salfner, F. and Malek, M. 2007. Using hidden semi-Markov models for effective online failure prediction. In Proceedings of the IEEE 26th International Symposium on Reliable Distributed Systems (SRDS). Google ScholarDigital Library
Salfner, F., Schieschke, M., and Malek, M. 2006. Predicting failures of computer systems: A case study for a telecommunication system. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS, Rhodes Island, Greece). Google ScholarDigital Library
Salfner, F., Tschirpke, S., and Malek, M. 2004. Comprehensive logfiles for autonomic systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems (FTPDS). IEEE Computer Society Press, Los Alamitos, CA.Google Scholar
Sen, P. K. 1968. Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc. 63, 324 (Dec.), 1379--1389.Google ScholarCross Ref
Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., and Liu, Y. 2003. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society Press, Los Alamitos, CA, 721--730.Google Scholar
Siewiorek, D. P. and Swarz, R. S. 1998. Reliable Computer Systems, 3rd ed. A. K. Peters, Ltd., Wellesley, MA. Google ScholarDigital Library
Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., and Wegerich, S. 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations. In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP, Seoul, Korea). 60--65.Google Scholar
Smith, T. and Waterman, M. 1981. Identification of common molecular subsequences. J. Molec. Biol. 147, 195--197.Google ScholarCross Ref
Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology, EDBT, P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, Eds. Lecture Notes in Computer Science, vol. 1057. Springer-Verlag, Berlin, Germany, 3--17. Google ScholarDigital Library
Tang, D. and Iyer, R. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (Jan.), 62--75. Google ScholarDigital Library
Troudet, T., Merrill, W., Center, N., and Cleveland, O. 1990. A real time neural net estimator of fatigue life. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN). 59--64.Google Scholar
Tsao, M. M. and Siewiorek, D. P. 1983. Trend analysis on system error files. In Proceedings of the 13th International Symposium on Fault-Tolerant Computing (Milano, Italy). 116--119.Google Scholar
Turnbull, D. and Alldrin, N. 2003. Failure prediction in hardware systems. Tech. rep. University of California, San Diego, CA. http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf.Google Scholar
Ulerich, N. and Powers, G. 1988. On-line hazard aversion and fault diagnosis in chemical processes: The digraph + fault-tree method. IEEE Trans. Reliab. 37, 2 (Jun.), 171--177.Google ScholarCross Ref
Vaidyanathan, K. and Trivedi, K. S. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE). Google ScholarDigital Library
van Rijsbergen, C. J. 1979. Information Retrieval, second ed. Butterworth, London, U.K. Google ScholarDigital Library
Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer Verlag, New York, NY. Google ScholarDigital Library
Vesely, W., Goldberg, F. F., Roberts, N. H., and Haasl, D. F. 1981. Fault tree handbook. Tech. rep. NUREG-0492. U.S. Nuclear Regulatory Commission, Washington, DC.Google Scholar
Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., and Weiss, S. M. 2002. Predictive algorithms in the management of computer systems. IBM Syst. J. 41, 3, 461--474. Google ScholarDigital Library
Vilalta, R. and Ma, S. 2002. Predicting rare events in temporal domains. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 474--482. Google ScholarDigital Library
Ward, A., Glynn, P., and Richardson, K. 1998. Internet service performance failure detection. SIGMETRICS Perform. Eval. Rev. 26, 3, 38--43. Google ScholarDigital Library
Ward, A. and Whitt, W. 2000. Predicting response times in processor-sharing queues. In Proceedings of the Fields Institute Conference on Communications Networks, P. W. Glynn, D. J. MacDonald, and S. J. Turner, Eds.Google Scholar
Weiss, G. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, San Francisco, CA, 718--725.Google Scholar
Weiss, G. 2002. Predicting telecommunication equipment failures from sequences of network alarms. In Handbook of Knowledge Discovery and Data Mining, W. Kloesgen and J. Zytkow, Eds. Oxford University Press, Oxford, U.K., 891--896. Google ScholarDigital Library
Wong, K. C. P., Ryan, H., and Tindle, J. 1996. Early warning fault detection using artificial intelligent methods. In Proceedings of the Universities Power Engineering Conference.Google Scholar
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Cambridge, MA.Google Scholar

Index Terms

A survey of online failure prediction methods

Recommendations

Runtime Prediction of Failure Modes from System Error Logs
ICECCS '13: Proceedings of the 2013 18th International Conference on Engineering of Complex Computer Systems

Predicting potential failure occurrences during runtime is important to achieve system resilience and avoid hazardous consequences of failures. Existing failure prediction techniques in software systems involve forecasting failure counts, effects, and ...
Read More
Exploring event correlation for failure prediction in coalitions of clusters
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
Read More
Towards Identifying the Best Variables for Failure Prediction Using Injection of Realistic Software Faults
PRDC '10: Proceedings of the 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing

Predicting failures at runtime is one of the most promising techniques to increase the availability of computer systems. However, failure prediction algorithms are still far from providing satisfactory results. In particular, the identification of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 42, Issue 3
March 2010
146 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/1670679
Issue’s Table of Contents

Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 March 2010
- Accepted: 1 October 2008
- Revised: 1 June 2008
- Received: 1 July 2007
Published in csur Volume 42, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Error
failure prediction
fault
prediction metrics
runtime monitoring
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 421
  Total Citations
  View Citations
- 7,714
  Total Downloads
- Downloads (Last 12 months)159
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A survey of online failure prediction methods

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Runtime Prediction of Failure Modes from System Error Logs

Exploring event correlation for failure prediction in coalitions of clusters

Towards Identifying the Best Variables for Failure Prediction Using Injection of Realistic Software Faults

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A survey of online failure prediction methods

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Runtime Prediction of Failure Modes from System Error Logs

Exploring event correlation for failure prediction in coalitions of clusters

Towards Identifying the Best Variables for Failure Prediction Using Injection of Realistic Software Faults

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media