skip to main content
research-article

A survey of online failure prediction methods

Published:29 March 2010Publication History
Skip Abstract Section

Abstract

With the ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.

Skip Supplemental Material Section

Supplemental Material

References

  1. Abraham, A. and Grosan, C. 2005. Genetic programming approach for fault modeling of electronic hardware. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). Edinburgh, U.K., Vol. 2, 1563--1569.Google ScholarGoogle Scholar
  2. Aitchison, J. and Dunsmore, I. R. 1975. Statistical Prediction Analysis. Cambridge University Press, Cambridge, U.K.Google ScholarGoogle Scholar
  3. Altman, D. G. 1991. Practical Statistics for Medical Research. CRC Press, Boca Raton, FL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 3, 403--410.Google ScholarGoogle ScholarCross RefCross Ref
  5. Andrzejak, A. and Silva, L. 2007. Deterministic models of software aging and optimal rejuvenation schedules. In Proceedings of the 10th IEEE/IFIP International Symposium on Integrated Network Management (IM). 159--168.Google ScholarGoogle Scholar
  6. Avizienis, A. and Laprie, J.-C. 1986. Dependable computing: From concepts to design diversity. Proc. IEEE 74, 5 (May), 629--638.Google ScholarGoogle ScholarCross RefCross Ref
  7. Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1, 1, 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., and van Steen, M., Eds. 2005. Self-Star Properties in Complex Information Systems. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bai, C. G., Hu, Q. P., Xie, M., and Ng, S. H. 2005. Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74, 3 (Feb.), 275--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Basseville, M. and Nikiforov, I. 1993. Detection of Abrupt Changes: Theory and Application. Prentice Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Berenji, H., Ametha, J., and Vengerov, D. 2003. Inductive learning for fault diagnosis. In Proceedings of the IEEE 12th International Conference on Fuzzy Systems (FUZZ). Vol. 1.Google ScholarGoogle Scholar
  12. Blischke, W. R. and Murthy, D. N. P. 2000. Reliability: Modeling, Prediction, and Optimization. Probability and Statistics Series. John Wiley and Sons, New York, NY.Google ScholarGoogle ScholarCross RefCross Ref
  13. Bodik, P., Friedman, G., Biewald, L., Levine, H., Candea, G., Patel, K., Tolle, G., Hui, J., Fox, A., Jordan, M. I., and Patterson, D. 2005. Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization. In Proceedings of the IEEE International Conference on Autonomic Computing (ICAC). IEEE Computer Society Press, Los Alamitos, CA, 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Brocklehurst, S. and Littlewood, B. 1996. Techniques for prediction analysis and recalibration. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 4, 119--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Brown, A. and Patterson, D. 2001. Embracing failure: A case for recovery-oriented computing (ROC). In Proceedings of the High Performance Transaction Processing Symposium.Google ScholarGoogle Scholar
  16. Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. 2004. Microreboot—a technique for cheap recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 31--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Candea, G., Kiciman, E., Kawamoto, S., and Fox, A. 2006. Autonomous recovery in componentized Internet applications. Cluster Comput. 9, 2, 175--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Candea, G., Kiciman, E., Zhang, S., Keyani, P., and Fox, A. 2003. Jagr: An autonomous self-recovering application server. In Proceedings of the 5th International Workshop on Active Middleware Services (Seattle, WA).Google ScholarGoogle Scholar
  19. Cassidy, K. J., Gross, K. C., and Malekpour, A. 2002. Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers. In Proceedings of the Conference on Dependable Systems and Networks (DSN). 478--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., and Zeggert, W. 2001. Proactive management of software aging. IBM J. Res. Develop. 45, 2 (Mar.), 311--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cavafy, C. P. 1992. But the wise perceive things about to happen. In Collected Poems, G. Savidis, Ed. Princeton University Press, Princeton, NJ.Google ScholarGoogle Scholar
  22. Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., and Brewer, E. 2004. Path-based failure and evolution management. In Proceedings of the USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI, San Francisco, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chen, M., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, IPDS track (DSN). IEEE Computer Society Press, Los Alamitos, CA, 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chen, M.-S., Park, J. S., and Yu, P. S. 1998. Efficient data mining for path traversal patterns. IEEE Trans. Knowl. Data Eng. 10, 2, 209--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cheng, F., Wu, S., Tsai, P., Chung, Y., and Yang, H. 2005. Application cluster service scheme for near-zero-downtime services. In Proceedings of the IEEE International Conference on Robotics and Automation. 4062--4067.Google ScholarGoogle Scholar
  26. Cleveland, W. et al. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 368, 829--836.Google ScholarGoogle ScholarCross RefCross Ref
  27. Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning. 115--123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Coleman, D. and Thompson, C. 2005. Model based automation and management for the adaptive enterprise. In Proceedings of the 12th Annual Workshop of the HP OpenView University Association. 171--184.Google ScholarGoogle Scholar
  29. Crowell, J., Shereshevsky, M., and Cukic, B. 2002. Using fractal analysis to model software aging. Tech. rep. West Virginia University, Lane Department of CSEE, Morgantown, WV. May.Google ScholarGoogle Scholar
  30. Csenki, A. 1990. Bayes predictive analysis of a fundamental software reliability model. IEEE Trans. Reliab. 39, 2 (Jun.), 177--183.Google ScholarGoogle ScholarCross RefCross Ref
  31. Daidone, A., Di Giandomenico, F., Bondavalli, A., and Chiaradonna, S. 2006. Hidden Markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS, Leeds, U.K.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarGoogle ScholarCross RefCross Ref
  33. Denson, W. 1998. The history of reliability prediction. IEEE Trans. Reliab. 47, 3 (Sep.), 321--328.Google ScholarGoogle ScholarCross RefCross Ref
  34. Discenzo, F., Unsworth, P., Loparo, K., and Marcy, H. 1999. Self-diagnosing intelligent motors: a key enabler for nextgeneration manufacturing systems. In Proceedings of the IEEE Colloquium on Intelligent and Self-Validating Sensors.Google ScholarGoogle Scholar
  35. Domeniconi, C., Perng, C.-S., Vilalta, R., and Ma, S. 2002. A classification approach for prediction of target events in temporal sequences. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'02), T. Elomaa, H. Mannila, and H. Toivonen, Eds. Lecture Notes in Artificial Intelligence, vol. 2431. Springer-Verlag, Heidelberg, Germany, 125--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.Google ScholarGoogle Scholar
  37. Elbaum, S., Kanduri, S., and Amschler, A. 2003. Anomalies as precursors of field failures. In Proceedings of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE). 108--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Farr, W. 1996. Software reliability modeling survey. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 3, 71--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fawcett, T. 2004. ROC graphs: Notes and practical considerations for researchers. Mach. Learn. 31, 1--38.Google ScholarGoogle Scholar
  40. Flach, P. A. 2004. The many faces of ROC analysis in machine learning. Tutorial at the International Conference on Machine Learning (ICML'04). http://www.cs.bris.ac.uk/flach/ICML04tutorial/.Google ScholarGoogle Scholar
  41. Fu, S. and Xu, C.-Z. 2007. Quantifying temporal and spatial fault event correlation for proactive failure management. In Proceedings of the IEEE Symposium on Reliable and Distributed Systems (SRDS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Garg, S., van Moorsel, A., Vaidyanathan, K., and Trivedi, K. S. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering (ISSRE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Gross, K. C., Bhardwaj, V., and Bickford, R. 2002. Proactive detection of software aging mechanisms in performance critical computers. In SEW '02: Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02). IEEE Computer Society Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Grottke, M., Matias, R., and Trivedi, K. S. 2008. The fundamentals of software aging. In Proceedings of the IEEE Workshop on Software Aging and Rejuvenation (Seattle, WA).Google ScholarGoogle Scholar
  45. Grottke, M. and Trivedi, K. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Comput. 40, 107--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Special Issue on Variable and Feature Selection. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hamerly, G. and Elkan, C. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA, 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, Berlin, Germany.Google ScholarGoogle Scholar
  49. Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., and Toivonen, H. 1996. Tasa: Telecommunication alarm sequence analyzer, or: How to enjoy faults in your network. In Proceedings of the IEEE Network Operations and Management Symposium (Kyoto, Japan). Vol. 2., 520--529.Google ScholarGoogle Scholar
  50. Hellerstein, J. L., Zhang, F., and Shahabuddin, P. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IEEE International Symposium on Integrated Network Management. 309--322.Google ScholarGoogle Scholar
  51. Ho, D. W. C., Zhang, P. A., and Xu, J. 2001. Fuzzy wavelet networks for function learning. IEEE Trans. Fuzzy Syst. 9, 1, 200--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hoffmann, G. A. 2004. Adaptive transfer functions in radial basis function (RBF) networks. In Proceedings of the 4th International Conference on Computational Science (ICCS 2004), M. Bubak, G. D. van Albada, P. M. A. Sloot, et al., Eds. Lecture Notes in Computer Science, vol. 3037. Springer-Verlag, Berlin, Germany, 682--686.Google ScholarGoogle ScholarCross RefCross Ref
  53. Hoffmann, G. A. 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag, Herzogexrath, Germany.Google ScholarGoogle Scholar
  54. Hoffmann, G. A. and Malek, M. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS), Leeds, U.K. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Hoffmann, G. A., Salfner, F., and Malek, M. 2004. Advanced failure prediction in complex software systems. Res. rep. 172, Department of Computer Science, Humboldt University, Berlin, Germany. www.rok.informatik.hu-berlin.de/Members/salfner.Google ScholarGoogle Scholar
  56. Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2006. A best practice guide to resource forecasting for the Apache Webserver. In IEEE Proceedings of the 12th International Symposium Pacific Rim Dependable Computing (PRDC). University of California, Riverside, Riverside, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2007. A best practice guide to resource forecasting for computing systems. IEEE Trans. Reliab. 56, 4 (Dec.), 615--628.Google ScholarGoogle ScholarCross RefCross Ref
  58. Horn, P. 2001. Autonomic computing: IBM's perspective on the state of information technology. Tech. rep. IBM, Yorktown Heights, NY.Google ScholarGoogle Scholar
  59. Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Educat. Psych. 24, 417--441.Google ScholarGoogle ScholarCross RefCross Ref
  60. Hughes, G., Murray, J., Kreutz-Delgado, K., and Elkan, C. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (Sep.), 350--357.Google ScholarGoogle ScholarCross RefCross Ref
  61. IEC: International Technical Comission, Ed. 2002. Dependability and Quality of Service, 2nd ed. IEC, Geneva, Switzerland, Chapter 191.Google ScholarGoogle Scholar
  62. Iyer, R. K., Young, L. T., and Sridhar, V. 1986. Recognition of error symptoms in large systems. In Proceedings of the 1986 ACM Fall Joint Computer Conference. IEEE Computer Society Press, Los Alamitos, CA, 797--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jelinski, Z. and Moranda, P. 1972. Software reliability research. In Statistical Computer Performance Evaluation, W. Freiberger, Ed. Academic Press, New York, NY.Google ScholarGoogle Scholar
  64. Kapadia, N. H., Fortes, J. A. B., and Brodley, C. E. 1999. Predictive application-performance modeling in a computational gridenvironment. In Procedings of the 8th International IEEE Symposium on High Performance Distributed Computing. 47--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Kiciman, E. and Fox, A. 2005. Detecting application-level failures in component-based Internet services. IEEE Trans. Neural Netw. 16, 5 (Sep.), 1027--1041. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Korbicz, J., Kościelny, J. M., Kowalczuk, Z., and Cholewa, W., Eds. 2004. Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Lal, R. and Choi, G. 1998. Error and failure analysis of a unix server. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 232--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Laprie, J.-C. and Kanoun, K. 1996. Software reliability and system reliability. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 2, 27--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Leangsuksun, C., Liu, T., Rao, T., Scott, S., and Libby, R. 2004. A failure predictive and policy-based high availability strategy for Linux high performance computing cluster. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution. 18--20.Google ScholarGoogle Scholar
  70. Levy, D. and Chillarege, R. 2003. Early warning of failures through alarm analysis—a case study in telecom voice mail systems. In ISSRE '03: Proceedings of the 14th International Symposium on Software Reliability Engineering. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Li, L., Vaidyanathan, K., and Trivedi, K. S. 2002. An approach for estimation of software aging in a Web server. In Proceedings of the Intenational Symposium on Empirical Software Engineering (ISESE, Nara, Japan). Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., and Sahoo, R. 2006. Bluegene/l failure analysis and prediction models. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN). 425--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Lin, T.-T. Y. and Siewiorek, D. P. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliab. 39, 4 (Oct.), 419--432.Google ScholarGoogle ScholarCross RefCross Ref
  74. Lunze, J. 2003. Automatisierungstechnik, 1st ed. Oldenbourg, Munich, Germany.Google ScholarGoogle Scholar
  75. Lyu, M. R., Ed. 1996. Handbook of Software Reliability Engineering. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Melliar-Smith, P. M. and Randell, B. 1977. Software reliability: The role of programmed exception handling. SIGPLAN Not. 12, 3, 95--100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Meng, H., Di Hou, Y., and Chen, Y. 2007. A rough wavelet network model with genetic algorithm and its application to aging forecasting of application server. In Procedings of the IEEE International Conference on Machine Learning and Cybernetics. Vol. 5.Google ScholarGoogle Scholar
  79. Mundie, C., de Vries, P., Haynes, P., and Corwine, M. 2002. Trustworthy computing. Tech. rep., Microsoft Corp., Redmond, WA. Oct.Google ScholarGoogle Scholar
  80. Murray, J., Hughes, G., and Kreutz-Delgado, K. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of ICANN/ICONIP.Google ScholarGoogle Scholar
  81. Musa, J. D., Iannino, A., and Okumoto, K. 1987. Software Reliability: Measurement, Prediction, Application. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Nassar, F. A. and Andrews, D. M. 1985. A methodology for analysis of failure prediction data. In Proceedings of the IEEE Real-Time Systems Symposium. 160--166.Google ScholarGoogle Scholar
  83. Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48, 3, 443--53.Google ScholarGoogle ScholarCross RefCross Ref
  84. Neville, S. W. 1998. Approaches for early fault detection in large scale engineering plants. Ph.D. dissertation, University of Victoria, Victoria, B.C., Canada.Google ScholarGoogle Scholar
  85. Ning, M. H., Yong, Q., Di, H., Ying, C., and Zhong, Z. J. 2006. Software aging prediction model based on fuzzy wavelet network with adaptive genetic algorithm. In Proceedings of the IEEE 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). IEEE Computer Society Press, Los Alamitos, CA, 659--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Parnas, D. L. 1994. Software aging. In Proceedings of the 16th IEEE International Conference on Software Engineering (ICSE). IEEE Computer Society Press, Los Alamitos, CA, 279--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Patterson, D. A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Tech. rep. UCB//CSD-02-1175. Computer Science Department, University of California, Berkeley, Berkeley, CA. March. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Pawlak, Z., Wong, S. K. M., and Ziarko, W. 1988. Rough sets: Probabilistic versus deterministic approach. Internat. J. Man-Mach. Stud. 29, 81--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Pettitt, A. 1977. Testing the normality of several independent samples using the anderson-darling statistic. Appl. Statist. 26, 2, 156--161.Google ScholarGoogle ScholarCross RefCross Ref
  90. Pfefferman, J. and Cernuschi-Frias, B. 2002. A nonparametric nonstationary procedure for failure prediction. IEEE Trans. Reliab. 51, 4 (Dec.), 434--442.Google ScholarGoogle ScholarCross RefCross Ref
  91. Pizza, M., Strigini, L., Bondavalli, A., and Di Giandomenico, F. 1998. Optimal discrimination between transient and permanent faults. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Quinlan, J. 1990. Learning logical definitions from relations. Mach. Learn. 5, 3, 239--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Quinlan, J. 1993. C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb.), 257--286.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Rovnyak, S., Kretsinger, S., Thorp, J., and Brown, D. 1994. Decision trees for real-time transient stability prediction. IEEE Trans. Power Syst. 9, 3, 1417--1426.Google ScholarGoogle ScholarCross RefCross Ref
  96. Sahner, R. A., Trivedi, K. S., and Puliafito, A. 1996. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package (The Red Book). Kluwer Academic Publishers, Dordrecht, The Netherlands. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., and Sivasubramaniam, A. 2003. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 426--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Salfner, F. 2006. Modeling event-driven time series with generalized hidden semi-Markov models. Tech. rep. 208. Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany. http://edoc.hu-berlin.de/docviews/abstract.php?id=27653.Google ScholarGoogle Scholar
  99. Salfner, F. 2008. Event-based Failure Prediction: An Extended Hidden Markov Model Approach. dissertation.de—Verlag im Internet GmbH, Berlin, Germany. http://www.rok.informatik.hu-berlin.de/Members/salfner/publications/salfner08event-based.pdf.Google ScholarGoogle Scholar
  100. Salfner, F., Hoffmann, G. A., and Malek, M. 2005. Prediction-based software availability enhancement. In Self-Star Properties in Complex Information Systems, O. Babaoglu, M. Jelasity, A. Montresor, C. Fetzer, S. Leonardi, van Moorsel A., and M. van Steen, Eds. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Salfner, F. and Malek, M. 2007. Using hidden semi-Markov models for effective online failure prediction. In Proceedings of the IEEE 26th International Symposium on Reliable Distributed Systems (SRDS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Salfner, F., Schieschke, M., and Malek, M. 2006. Predicting failures of computer systems: A case study for a telecommunication system. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS, Rhodes Island, Greece). Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Salfner, F., Tschirpke, S., and Malek, M. 2004. Comprehensive logfiles for autonomic systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems (FTPDS). IEEE Computer Society Press, Los Alamitos, CA.Google ScholarGoogle Scholar
  104. Sen, P. K. 1968. Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc. 63, 324 (Dec.), 1379--1389.Google ScholarGoogle ScholarCross RefCross Ref
  105. Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., and Liu, Y. 2003. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society Press, Los Alamitos, CA, 721--730.Google ScholarGoogle Scholar
  106. Siewiorek, D. P. and Swarz, R. S. 1998. Reliable Computer Systems, 3rd ed. A. K. Peters, Ltd., Wellesley, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., and Wegerich, S. 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations. In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP, Seoul, Korea). 60--65.Google ScholarGoogle Scholar
  108. Smith, T. and Waterman, M. 1981. Identification of common molecular subsequences. J. Molec. Biol. 147, 195--197.Google ScholarGoogle ScholarCross RefCross Ref
  109. Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology, EDBT, P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, Eds. Lecture Notes in Computer Science, vol. 1057. Springer-Verlag, Berlin, Germany, 3--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Tang, D. and Iyer, R. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (Jan.), 62--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Troudet, T., Merrill, W., Center, N., and Cleveland, O. 1990. A real time neural net estimator of fatigue life. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN). 59--64.Google ScholarGoogle Scholar
  112. Tsao, M. M. and Siewiorek, D. P. 1983. Trend analysis on system error files. In Proceedings of the 13th International Symposium on Fault-Tolerant Computing (Milano, Italy). 116--119.Google ScholarGoogle Scholar
  113. Turnbull, D. and Alldrin, N. 2003. Failure prediction in hardware systems. Tech. rep. University of California, San Diego, CA. http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf.Google ScholarGoogle Scholar
  114. Ulerich, N. and Powers, G. 1988. On-line hazard aversion and fault diagnosis in chemical processes: The digraph + fault-tree method. IEEE Trans. Reliab. 37, 2 (Jun.), 171--177.Google ScholarGoogle ScholarCross RefCross Ref
  115. Vaidyanathan, K. and Trivedi, K. S. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. van Rijsbergen, C. J. 1979. Information Retrieval, second ed. Butterworth, London, U.K. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer Verlag, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Vesely, W., Goldberg, F. F., Roberts, N. H., and Haasl, D. F. 1981. Fault tree handbook. Tech. rep. NUREG-0492. U.S. Nuclear Regulatory Commission, Washington, DC.Google ScholarGoogle Scholar
  119. Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., and Weiss, S. M. 2002. Predictive algorithms in the management of computer systems. IBM Syst. J. 41, 3, 461--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Vilalta, R. and Ma, S. 2002. Predicting rare events in temporal domains. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 474--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Ward, A., Glynn, P., and Richardson, K. 1998. Internet service performance failure detection. SIGMETRICS Perform. Eval. Rev. 26, 3, 38--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Ward, A. and Whitt, W. 2000. Predicting response times in processor-sharing queues. In Proceedings of the Fields Institute Conference on Communications Networks, P. W. Glynn, D. J. MacDonald, and S. J. Turner, Eds.Google ScholarGoogle Scholar
  123. Weiss, G. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, San Francisco, CA, 718--725.Google ScholarGoogle Scholar
  124. Weiss, G. 2002. Predicting telecommunication equipment failures from sequences of network alarms. In Handbook of Knowledge Discovery and Data Mining, W. Kloesgen and J. Zytkow, Eds. Oxford University Press, Oxford, U.K., 891--896. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Wong, K. C. P., Ryan, H., and Tindle, J. 1996. Early warning fault detection using artificial intelligent methods. In Proceedings of the Universities Power Engineering Conference.Google ScholarGoogle Scholar
  126. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Cambridge, MA.Google ScholarGoogle Scholar

Index Terms

  1. A survey of online failure prediction methods

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Computing Surveys
              ACM Computing Surveys  Volume 42, Issue 3
              March 2010
              146 pages
              ISSN:0360-0300
              EISSN:1557-7341
              DOI:10.1145/1670679
              Issue’s Table of Contents

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 29 March 2010
              • Accepted: 1 October 2008
              • Revised: 1 June 2008
              • Received: 1 July 2007
              Published in csur Volume 42, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader