Abstract
With the ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.
Supplemental Material
Available for Download
Online appendix to a survey of online failure prediction methods on article 10.
- Abraham, A. and Grosan, C. 2005. Genetic programming approach for fault modeling of electronic hardware. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC). Edinburgh, U.K., Vol. 2, 1563--1569.Google Scholar
- Aitchison, J. and Dunsmore, I. R. 1975. Statistical Prediction Analysis. Cambridge University Press, Cambridge, U.K.Google Scholar
- Altman, D. G. 1991. Practical Statistics for Medical Research. CRC Press, Boca Raton, FL. Google ScholarDigital Library
- Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Molec. Biol. 215, 3, 403--410.Google ScholarCross Ref
- Andrzejak, A. and Silva, L. 2007. Deterministic models of software aging and optimal rejuvenation schedules. In Proceedings of the 10th IEEE/IFIP International Symposium on Integrated Network Management (IM). 159--168.Google Scholar
- Avizienis, A. and Laprie, J.-C. 1986. Dependable computing: From concepts to design diversity. Proc. IEEE 74, 5 (May), 629--638.Google ScholarCross Ref
- Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1, 1, 11--33. Google ScholarDigital Library
- Babaoglu, O., Jelasity, M., Montresor, A., Fetzer, C., Leonardi, S., van Moorsel A., and van Steen, M., Eds. 2005. Self-Star Properties in Complex Information Systems. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
- Bai, C. G., Hu, Q. P., Xie, M., and Ng, S. H. 2005. Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74, 3 (Feb.), 275--282. Google ScholarDigital Library
- Basseville, M. and Nikiforov, I. 1993. Detection of Abrupt Changes: Theory and Application. Prentice Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
- Berenji, H., Ametha, J., and Vengerov, D. 2003. Inductive learning for fault diagnosis. In Proceedings of the IEEE 12th International Conference on Fuzzy Systems (FUZZ). Vol. 1.Google Scholar
- Blischke, W. R. and Murthy, D. N. P. 2000. Reliability: Modeling, Prediction, and Optimization. Probability and Statistics Series. John Wiley and Sons, New York, NY.Google ScholarCross Ref
- Bodik, P., Friedman, G., Biewald, L., Levine, H., Candea, G., Patel, K., Tolle, G., Hui, J., Fox, A., Jordan, M. I., and Patterson, D. 2005. Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization. In Proceedings of the IEEE International Conference on Autonomic Computing (ICAC). IEEE Computer Society Press, Los Alamitos, CA, 89--100. Google ScholarDigital Library
- Brocklehurst, S. and Littlewood, B. 1996. Techniques for prediction analysis and recalibration. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 4, 119--166. Google ScholarDigital Library
- Brown, A. and Patterson, D. 2001. Embracing failure: A case for recovery-oriented computing (ROC). In Proceedings of the High Performance Transaction Processing Symposium.Google Scholar
- Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. 2004. Microreboot—a technique for cheap recovery. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 31--44. Google ScholarDigital Library
- Candea, G., Kiciman, E., Kawamoto, S., and Fox, A. 2006. Autonomous recovery in componentized Internet applications. Cluster Comput. 9, 2, 175--190. Google ScholarDigital Library
- Candea, G., Kiciman, E., Zhang, S., Keyani, P., and Fox, A. 2003. Jagr: An autonomous self-recovering application server. In Proceedings of the 5th International Workshop on Active Middleware Services (Seattle, WA).Google Scholar
- Cassidy, K. J., Gross, K. C., and Malekpour, A. 2002. Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers. In Proceedings of the Conference on Dependable Systems and Networks (DSN). 478--482. Google ScholarDigital Library
- Castelli, V., Harper, R., P., H., Hunter, S., Trivedi, K., Vaidyanathan, K., and Zeggert, W. 2001. Proactive management of software aging. IBM J. Res. Develop. 45, 2 (Mar.), 311--332. Google ScholarDigital Library
- Cavafy, C. P. 1992. But the wise perceive things about to happen. In Collected Poems, G. Savidis, Ed. Princeton University Press, Princeton, NJ.Google Scholar
- Chen, M., Accardi, A., Lloyd, J., Kiciman, E., Fox, A., Patterson, D., and Brewer, E. 2004. Path-based failure and evolution management. In Proceedings of the USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI, San Francisco, CA). Google ScholarDigital Library
- Chen, M., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, IPDS track (DSN). IEEE Computer Society Press, Los Alamitos, CA, 595--604. Google ScholarDigital Library
- Chen, M.-S., Park, J. S., and Yu, P. S. 1998. Efficient data mining for path traversal patterns. IEEE Trans. Knowl. Data Eng. 10, 2, 209--221. Google ScholarDigital Library
- Cheng, F., Wu, S., Tsai, P., Chung, Y., and Yang, H. 2005. Application cluster service scheme for near-zero-downtime services. In Proceedings of the IEEE International Conference on Robotics and Automation. 4062--4067.Google Scholar
- Cleveland, W. et al. 1979. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 74, 368, 829--836.Google ScholarCross Ref
- Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning. 115--123.Google ScholarDigital Library
- Coleman, D. and Thompson, C. 2005. Model based automation and management for the adaptive enterprise. In Proceedings of the 12th Annual Workshop of the HP OpenView University Association. 171--184.Google Scholar
- Crowell, J., Shereshevsky, M., and Cukic, B. 2002. Using fractal analysis to model software aging. Tech. rep. West Virginia University, Lane Department of CSEE, Morgantown, WV. May.Google Scholar
- Csenki, A. 1990. Bayes predictive analysis of a fundamental software reliability model. IEEE Trans. Reliab. 39, 2 (Jun.), 177--183.Google ScholarCross Ref
- Daidone, A., Di Giandomenico, F., Bondavalli, A., and Chiaradonna, S. 2006. Hidden Markov models as a support for diagnosis: Formalization of the problem and synthesis of the solution. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS, Leeds, U.K.). Google ScholarDigital Library
- Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41, 6, 391--407.Google ScholarCross Ref
- Denson, W. 1998. The history of reliability prediction. IEEE Trans. Reliab. 47, 3 (Sep.), 321--328.Google ScholarCross Ref
- Discenzo, F., Unsworth, P., Loparo, K., and Marcy, H. 1999. Self-diagnosing intelligent motors: a key enabler for nextgeneration manufacturing systems. In Proceedings of the IEEE Colloquium on Intelligent and Self-Validating Sensors.Google Scholar
- Domeniconi, C., Perng, C.-S., Vilalta, R., and Ma, S. 2002. A classification approach for prediction of target events in temporal sequences. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'02), T. Elomaa, H. Mannila, and H. Toivonen, Eds. Lecture Notes in Artificial Intelligence, vol. 2431. Springer-Verlag, Heidelberg, Germany, 125--137. Google ScholarDigital Library
- Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, U.K.Google Scholar
- Elbaum, S., Kanduri, S., and Amschler, A. 2003. Anomalies as precursors of field failures. In Proceedings of the 14th IEEE International Symposium on Software Reliability Engineering (ISSRE). 108--118. Google ScholarDigital Library
- Farr, W. 1996. Software reliability modeling survey. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 3, 71--117. Google ScholarDigital Library
- Fawcett, T. 2004. ROC graphs: Notes and practical considerations for researchers. Mach. Learn. 31, 1--38.Google Scholar
- Flach, P. A. 2004. The many faces of ROC analysis in machine learning. Tutorial at the International Conference on Machine Learning (ICML'04). http://www.cs.bris.ac.uk/flach/ICML04tutorial/.Google Scholar
- Fu, S. and Xu, C.-Z. 2007. Quantifying temporal and spatial fault event correlation for proactive failure management. In Proceedings of the IEEE Symposium on Reliable and Distributed Systems (SRDS). Google ScholarDigital Library
- Garg, S., van Moorsel, A., Vaidyanathan, K., and Trivedi, K. S. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering (ISSRE). Google ScholarDigital Library
- Gross, K. C., Bhardwaj, V., and Bickford, R. 2002. Proactive detection of software aging mechanisms in performance critical computers. In SEW '02: Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02). IEEE Computer Society Press, Los Alamitos, CA. Google ScholarDigital Library
- Grottke, M., Matias, R., and Trivedi, K. S. 2008. The fundamentals of software aging. In Proceedings of the IEEE Workshop on Software Aging and Rejuvenation (Seattle, WA).Google Scholar
- Grottke, M. and Trivedi, K. 2007. Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Comput. 40, 107--109. Google ScholarDigital Library
- Guyon, I. and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Special Issue on Variable and Feature Selection. Google ScholarDigital Library
- Hamerly, G. and Elkan, C. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA, 202--209. Google ScholarDigital Library
- Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer Verlag, Berlin, Germany.Google Scholar
- Hätönen, K., Klemettinen, M., Mannila, H., Ronkainen, P., and Toivonen, H. 1996. Tasa: Telecommunication alarm sequence analyzer, or: How to enjoy faults in your network. In Proceedings of the IEEE Network Operations and Management Symposium (Kyoto, Japan). Vol. 2., 520--529.Google Scholar
- Hellerstein, J. L., Zhang, F., and Shahabuddin, P. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IEEE International Symposium on Integrated Network Management. 309--322.Google Scholar
- Ho, D. W. C., Zhang, P. A., and Xu, J. 2001. Fuzzy wavelet networks for function learning. IEEE Trans. Fuzzy Syst. 9, 1, 200--211. Google ScholarDigital Library
- Hoffmann, G. A. 2004. Adaptive transfer functions in radial basis function (RBF) networks. In Proceedings of the 4th International Conference on Computational Science (ICCS 2004), M. Bubak, G. D. van Albada, P. M. A. Sloot, et al., Eds. Lecture Notes in Computer Science, vol. 3037. Springer-Verlag, Berlin, Germany, 682--686.Google ScholarCross Ref
- Hoffmann, G. A. 2006. Failure Prediction in Complex Computer Systems: A Probabilistic Approach. Shaker Verlag, Herzogexrath, Germany.Google Scholar
- Hoffmann, G. A. and Malek, M. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems (SRDS), Leeds, U.K. Google ScholarDigital Library
- Hoffmann, G. A., Salfner, F., and Malek, M. 2004. Advanced failure prediction in complex software systems. Res. rep. 172, Department of Computer Science, Humboldt University, Berlin, Germany. www.rok.informatik.hu-berlin.de/Members/salfner.Google Scholar
- Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2006. A best practice guide to resource forecasting for the Apache Webserver. In IEEE Proceedings of the 12th International Symposium Pacific Rim Dependable Computing (PRDC). University of California, Riverside, Riverside, CA. Google ScholarDigital Library
- Hoffmann, G. A., Trivedi, K. S., and Malek, M. 2007. A best practice guide to resource forecasting for computing systems. IEEE Trans. Reliab. 56, 4 (Dec.), 615--628.Google ScholarCross Ref
- Horn, P. 2001. Autonomic computing: IBM's perspective on the state of information technology. Tech. rep. IBM, Yorktown Heights, NY.Google Scholar
- Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Educat. Psych. 24, 417--441.Google ScholarCross Ref
- Hughes, G., Murray, J., Kreutz-Delgado, K., and Elkan, C. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (Sep.), 350--357.Google ScholarCross Ref
- IEC: International Technical Comission, Ed. 2002. Dependability and Quality of Service, 2nd ed. IEC, Geneva, Switzerland, Chapter 191.Google Scholar
- Iyer, R. K., Young, L. T., and Sridhar, V. 1986. Recognition of error symptoms in large systems. In Proceedings of the 1986 ACM Fall Joint Computer Conference. IEEE Computer Society Press, Los Alamitos, CA, 797--806. Google ScholarDigital Library
- Jelinski, Z. and Moranda, P. 1972. Software reliability research. In Statistical Computer Performance Evaluation, W. Freiberger, Ed. Academic Press, New York, NY.Google Scholar
- Kapadia, N. H., Fortes, J. A. B., and Brodley, C. E. 1999. Predictive application-performance modeling in a computational gridenvironment. In Procedings of the 8th International IEEE Symposium on High Performance Distributed Computing. 47--54. Google ScholarDigital Library
- Kiciman, E. and Fox, A. 2005. Detecting application-level failures in component-based Internet services. IEEE Trans. Neural Netw. 16, 5 (Sep.), 1027--1041. Google ScholarDigital Library
- Korbicz, J., Kościelny, J. M., Kowalczuk, Z., and Cholewa, W., Eds. 2004. Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
- Lal, R. and Choi, G. 1998. Error and failure analysis of a unix server. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 232--239. Google ScholarDigital Library
- Laprie, J.-C. and Kanoun, K. 1996. Software reliability and system reliability. In Handbook of Software Reliability Engineering, M. R. Lyu, Ed. McGraw-Hill, New York, NY, Chapter 2, 27--69. Google ScholarDigital Library
- Leangsuksun, C., Liu, T., Rao, T., Scott, S., and Libby, R. 2004. A failure predictive and policy-based high availability strategy for Linux high performance computing cluster. In Proceedings of the 5th LCI International Conference on Linux Clusters: The HPC Revolution. 18--20.Google Scholar
- Levy, D. and Chillarege, R. 2003. Early warning of failures through alarm analysis—a case study in telecom voice mail systems. In ISSRE '03: Proceedings of the 14th International Symposium on Software Reliability Engineering. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarDigital Library
- Li, L., Vaidyanathan, K., and Trivedi, K. S. 2002. An approach for estimation of software aging in a Web server. In Proceedings of the Intenational Symposium on Empirical Software Engineering (ISESE, Nara, Japan). Google ScholarDigital Library
- Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., and Sahoo, R. 2006. Bluegene/l failure analysis and prediction models. In Proceedings of the IEEE International Conference on Dependable Systems and Networks (DSN). 425--434. Google ScholarDigital Library
- Lin, T.-T. Y. and Siewiorek, D. P. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliab. 39, 4 (Oct.), 419--432.Google ScholarCross Ref
- Lunze, J. 2003. Automatisierungstechnik, 1st ed. Oldenbourg, Munich, Germany.Google Scholar
- Lyu, M. R., Ed. 1996. Handbook of Software Reliability Engineering. McGraw-Hill, New York, NY. Google ScholarDigital Library
- Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Melliar-Smith, P. M. and Randell, B. 1977. Software reliability: The role of programmed exception handling. SIGPLAN Not. 12, 3, 95--100.Google ScholarDigital Library
- Meng, H., Di Hou, Y., and Chen, Y. 2007. A rough wavelet network model with genetic algorithm and its application to aging forecasting of application server. In Procedings of the IEEE International Conference on Machine Learning and Cybernetics. Vol. 5.Google Scholar
- Mundie, C., de Vries, P., Haynes, P., and Corwine, M. 2002. Trustworthy computing. Tech. rep., Microsoft Corp., Redmond, WA. Oct.Google Scholar
- Murray, J., Hughes, G., and Kreutz-Delgado, K. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of ICANN/ICONIP.Google Scholar
- Musa, J. D., Iannino, A., and Okumoto, K. 1987. Software Reliability: Measurement, Prediction, Application. McGraw-Hill, New York, NY. Google ScholarDigital Library
- Nassar, F. A. and Andrews, D. M. 1985. A methodology for analysis of failure prediction data. In Proceedings of the IEEE Real-Time Systems Symposium. 160--166.Google Scholar
- Needleman, S. B. and Wunsch, C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Molec. Biol. 48, 3, 443--53.Google ScholarCross Ref
- Neville, S. W. 1998. Approaches for early fault detection in large scale engineering plants. Ph.D. dissertation, University of Victoria, Victoria, B.C., Canada.Google Scholar
- Ning, M. H., Yong, Q., Di, H., Ying, C., and Zhong, Z. J. 2006. Software aging prediction model based on fuzzy wavelet network with adaptive genetic algorithm. In Proceedings of the IEEE 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI). IEEE Computer Society Press, Los Alamitos, CA, 659--666. Google ScholarDigital Library
- Parnas, D. L. 1994. Software aging. In Proceedings of the 16th IEEE International Conference on Software Engineering (ICSE). IEEE Computer Society Press, Los Alamitos, CA, 279--287. Google ScholarDigital Library
- Patterson, D. A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Tech. rep. UCB//CSD-02-1175. Computer Science Department, University of California, Berkeley, Berkeley, CA. March. Google ScholarDigital Library
- Pawlak, Z., Wong, S. K. M., and Ziarko, W. 1988. Rough sets: Probabilistic versus deterministic approach. Internat. J. Man-Mach. Stud. 29, 81--95. Google ScholarDigital Library
- Pettitt, A. 1977. Testing the normality of several independent samples using the anderson-darling statistic. Appl. Statist. 26, 2, 156--161.Google ScholarCross Ref
- Pfefferman, J. and Cernuschi-Frias, B. 2002. A nonparametric nonstationary procedure for failure prediction. IEEE Trans. Reliab. 51, 4 (Dec.), 434--442.Google ScholarCross Ref
- Pizza, M., Strigini, L., Bondavalli, A., and Di Giandomenico, F. 1998. Optimal discrimination between transient and permanent faults. In Proceedings of the IEEE 3rd International High-Assurance Systems Engineering Symposium (HASE). IEEE Computer Society Press, Los Alamitos, CA, 214--223. Google ScholarDigital Library
- Quinlan, J. 1990. Learning logical definitions from relations. Mach. Learn. 5, 3, 239--266. Google ScholarDigital Library
- Quinlan, J. 1993. C4. 5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (Feb.), 257--286.Google ScholarDigital Library
- Rovnyak, S., Kretsinger, S., Thorp, J., and Brown, D. 1994. Decision trees for real-time transient stability prediction. IEEE Trans. Power Syst. 9, 3, 1417--1426.Google ScholarCross Ref
- Sahner, R. A., Trivedi, K. S., and Puliafito, A. 1996. Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package (The Red Book). Kluwer Academic Publishers, Dordrecht, The Netherlands. Google ScholarDigital Library
- Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., Vilalta, R., and Sivasubramaniam, A. 2003. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). ACM Press, New York, NY, 426--435. Google ScholarDigital Library
- Salfner, F. 2006. Modeling event-driven time series with generalized hidden semi-Markov models. Tech. rep. 208. Department of Computer Science, Humboldt-Universität zu Berlin, Berlin, Germany. http://edoc.hu-berlin.de/docviews/abstract.php?id=27653.Google Scholar
- Salfner, F. 2008. Event-based Failure Prediction: An Extended Hidden Markov Model Approach. dissertation.de—Verlag im Internet GmbH, Berlin, Germany. http://www.rok.informatik.hu-berlin.de/Members/salfner/publications/salfner08event-based.pdf.Google Scholar
- Salfner, F., Hoffmann, G. A., and Malek, M. 2005. Prediction-based software availability enhancement. In Self-Star Properties in Complex Information Systems, O. Babaoglu, M. Jelasity, A. Montresor, C. Fetzer, S. Leonardi, van Moorsel A., and M. van Steen, Eds. Lecture Notes in Computer Science, vol. 3460. Springer-Verlag, Berlin, Germany. Google ScholarDigital Library
- Salfner, F. and Malek, M. 2007. Using hidden semi-Markov models for effective online failure prediction. In Proceedings of the IEEE 26th International Symposium on Reliable Distributed Systems (SRDS). Google ScholarDigital Library
- Salfner, F., Schieschke, M., and Malek, M. 2006. Predicting failures of computer systems: A case study for a telecommunication system. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS, Rhodes Island, Greece). Google ScholarDigital Library
- Salfner, F., Tschirpke, S., and Malek, M. 2004. Comprehensive logfiles for autonomic systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), Workshop on Fault-Tolerant Parallel, Distributed and Network-Centric Systems (FTPDS). IEEE Computer Society Press, Los Alamitos, CA.Google Scholar
- Sen, P. K. 1968. Estimates of the regression coefficient based on Kendall's tau. J. Amer. Statist. Assoc. 63, 324 (Dec.), 1379--1389.Google ScholarCross Ref
- Shereshevsky, M., Crowell, J., Cukic, B., Gandikota, V., and Liu, Y. 2003. Software aging and multifractality of memory resources. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society Press, Los Alamitos, CA, 721--730.Google Scholar
- Siewiorek, D. P. and Swarz, R. S. 1998. Reliable Computer Systems, 3rd ed. A. K. Peters, Ltd., Wellesley, MA. Google ScholarDigital Library
- Singer, R. M., Gross, K. C., Herzog, J. P., King, R. W., and Wegerich, S. 1997. Model-based nuclear power plant monitoring and fault detection: Theoretical foundations. In Proceedings of the Conference on Intelligent System Application to Power Systems (ISAP, Seoul, Korea). 60--65.Google Scholar
- Smith, T. and Waterman, M. 1981. Identification of common molecular subsequences. J. Molec. Biol. 147, 195--197.Google ScholarCross Ref
- Srikant, R. and Agrawal, R. 1996. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology, EDBT, P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, Eds. Lecture Notes in Computer Science, vol. 1057. Springer-Verlag, Berlin, Germany, 3--17. Google ScholarDigital Library
- Tang, D. and Iyer, R. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (Jan.), 62--75. Google ScholarDigital Library
- Troudet, T., Merrill, W., Center, N., and Cleveland, O. 1990. A real time neural net estimator of fatigue life. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN). 59--64.Google Scholar
- Tsao, M. M. and Siewiorek, D. P. 1983. Trend analysis on system error files. In Proceedings of the 13th International Symposium on Fault-Tolerant Computing (Milano, Italy). 116--119.Google Scholar
- Turnbull, D. and Alldrin, N. 2003. Failure prediction in hardware systems. Tech. rep. University of California, San Diego, CA. http://www.cs.ucsd.edu/~dturnbul/Papers/ServerPrediction.pdf.Google Scholar
- Ulerich, N. and Powers, G. 1988. On-line hazard aversion and fault diagnosis in chemical processes: The digraph + fault-tree method. IEEE Trans. Reliab. 37, 2 (Jun.), 171--177.Google ScholarCross Ref
- Vaidyanathan, K. and Trivedi, K. S. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the International Symposium on Software Reliability Engineering (ISSRE). Google ScholarDigital Library
- van Rijsbergen, C. J. 1979. Information Retrieval, second ed. Butterworth, London, U.K. Google ScholarDigital Library
- Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer Verlag, New York, NY. Google ScholarDigital Library
- Vesely, W., Goldberg, F. F., Roberts, N. H., and Haasl, D. F. 1981. Fault tree handbook. Tech. rep. NUREG-0492. U.S. Nuclear Regulatory Commission, Washington, DC.Google Scholar
- Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., and Weiss, S. M. 2002. Predictive algorithms in the management of computer systems. IBM Syst. J. 41, 3, 461--474. Google ScholarDigital Library
- Vilalta, R. and Ma, S. 2002. Predicting rare events in temporal domains. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE Computer Society Press, Los Alamitos, CA, 474--482. Google ScholarDigital Library
- Ward, A., Glynn, P., and Richardson, K. 1998. Internet service performance failure detection. SIGMETRICS Perform. Eval. Rev. 26, 3, 38--43. Google ScholarDigital Library
- Ward, A. and Whitt, W. 2000. Predicting response times in processor-sharing queues. In Proceedings of the Fields Institute Conference on Communications Networks, P. W. Glynn, D. J. MacDonald, and S. J. Turner, Eds.Google Scholar
- Weiss, G. 1999. Timeweaver: A genetic algorithm for identifying predictive patterns in sequences of events. In Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann, San Francisco, CA, 718--725.Google Scholar
- Weiss, G. 2002. Predicting telecommunication equipment failures from sequences of network alarms. In Handbook of Knowledge Discovery and Data Mining, W. Kloesgen and J. Zytkow, Eds. Oxford University Press, Oxford, U.K., 891--896. Google ScholarDigital Library
- Wong, K. C. P., Ryan, H., and Tindle, J. 1996. Early warning fault detection using artificial intelligent methods. In Proceedings of the Universities Power Engineering Conference.Google Scholar
- Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press, Cambridge, MA.Google Scholar
Index Terms
- A survey of online failure prediction methods
Recommendations
Runtime Prediction of Failure Modes from System Error Logs
ICECCS '13: Proceedings of the 2013 18th International Conference on Engineering of Complex Computer SystemsPredicting potential failure occurrences during runtime is important to achieve system resilience and avoid hazardous consequences of failures. Existing failure prediction techniques in software systems involve forecasting failure counts, effects, and ...
Exploring event correlation for failure prediction in coalitions of clusters
SC '07: Proceedings of the 2007 ACM/IEEE conference on SupercomputingIn large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
Towards Identifying the Best Variables for Failure Prediction Using Injection of Realistic Software Faults
PRDC '10: Proceedings of the 2010 IEEE 16th Pacific Rim International Symposium on Dependable ComputingPredicting failures at runtime is one of the most promising techniques to increase the availability of computer systems. However, failure prediction algorithms are still far from providing satisfactory results. In particular, the identification of the ...
Comments