ABSTRACT
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.
- ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems, Peter Kogge, Editor and Study Lead, 2008.Google Scholar
- The HMDR Project: Holistic, Measurement-Driven Resilience. http://portal.nersc.gov/project/m888/resilience/.Google Scholar
- Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, impact, and tolerance of partial disk failures. ProQuest.Google Scholar
- Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Franck Cappello, and Marc Snir. 2016. Reducing waste in extreme scale systems through introspective analysis. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 212--221.Google ScholarCross Ref
- John T Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312. Google ScholarDigital Library
- Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. 2017. LogAider: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 442--451. Google ScholarDigital Library
- Catello Di Martino, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2015. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarDigital Library
- Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail, DSN. (2013). Google ScholarDigital Library
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE, 615--626. Google ScholarDigital Library
- Cappello et al. 2009. Toward exascale resilience. The International Journal of High Performance Computing Applications 23, 4 (2009), 374--388. Google ScholarDigital Library
- Di Martino et al. 2014. Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters. 44th international Conference on Dependable Systems and Networks (2014). Google ScholarDigital Library
- Gupta et al. 2015. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems. International Conference on Dependable Systems and Networks (DSN) (2015). Google ScholarDigital Library
- Jones et al. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. In Proceedings of the 50th Annual Southeast Regional Conference. ACM, 262--267. Google ScholarDigital Library
- Snir et al. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (2014), 1094342014522573. Google ScholarDigital Library
- Tiwari et al. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarDigital Library
- Tiwari et al. 2015. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility. Proceedings of SC15: International Conference for High Performance Computing, Networking, Storage and Analysis (2015). Google ScholarDigital Library
- Tiwari et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 331--342.Google ScholarCross Ref
- Ana Gainaru, Franck Cappello, and William Kramer. 2012. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 1168--1179. Google ScholarDigital Library
- Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 77. Google ScholarDigital Library
- Andy A Hwang, Ioan A Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Notices 47, 4 (2012), 111--122. Google ScholarDigital Library
- Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 39. Google ScholarDigital Library
- Patricia Kovatch, Matthew Ezell, and Ryan Braby. 2011. The Malthusian Catastrophe is Upon Us! Are the Largest HPC Machines Ever Up?. In European Conference on Parallel Processing. Springer, 211--220. Google ScholarDigital Library
- Scott Levy, Kurt B Ferreira, and Patrick G Bridges. 2016. Improving application resilience to memory errors with lightweight compression. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 323--334. Google ScholarDigital Library
- Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434. Google ScholarDigital Library
- Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K Sahoo, Jose Moreira, and Manish Gupta. 2005. Filtering failure logs for a bluegene/l prototype. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 476--485. Google ScholarDigital Library
- Robert Lucas. 2014. Top Ten Exascale Research Challenges. In DOE ASCAC Subcommittee Report.Google Scholar
- Justin Meza et al. 2015. A Large-Scale Study of Flash Memory Errors in the Field. ACM SIGMETRICS Performance Evaluation Review (2015). Google ScholarDigital Library
- Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on gpus in the field. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 519--530.Google ScholarCross Ref
- Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on. IEEE, 575--584. Google ScholarDigital Library
- Ayush Patwari, Ignacio Laguna, Martin Schulz, and Saurabh Bagchi. 2017. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters. (2017). Google ScholarDigital Library
- Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2011. Improving log-based field failure data analysis of multi-node computing systems. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on. IEEE, 97--108. Google ScholarDigital Library
- Brian Randell, Jean-Claude Laprie, Hermann Kopetz, and Bev Littlewood. 2013. Predictably dependable computing systems. Springer Science & Business Media. Google ScholarDigital Library
- Ramendra K Sahoo, Mark S Squillante, A Sivasubramaniam, and Yanyong Zhang. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Dependable Systems and Networks, 2004 International Conference on. IEEE, 772--781. Google ScholarDigital Library
- B Schroeder and Garth Gibson. 2010. A large-scale study of failures in high-performance computing systems. Dependable and Secure Computing, IEEE Transactions on 7, 4 (2010), 337--350. Google ScholarDigital Library
- Bianca Schroeder and Garth A Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In FAST, Vol. 7. 1--16. Google ScholarDigital Library
- Bianca Schroeder and Garth A Gibson. 2007. Understanding failures in petascale computers. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012022.Google Scholar
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: a large-scale field study. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 193--204. Google ScholarDigital Library
- John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale computing technology challenges. In High Performance Computing for Computational Science-VECPAR 2010. Springer, 1--25. Google ScholarDigital Library
- Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 1--11. Google ScholarDigital Library
- Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22. Google ScholarDigital Library
- Christopher Zimmer, Saurabh Gupta, Scott Atchley, Sudharshan S Vazhkudai, and Carl Albing. 2016. A multi-faceted approach to job placement for improved performance on extreme-scale systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 1015--1025. Google ScholarDigital Library
Recommendations
A Large-Scale Study of Failures in High-Performance Computing Systems
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...
An analysis of clustered failures on large supercomputing systems
Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied ...
Comments