research-article

Public Access

Failures in large scale systems: long-term measurement, analysis, and implications

Authors:
Saurabh Gupta

Intel Labs

Intel Labs
View Profile

,
Tirthak Patel

Northeastern University

Northeastern University
View Profile

,
Christian Engelmann

Oak Ridge National Laboratory

Oak Ridge National Laboratory
View Profile

,
Devesh Tiwari

Northeastern University

Northeastern University
View Profile

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2017Article No.: 44Pages 1–12https://doi.org/10.1145/3126908.3126937

Published:12 November 2017Publication History

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

References

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems, Peter Kogge, Editor and Study Lead, 2008.Google Scholar
The HMDR Project: Holistic, Measurement-Driven Resilience. http://portal.nersc.gov/project/m888/resilience/.Google Scholar
Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, impact, and tolerance of partial disk failures. ProQuest.Google Scholar
Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Franck Cappello, and Marc Snir. 2016. Reducing waste in extreme scale systems through introspective analysis. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 212--221.Google ScholarCross Ref
John T Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312. Google ScholarDigital Library
Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. 2017. LogAider: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 442--451. Google ScholarDigital Library
Catello Di Martino, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2015. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarDigital Library
Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail, DSN. (2013). Google ScholarDigital Library
James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE, 615--626. Google ScholarDigital Library
Cappello et al. 2009. Toward exascale resilience. The International Journal of High Performance Computing Applications 23, 4 (2009), 374--388. Google ScholarDigital Library
Di Martino et al. 2014. Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters. 44th international Conference on Dependable Systems and Networks (2014). Google ScholarDigital Library
Gupta et al. 2015. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems. International Conference on Dependable Systems and Networks (DSN) (2015). Google ScholarDigital Library
Jones et al. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. In Proceedings of the 50th Annual Southeast Regional Conference. ACM, 262--267. Google ScholarDigital Library
Snir et al. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (2014), 1094342014522573. Google ScholarDigital Library
Tiwari et al. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarDigital Library
Tiwari et al. 2015. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility. Proceedings of SC15: International Conference for High Performance Computing, Networking, Storage and Analysis (2015). Google ScholarDigital Library
Tiwari et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 331--342.Google ScholarCross Ref
Ana Gainaru, Franck Cappello, and William Kramer. 2012. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 1168--1179. Google ScholarDigital Library
Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 77. Google ScholarDigital Library
Andy A Hwang, Ioan A Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Notices 47, 4 (2012), 111--122. Google ScholarDigital Library
Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 39. Google ScholarDigital Library
Patricia Kovatch, Matthew Ezell, and Ryan Braby. 2011. The Malthusian Catastrophe is Upon Us! Are the Largest HPC Machines Ever Up?. In European Conference on Parallel Processing. Springer, 211--220. Google ScholarDigital Library
Scott Levy, Kurt B Ferreira, and Patrick G Bridges. 2016. Improving application resilience to memory errors with lightweight compression. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 323--334. Google ScholarDigital Library
Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434. Google ScholarDigital Library
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K Sahoo, Jose Moreira, and Manish Gupta. 2005. Filtering failure logs for a bluegene/l prototype. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 476--485. Google ScholarDigital Library
Robert Lucas. 2014. Top Ten Exascale Research Challenges. In DOE ASCAC Subcommittee Report.Google Scholar
Justin Meza et al. 2015. A Large-Scale Study of Flash Memory Errors in the Field. ACM SIGMETRICS Performance Evaluation Review (2015). Google ScholarDigital Library
Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on gpus in the field. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 519--530.Google ScholarCross Ref
Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on. IEEE, 575--584. Google ScholarDigital Library
Ayush Patwari, Ignacio Laguna, Martin Schulz, and Saurabh Bagchi. 2017. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters. (2017). Google ScholarDigital Library
Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2011. Improving log-based field failure data analysis of multi-node computing systems. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on. IEEE, 97--108. Google ScholarDigital Library
Brian Randell, Jean-Claude Laprie, Hermann Kopetz, and Bev Littlewood. 2013. Predictably dependable computing systems. Springer Science & Business Media. Google ScholarDigital Library
Ramendra K Sahoo, Mark S Squillante, A Sivasubramaniam, and Yanyong Zhang. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Dependable Systems and Networks, 2004 International Conference on. IEEE, 772--781. Google ScholarDigital Library
B Schroeder and Garth Gibson. 2010. A large-scale study of failures in high-performance computing systems. Dependable and Secure Computing, IEEE Transactions on 7, 4 (2010), 337--350. Google ScholarDigital Library
Bianca Schroeder and Garth A Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In FAST, Vol. 7. 1--16. Google ScholarDigital Library
Bianca Schroeder and Garth A Gibson. 2007. Understanding failures in petascale computers. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012022.Google Scholar
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: a large-scale field study. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 193--204. Google ScholarDigital Library
John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale computing technology challenges. In High Performance Computing for Computational Science-VECPAR 2010. Springer, 1--25. Google ScholarDigital Library
Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 1--11. Google ScholarDigital Library
Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22. Google ScholarDigital Library
Christopher Zimmer, Saurabh Gupta, Scott Atchley, Sudharshan S Vazhkudai, and Carl Albing. 2016. A multi-faceted approach to job placement for improved performance on extreme-scale systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 1015--1025. Google ScholarDigital Library

Recommendations

A Large-Scale Study of Failures in High-Performance Computing Systems

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...
Read More
An analysis of clustered failures on large supercomputing systems

Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied ...
Read More
Understanding and coping with failures in large-scale storage systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN
Copyright © 2017 ACM
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 94
  Total Citations
  View Citations
- 3,115
  Total Downloads
- Downloads (Last 12 months)365
- Downloads (Last 6 weeks)48
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Failures in large scale systems: long-term measurement, analysis, and implications

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

A Large-Scale Study of Failures in High-Performance Computing Systems

An analysis of clustered failures on large supercomputing systems

Understanding and coping with failures in large-scale storage systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Failures in large scale systems: long-term measurement, analysis, and implications

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

A Large-Scale Study of Failures in High-Performance Computing Systems

An analysis of clustered failures on large supercomputing systems

Understanding and coping with failures in large-scale storage systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media