skip to main content
10.1145/3126908.3126937acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Failures in large scale systems: long-term measurement, analysis, and implications

Published:12 November 2017Publication History

ABSTRACT

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple large-scale HPC production systems. Our study covers more than one billion compute node hours across five different systems over a period of 8 years. We confirm previous findings which continue to be valid, discover new findings, and discuss their implications.

References

  1. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems, Peter Kogge, Editor and Study Lead, 2008.Google ScholarGoogle Scholar
  2. The HMDR Project: Holistic, Measurement-Driven Resilience. http://portal.nersc.gov/project/m888/resilience/.Google ScholarGoogle Scholar
  3. Lakshmi Narayanan Bairavasundaram. 2008. Characteristics, impact, and tolerance of partial disk failures. ProQuest.Google ScholarGoogle Scholar
  4. Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Franck Cappello, and Marc Snir. 2016. Reducing waste in extreme scale systems through introspective analysis. In Parallel and Distributed Processing Symposium, 2016 IEEE International. IEEE, 212--221.Google ScholarGoogle ScholarCross RefCross Ref
  5. John T Daly. 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 3 (2006), 303--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. 2017. LogAider: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 442--451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Catello Di Martino, William Kramer, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2015. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nosayba El-Sayed and Bianca Schroeder. 2013. Reading between the lines of failure logs: Understanding how HPC systems fail, DSN. (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining partial redundancy and checkpointing for HPC. In Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on. IEEE, 615--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cappello et al. 2009. Toward exascale resilience. The International Journal of High Performance Computing Applications 23, 4 (2009), 374--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Di Martino et al. 2014. Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters. 44th international Conference on Dependable Systems and Networks (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gupta et al. 2015. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems. International Conference on Dependable Systems and Networks (DSN) (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jones et al. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. In Proceedings of the 50th Annual Southeast Regional Conference. ACM, 262--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Snir et al. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (2014), 1094342014522573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tiwari et al. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tiwari et al. 2015. Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility. Proceedings of SC15: International Conference for High Performance Computing, Networking, Storage and Analysis (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tiwari et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 331--342.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ana Gainaru, Franck Cappello, and William Kramer. 2012. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 1168--1179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ana Gainaru, Franck Cappello, Marc Snir, and William Kramer. 2012. Fault prediction under the microscope: A closer look into HPC systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andy A Hwang, Ioan A Stefanovici, and Bianca Schroeder. 2012. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. ACM SIGPLAN Notices 47, 4 (2012), 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Patricia Kovatch, Anthony Costa, Zachary Giles, Eugene Fluder, Hyung Min Cho, and Svetlana Mazurkova. 2015. Big omics data experience. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Patricia Kovatch, Matthew Ezell, and Ryan Braby. 2011. The Malthusian Catastrophe is Upon Us! Are the Largest HPC Machines Ever Up?. In European Conference on Parallel Processing. Springer, 211--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Scott Levy, Kurt B Ferreira, and Patrick G Bridges. 2016. Improving application resilience to memory errors with lightweight compression. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 323--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. BlueGene/L failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K Sahoo, Jose Moreira, and Manish Gupta. 2005. Filtering failure logs for a bluegene/l prototype. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on. IEEE, 476--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Robert Lucas. 2014. Top Ten Exascale Research Challenges. In DOE ASCAC Subcommittee Report.Google ScholarGoogle Scholar
  27. Justin Meza et al. 2015. A Large-Scale Study of Flash Memory Errors in the Field. ACM SIGMETRICS Performance Evaluation Review (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on gpus in the field. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 519--530.Google ScholarGoogle ScholarCross RefCross Ref
  29. Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In Dependable Systems and Networks, 2007. DSN'07. 37th Annual IEEE/IFIP International Conference on. IEEE, 575--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ayush Patwari, Ignacio Laguna, Martin Schulz, and Saurabh Bagchi. 2017. Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters. (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, and Ravishankar K Iyer. 2011. Improving log-based field failure data analysis of multi-node computing systems. In Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on. IEEE, 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Brian Randell, Jean-Claude Laprie, Hermann Kopetz, and Bev Littlewood. 2013. Predictably dependable computing systems. Springer Science & Business Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ramendra K Sahoo, Mark S Squillante, A Sivasubramaniam, and Yanyong Zhang. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Dependable Systems and Networks, 2004 International Conference on. IEEE, 772--781. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B Schroeder and Garth Gibson. 2010. A large-scale study of failures in high-performance computing systems. Dependable and Secure Computing, IEEE Transactions on 7, 4 (2010), 337--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Bianca Schroeder and Garth A Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In FAST, Vol. 7. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bianca Schroeder and Garth A Gibson. 2007. Understanding failures in petascale computers. In Journal of Physics: Conference Series, Vol. 78. IOP Publishing, 012022.Google ScholarGoogle Scholar
  37. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. 2009. DRAM errors in the wild: a large-scale field study. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. John Shalf, Sudip Dosanjh, and John Morrison. 2011. Exascale computing technology challenges. In High Performance Computing for Computational Science-VECPAR 2010. Springer, 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Vilas Sridharan and Dean Liberty. 2012. A study of DRAM failures in the field. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Christopher Zimmer, Saurabh Gupta, Scott Atchley, Sudharshan S Vazhkudai, and Carl Albing. 2016. A multi-faceted approach to job placement for improved performance on extreme-scale systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 1015--1025. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2017
    801 pages
    ISBN:9781450351140
    DOI:10.1145/3126908
    • General Chair:
    • Bernd Mohr,
    • Program Chair:
    • Padma Raghavan

    Copyright © 2017 ACM

    © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader