skip to main content
10.1109/SC.2014.62acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Quantitatively modeling application resilience with the data vulnerability factor

Published:16 November 2014Publication History

ABSTRACT

Recent strategies to improve the observable resilience of applications require the ability to classify vulnerabilities of individual components (e.g., data structures, instructions) of an application, and then, selectively apply protection mechanisms to its critical components. To facilitate this vulnerability classification, it is important to have accurate, quantitative techniques that can be applied uniformly and automatically across real-world applications. Traditional methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven, practical methodology to analyze these application vulnerabilities using a novel resilience metric: the data vulnerability factor (DVF). DVF integrates knowledge from both the application and target hardware into the calculation. To calculate DVF, we extend a performance modeling language to provide a structured, fast modeling solution. We evaluate our methodology on six representative computational kernels; we demonstrate the significance of DVF by quantifying the impact of algorithm optimization on vulnerability, and by quantifying the effectiveness of specific hardware protection mechanisms.

References

  1. Bames-hut Implementation on GitHub. http://github.com/JAChapmanII/barnes-hut, 2010.Google ScholarGoogle Scholar
  2. NPB Website. https://www.nas.nasa.gov/publications/npb.html, 2012.Google ScholarGoogle Scholar
  3. Conjugate Gradient Implementation on GitHub. https://github.com/danesh-d/cg/blob/master, 2013.Google ScholarGoogle Scholar
  4. The Monte Carlo Macroscopic Cross Section Lookup Benchmark. https://github.com/jtramm/XSBench, 2013.Google ScholarGoogle Scholar
  5. A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In International Symposium on Computer Architecture (ISCA), 2005.Google ScholarGoogle Scholar
  6. A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. S. Mukherjee, and R. Rangan. Computing Architectural Vulnerability Factors for Address-based Structures. In International Symposium on Computer Architecture (ISCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bland, W. Joubert, D. Maxwell, N. Podhorszki, J. Rogers, G. Shipman, and A. Tharrington. Titan: 20-Petaflop Cray XK6 at Oak Ridge National Laboratory. In J. S. Vetter, editor, Contemporary High Performance Computing: From Petascale Toward Exascale, CRC Computational Science Series. Taylor and Francis, 2013.Google ScholarGoogle Scholar
  8. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, and F. Magniette. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Bronevetsky and B. R. Supinski. Soft Error Vulnerability of Iterative Linear Algebra Methods. In International Conference on Supercomputing (ICS), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Casas, B. R. Supinski, G. Bronevetsky, and M. Schulz. Fault Resilience of the Algebraic Multi-grid Solver. In International Conference on Supercomputing (ICS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Chen. Algorithm-Based Recovery for Iterative Methods without Checkpointing. In The International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Chen. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In ACM SIGPLAN Annual Symposium Principles and Practice of Parallel Programming (PPoPP), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Davies and Z. Chen. Correcting Soft Errors Online in LU Factorization. In The International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation without Checkpointing. In International Conference on Supercomputing (ICS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Dell. A White Paper On The Benefits Of Chipkill-Correct ECC for PC Server Main Memory. Technical report, IBM Microelectronics Division, 1997.Google ScholarGoogle Scholar
  17. P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based Fault Tolerance for Dense Matrix Factorizations. In ACM SIGPLAN Annual Symposium Principles and Practice of Parallel Programming (PPoPP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Duan, B. Li, and L. Peng. Versatile Prediction and Fast Estimation of Architectural Vulnerability Factor from Processor Performance Metrics. In International Symposium on High-Performance Computer Architecture (HPCA), 2009.Google ScholarGoogle Scholar
  19. P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart(BLCR) for Linux Clusters. JPCS, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran. Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults. In The International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Y. Hsiao. A Class of Optimal Minimum Odd-Weight-Column SECDED Codes. IBM Journal of Research and Development, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine. The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2007.Google ScholarGoogle Scholar
  23. R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. Statistical Fault Injection: Quantified Error and Confidence. In Design, Automation and Test in Europe (DATE), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Li, J. S. Vetter, and W. Yu. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B. Brockman, A. F. Rodrigues, and N. P. Jouppi. System Implications of Memory Reliability in Exascale Computing. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Li, M. C. Huang, K. Shen, and L. Chu. A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility. In USENIX Annual Technical Conference (ATC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. SIGPLAN Not., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Manegold, P. Boncz, and M. L. Kersten. Generic Database Cost Models for Hierarchical Memory Systems. In International Conference on Very Large Databases (VLDB), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Moody, G. Bronevetsky, K. Mohror, and B. R. Supinski. Design, Modeling, and Evaluation of A Scalable Multi-level Checkpointing System. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor. In The Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. Measuring Architectural Vulnerability Factors. IEEE Micro, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Shantharam, S. Srinivasmurthy, and P. Raghavan. Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing. In International Conference on Supercomputing (ICS), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Shi, J. Enos, M. Showerman, and V. Kindratenko. On Testing GPU Memory for Hard and Soft Errors. In Symposium on Application Accelerators in High-Performance Computing (SAAHPC), 2009.Google ScholarGoogle Scholar
  34. C. Slayman. Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory. In Integrated Reliability Workshop, 2006.Google ScholarGoogle Scholar
  35. K. Spafford and J. S. Vetter. Aspen: A Domain Specific Language for Performance Modeling. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. V. Sridharan and D. R. Kaeli. Eliminating Microarchitectural Dependency From Architectural Vulnerability. In International Symposium on High-Performance Computer Architecture (HPCA), 2009.Google ScholarGoogle Scholar
  37. V. Sridharan and D. R. Kaeli. Using PVF Traces to Accelerate AVF Modeling. In Workshop on Silicon Errors in Logic - System Effects, 2010.Google ScholarGoogle Scholar
  38. D. Thiebaut and H. S. Stone. Footprints in the Cache. ACM Trans. Comput. Syst., 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and Tiered Reliability Mechanisms for Commodity Memory Systems. In International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. R. Walcott, G. Humphreys, and S. Gurumurthi. Dynamic Prediction of Architectural Vulnerability from Microarchitectural State. In International Symposium on Computer Architecture (ISCA), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Xu and M.-L. Li. Understanding Soft Error Propagation Using Efficient Vulnerability-Driven Fault Injection. In The Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quantitatively modeling application resilience with the data vulnerability factor

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
            November 2014
            1054 pages
            ISBN:9781479955008
            • General Chair:
            • Trish Damkroger,
            • Program Chair:
            • Jack Dongarra

            Publisher

            IEEE Press

            Publication History

            • Published: 16 November 2014

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SC '14 Paper Acceptance Rate83of394submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader