skip to main content
10.1145/2642769.2642776acmotherconferencesArticle/Chapter ViewAbstractPublication Pageseurompi-asiaConference Proceedingsconference-collections
research-article

Comparing, Contrasting, Generalizing, and Integrating Two Current Designs for Fault-Tolerant MPI

Authors Info & Claims
Published:09 September 2014Publication History

ABSTRACT

We compare and contrast the approaches and key features of two proposals for fault-tolerant MPI: User-Level Failure Mitigation (UFLM) and Fault-Aware MPI (FA-MPI). We show how they are complementary and also how they could leverage each other through modifications and/or extensions. We show how to "weaken" and extend ULFM to help integrate it with FA-MPI, with corollary benefits of broadening applicability of ULFM. Reducibility of each to the other is considered. This helps identify which components of each are minimally "required" for standardization, versus layerable on a future MPI specification.

References

  1. W. Bland, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra. A proposal for user-level failure mitigation in the MPI-3 standard. Technical report, Tech. rep., Department of Electrical Engineering and Computer Science, University of Tennessee, 2012.Google ScholarGoogle Scholar
  2. W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability design and rationale. International Journal of High Performance Computing Applications, 27(3):244--254, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225--267, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Dimitrov and A. Skjellum. Software architecture and performance comparison of mpi/pro and mpich. In International Conference on Computational Science, pages 307--315, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374--382, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Hassani, A. Skjellum, and R. Brightwell. Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. FTXS '14, Atlanta, GA, June 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In EuroMPI, pages 255--263, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Message Passing Interface Forum. MPI: a message-passing interface standard version 3.0. Technical report, Sept. 2012.Google ScholarGoogle Scholar
  9. K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 19:1--19:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Schneider, T. Hoefler, R. Grant, B. Barrett, and R. Brightwell. Protocols for fully offloaded collective operations on accelerated network adapters. In 2013 42nd International Conference on Parallel Processing (ICPP), pages 593--602, Oct. 2013. 00000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337--350, Dec. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Skjellum and P. V. Bangalore. FA-MPI: fault-aware MPI specification and concept of operations: A transactional message passing interface & an alternative proposal to the MPI-3 forum. Technical Report UABCISTR-2012-011912, University of Alabama at Birmingham, Computer and Information Sciences, Feb. 2012.Google ScholarGoogle Scholar

Index Terms

  1. Comparing, Contrasting, Generalizing, and Integrating Two Current Designs for Fault-Tolerant MPI

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting
          September 2014
          183 pages
          ISBN:9781450328753
          DOI:10.1145/2642769

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 September 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          EuroMPI/ASIA '14 Paper Acceptance Rate18of39submissions,46%Overall Acceptance Rate18of39submissions,46%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader