ABSTRACT
We compare and contrast the approaches and key features of two proposals for fault-tolerant MPI: User-Level Failure Mitigation (UFLM) and Fault-Aware MPI (FA-MPI). We show how they are complementary and also how they could leverage each other through modifications and/or extensions. We show how to "weaken" and extend ULFM to help integrate it with FA-MPI, with corollary benefits of broadening applicability of ULFM. Reducibility of each to the other is considered. This helps identify which components of each are minimally "required" for standardization, versus layerable on a future MPI specification.
- W. Bland, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra. A proposal for user-level failure mitigation in the MPI-3 standard. Technical report, Tech. rep., Department of Electrical Engineering and Computer Science, University of Tennessee, 2012.Google Scholar
- W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of MPI communication capability design and rationale. International Journal of High Performance Computing Applications, 27(3):244--254, Aug. 2013. Google ScholarDigital Library
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225--267, Mar. 1996. Google ScholarDigital Library
- R. Dimitrov and A. Skjellum. Software architecture and performance comparison of mpi/pro and mpich. In International Conference on Computational Science, pages 307--315, 2003. Google ScholarDigital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374--382, 1985. Google ScholarDigital Library
- A. Hassani, A. Skjellum, and R. Brightwell. Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. FTXS '14, Atlanta, GA, June 2014.Google ScholarDigital Library
- J. Hursey, T. Naughton, G. Vallee, and R. L. Graham. A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI. In EuroMPI, pages 255--263, 2011. Google ScholarDigital Library
- Message Passing Interface Forum. MPI: a message-passing interface standard version 3.0. Technical report, Sept. 2012.Google Scholar
- K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 19:1--19:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
- T. Schneider, T. Hoefler, R. Grant, B. Barrett, and R. Brightwell. Protocols for fully offloaded collective operations on accelerated network adapters. In 2013 42nd International Conference on Parallel Processing (ICPP), pages 593--602, Oct. 2013. 00000. Google ScholarDigital Library
- B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing, 7(4):337--350, Dec. 2010. Google ScholarDigital Library
- A. Skjellum and P. V. Bangalore. FA-MPI: fault-aware MPI specification and concept of operations: A transactional message passing interface & an alternative proposal to the MPI-3 forum. Technical Report UABCISTR-2012-011912, University of Alabama at Birmingham, Computer and Information Sciences, Feb. 2012.Google Scholar
Index Terms
Comparing, Contrasting, Generalizing, and Integrating Two Current Designs for Fault-Tolerant MPI
Recommendations
Legio: fault resiliency for embarrassingly parallel MPI applications
AbstractDue to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is ...
Fault tolerant file models for MPI-IO parallel file systems
PVM/MPI'07: Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing InterfaceParallelism in file systems is obtained by using several independent server nodes supporting one or more secondary storage devices. This approach increases the performance and scalability of the system, but a fault in one single node can make the whole ...
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
EuroMPI'11: Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interfaceThe lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective ...
Comments