skip to main content
Communication-induced checkpointing and recovery protocols for distributed systems
Publisher:
  • University of Kentucky
  • Lexington, KY
  • United States
ISBN:978-1-267-23761-3
Order Number:AAI3501629
Pages:
115
Bibliometrics
Skip Abstract Section
Abstract

Checkpointing and rollback recovery are recognized techniques to provide fault tolerance for distributed computations. Depending on saved states of processes in the stable storage during execution, such techniques allow processes to make progress in spite of failures. In case of a failure, the whole computation can be restarted from a consistent global state that minimizes the amount of lost computation.

The Communication-Induced Checkpointing protocols are popular, because they help in bounding rollback propagation during failure recovery, by ensuring that each checkpoint taken is part of a consistent global checkpoint of the distributed computation, while at the same time allowing each process to take checkpoints independently. In this dissertation, we first present a fully informed and efficient communication-induced checkpointing protocol, which not only has less checkpointing overhead than a well-known Communication-Induced Checkpointing protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the existing Communication-Induced Checkpointing protocols.

Second, we present a through theoretical and experimental evaluation of the Communication-Induced Checkpointing protocols belonging to two families, namely, the F E family and the F Lazy–E family. Based on both theoretical and experimental evaluations, we conclude that we can compare the performance of protocols either among the F E family or the F Lazy–E family (but not between these two families) by merely comparing their checkpoint-inducing conditions.

Since existing checkpointing and rollback recovery protocols are suitable to only small-scale message passing systems and the ability to provide fault tolerance in large scale distributed systems is important for the success of future large-scale systems, we address this issue and present a group-based hybrid optimistic checkpointing and pessimistic message logging protocol. Then, we present a comprehensive recovery protocol based on the checkpointing protocol, which not only restores the whole system to a consistent global checkpoint when a failure occurs, but also handles different kinds of messages that arise during recovery appropriately, and restores the system to a consistent state.

KEYWORDS: Fault Tolerance, Distributed Systems, Consistent Global Checkpoints, Communication-Induced Checkpointing Protocols, Rollback Recovery

Contributors
  • University of Kentucky
  • University of Kentucky

Recommendations