Communication-induced checkpointing and recovery protocols for distributed systems

January 2011

Author:
Yi Luo
University of Kentucky
,
Adviser:
Dakshnamoorthy Manivannan
University of Kentucky

Publisher:

University of Kentucky
Lexington, KY
United States

ISBN:978-1-267-23761-3

Order Number:AAI3501629

Pages:

115

Purchase on ProQuest

Bibliometrics

Abstract

Checkpointing and rollback recovery are recognized techniques to provide fault tolerance for distributed computations. Depending on saved states of processes in the stable storage during execution, such techniques allow processes to make progress in spite of failures. In case of a failure, the whole computation can be restarted from a consistent global state that minimizes the amount of lost computation.

The Communication-Induced Checkpointing protocols are popular, because they help in bounding rollback propagation during failure recovery, by ensuring that each checkpoint taken is part of a consistent global checkpoint of the distributed computation, while at the same time allowing each process to take checkpoints independently. In this dissertation, we first present a fully informed and efficient communication-induced checkpointing protocol, which not only has less checkpointing overhead than a well-known Communication-Induced Checkpointing protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the existing Communication-Induced Checkpointing protocols.

Second, we present a through theoretical and experimental evaluation of the Communication-Induced Checkpointing protocols belonging to two families, namely, the F _E family and the F _Lazy–E family. Based on both theoretical and experimental evaluations, we conclude that we can compare the performance of protocols either among the F _E family or the F _Lazy–E family (but not between these two families) by merely comparing their checkpoint-inducing conditions.

Since existing checkpointing and rollback recovery protocols are suitable to only small-scale message passing systems and the ability to provide fault tolerance in large scale distributed systems is important for the success of future large-scale systems, we address this issue and present a group-based hybrid optimistic checkpointing and pessimistic message logging protocol. Then, we present a comprehensive recovery protocol based on the checkpointing protocol, which not only restores the whole system to a consistent global checkpoint when a failure occurs, but also handles different kinds of messages that arise during recovery appropriately, and restores the system to a consistent state.

KEYWORDS: Fault Tolerance, Distributed Systems, Consistent Global Checkpoints, Communication-Induced Checkpointing Protocols, Rollback Recovery

Contributors

D. Manivannan
University of Kentucky
- Publication Years1996 - 2022
- Publication counts39
- Citation count177
- Available for Download1
- Downloads (cumulative)1,182
- Downloads (12 months)4
- Downloads (6 weeks)1
- Average Downloads per Article1,182
- Average Citation per Article5
View Full Profile
Yi Luo
University of Kentucky
- Publication Years2007 - 2012
- Publication counts7
- Citation count18
- Available for Download0
- Downloads (cumulative)15
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article3
View Full Profile

Recommendations

Quasi-synchronous checkpointing and failure recovery in distributed systems
Read More
A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems
PDCAT'04: Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Checkpointing and recovery in traditional distributed systems is relatively well established. However, checkpointing and recovery in multithreaded distributed systems has not been studied in the literature. Using the traditional checkpointing and ...
Read More
A Communication-Induced Checkpointing and Asynchronous Recovery Protocol for Mobile Computing Systems
PDCAT '05: Proceedings of the Sixth International Conference on Parallel and Distributed Computing Applications and Technologies

Mobile computing systems have many constraints such as low battery power, low bandwidth , high mobility and lack of stable storage which are not presented in static distributed systems. In this paper, we propose an efficient communication-induced ...
Read More

Comments

Browse Theses

Sections

Quasi-synchronous checkpointing and failure recovery in distributed systems

A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

A Communication-Induced Checkpointing and Asynchronous Recovery Protocol for Mobile Computing Systems

Sections

Save to Binder

Recommendations

Quasi-synchronous checkpointing and failure recovery in distributed systems

A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

A Communication-Induced Checkpointing and Asynchronous Recovery Protocol for Mobile Computing Systems