Fast crash recovery in distributed file systems

September 1995

Author:
Mary Louise Gray Baker
Univ. of California, Berkeley

Publisher:

University of California at Berkeley
Computer Science Division 571 Evans Hall Berkeley, CA
United States

Order Number:UMI Order No. GAX95-04737

Bibliometrics

Abstract

This thesis presents fast crash recovery: a simple, efficient, and inexpensive method for increasing availability in distributed systems. In fast crash recovery we assume that critical resources will fail, and we do not attempt to mask the failures with redundant hardware or software. Instead, we design the system to recover so quickly that there is little downtime. This approach is intended for environments that can tolerate occasional failures and cannot afford the cost and overhead of redundant resources.

In particular, I focus on fast recovery of distributed state. An example of distributed state is the file caching information maintained by servers in most modern file systems. This information describes the state of file caches on client workstations. After a crash, a server must recover this information in order to guarantee the consistency of the caches. Unfortunately, distributed state recovery can be slow and complex. The techniques I have developed reduce state recovery from several minutes to under six seconds for a Sprite file server (Ouster88) with 40 clients.

This thesis evaluates three distributed state recovery techniques based on their speed, complexity, and performance overhead. The fastest technique is transparent recovery, so-called because client workstations do not communicate with the server during recovery. Instead, the server stores its distributed state in a protected area of its own main memory called the recovery box. The interface to the recovery box helps detect and prevent corruption of this state information.

To achieve fast overall recovery times, we must also recover other parts of the system quickly. For example, we can eliminate a lengthy file system consistency check by using a log-structured file system that recovers in seconds (Rosenb91). By combining the improvements described in this thesis, a Sprite file server can reboot in under 30 seconds. This is two orders of magnitude faster than most modern file systems recover.

In addition to evaluating distributed state recovery techniques, this thesis presents some overall guidelines for designing distributed systems that will recover quickly from crashes.

Cited By

Contributors

Mary G Baker
HP Labs
- Publication Years1990 - 2016
- Publication counts26
- Citation count1,082
- Available for Download15
- Downloads (cumulative)14,586
- Downloads (12 months)563
- Downloads (6 weeks)71
- Average Downloads per Article972
- Average Citation per Article42
View Full Profile

Index Terms

Fast crash recovery in distributed file systems
1. Software and its engineering
  1. Software creation and management
    1. Designing software
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Distributed memory
    2. Software system structures
      1. Distributed systems organizing principles

Recommendations

Fast Crash Recovery in Distributed File Systems
Read More
Fast crash recovery in RAMCloud
SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it ...
Read More
Quasi-synchronous checkpointing and failure recovery in distributed systems
Read More

Comments

Browse Theses

Sections

Cited By

Index Terms

Fast Crash Recovery in Distributed File Systems

Fast crash recovery in RAMCloud

Quasi-synchronous checkpointing and failure recovery in distributed systems

Sections

Cited By

Save to Binder

Index Terms

Recommendations

Fast Crash Recovery in Distributed File Systems

Fast crash recovery in RAMCloud

Quasi-synchronous checkpointing and failure recovery in distributed systems