skip to main content
Understanding and coping with failures in large-scale storage systems
  • Author:
  • Qin Xin,
  • Chair:
  • Ethan L. Miller
Publisher:
  • University of California at Santa Cruz
  • Computer and Information Sciences Dept. 265 Applied Sciences Building Santa Cruz, CA
  • United States
ISBN:978-0-542-38519-3
Order Number:AAI3194083
Pages:
134
Bibliometrics
Skip Abstract Section
Abstract

Reliability for very large-scale storage systems has become more and more important as the need for storage has grown dramatically. New phenomena related to system reliability appear as systems scale up. In such a system, failures are a normality. In order to ensure high reliability for petabyte-scale storage systems in scientific applications, characterization of failures and techniques of coping with them are studied in this thesis.

The thesis first describes the architecture of a petabyte-scale storage system and characterizes the challenges of achieving high reliability in such a system. The long disk recovery time and the large number of system components are identified as the main obstacles against high system reliability.

The thesis then presents a fast recovery mechanism, FARM, which greatly reduces data loss in the occurrence of multiple disk failures. Reliability of a petabyte-scale system with and without FARM has been evaluated. Accordingly, various aspects of system reliability, such as failure detection latency, bandwidth utilization for recovery, disk space utilization, and system scale, have been examined by simulations.

The overall system reliability is modeled and estimated by quantitative analysis based on Markov models and event-driven simulations. It is found that disk failure models which take infant mortality into consideration result in more precise reliability estimation than the traditional model which assumes a constant failure rate, since infant mortality has a pronounced impact on petabyte-scale systems. To safeguard data against failures from young disk drives, an adaptive data redundancy scheme is presented and evaluated.

A petabyte-scale storage system is typically built up by thousands of components in a complicated interconnect structure. The impact of various failures on the interconnection networks is gauged and the performance and robustness under degraded modes are evaluated in a simulated petabyte-scale storage system with different configurations of network topology.

This thesis is directed towards understanding and coping with failures in petabyte-scale storage systems. It addresses several emerging reliability challenges posed by the increasing scale of storage systems and study the methods to improving system reliability. The research is targeted to help system architects in the designs of reliable storage systems at petabyte-scale and beyond.

Contributors
  • University of California, Santa Cruz
  • University of California, Santa Cruz

Recommendations