article

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

Authors:
Bianca Schroeder

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Garth A. Gibson

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 3 Issue 3pp 8–eshttps://doi.org/10.1145/1288783.1288785

Published:01 October 2007Publication History

ACM Transactions on Storage

Abstract

Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million.

This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%.

We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF.

We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years.

Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks.

Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

References

Bairavasundaram, L. N., Goodson, G. R., Pasupathy, S., and Schindler, J. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
CFDR. 2007. The computer failure data repository. http://cfdr.usenix.org/.Google Scholar
Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. TP-338.1. Seagate Technology, November.Google Scholar
Corbett, P. F., English, R., Goel, A., Grcanac, T., Kleiman, S., Leong, J., and Sankar, S. 2004. Row-diagonal parity for double disk failure correction. In Proceedings of the Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
Drummer, D., Khurshudov, A., Riedel, E., and Watts R. 2006. Personal communication.Google Scholar
Elerath, J. G. 2000a. AFR: Problems of definition, calculation and measurement in a commercial environment. In Proceedings of the Annual Reliability and Maintainability Symposium.Google ScholarCross Ref
Elerath, J. G. 2000b. Specifying reliability in the disk drive industry: No more MTBFs. In Proceedings of the Annual Reliability and Maintainability Symposium.Google ScholarCross Ref
Elerath, J. G. and Shah, S. 2004. Server class drives: How reliable are they&quest; In Proceedings of the Annual Reliability and Maintainability Symposium.Google Scholar
Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
Gibson, G. A. 1992. Redundant disk arrays: Reliable, parallel secondary storage. Dissertation. MIT Press, New York. Google ScholarDigital Library
Gray, J. 1990. A census of tandem system availability between 1985 and 1990. IEEE Trans. Reliabil. 39, 4.Google ScholarCross Ref
Gray, J. 1986. Why do computers stop and what can be done about it. In Proceedings of the 5th Symposium on Reliability in Distributed Software and Database Systems.Google Scholar
Heath, T., Martin, R. P., and Nguyen, T. D. 2002. Improving cluster availability using workstation validation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
Iyer, R. K., Rossetti, D. J., and Hsueh, M. C. 1986. Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4, 3. Google ScholarDigital Library
Kalyanakrishnam, M., Kalbarczyk, Z., and Iyer, R. 1999. Failure data analysis of a LAN of Windows NT-based computers. In Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems. Google ScholarDigital Library
Karagiannis, T. 2002. Selfis: A short tutorial. Tech. rep., University of California, Riverside.Google Scholar
Karagiannis, T., Molle, M., and Faloutsos, M. 2004. Long-range dependence: Ten years of internet traffic modeling. IEEE Internet Comput. 8, 5. Google ScholarDigital Library
LANL. http://www.lanl.gov/projects/computerscience/data/.Google Scholar
Leland, W. E., Taqqu, M. S., Willinger, W., and Wilson, D. V. 1994. On the self-similar nature of ethernet traffic. IEEE/ACM Trans. Netw. 2, 1. Google ScholarDigital Library
Lin, T.-T. Y. and Siewiorek, D. P. 1990. Error log analysis: Statistical modeling and heuristic trend analysis. IEEE Trans. Reliabil. 39, 4.Google ScholarCross Ref
Meyer, J. and Wei, L. 1988. Analysis of workload influence on dependability. In Proceedings of the International Symposium on Fault-Tolerant Computing.Google Scholar
Murphy, B. and Gent, T. 1995. Measuring system and software reliability using an automated data collection process. Qual. Reliabil. Eng. Int. 11, 5.Google ScholarCross Ref
NERSC. 2007. Systems disk failure. http://pdsi.nersc.gov/all_diskfailure.php.Google Scholar
Nurmi, D., Brevik, J., and Wolski, R. 2005. Modeling machine availability in enterprise and wide-area distributed computing environments. In International Euro-Par Conference on Parallel Processing. Google ScholarDigital Library
Oppenheimer, D. L., Ganapathi, A., and Patterson, D. A. 2003. Why do internet services fail, and what can be done about it&quest; In USENIX Symposium on Internet Technologies and Systems. Google ScholarDigital Library
Patterson, D., Gibson, G., and Katz, R. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Google ScholarDigital Library
Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005. Iron file systems. In Proc. of the 20th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
Ross, S. M. Introduction to Probability Models. 6th edn. Academic Press. Google ScholarDigital Library
Sahoo, R. K., Sivasubramaniam, A., Squillante, M. S., and Zhang, Y. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). Google ScholarDigital Library
Schroeder, B. and Gibson, G. A. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you&quest; In Proceedings of the Conference on File and Storage Technologies (FAST). Google ScholarDigital Library
Schroeder, B. and Gibson, G. A. 2006. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN). Google ScholarDigital Library
Schwarz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., van Ingen, C., Joste, K., Manasse, M., and Shah, M. 2006. Disk failure investigations at the internet archive. In NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST) Work in Progress Session.Google Scholar
Talagala, N. and Patterson, D. 1999. An analysis of error behaviour in a large storage system. In The IEEE Workshop on Fault Tolerance in Parallel and Distributed Systems.Google Scholar
Tang, D., Iyer, R. K., and Subramani, S. S. 1990. Failure analysis and modelling of a VAX cluster system. In Proceedings of the International Symposium on Fault-tolerant Computing.Google Scholar
van Ingen, C. and Gray, J. 2005. Empirical measurements of disk failure rates and error rates. Tech. Rep. MSR-TR-2005-166, Microsoft Research, December.Google Scholar
Xu, J., Kalbarczyk, Z., and Iyer, R. K. 1999. Networked Windows NT system field failure data analysis. In Proceedings of the Pacific Rim International Symposium on Dependable Computing. Google ScholarDigital Library
Yang, J. and Sun, F.-B. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Reliability and Maintainability Symposium.Google Scholar

Index Terms

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

Recommendations

Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers
Big Data – BigData 2018
Abstract
Ensuring the reliability of large-scale storage systems remains a challenge, especially when there are millions of disk drives deployed. Post-failure disk rebuild takes much longer time nowadays due to the ever-increasing disk capacity, which also ...
Read More
Performance of Two-Disk Failure-Tolerant Disk Arrays

RAID5 disk arrays use the rebuild process to reconstruct the contents of a failed disk on a spare disk, but this process is unsuccessful if latent sector failures are encountered or a second disk failure occurs. The high cost of data loss has led to two-...
Read More
Storage Reliability Evaluation based on Competing Risks of Degradation Failure and Random Failure for Missiles
WSSE '20: Proceedings of the 2nd World Symposium on Software Engineering

Storage reliability is an important technical index of missiles. And missile failures as the competition results of multi-components degradation failure and random failure, which is utilized to construct storage reliability evaluation model in this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 3, Issue 3
October 2007
183 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1288783
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2007
Published in tos Volume 3, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hard drive replacements
MTTF
annual failure rates
annual replacement rates
datasheet MTTF
failure correlation
hard drive failure
infant mortality
storage reliability
time between failure
wear-out
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 81
  Total Citations
  View Citations
- 2,095
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers

Performance of Two-Disk Failure-Tolerant Disk Arrays

Storage Reliability Evaluation based on Competing Risks of Degradation Failure and Random Failure for Missiles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers

Performance of Two-Disk Failure-Tolerant Disk Arrays

Storage Reliability Evaluation based on Competing Risks of Degradation Failure and Random Failure for Missiles

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media