research-article

Free Access

The network is reliable

Authors:
Peter Bailis

UC Berkeley

UC Berkeley
View Profile

,
Kyle Kingsbury

View Profile

Authors Info & Claims

Communications of the ACM Volume 57 Issue 9September 2014pp 48–55https://doi.org/10.1145/2643130

Published:01 September 2014Publication History

Communications of the ACM

Abstract

An informal survey of real-world communications failures.

References

Abadi, D. Consistency trade-offs in modern distributed database system design: CAP is only part of the story. Computer 45 (2 (2012), 37--42; http://dl.acm.org/citation.cfm?id=2360959. Google ScholarDigital Library
Amazon Web Services. Summary of the Amazon EC2 and Amazon RDS service disruption in the US East region, 2011; http://aws.amazon.com/message/65648/.Google Scholar
Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J.M. and Stoica, I. Highly available transactions: virtues and limitations. In Proceedings of VLDB 2014 (to appear); http://www.bailis.org/papers/hat-vldb2014.pdf.Google Scholar
Bailis, P., Fekete, A., Franklin, M.J., Ghodsi, A., Hellerstein, J.M. and Stoica, I. Coordination-avoiding database systems, 2014; http://arxiv.org/abs/1402.2237Google Scholar
Bailis, P. and Ghodsi, A. Eventual consistency today: Limitations, extensions, and beyond. ACM Queue 11, 3 (2013); http://queue.acm.org/detail.cfm?id=2462076. Google ScholarDigital Library
CityCloud, 2011; https://www.citycloud.eu/cloud-computing/post-mortem/.Google Scholar
Davidson, S.B., Garcia-Molina, H. and Skeen, D. Consistency in a partitioned network: A survey. ACM Computing Surveys 17, 3 (1985), 341--370; http://dl.acm.org/citation.cfm?id=5508. Google ScholarDigital Library
Dwork, C., Lynch, M. and Stockmeyer, L. Consensus in the presence of partial synchrony. JACM 35, 2 (1988); 288--323. http://dl.acm.org/citation.cfm?id=42283. Google ScholarDigital Library
Fischer, M.J., Lynch, N.A., Patterson, M.S. Impossibility of distributed consensus with one faulty process. JACM 32, 2 (1985), 374--382; http://dl.acm.org/citation.cfm?id=214121 Google ScholarDigital Library
Fog Creek Software. May 5--6 network maintenance post-mortem; http://status.fogcreek.com/2012/05/may-5-6-network-maintenance-post-mortem.html.Google Scholar
Gilbert, S. and Lynch, N. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33, 2 (2002), 51--59; http://dl.acm.org/citation.cfm?id=564601. Google ScholarDigital Library
Gill, P., Jain, N., Nagappan, N. Understanding network failures in data centers: Measurement, analysis, and implications. In Proceedings of SIGCOMM '11; http://research.microsoft.com/enus/um/people/navendu/papers/sigcomm11netwiser.pdf. Google ScholarDigital Library
Github. Github availability this week, 2012; https://github.com/blog/1261-github-availability-this-week.Google Scholar
Kielhofner, K. Packets of death; http://blog.krisk.org/2013/02/packets-of-death.html.Google Scholar
Lillich, J. Post mortem: Network issues last week; http://www.freistil.it/2013/02/post-mortem-network-issues-last-week/.Google Scholar
Narayan, P.P.S. Sherpa update, 2010; https://developer.yahoo.com/blogs/ydn/sherpa-7992.html#4.Google Scholar
Prince, M. Today's outage post mortem, 2013; http://blog.cloudflare.com/todays-outage-post-mortem-82515.Google Scholar
Turner, D., Levchenko, K., Snoeren, A. and Savage, S. California fault lines: Understanding the causes and impact of network failures. In Proceedings of SIGCOMM '10; http://cseweb.ucsd.edu/~snoeren/papers/cenic-sigcomm10.pdf. Google ScholarDigital Library
Twilio. Billing incident post-mortem: breakdown, analysis and root cause; http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html.Google Scholar

Index Terms

The network is reliable
1. Networks
  1. Network architectures
  2. Network services

Recommendations

On Quiescent Reliable Communication

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Read More
On reliable broadcast in a radio network
PODC '05: Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing

We consider the problem of reliable broadcast in an infinite grid (or finite toroidal) radio network under Byzantine and crash-stop failures. We present bounds on the maximum number of failures that may occur in any given neighborhood without rendering ...
Read More
Highly reliable message-passing mechanism for cluster file system

With the increase in personal computer clusters in popularity and quantity, message passing between nodes has been an important issue for high failure rate in the network. File access in a cluster file system often contains several sub-operations; each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 57, Issue 9
September 2014
94 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2663191
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 7,872
  Total Downloads
- Downloads (Last 12 months)312
- Downloads (Last 6 weeks)48
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF Chinese translation

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The network is reliable

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

On Quiescent Reliable Communication

On reliable broadcast in a radio network

Highly reliable message-passing mechanism for cluster file system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

The network is reliable

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

On Quiescent Reliable Communication

On reliable broadcast in a radio network

Highly reliable message-passing mechanism for cluster file system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media