Article

Latency and bandwidth-minimizing failure detectors

Authors:
Kelvin C. W. So

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Emin Gün Sirer

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007March 2007Pages 89–99https://doi.org/10.1145/1272996.1273008

Published:21 March 2007Publication History

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Pages 89–99

ABSTRACT

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, failures need to be detected quickly and with low overhead. Achieving these properties simultaneously poses a difficult tradeoff between detection latency and resource consumption.

In this paper, we examine this central tradeoff, formalize it as an optimization problem and analytically derive the optimal closed form formulas for multi-node failure detectors. We provide two variants of the optimal solution for optimality metrics appropriate for two different deployment scenarios. √s-LM is a latency-minimizing optimal failure detector that achieves the lowest average failure detection latency given a fixed bandwidth constraint for system maintenance. √s-BM is a bandwidth-minimizing optimal failure detector that meets a desired detection latency target with the least amount of bandwidth consumed. We evaluate our optimal results with node lifetimes chosen from bimodal and Pareto distributions, as well as real-world trace data from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard failure detectors in wide use, √s failure detectors reduce failure detection latencies by 40% on average for the same bandwidth consumption, or conversely, reduce the amount of bandwidth consumed by 30% for the same failure detection latency.

References

M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a Timeout-free Failure Detector for Quiescent Reliable Communication. In Proceedings of the International Workshop on Distributed Algorithms, Saarbrücken, Germany, Sept. 1997. Google ScholarDigital Library
M. K. Aguilera, W. Chen, and S. Toueg. Failure Detection and Consensus in the Crash-Recovery Model. In Proceedings of the International Symposium on Distributed Computing, Andros, Greece, Sept. 1998. Google ScholarDigital Library
Y. Amir, D. Dolev, S. Kramer, and D. Malkhi. Transis: A Communication Sub-system for High Availability. In Proceedings of the International Symposium on Fault-Tolerant Computing, Boston, Massachussetts, July 1992.Google Scholar
M. Bakkaloglu, J. J. Wylie, C. Wang, and G. R. Ganger. On Correlated Failures in Survivable Storage Systems. Technical Report CMU-CS-02-129, School of Computer Science, Carnegie Mellon University, May 2002.Google ScholarCross Ref
M. Bertier, O. Marin, and P. Sens. Implementation and Performance Evaluation of an Adaptable Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002. Google ScholarDigital Library
M. Bertier, O. Marin, and P. Sens. Performance Analysis of Hierarchical Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, California, June 2003.Google ScholarCross Ref
W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. In Proceedings of the SIGMETRICS Conference, Santa Clara, California, June 2000. Google ScholarDigital Library
M. Castro, M. Costa, and A. I. T. Rowstron. Performance and Dependability of Structured Peer-to-Peer Overlays. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 2004. Google ScholarDigital Library
T. D. Chandra and S. Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225--267, Mar. 1996. Google ScholarDigital Library
W. Chen. On the Quality of Service of Failure Detectors. PhD thesis, Cornell University, May 2000. Google ScholarDigital Library
W. Chen, S. Toueg, and M. K. Aguilera. On the Quality of Service of Failure Detectors. IEEE Transactions on Computers, 51(5), May 2002. Google ScholarDigital Library
X. Defago, P. Felber, and A. Schiper. Optimization Techniques for Replicating CORBA Objects. In Proceedings of the IEEE International Workshop on Object-oriented Real-time Dependable Systems, Santa Barbara, California, Jan. 1999. Google ScholarDigital Library
X. Defago, P. Urban, N. Hayashibara, and T. Katayama. Definition and Specification of Accrual Failure Detectors. In Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005. Google ScholarDigital Library
J. Dunagan, N. J. A. Harvey, M. B. Jones, D. Kostic, M. Theimer, and A. Wolman. FUSE: Lightweight Guaranteed Distributed Failure Notification. In Proceedings of the Symposium on Operating System Design and Implementation, San Francisco, California, Dec. 2004. Google ScholarDigital Library
S. A. Fakhouri, G. S. Goldszmidt, and I. Gupta. Gulfstream - A System for Dynamic Topology Management in Multi-domain Server Farms. Technical Report RC21954, IBM T.J. Watson Research Center, Feb. 2001.Google ScholarCross Ref
C. Fetzer, M. Raynal, and F. Tronel. An Adaptive Failure Detection Protocol. In Proceedings of the Pacific Rim Symposium on Dependable Computing, Seoul, Korea, Dec. 2001. Google ScholarDigital Library
B. Glade, K. P. Birman, R. Cooper, and R. van Renesse. Light-Weight Process Groups in the ISIS System. Distributed Systems Engineering, 1(1):29--36, Sept. 1993.Google ScholarCross Ref
Gnutella. The Gnutella Protocol Specification. http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf.Google Scholar
P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn in Distributed Systems. In Proceedings of the SIGCOMM Conference, Pisa, Italy, Sept. 2006. Google ScholarDigital Library
I. Gupta, K. Birman, P. Linga, A. Demers, and R. van Rennesse. Kelips: Building an Efficient and Stable P2P DHT Through Increased Memory and Background Overhead. In Proceedings of the International Workshop on Peer-to-Peer Systems, Berkeley, California, Feb. 2003.Google ScholarCross Ref
I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On Scalable and Efficient Distributed Failure Detectors. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Newport, Rhode Island, Aug. 2001. Google ScholarDigital Library
N. Hayashibara, X. Defago, and T. Katayama. Two-ways Adaptive Failure Detection with the ψ-failure Detector. In Proceedings of the International Workshop on Adaptive Distributed Systems, Italy, Oct. 2003.Google Scholar
M. G. Hayden. The Ensemble System. PhD thesis, Cornell University, Jan. 1998. Google ScholarDigital Library
M. Larrea, S. Arevalo, and A. Fernandez. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Proceedings of the International Symposium on Distributed Computing, Bratislava, Slovak Republic, Sept. 1999. Google ScholarDigital Library
M. Larrea, A. Fernandez, and S. Arevalo. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Portland, Oregon, July 2000. Google ScholarDigital Library
J. Li, J. Stribling, R. Morris, and M. F. Kaashoek. Bandwidth-Efficient Management of DHT Routing Tables. In Proceedings of the Symposium on Networked System Design and Implementation, Boston, Massachusetts, May 2005. Google ScholarDigital Library
L. E. Moser, P. M. Melliar-Smith, D. A. Argarwal, R. K. Budhia, and C. A. Lingley-Papadopoulos. Totem: A Fault-tolerant Multicast Group Communication System. Communications of the ACM, 39(4):54--63, Apr. 1996. Google ScholarDigital Library
V. Paxson. End-to-End Internet Packet Dynamics. In Proceedings of the SIGCOMM Conference, France, Sept. 1997. Google ScholarDigital Library
V. Ramasubramanian, R. Peterson, and E. G. Sirer. Corona: A High Performance Publish-Subscribe System for the World Wide Web. In Proceedings of the Symposium on Networked System Design and Implementation, San Jose, California, May 2006. Google ScholarDigital Library
V. Ramasubramanian and E. G. Sirer. The Design and Implementation of a Next Generation Name Service for the Internet. In Proceedings of the SIGCOMM Conference, Portland, Oregon, Aug. 2004. Google ScholarDigital Library
R. V. Renesse, Y. Minsky, and M. Hayden. A Gossip-Style Failure Detection Service. In Proceedings of the International Conference and Distributed Systems Platforms and Open Distributed Processing, Vienna, Austria, Sept. 1998.Google ScholarCross Ref
S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling Churn in a DHT. In Proceedings of the USENIX Technical Conference, Boston, Massachussetts, June 2004. Google ScholarDigital Library
A. Rowstorn and P. Druschel. Pastry: Scalable, Decentralized Object Location and Routing for Large-scale Peer-to-Peer Systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, Heidelberg, Germany, Nov. 2001. Google ScholarDigital Library
S. Saroiu, K. Gummadi, and S. D. Gribble. Measuring and Analyzing the Characteristics of Napster and Gnutella Hosts. Multimedia Systems Journal, 9(2):170--184, Aug. 2003. Google ScholarDigital Library
Y. J. Song, V. Ramasubramanian, and E. G. Sirer. Optimal Resource Utilization in Content Distribution Networks. Technical Report TR2005-2004, Cornell University, Computing and Information Science, Ithaca, New York, Nov. 2005.Google Scholar
P. Stelling, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski. A Fault Detection Service for Wide Area Distributed Computations. In Proceedings of the IEEE Symposium on High Performance Distributed Computing, Chicago, Illinois, July 1998. Google ScholarDigital Library
J. Stribling. PlanetLab All Pairs Ping Data. http://pdos.csail.mit.edu/~strib/pl_app.Google Scholar
V. Vishnumurthy and P. Francis. On Heterogeneous Overlay Construction and Random Node Selection in Unstructured P2P Networks. In Proceedings of the INFOCOM Conference, Barcelona, Spain, Apr. 2006.Google ScholarCross Ref
H. Weatherspoon, B.-G. Chun, C. W. So, and J. Kubiatowicz. Long-Term Data Maintenance in Wide-Area Storage Systems: A Quantitative Approach. Technical Report UCB/CSD-05-1404, University of California-Berkeley, July 2005.Google Scholar
M. Yajnik, S. B. Moon, J. F. Kurose, and D. F. Towsley. Measurement and Modeling of the Temporal Dependence in Packet Loss. In Proceedings of the INFOCOM Conference, pages 345--352, New York, New York, Mar. 1999.Google Scholar
S. Zhuang, D. Geels, I. Stoica, and R. Katz. On Failure Detection Algorithms in Overlay Networks. In Proceedings of the INFOCOM Conference, Miami, Florida, Mar. 2005.Google ScholarCross Ref

Index Terms

Latency and bandwidth-minimizing failure detectors
1. Networks
  1. Network services

Recommendations

Latency and bandwidth-minimizing failure detectors
EuroSys'07 Conference Proceedings

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, ...
Read More
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

Unreliable failure detectors were proposed by Chandra and Toueg as mechanisms that provide information about process failures. Chandra and Toueg defined eight classes of failure detectors, depending on how accurate this information is, and presented an ...
Read More
Unreliable failure detectors for reliable distributed systems

We introduce the concept of unreliable failure detectors and study how they can be used to solve Consensus in asynchronous systems with crash failures. We characterise unreliable failure detectors in terms of two properties—completeness and accuracy. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
March 2007
431 pages
ISBN:9781595936363
DOI:10.1145/1272996
ACM SIGOPS Operating Systems Review Volume 41, Issue 3
EuroSys'07 Conference Proceedings
June 2007
386 pages
ISSN:0163-5980
DOI:10.1145/1272998
Issue’s Table of Contents
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 March 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
failure detection
overlays
wide-area networks
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 472
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Latency and bandwidth-minimizing failure detectors

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

ABSTRACT

References

Cited By

Index Terms

Recommendations

Latency and bandwidth-minimizing failure detectors

On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

Unreliable failure detectors for reliable distributed systems