skip to main content
10.1145/1272996.1273008acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
Article

Latency and bandwidth-minimizing failure detectors

Published:21 March 2007Publication History

ABSTRACT

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, failures need to be detected quickly and with low overhead. Achieving these properties simultaneously poses a difficult tradeoff between detection latency and resource consumption.

In this paper, we examine this central tradeoff, formalize it as an optimization problem and analytically derive the optimal closed form formulas for multi-node failure detectors. We provide two variants of the optimal solution for optimality metrics appropriate for two different deployment scenarios. √s-LM is a latency-minimizing optimal failure detector that achieves the lowest average failure detection latency given a fixed bandwidth constraint for system maintenance. √s-BM is a bandwidth-minimizing optimal failure detector that meets a desired detection latency target with the least amount of bandwidth consumed. We evaluate our optimal results with node lifetimes chosen from bimodal and Pareto distributions, as well as real-world trace data from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard failure detectors in wide use, √s failure detectors reduce failure detection latencies by 40% on average for the same bandwidth consumption, or conversely, reduce the amount of bandwidth consumed by 30% for the same failure detection latency.

References

  1. M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a Timeout-free Failure Detector for Quiescent Reliable Communication. In Proceedings of the International Workshop on Distributed Algorithms, Saarbrücken, Germany, Sept. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. K. Aguilera, W. Chen, and S. Toueg. Failure Detection and Consensus in the Crash-Recovery Model. In Proceedings of the International Symposium on Distributed Computing, Andros, Greece, Sept. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Amir, D. Dolev, S. Kramer, and D. Malkhi. Transis: A Communication Sub-system for High Availability. In Proceedings of the International Symposium on Fault-Tolerant Computing, Boston, Massachussetts, July 1992.Google ScholarGoogle Scholar
  4. M. Bakkaloglu, J. J. Wylie, C. Wang, and G. R. Ganger. On Correlated Failures in Survivable Storage Systems. Technical Report CMU-CS-02-129, School of Computer Science, Carnegie Mellon University, May 2002.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Bertier, O. Marin, and P. Sens. Implementation and Performance Evaluation of an Adaptable Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bertier, O. Marin, and P. Sens. Performance Analysis of Hierarchical Failure Detector. In Proceedings of the International Conference on Dependable Systems and Networks, San Francisco, California, June 2003.Google ScholarGoogle ScholarCross RefCross Ref
  7. W. J. Bolosky, J. R. Douceur, D. Ely, and M. Theimer. Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs. In Proceedings of the SIGMETRICS Conference, Santa Clara, California, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Castro, M. Costa, and A. I. T. Rowstron. Performance and Dependability of Structured Peer-to-Peer Overlays. In Proceedings of the International Conference on Dependable Systems and Networks, Florence, Italy, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. D. Chandra and S. Toueg. Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225--267, Mar. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Chen. On the Quality of Service of Failure Detectors. PhD thesis, Cornell University, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Chen, S. Toueg, and M. K. Aguilera. On the Quality of Service of Failure Detectors. IEEE Transactions on Computers, 51(5), May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Defago, P. Felber, and A. Schiper. Optimization Techniques for Replicating CORBA Objects. In Proceedings of the IEEE International Workshop on Object-oriented Real-time Dependable Systems, Santa Barbara, California, Jan. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Defago, P. Urban, N. Hayashibara, and T. Katayama. Definition and Specification of Accrual Failure Detectors. In Proceedings of the International Conference on Dependable Systems and Networks, Yokohama, Japan, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dunagan, N. J. A. Harvey, M. B. Jones, D. Kostic, M. Theimer, and A. Wolman. FUSE: Lightweight Guaranteed Distributed Failure Notification. In Proceedings of the Symposium on Operating System Design and Implementation, San Francisco, California, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. A. Fakhouri, G. S. Goldszmidt, and I. Gupta. Gulfstream - A System for Dynamic Topology Management in Multi-domain Server Farms. Technical Report RC21954, IBM T.J. Watson Research Center, Feb. 2001.Google ScholarGoogle ScholarCross RefCross Ref
  16. C. Fetzer, M. Raynal, and F. Tronel. An Adaptive Failure Detection Protocol. In Proceedings of the Pacific Rim Symposium on Dependable Computing, Seoul, Korea, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Glade, K. P. Birman, R. Cooper, and R. van Renesse. Light-Weight Process Groups in the ISIS System. Distributed Systems Engineering, 1(1):29--36, Sept. 1993.Google ScholarGoogle ScholarCross RefCross Ref
  18. Gnutella. The Gnutella Protocol Specification. http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf.Google ScholarGoogle Scholar
  19. P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing Churn in Distributed Systems. In Proceedings of the SIGCOMM Conference, Pisa, Italy, Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Gupta, K. Birman, P. Linga, A. Demers, and R. van Rennesse. Kelips: Building an Efficient and Stable P2P DHT Through Increased Memory and Background Overhead. In Proceedings of the International Workshop on Peer-to-Peer Systems, Berkeley, California, Feb. 2003.Google ScholarGoogle ScholarCross RefCross Ref
  21. I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On Scalable and Efficient Distributed Failure Detectors. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Newport, Rhode Island, Aug. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Hayashibara, X. Defago, and T. Katayama. Two-ways Adaptive Failure Detection with the ψ-failure Detector. In Proceedings of the International Workshop on Adaptive Distributed Systems, Italy, Oct. 2003.Google ScholarGoogle Scholar
  23. M. G. Hayden. The Ensemble System. PhD thesis, Cornell University, Jan. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Larrea, S. Arevalo, and A. Fernandez. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Proceedings of the International Symposium on Distributed Computing, Bratislava, Slovak Republic, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Larrea, A. Fernandez, and S. Arevalo. Optimal Implementation of the Weakest Failure Detector for Solving Consensus. In Proceedings of the ACM Symposium on Principles of Distributed Computing, Portland, Oregon, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Li, J. Stribling, R. Morris, and M. F. Kaashoek. Bandwidth-Efficient Management of DHT Routing Tables. In Proceedings of the Symposium on Networked System Design and Implementation, Boston, Massachusetts, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. E. Moser, P. M. Melliar-Smith, D. A. Argarwal, R. K. Budhia, and C. A. Lingley-Papadopoulos. Totem: A Fault-tolerant Multicast Group Communication System. Communications of the ACM, 39(4):54--63, Apr. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Paxson. End-to-End Internet Packet Dynamics. In Proceedings of the SIGCOMM Conference, France, Sept. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. Ramasubramanian, R. Peterson, and E. G. Sirer. Corona: A High Performance Publish-Subscribe System for the World Wide Web. In Proceedings of the Symposium on Networked System Design and Implementation, San Jose, California, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. Ramasubramanian and E. G. Sirer. The Design and Implementation of a Next Generation Name Service for the Internet. In Proceedings of the SIGCOMM Conference, Portland, Oregon, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. V. Renesse, Y. Minsky, and M. Hayden. A Gossip-Style Failure Detection Service. In Proceedings of the International Conference and Distributed Systems Platforms and Open Distributed Processing, Vienna, Austria, Sept. 1998.Google ScholarGoogle ScholarCross RefCross Ref
  32. S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling Churn in a DHT. In Proceedings of the USENIX Technical Conference, Boston, Massachussetts, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Rowstorn and P. Druschel. Pastry: Scalable, Decentralized Object Location and Routing for Large-scale Peer-to-Peer Systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms, Heidelberg, Germany, Nov. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Saroiu, K. Gummadi, and S. D. Gribble. Measuring and Analyzing the Characteristics of Napster and Gnutella Hosts. Multimedia Systems Journal, 9(2):170--184, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. J. Song, V. Ramasubramanian, and E. G. Sirer. Optimal Resource Utilization in Content Distribution Networks. Technical Report TR2005-2004, Cornell University, Computing and Information Science, Ithaca, New York, Nov. 2005.Google ScholarGoogle Scholar
  36. P. Stelling, I. Foster, C. Kesselman, C. Lee, and G. von Laszewski. A Fault Detection Service for Wide Area Distributed Computations. In Proceedings of the IEEE Symposium on High Performance Distributed Computing, Chicago, Illinois, July 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Stribling. PlanetLab All Pairs Ping Data. http://pdos.csail.mit.edu/~strib/pl_app.Google ScholarGoogle Scholar
  38. V. Vishnumurthy and P. Francis. On Heterogeneous Overlay Construction and Random Node Selection in Unstructured P2P Networks. In Proceedings of the INFOCOM Conference, Barcelona, Spain, Apr. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  39. H. Weatherspoon, B.-G. Chun, C. W. So, and J. Kubiatowicz. Long-Term Data Maintenance in Wide-Area Storage Systems: A Quantitative Approach. Technical Report UCB/CSD-05-1404, University of California-Berkeley, July 2005.Google ScholarGoogle Scholar
  40. M. Yajnik, S. B. Moon, J. F. Kurose, and D. F. Towsley. Measurement and Modeling of the Temporal Dependence in Packet Loss. In Proceedings of the INFOCOM Conference, pages 345--352, New York, New York, Mar. 1999.Google ScholarGoogle Scholar
  41. S. Zhuang, D. Geels, I. Stoica, and R. Katz. On Failure Detection Algorithms in Overlay Networks. In Proceedings of the INFOCOM Conference, Miami, Florida, Mar. 2005.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Latency and bandwidth-minimizing failure detectors

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
      March 2007
      431 pages
      ISBN:9781595936363
      DOI:10.1145/1272996
      • cover image ACM SIGOPS Operating Systems Review
        ACM SIGOPS Operating Systems Review  Volume 41, Issue 3
        EuroSys'07 Conference Proceedings
        June 2007
        386 pages
        ISSN:0163-5980
        DOI:10.1145/1272998
        Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 March 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate241of1,308submissions,18%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader