We report on the design and correctness of a communication facility for a distributed computer system. The facility provides support for fault tolerant process groups in the form of a family of reliable multicast protocols that can be used both in local and wide-area networks. These protocols attain high levels of concurrency while respecting application-specific delivery ordering constraints, and have varying cost and performance that depends on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced, and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher-level algorithms; made possible by our approach.
Recommendations
Reliable communication in the presence of failures
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both ...
On Quiescent Reliable Communication
We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Reliable Broadcast in Radio Networks with Locally Bounded Failures
This paper studies the reliable broadcast problem in a radio network with locally bounded failures. We present a sufficient condition for achievability of reliable broadcast in a general graph subject to Byzantine/crash-stop failures. We then consider ...