ABSTRACT
This paper describes the design and implementation of Egalitarian Paxos (EPaxos), a new distributed consensus algorithm based on Paxos. EPaxos achieves three goals: (1) optimal commit latency in the wide-area when tolerating one and two failures, under realistic conditions; (2) uniform load balancing across all replicas (thus achieving high throughput); and (3) graceful performance degradation when replicas are slow or crash.
Egalitarian Paxos is to our knowledge the first protocol to achieve the previously stated goals efficiently---that is, requiring only a simple majority of replicas to be non-faulty, using a number of messages linear in the number of replicas to choose a command, and committing commands after just one communication round (one round trip) in the common case or after at most two rounds in any case. We prove Egalitarian Paxos's properties theoretically and demonstrate its advantages empirically through an implementation running on Amazon EC2.
Supplemental Material
- M. K. Aguilera, C. Delporte-Gallet, H. Fauconnier, and S. Toueg. Thrifty generic broadcast. In Proc. 14th International Conference on Distributed Computing, DISC '00, pages 268--282, London, UK, UK, 2000. Springer-Verlag. Google ScholarDigital Library
- J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc. of the Conference on Innovative Data system Research (CIDR), pages 223--234, 2011.Google Scholar
- M. Biely, Z. Milosevic, N. Santos, and A. Schiper. Spaxos: Offloading the leader for high throughput state machine replication. In Reliable Distributed Systems (SRDS), 2012 IEEE 31st Symposium on, 2012. Google ScholarDigital Library
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. 7th USENIX OSDI, Seattle, WA, Nov. 2006. Google ScholarDigital Library
- L. J. Camargos, R. M. Schmidt, and F. Pedone. Multicoordinated paxos. In Proc. 26th annual ACM symposium on Principles of distributed computing, PODC '07, pages 316--317, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43: 225--267, Mar. 1996. Google ScholarDigital Library
- T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: an engineering perspective. In Proc. 26th ACM SOSP, PODC '07, pages 398--407, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In Proc. 10th USENIX OSDI. USENIX, 2012. Google ScholarDigital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374--382, Apr. 1985. ISSN 0004-5411. Google ScholarDigital Library
- Google AppEngine. High replication datastore, 2012. https://developers.google.com/appengine/docs/java/datastore/overview.Google Scholar
- M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3), July 1990. Google ScholarDigital Library
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: wait-free coordination for internet-scale systems. In Proc. USENIX ATC, USENIXATC'10, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
- M. Kaptritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin. Eve: Execute-verify replication for multi-core servers. In Proc. 10th USENIX OSDI, Hollywood, CA, Oct. 2012. Google ScholarDigital Library
- T. Kraska, G. Pang, M. J. Franklin, S. Madden, and A. Fekete. MDCC: Multi-data center consistency. In Proc. 8th ACM European Conference on Computer Systems (EuroSys), Apr. 2013. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, 1998. ISSN 0734-2071. Google ScholarDigital Library
- L. Lamport. Paxos made simple. ACM SIGACT News, 32 (4), Dec. 2001.Google Scholar
- L. Lamport. Generalized consensus and Paxos. http://research.microsoft.com/apps/pubs/default.aspx?id=64631, 2005.Google Scholar
- L. Lamport. Fast Paxos. http://research.microsoft.com/apps/pubs/default.aspx?id=64624, 2006.Google Scholar
- L. Lamport, D. Malkhi, and L. Zhou. Vertical Paxos and primary-backup replication. Technical report, Microsoft Research, 2009.Google Scholar
- L. Lamport, D. Malkhi, and L. Zhou. Reconfiguring a state machine. SIGACT News, 41(1), Mar. 2010. Google ScholarDigital Library
- B. Liskov and J. Cowling. Viewstamped replication revisited. Technical Report MIT-CSAIL-TR-2012-021, MIT Computer Science and Artificial Intelligence Laboratory, 2012.Google Scholar
- J. MacCormick, N. Murphy, M. Najork, C. A. Thekkath, and L. Zhou. Boxwood: abstractions as the foundation for storage infrastructure. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004. Google ScholarDigital Library
- Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: building efficient replicated state machines for WANs. In Proc. 8th USENIX OSDI, pages 369--384, San Diego, CA, Dec. 2008. Google ScholarDigital Library
- I. Moraru, D. G. Andersen, and M. Kaminsky. Epaxos code base. https://github.com/efficient/epaxos, Aug. 2013.Google Scholar
- I. Moraru, D. G. Andersen, and M. Kaminsky. A proof of correctness for Egalitarian Paxos. Technical report, Parallel Data Laboratory, Carnegie Mellon University, Aug. 2013. http://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-13-111.pdf.Google Scholar
- F. Pedone and A. Schiper. Handling message semantics with generic broadcast protocols. Distributed Computing, 15:97--107, Apr. 2002. Google ScholarDigital Library
- F. Pedone and A. Schiper. Optimistic atomic broadcast: a pragmatic viewpoint. Theoretical Computer Science, 291: 79--101, Jan. 2003. Google ScholarDigital Library
- tpc-c. TPC benchmark C. http://www.tpc.org/tpcc/spec/tpcc_current.pdf, 2010.Google Scholar
- P. Zieliński. Optimistic generic broadcast. In Proc. 19th International Symposium on Distributed Computing (DISC), pages 369--383, Kraków, Poland, Sept. 2005. Google ScholarDigital Library
Recommendations
Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data
A novel algorithm for managing replicated data is presented. The proposed method is based on organizing the copies of an object into a logical, multilevel hierarchy, and extending the quorum consensus algorithm to such an environment. Several properties ...
Generalized Grid Quorum Consensus for Replica Control Protocol
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication NetworksIn distributed systems it is often necessary to provide coordination among the multiple concurrent processes to tolerate the contention, periods of asynchrony and a number of failures. Quorum systems provide a decentralized approach for such ...
On the correctness of Egalitarian Paxos
AbstractThis paper identifies a problem in both the TLA + specification and the implementation of the Egalitarian Paxos protocol. It is related to how replicas switch from one ballot to another when computing the dependencies of a command. The ...
Highlights- There is a problem in the specification and implementation of Egalitarian Paxos.
Comments