ABSTRACT
Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity.
The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead -- achieving throughput 3.6-35x higher and latency 72-80% lower than a conventional design on standard benchmarks.
Supplemental Material
- A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Efficient optimistic concurrency control using loosely synchronized clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, USA, June 1995. ACM. Google ScholarDigital Library
- D. Agrawal, A. E. Abbadi, and K. Salem. A taxonomy of partitioned replicated cloud-based database systems. IEEE Data Engineering Bulletin, 38(1):4--9, Mar. 2015.Google Scholar
- J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Léon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR '11), Asilomar, CA, USA, Jan. 2011. VLDB / ACM.Google Scholar
- M. Balakrishnan, D. Malkhi, J. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared log. ACM Transactions on Computer Systems, 31(4), Dec. 2013. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13), Farmington, PA, USA, Nov. 2013. ACM. Google ScholarDigital Library
- Barefoot Networks. Tofino. https://www.barefootnetworks.com/technology/.Google Scholar
- P. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys, 13(2), June 1981. Google ScholarDigital Library
- P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery In Database Systems. Addison-Wesley, Boston, MA, USA, Feb. 1987. Google ScholarDigital Library
- P. A. Bernstein, C. W. Reid, and S. Das. Hyder - A transactional record manager for shared flash. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR '11), Asilomar, CA, USA, Jan. 2011. VLDB / ACM.Google Scholar
- K. P. Birman and T. A. Joseph. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP '87), Austin, TX, USA, Oct. 1987. ACM. Google ScholarDigital Library
- K. P. Birman and T. A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems, 5(1):47--76, Jan. 1987. Google ScholarDigital Library
- P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review, 44(3):87--95, July 2014. Google ScholarDigital Library
- P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In Proceedings of ACM SIGCOMM 2013, Hong Kong, China, Aug. 2013. ACM. Google ScholarDigital Library
- G. Brebner and W. Jiang. High-speed packet processing using reconfigurable computing. IEEE Micro, 34(1):8--18, Jan. 2014.Google ScholarCross Ref
- M. J. Carey and M. R. Stonebraker. The performance of concurrency control algorithms for database management systems. In Proceedings of the 10th International Conference on Very Large Data Bases (VLDB '84), Singapore, Aug. 1984. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, WA, USA, Nov. 2006. USENIX. Google ScholarDigital Library
- Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the 11th ACM SIGOPS EuroSys (EuroSys '16), London, United Kingdom, Apr. 2016. ACM. Google ScholarDigital Library
- B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, IN, USA, 2010. ACM. Google ScholarDigital Library
- J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12), Hollywood, CA, USA, Oct. 2012. USENIX. Google ScholarDigital Library
- J. Cowling. Low-Overhead Distributed Transaction Coordination. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, June 2012. Google ScholarDigital Library
- J. Cowling and B. Liskov. Granola: Low-overhead distributed transaction coordination. In Proceedings of the 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 2012. USENIX. Google ScholarDigital Library
- H. T. Dang, D. Sciascia, M. Canini, F. Pedone, and R. Soulé. Netpaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR '15), pages 5:1--5:7, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- J. Dean and L. A. Barosso. The tail at scale. Communications of the ACM, 56(2):74--80, Feb. 2013. Google ScholarDigital Library
- A. Dey, A. Fekete, R. Nambiar, and U. Rohm. YCSB+T: Benchmarking web-scale transactional databases. In Proceedings of the 30th IEEE International Conference on Data Engineering Workshops (ICDEW '14), Chicago, IL, USA, 2014. IEEE.Google ScholarCross Ref
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI '14), Seattle, WA, USA, Apr. 2014. USENIX. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, E. B. Nightingale, M. Renzelmann, A. Shamis, A. Badam, and M. Castro. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarDigital Library
- J. Du, S. Elnikety, and W. Zwaenepoel. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In Proceedings of the 32nd IEEE Symposium on Reliable Distributed Systems (SRDS '13), Braga, Portugal, Oct. 2013. IEEE. Google ScholarDigital Library
- D. G. Ferro and M. Yabandeh. A critique of snapshot isolation. In Proceedings of the 7th ACM SIGOPS EuroSys (EuroSys '12), Bern, Switzerland, Apr. 2012. ACM. Google ScholarDigital Library
- L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), Cascais, Portugal, Oct. 2011. ACM. Google ScholarDigital Library
- S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through the looking glass, and what we found there. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, June 2008. ACM. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizabiliy: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, July 1990. Google ScholarDigital Library
- S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined WAN. In Proceedings of ACM SIGCOMM 2013, Hong Kong, China, Aug. 2013. ACM. Google ScholarDigital Library
- E. P. C. Jones, D. J. Abadi, and S. Madden. Low overhead concurrency control for partitioned main memory databases. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, June 2010. ACM. Google ScholarDigital Library
- M. F. Kaashoek and A. S. Tanenbaum. Group communication in the Amoeba distributed operating system. In Proceedings of the 11th IEEE International Conference on Distributed Computing Systems (ICDCS '91), Arlington, TX, USA, 1991. IEEE.Google ScholarCross Ref
- A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC '16), Denver, CO, July 2016. USENIX. Google ScholarDigital Library
- R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 1(2):1496--1499, 2008. Google ScholarDigital Library
- T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker. Onix: A distributed control platform for large-scale production networks. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada, Oct. 2010. USENIX. Google ScholarDigital Library
- T. Kraska, G. Pang, M. J. Franklin, S. Madden, and A. Fekete. MDCC: multi-data center consistency. In Proceedings of the 8th ACM SIGOPS EuroSys (EuroSys '13), Prague, Czech Republic, Apr. 2013. ACM. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarDigital Library
- L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, Dec. 2001.Google Scholar
- L. Lamport. Fast Paxos. Distributed Computing, 19(2):79--103, Oct. 2006.Google ScholarDigital Library
- J. Li, E. Michael, and D. R. K. Ports. Eris: Coordination-free consistent transactions using in-network concurrency control {extended version}. Technical Report UW-CSE-17-10-01, University of Washington CSE, Seattle, WA, USA, Oct. 2017.Google Scholar
- J. Li, E. Michael, A. Szekeres, N. K. Sharma, and D. R. K. Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, Nov. 2016. USENIX. Google ScholarDigital Library
- B. Liskov. Practical uses of synchronized clocks in distributed systems. In Proceedings of the 10th ACM Symposium on Principles of Distributed Computing (PODC '91), Montreal, QC, Canada, Aug. 1991. ACM. Google ScholarDigital Library
- B. Liskov and J. Cowling. Viewstamped replication revisited. Technical Report MIT-CSAIL-TR-2012-021, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA, July 2012.Google Scholar
- V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson. F10: A fault-tolerant engineered network. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13), Lombard, IL, USA, Apr. 2013. USENIX. Google ScholarDigital Library
- H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, and A. El Abbadi. Low-latency multi-datacenter databases using replicated commit. Proceedings of the VLDB Endowment, 6(9):661--672, July 2013. Google ScholarDigital Library
- S. Mu, L. Nelson, W. Lloyd, and J. Li. Consolidating concurrency control and consensus for commits under conflicts. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, Nov. 2016. USENIX. Google ScholarDigital Library
- R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13), Lombard, IL, USA, Apr. 2013. USENIX. Google ScholarDigital Library
- NOX network control platform. The POX SDN controller. https://github.com/noxrepo/pox.Google Scholar
- B. M. Oki and B. H. Liskov. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (PODC '88), Toronto, Ontario, Canada, Aug. 1988. ACM. Google ScholarDigital Library
- R. Ozdag. Intel® Ethernet switch FM6000 series-software defined networking.Google Scholar
- D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada, Oct. 2010. USENIX. Google ScholarDigital Library
- D. R. K. Ports, J. Li, V. Liu, N. K. Sharma, and A. Krishnamurthy. Designing distributed systems using approximate synchrony in data-center networks. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15), Oakland, CA, USA, May 2015. USENIX. Google ScholarDigital Library
- D. P. Reed. Naming and synchronization in a decentralized computer system. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, Sept. 1978.Google ScholarDigital Library
- M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), Vienna, Austria, Sept. 2007. Google ScholarDigital Library
- R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems, 4(2):180--209, June 1979. Google ScholarDigital Library
- A. Thomson and D. J. Abadi. The case for determinism in database systems. Proceedings of the VLDB Endowment, 3(10), 2010. Google ScholarDigital Library
- A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, May 2012. ACM. Google ScholarDigital Library
- A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Fast distributed transactions and strongly consistent replication for OLTP database systems. ACM Transactions on Database Systems, 39(2), May 2014. Google ScholarDigital Library
- Transaction Processing Performance Council. TPC Benchmark C. http//:www.tpc.org/tpcc/, Feb. 2010.Google Scholar
- M. Wei, A. Tai, C. J. Rossbach, I. Abraham, M. Munshed, M. Dhawan, J. Stabile, U. Wieder, S. Fritchie, S. Swanson, M. J. Freedman, and D. Malkhi. vCorfu: A cloud-scale object store on a shared log. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17), Boston, MA, Mar. 2017. USENIX. Google ScholarDigital Library
- X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarDigital Library
- S. Wolf, H. Mühe, A. Kemper, and T. Neumann. An evaluation of strict timestamp ordering concurrency control for main-memory database systems. In IMDM Workshop, 2013.Google Scholar
- XPliant Ethernet switch product family. www.cavium.com/XPliant-Ethernet-Switch-Product-Family.html.Google Scholar
- I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. K. Ports. Building consistent transactions with inconsistent replication. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarDigital Library
- I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. K. Ports. When is operation ordering required in replicated transactional storage? IEEE Data Engineering Bulletin, 39(1):27--38, Mar. 2016.Google Scholar
Index Terms
- Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control
Recommendations
Fast In-Memory Transaction Processing Using RDMA and HTM
DrTM is a fast in-memory transaction processing system that exploits advanced hardware features such as remote direct memory access (RDMA) and hardware transactional memory (HTM). To achieve high efficiency, it mostly offloads concurrency control such ...
Carousel: Low-Latency Transaction Processing for Globally-Distributed Data
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataThe trend towards global applications and services has created an increasing demand for transaction processing on globally-distributed data. Many database systems, such as Spanner and CockroachDB, support distributed transactions but require a large ...
Comments