skip to main content
10.1145/3132747.3132751acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article
Public Access

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Published:14 October 2017Publication History

ABSTRACT

Distributed storage systems aim to provide strong consistency and isolation guarantees on an architecture that is partitioned across multiple shards for scalability and replicated for fault tolerance. Traditionally, achieving all of these goals has required an expensive combination of atomic commitment and replication protocols -- introducing extensive coordination overhead. Our system, Eris, takes a different approach. It moves a core piece of concurrency control functionality, which we term multi-sequencing, into the datacenter network itself. This network primitive takes on the responsibility for consistently ordering transactions, and a new lightweight transaction protocol ensures atomicity.

The end result is that Eris avoids both replication and transaction coordination overhead: we show that it can process a large class of distributed transactions in a single round-trip from the client to the storage system without any explicit coordination between shards or replicas in the normal case. It provides atomicity, consistency, and fault tolerance with less than 10% overhead -- achieving throughput 3.6-35x higher and latency 72-80% lower than a conventional design on standard benchmarks.

Skip Supplemental Material Section

Supplemental Material

eris.mp4

mp4

2 GB

References

  1. A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Efficient optimistic concurrency control using loosely synchronized clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, USA, June 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Agrawal, A. E. Abbadi, and K. Salem. A taxonomy of partitioned replicated cloud-based database systems. IEEE Data Engineering Bulletin, 38(1):4--9, Mar. 2015.Google ScholarGoogle Scholar
  3. J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Léon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR '11), Asilomar, CA, USA, Jan. 2011. VLDB / ACM.Google ScholarGoogle Scholar
  4. M. Balakrishnan, D. Malkhi, J. Davis, V. Prabhakaran, M. Wei, and T. Wobber. CORFU: A distributed shared log. ACM Transactions on Computer Systems, 31(4), Dec. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13), Farmington, PA, USA, Nov. 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Barefoot Networks. Tofino. https://www.barefootnetworks.com/technology/.Google ScholarGoogle Scholar
  7. P. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys, 13(2), June 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery In Database Systems. Addison-Wesley, Boston, MA, USA, Feb. 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. A. Bernstein, C. W. Reid, and S. Das. Hyder - A transactional record manager for shared flash. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR '11), Asilomar, CA, USA, Jan. 2011. VLDB / ACM.Google ScholarGoogle Scholar
  10. K. P. Birman and T. A. Joseph. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP '87), Austin, TX, USA, Oct. 1987. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. P. Birman and T. A. Joseph. Reliable communication in the presence of failures. ACM Transactions on Computer Systems, 5(1):47--76, Jan. 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review, 44(3):87--95, July 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In Proceedings of ACM SIGCOMM 2013, Hong Kong, China, Aug. 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Brebner and W. Jiang. High-speed packet processing using reconfigurable computing. IEEE Micro, 34(1):8--18, Jan. 2014.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. J. Carey and M. R. Stonebraker. The performance of concurrency control algorithms for database management systems. In Proceedings of the 10th International Conference on Very Large Data Bases (VLDB '84), Singapore, Aug. 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '06), Seattle, WA, USA, Nov. 2006. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the 11th ACM SIGOPS EuroSys (EuroSys '16), London, United Kingdom, Apr. 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10), Indianapolis, IN, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI '12), Hollywood, CA, USA, Oct. 2012. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Cowling. Low-Overhead Distributed Transaction Coordination. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Cowling and B. Liskov. Granola: Low-overhead distributed transaction coordination. In Proceedings of the 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 2012. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. T. Dang, D. Sciascia, M. Canini, F. Pedone, and R. Soulé. Netpaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR '15), pages 5:1--5:7, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Dean and L. A. Barosso. The tail at scale. Communications of the ACM, 56(2):74--80, Feb. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Dey, A. Fekete, R. Nambiar, and U. Rohm. YCSB+T: Benchmarking web-scale transactional databases. In Proceedings of the 30th IEEE International Conference on Data Engineering Workshops (ICDEW '14), Chicago, IL, USA, 2014. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI '14), Seattle, WA, USA, Apr. 2014. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Dragojević, D. Narayanan, E. B. Nightingale, M. Renzelmann, A. Shamis, A. Badam, and M. Castro. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Du, S. Elnikety, and W. Zwaenepoel. Clock-SI: Snapshot isolation for partitioned data stores using loosely synchronized clocks. In Proceedings of the 32nd IEEE Symposium on Reliable Distributed Systems (SRDS '13), Braga, Portugal, Oct. 2013. IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. G. Ferro and M. Yabandeh. A critique of snapshot isolation. In Proceedings of the 7th ACM SIGOPS EuroSys (EuroSys '12), Bern, Switzerland, Apr. 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and T. Anderson. Scalable consistency in scatter. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11), Cascais, Portugal, Oct. 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. OLTP through the looking glass, and what we found there. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, June 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. P. Herlihy and J. M. Wing. Linearizabiliy: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463--492, July 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined WAN. In Proceedings of ACM SIGCOMM 2013, Hong Kong, China, Aug. 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. P. C. Jones, D. J. Abadi, and S. Madden. Low overhead concurrency control for partitioned main memory databases. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, June 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. F. Kaashoek and A. S. Tanenbaum. Group communication in the Amoeba distributed operating system. In Proceedings of the 11th IEEE International Conference on Distributed Computing Systems (ICDCS '91), Arlington, TX, USA, 1991. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  35. A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance RDMA systems. In 2016 USENIX Annual Technical Conference (USENIX ATC '16), Denver, CO, July 2016. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 1(2):1496--1499, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ramanathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker. Onix: A distributed control platform for large-scale production networks. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada, Oct. 2010. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. T. Kraska, G. Pang, M. J. Franklin, S. Madden, and A. Fekete. MDCC: multi-data center consistency. In Proceedings of the 8th ACM SIGOPS EuroSys (EuroSys '13), Prague, Czech Republic, Apr. 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18--25, Dec. 2001.Google ScholarGoogle Scholar
  41. L. Lamport. Fast Paxos. Distributed Computing, 19(2):79--103, Oct. 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Li, E. Michael, and D. R. K. Ports. Eris: Coordination-free consistent transactions using in-network concurrency control {extended version}. Technical Report UW-CSE-17-10-01, University of Washington CSE, Seattle, WA, USA, Oct. 2017.Google ScholarGoogle Scholar
  43. J. Li, E. Michael, A. Szekeres, N. K. Sharma, and D. R. K. Ports. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, Nov. 2016. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. Liskov. Practical uses of synchronized clocks in distributed systems. In Proceedings of the 10th ACM Symposium on Principles of Distributed Computing (PODC '91), Montreal, QC, Canada, Aug. 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. B. Liskov and J. Cowling. Viewstamped replication revisited. Technical Report MIT-CSAIL-TR-2012-021, MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA, July 2012.Google ScholarGoogle Scholar
  46. V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson. F10: A fault-tolerant engineered network. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13), Lombard, IL, USA, Apr. 2013. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, and A. El Abbadi. Low-latency multi-datacenter databases using replicated commit. Proceedings of the VLDB Endowment, 6(9):661--672, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Mu, L. Nelson, W. Lloyd, and J. Li. Consolidating concurrency control and consensus for commits under conflicts. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16), Savannah, GA, USA, Nov. 2016. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at Facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI '13), Lombard, IL, USA, Apr. 2013. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. NOX network control platform. The POX SDN controller. https://github.com/noxrepo/pox.Google ScholarGoogle Scholar
  51. B. M. Oki and B. H. Liskov. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (PODC '88), Toronto, Ontario, Canada, Aug. 1988. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. R. Ozdag. Intel® Ethernet switch FM6000 series-software defined networking.Google ScholarGoogle Scholar
  53. D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), Vancouver, BC, Canada, Oct. 2010. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. D. R. K. Ports, J. Li, V. Liu, N. K. Sharma, and A. Krishnamurthy. Designing distributed systems using approximate synchrony in data-center networks. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15), Oakland, CA, USA, May 2015. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. D. P. Reed. Naming and synchronization in a decentralized computer system. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, Sept. 1978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), Vienna, Austria, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. R. H. Thomas. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems, 4(2):180--209, June 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. Thomson and D. J. Abadi. The case for determinism in database systems. Proceedings of the VLDB Endowment, 3(10), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, May 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Fast distributed transactions and strongly consistent replication for OLTP database systems. ACM Transactions on Database Systems, 39(2), May 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Transaction Processing Performance Council. TPC Benchmark C. http//:www.tpc.org/tpcc/, Feb. 2010.Google ScholarGoogle Scholar
  62. M. Wei, A. Tai, C. J. Rossbach, I. Abraham, M. Munshed, M. Dhawan, J. Stabile, U. Wieder, S. Fritchie, S. Swanson, M. J. Freedman, and D. Malkhi. vCorfu: A cloud-scale object store on a shared log. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI '17), Boston, MA, Mar. 2017. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. X. Wei, J. Shi, Y. Chen, R. Chen, and H. Chen. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. S. Wolf, H. Mühe, A. Kemper, and T. Neumann. An evaluation of strict timestamp ordering concurrency control for main-memory database systems. In IMDM Workshop, 2013.Google ScholarGoogle Scholar
  65. XPliant Ethernet switch product family. www.cavium.com/XPliant-Ethernet-Switch-Product-Family.html.Google ScholarGoogle Scholar
  66. I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. K. Ports. Building consistent transactions with inconsistent replication. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15), Monterey, CA, USA, Oct. 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. K. Ports. When is operation ordering required in replicated transactional storage? IEEE Data Engineering Bulletin, 39(1):27--38, Mar. 2016.Google ScholarGoogle Scholar

Index Terms

  1. Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
              October 2017
              677 pages
              ISBN:9781450350853
              DOI:10.1145/3132747

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 October 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate131of716submissions,18%

              Upcoming Conference

              SOSP '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader