ABSTRACT
We revisit the question of delegation vs. synchronized access to shared memory, and show through analysis and demonstration that delegation can be much faster than locking under a range of common circumstances. Starting from first principles, we propose fast, fly-weight delegation (ffwd). The highly optimized design of ffwd allows it to significantly outperform prior work on delegation, while retaining the scalability advantage.
In experiments with 6 benchmark applications, and 6 shared data structures, running on four different multi-socket systems with up to 128 hardware threads, we compare ffwd to a selection of lock, combining, lock-free, software transactional memory and delegation designs. Overall, we find that ffwd often offers a simple and highly competitive alternative to existing work. By definition, the performance of a fully delegated data structure is limited by the single-thread throughput of said data structure. However, due to cache effects, many data structures offer their best performance when confined to a single thread. With an efficient delegation mechanism, we approach this single-threaded performance in a multi-threaded setting. In application-level benchmarks, we see improvements up to 100% over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5-10x range.
Supplemental Material
- Intel® memory latency chekcer. URL https://software.intel.com/en-us/articles/intelr-memory-latency-checker.Google Scholar
- Fastforward @ github. URL http://bitslab.github.com/bitslab/fastforward.Google Scholar
- Hoard. URL https://www.hoard.org.Google Scholar
- Jemalloc. URL http://jemalloc.net.Google Scholar
- Memcached,. URL https://memcached.org.Google Scholar
- Memslap,. URL http://docs.libmemcached.org/bin/memaslap.html.Google Scholar
- Slab. URL https://github.com/bbu/userland-slab-allocator.Google Scholar
- Splash-2. URL http//:www.capsl.udel.edu/splash.Google Scholar
- J. L. Abell, J. Fern, M. E. Acacio, et al. Glocks: Efficient support for highly-contended locks in many-core cmps. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 893--905. IEEE, 2011. Google ScholarDigital Library
- K. Agrawal, J. T. Fineman, K. Lu, B. Sheridan, J. Sukha, and R. Utterback. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, pages 84--95. ACM, 2014. Google ScholarDigital Library
- S. Al Bahra. Nonblocking algorithms and scalable multicore programming. Queue, 11:40, 2013. Google ScholarDigital Library
- T. E. Anderson. The performance of spin lock alternatives for shared-money multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1:6--16, 1990. Google ScholarDigital Library
- M. Arbel and H. Attiya. Concurrent updates with rcu: search tree as an example. In Proceedings of the 2014 ACM symposium on Principles of distributed computing, pages 196--205. ACM, 2014. Google ScholarDigital Library
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP '09, pages 29--44, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik, and A. Vainshtein. Optimal strategies for spinning and blocking. Journal of Parallel and Distributed Computing, 21:246--254, 1994. Google ScholarDigital Library
- G. S. Brodal, J. L. Träff, and C. D. Zaroliagis. A parallel priority queue with constant time operations. Journal of Parallel and Distributed Computing, 49:4--21, 1998. Google ScholarDigital Library
- I. Calciu, D. Dice, T. Harris, M. Herlihy, A. Kogan, V. Marathe, and M. Moir. Message passing or shared memory: Evaluating the delegation abstraction for multicores. In International Conference on Principles of Distributed Systems, pages 83--97. Springer, 2013. Google ScholarDigital Library
- M. Chabbi and J. Mellor-Crummey. Contention-conscious, locality-preserving locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, page 22. ACM, 2016. Google ScholarDigital Library
- M. Chabbi, M. Fagan, and J. Mellor-Crummey. High performance locks for multi-level numa systems. In ACM SIGPLAN Notices, volume 50, pages 215--226. ACM, 2015. Google ScholarDigital Library
- A. T. Clements, M. F. Kaashoek, and N. Zeldovich. Scalable address spaces using rcu balanced trees. ACM SIGPLAN Notices, 47:199--210, 2012. Google ScholarDigital Library
- A. T. Clements, M. F. Kaashoek, and N. Zeldovich. Scalable address spaces using rcu balanced trees. ACM SIGPLAN Notices, 47:199--210, 2012. Google ScholarDigital Library
- P.-J. Courtois, F. Heymans, and D. L. Parnas. Concurrent control with readers and writers. Communications of the ACM, 14:667--668, 1971. Google ScholarDigital Library
- T. Craig. Building fifo and priorityqueuing spin locks from atomic swap. Technical report, Technical Report TR 93-02-02, University of Washington, 02 1993.(ftp tr/1993/02/UW-CSE-93-02-02. PS. Z from cs. washington. edu), 1993.Google Scholar
- A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders. A parallelization of dijkstra's shortest path algorithm. Mathematical Foundations of Computer Science 1998, pages 722--731, 1998. Google ScholarDigital Library
- T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 33--48. ACM, 2013. Google ScholarDigital Library
- D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 65--74. ACM, 2011. Google ScholarDigital Library
- D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In ACM SIGPLAN Notices, volume 47, pages 247--256. ACM, 2012. Google ScholarDigital Library
- A. Dragojević, R. Guerraoui, and M. Kapalka. Stretching transactional memory. In ACM sigplan notices, volume 44, pages 155--165. ACM, 2009. Google ScholarDigital Library
- J. R. Driscoll, H. N. Gabow, R. Shrairman, and R. E. Tarjan. Relaxed heaps: An alternative to fibonacci heaps with applications to parallel computation. Communications of the ACM, 31:1343--1354, 1988. Google ScholarDigital Library
- R. Ennals. Software transactional memory should not be obstruction-free. Technical report, Technical Report IRC-TR-06-052, Intel Research Cambridge Tech Report, 2006.Google Scholar
- P. Fatourou and N. D. Kallimanis. A highly-efficient wait-free universal construction. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 325--334. ACM, 2011. Google ScholarDigital Library
- P. Fatourou and N. D. Kallimanis. Revisiting the combining synchronization technique. In ACM SIGPLAN Notices, volume 47, pages 257--266. ACM, 2012. Google ScholarDigital Library
- P. Felber, C. Fetzer, and T. Riegel. Dynamic performance tuning of word-based software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 237--246. ACM, 2008. Google ScholarDigital Library
- K. Fraser. Practical lock-freedom. Technical report, University of Cambridge, Computer Laboratory, 2004.Google Scholar
- K. Fraser and T. Harris. Concurrent programming without locks. ACM Transactions on Computer Systems (TOCS), 25: 5, 2007. Google ScholarDigital Library
- B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In OSDI, volume 99, pages 87--100, 1999. Google ScholarDigital Library
- T. Harris. A pragmatic implementation of non-blocking linked-lists. Distributed Computing, pages 300--314, 2001. Google ScholarCross Ref
- T. E. Hart, P. E. McKenney, A. D. Brown, and J. Walpole. Performance of memory reclamation for lockless synchronization. Journal of Parallel and Distributed Computing, 67: 1270--1285, 2007. Google ScholarDigital Library
- S. Heller, M. Herlihy, V. Luchangco, M. Moir, W. N. Scherer III, and N. Shavit. A lazy concurrent list-based set algorithm. In International Conference On Principles Of Distributed Systems, pages 3--16. Springer, 2005. Google ScholarDigital Library
- D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures, pages 355--364. ACM, 2010. Google ScholarDigital Library
- M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13:124--149, 1991. Google ScholarDigital Library
- M. Herlihy. A methodology for implementing highly concurrent data objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 15:745--770, 1993. Google ScholarDigital Library
- M. Herlihy and N. Shavit. The art of multiprocessor programming. Morgan Kaufmann, 2011. Google ScholarDigital Library
- M. Herlihy, V. Luchangco, and M. Moir. Obstruction-free synchronization: Double-ended queues as an example. In Distributed Computing Systems, 2003. Proceedings. 23rd International Conference on, pages 522--529. IEEE, 2003. Google ScholarDigital Library
- M. Herlihy, V. Luchangco, M. Moir, and W. N. Scherer III. Software transactional memory for dynamic-sized data structures. In Proceedings of the twenty-second annual symposium on Principles of distributed computing, pages 92--101. ACM, 2003. Google ScholarDigital Library
- M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit. A provably correct scalable concurrent skip list. In Conference On Principles of Distributed Systems (OPODIS), 2006.Google Scholar
- P. W. Howard and J. Walpole. Relativistic red-black trees. Concurrency and Computation: Practice and Experience, 26: 2684--2712, 2014. Google ScholarDigital Library
- F. R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry. Decoupling contention management from scheduling. ACM SIGARCH Computer Architecture News, 38:117--128, 2010. Google ScholarDigital Library
- D. Klaftenegger, K. Sagonas, and K. Winblad. Delegation locking libraries for improved performance of multithreaded programs. In European Conference on Parallel Processing, pages 572--583. Springer, 2014.Google ScholarCross Ref
- L. Lamport. Specifying concurrent program modules. ACM Transactions on Programming Languages and Systems (TOPLAS), 5:190--222, 1983. Google ScholarDigital Library
- S. Lists. A probabilistic alternative to balanced trees, william pugh. Communications. of the ACM, 33:668--676, 1990. Google ScholarDigital Library
- T. Liu and E. D. Berger. Sheriff: precise detection and automatic mitigation of false sharing. ACM Sigplan Notices, 46: 3--18, 2011. Google ScholarDigital Library
- J.-P. Lozi, F. David, G. Thomas, J. L. Lawall, G. Muller, et al. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In USENIX Annual Technical Conference, pages 65--76, 2012. Google ScholarDigital Library
- V. Luchangco, D. Nussbaum, and N. Shavit. A hierarchical clh queue lock. Euro-Par 2006 Parallel Processing, pages 801--810, 2006. Google ScholarDigital Library
- P. Magnusson, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 165--171. IEEE, 1994. Google ScholarDigital Library
- V. J. Marathe, W. N. Scherer, and M. L. Scott. Adaptive software transactional memory. In International Symposium on Distributed Computing, pages 354--368. Springer, 2005. Google ScholarDigital Library
- A. Matveev, N. Shavit, P. Felber, and P. Marlier. Read-log-update: A lightweight synchronization mechanism for concurrent programming. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 168--183. ACM, 2015. Google ScholarDigital Library
- P. E. McKenney and B. Kingsbury. Reader-writer lock for multiprocessor systems, Nov. 23 2004. US Patent 6,823,511.Google Scholar
- J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. ACM SIGPLAN Notices, 26:269--278, 1991. Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott. Scalable reader-writer synchronization for shared-memory multiprocessors. In ACM SIGPLAN Notices, volume 26, pages 106--113. ACM, 1991. Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9:21--65, 1991. Google ScholarDigital Library
- M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages 73--82. ACM, 2002. Google ScholarDigital Library
- M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, pages 267--275. ACM, 1996. Google ScholarDigital Library
- M. S. Moir. Hybrid software/hardware transactional memory, July 1 2008. US Patent 7,395,382.Google Scholar
- M. Nanavati, M. Spear, N. Taylor, S. Rajagopalan, D. T. Meyer, W. Aiello, and A. Warfield. Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 141--154. ACM, 2013. Google ScholarDigital Library
- Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, volume 16. Citeseer, 1999.Google Scholar
- P. Paige. Parallel algorithms for shortest path problems. In Proc. the 1985 International Conference on Parallel Processing, 1985.Google Scholar
- D. Petrović, T. Ropars, and A. Schiper. On the performance of delegation over cache-coherent shared memory. In Proceedings of the 2015 International Conference on Distributed Computing and Networking, page 17. ACM, 2015. Google ScholarDigital Library
- C. Purcell and T. Harris. Non-blocking hashtables with open addressing. In International Symposium on Distributed Computing, pages 108--121. Springer, 2005. Google ScholarDigital Library
- Z. Radovic and E. Hagersten. Hierarchical backoff locks for nonuniform communication architectures. In High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on, pages 241--252. IEEE, 2003. Google ScholarDigital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pages 13--24. Ieee, 2007. Google ScholarDigital Library
- T. Riegel, C. Fetzer, and P. Felber. Snapshot isolation for software transactional memory. In First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing (TRANSACT'06), pages 1--10. Association for Computing Machinery (ACM), 2006.Google Scholar
- S. Roghanchi, J. Eriksson, and N. Basu. ffwd: delegation is (much) faster than you think. Technical report, University of Illinois At Chicago, 2017.Google Scholar
- L. Rudolph and Z. Segall. Dynamic decentralized cache schemes for MIMD parallel processors, volume 12. ACM, 1984. Google ScholarDigital Library
- B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. Mcrt-stm: a high performance software transactional memory system for a multi-core runtime. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 187--197. ACM, 2006. Google ScholarDigital Library
- P. Sanders. Randomized priority queues for fast parallel access. Journal of Parallel and Distributed Computing, 49:86--97, 1998. Google ScholarDigital Library
- M. L. Scott. Shared-memory synchronization. Synthesis Lectures on Computer Architecture, 8:1--221, 2013. Google ScholarDigital Library
- M. L. Scott and W. N. Scherer. Scalable queue-based spin locks with timeout. In ACM SIGPLAN Notices, volume 36, pages 44--52. ACM, 2001. Google ScholarDigital Library
- O. Shalev and N. Shavit. Predictive log-synchronization. In ACM SIGOPS Operating Systems Review, volume 40, pages 305--315. ACM, 2006. Google ScholarDigital Library
- N. Shavit and I. Lotan. Skiplist-based concurrent priority queues. In Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International, pages 263--268. IEEE, 2000. Google ScholarDigital Library
- N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, 10:99--116, 1997.Google ScholarCross Ref
- J. M. Stone. A simple and correct shared-queue algorithm using compare-and-swap. In Supercomputing'90., Proceedings of, pages 495--504. IEEE, 1990. Google ScholarDigital Library
- J. M. Stone. A non-blocking compare-and-swap algorithm for a shared circular queue. S. Tzafestas et al., editors, Parallel and Distributed Computing in Engineering Systems, pages 147--152, 1992.Google Scholar
- M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. In ACM SIGARCH Computer Architecture News, volume 37, pages 253--264. ACM, 2009. Google ScholarDigital Library
- H. Sundell and P. Tsigas. Fast and lock-free concurrent priority queues for multi-thread systems. In Parallel and Distributed Processing Symposium, 2003. Proceedings. International, pages 11--pp. IEEE, 2003. Google ScholarDigital Library
- H. Sundell and P. Tsigas. Lock-free and practical doubly linked list-based deques using single-word compare-and-swap. In OPODIS, pages 240--255. Springer, 2004. Google ScholarDigital Library
- J. Triplett, P. E. McKenney, and J. Walpole. Scalable concurrent hash tables via relativistic programming. ACM SIGOPS Operating Systems Review, 44:102--109, 2010. Google ScholarDigital Library
- J. Triplett, P. E. McKenney, and J. Walpole. Resizable, scalable, concurrent hash tables via relativistic programming. In USENIX Annual Technical Conference, page 11, 2011. Google ScholarDigital Library
- J. D. Valois. Implementing lock-free queues. In Proceedings of the seventh international conference on Parallel and Distributed Computing Systems, pages 64--69, 1994.Google Scholar
- J. D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pages 214--222. ACM, 1995. Google ScholarDigital Library
- J. D. Valois. Lock-free data structures. 1996.Google Scholar
- C. Wang, W.-Y. Chen, Y. Wu, B. Saha, and A.-R. Adl-Tabatabai. Code generation and optimization for transactional memory constructs in an unmanaged language. In Proceedings of the International Symposium on Code Generation and Optimization, pages 34--48. IEEE Computer Society, 2007. Google ScholarDigital Library
- A. Welc, S. Jagannathan, and A. Hosking. Transactional monitors for concurrent objects. ECOOP 2004--Object-Oriented Programming, pages 494--514, 2004.Google ScholarCross Ref
- P.-C. Yew, N.-F. Tzeng, et al. Distributing hot-spot addressing in large-scale multiprocessors. IEEE Transactions on Computers, 100:388--395, 1987. Google ScholarDigital Library
- Y. Zhan and D. E. Porter. Versioned programming: A simple technique for implementing efficient, lock-free, and compos-able data structures. In Proceedings of the 9th ACM International on Systems and Storage Conference, page 11. ACM, 2016. Google ScholarDigital Library
Recommendations
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Comments