Abstract
The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks.
Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.
- Jose L. Abellán, Juan Fernández, and Manuel E. Acacio. 2011. GLocks: Efficient support for highly-contended locks in many-core CMPs. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 893--905. Google ScholarDigital Library
- Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18--20, 1967, Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, NY, 483--485. Google ScholarDigital Library
- Thomas E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1, 1 (Jan. 1990), 6--16. Google ScholarDigital Library
- Marc Auslander, David Edelsohn, Orran Krieger, Bryan Rosenburg, and Robert Wisniewski. 2003. Enhancement to the MCS lock for increased functionality and improved programmability. (Oct. 2003). U.S. Patent Application No. 10/128,745.Google Scholar
- Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principle (SOSP’09). ACM, New York, NY, 29--44. Google ScholarDigital Library
- Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 43--57. Google ScholarDigital Library
- Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). USENIX Association. Google ScholarDigital Library
- Bjorn B. Brandenburg. 2013. Improved analysis and evaluation of real-time semaphore protocols for P-FP scheduling. In Proceedings of the 2013 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’13). IEEE Computer Society, Washington, DC, 141--152. Google ScholarDigital Library
- Alex Brodsky, Faith Ellen, and Philipp Woelfel. 2006. Fully-adaptive algorithms for long-lived renaming. In Proceedings of the 20th International Conference on Distributed Computing (DISC’06). Springer-Verlag, Berlin, 413--427. Google ScholarDigital Library
- Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High performance locks for multi-level NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 215--226. DOI:http://dx.doi.org/10.1145/2688500.2688503 Google ScholarDigital Library
- Travis S. Craig. 2003. Building FIFO and Priority-Queueing Spin Locks from Atomic Swap. Technical Report TR 93-02-02. Department of Computer Science, University of Washington.Google Scholar
- Danga Interactive. 2003. Memcached: Distributed Memory Object Caching System. Retrieved from http://memcached.org.Google Scholar
- Data Differential. 2011. Libmemcached. Retrieved from https://launchpad.net/libmemcached.Google Scholar
- Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 33--48. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communication of the ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). ACM, New York, NY, 65--74. Google ScholarDigital Library
- David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: A general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 247--256. Google ScholarDigital Library
- David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2, Article 13 (Feb. 2015), 42 pages. DOI:http://dx.doi.org/10.1145/2686884 Google ScholarDigital Library
- Edsger W. Dijkstra. 1965. Cooperating sequential processes. (Sept. 1965). Published as EWD:EWD123pub.Google Scholar
- Jonathan Eastep, David Wingate, Marco D. Santambrogio, and Anant Agarwal. 2010. Smartlocks: Lock acquisition scheduling for self-aware synchronization. In Proceedings of the 7th International Conference on Autonomic Computing (ICAC’10). ACM, New York, NY, 215--224. Google ScholarDigital Library
- Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. Sim: A Highly-Efficient Wait-Free Universal Construction. Retrieved from https://code.google.com/p/sim-universal-construction/.Google Scholar
- Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 257--266. Google ScholarDigital Library
- Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux Journal 2004, 124 (Aug. 2004), 5--5. Google ScholarDigital Library
- Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
- Ahmed Hassan, Roberto Palmieri, and Binoy Ravindran. 2014. Remote invalidation: Optimizing the critical path of memory transactions. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society. Google ScholarDigital Library
- Bijun He, William N. Scherer III, and Michael L. Scott. 2005a. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the 11th International Conference on High Performance Computing (HiPC’05). 7--18. Google ScholarDigital Library
- Bijun He, William N. Scherer III, and Michael L. Scott. 2005b. Time-Published Queue-Based Spin Locks. Retrieved from http://www.cs.rochester.edu/research/synchronization/pseudocode/tp_lock s.html.Google Scholar
- Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010a. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, 355--364. Google ScholarDigital Library
- Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010b. Flat Combining and the Synchronization-Parallelism Tradeoff. (2010). http://mcg.cs.tau.ac.il/projects/flat-combining. Google ScholarDigital Library
- Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarDigital Library
- Charles Antony Richard Hoare. 1974. Monitors: An operating system structuring concept. Communications of the ACM 17, 10 (Oct. 1974), 549--557. Google ScholarDigital Library
- David Koufaty, Dheeraj Reddy, and Scott Hahn. 2010. Bias scheduling in heterogeneous multi-core architectures. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 125--138. Google ScholarDigital Library
- Scott T. Leutenegger and Daniel Dias. 1993. A modeling study of the TPC-C benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD’93). ACM, New York, NY, 22--31. Google ScholarDigital Library
- Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the 2012. USENIX Annual Technical Conference (USENIX ATC’12). USENIX Association, 65--76. Google ScholarDigital Library
- Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (Euro-Par’06). Springer-Verlag, Berlin, 801--810. Google ScholarDigital Library
- Peter Magnusson, Anders Landin, and Erik Hagersteny. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium (IPPS’94). IEEE Computer Society Press, 165--171. Google ScholarDigital Library
- John M. Mellor-Crummey and Michael L. Scott. 1991a. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65. Google ScholarDigital Library
- John M. Mellor-Crummey and Michael L. Scott. 1991b. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, 269--278. Google ScholarDigital Library
- Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (USENIX ATC’99). USENIX Association, 43--43. Google ScholarDigital Library
- Oracle Corporation. 2004. Berkeley DB. Retrieved from http://www.oracle.com/technetwork/database/berkeleydb.Google Scholar
- John K. Ousterhout. 1982. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS’82). 22--30.Google Scholar
- Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA’99).Google Scholar
- Yoann Padioleau, Julia Lawall, René Rydhof Hansen, and Gilles Muller. 2008. Documenting and automating collateral evolutions in Linux device drivers. In Proceedings of the 3rd European Conference on Computer Systems 2008 (Eurosys’08). ACM, New York, NY, 247--260. Google ScholarDigital Library
- Darko Petrović, Thomas Ropars, and André Schiper. 2014. Leveraging hardware message passing for efficient thread synchronization. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 143--154. Google ScholarDigital Library
- Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi Narayan Bhuyan. 2014. Lock contention aware thread migrations. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 369--370. Google ScholarDigital Library
- Zoran Radovic and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE Computer Society, Washington, DC, 241--253. Google ScholarDigital Library
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, Washington, DC, 13--24. Google ScholarDigital Library
- David P. Reed and Rajendra K. Kanodia. 1979. Synchronization with eventcounts and sequencers. Communications of the ACM 22, 2 (Feb. 1979), 115--123. DOI:http://dx.doi.org/10.1145/359060.359076 Google ScholarDigital Library
- Manish Shah, Jama Barreh, Jeff Brooks, Robert Golla, Gregory Grohoski, Nils Gura, Rick Hetherington, Paul Jordan, Mark Luttrell, Christopher Olson, Bikram Saha, Denis Sheahan, Lawrence Spracklen, and Aaron Wynn. 2007. UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC. Retrieved from http://www.oracle.com/technetwork/systems/opensparc/02-t2-a-sscc2007-15 30395.pdf.Google Scholar
- Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. 1992. SPLASH: Stanford parallel applications for shared-memory. SIGARCH Computer Architecture News 20, 1 (March 1992), 5--44. Google ScholarDigital Library
- Stanford University. 2011. The Phoenix System for MapReduce Programming. Retrieved from http://mapreduce.stanford.edu.Google Scholar
- Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications (MapReduce’11). ACM, New York, NY, 9--16. Google ScholarDigital Library
- University of Delaware. 2007. The Modified SPLASH-2 Home Page. Retrieved from http://www.capsl.udel.edu/splash.Google Scholar
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 24--36. Google ScholarDigital Library
- Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 1--8. Google ScholarDigital Library
- Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. 2009. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Washington, DC, 198--207. Google ScholarDigital Library
- Kamen Yotov, Keshav Pingali, and Paul Stodghill. 2005. Automatic measurement of memory hierarchy parameters. In Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). ACM, New York, NY, 181--192. Google ScholarDigital Library
Index Terms
- Fast and Portable Locking for Multicore Architectures
Recommendations
Lock Cohorting: A General Technique for Designing NUMA Locks
Special Issue on PPOPP 2012Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock ...
Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems
A plethora of optimized mutex lock algorithms have been designed over the past 25 years to mitigate performance bottlenecks related to critical sections and locks. Unfortunately, there is currently no broad study of the behavior of these optimized lock ...
Pessimistic software lock-elision
DISC'12: Proceedings of the 26th international conference on Distributed ComputingRead-write locks are one of the most prevalent lock forms in concurrent applications because they allow read accesses to locked code to proceed in parallel. However, they do not offer any parallelism between reads and writes.
This paper introduces ...
Comments