research-article

Fast and Portable Locking for Multicore Architectures

Authors:
Jean-Pierre Lozi

Université Nice Sophia Antipolis, CNRS

Université Nice Sophia Antipolis, CNRS
View Profile

,
Florian David

Sorbonne Universités, Inria, CNRS, UPMC

Sorbonne Universités, Inria, CNRS, UPMC
View Profile

,
Gaël Thomas

SAMOVAR, CNRS, Télécom ParisSud, Université Paris-Saclay

SAMOVAR, CNRS, Télécom ParisSud, Université Paris-Saclay
View Profile

,
Julia Lawall

Sorbonne Universités, Inria, CNRS, UPMC

Sorbonne Universités, Inria, CNRS, UPMC
View Profile

,
Gilles Muller

Sorbonne Universités, Inria, CNRS, UPMC

Sorbonne Universités, Inria, CNRS, UPMC
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 33 Issue 4Article No.: 13pp 1–62https://doi.org/10.1145/2845079

Published:04 January 2016Publication History

ACM Transactions on Computer Systems

Abstract

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks.

Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.

References

Jose L. Abellán, Juan Fernández, and Manuel E. Acacio. 2011. GLocks: Efficient support for highly-contended locks in many-core CMPs. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 893--905. Google ScholarDigital Library
Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18--20, 1967, Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, NY, 483--485. Google ScholarDigital Library
Thomas E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1, 1 (Jan. 1990), 6--16. Google ScholarDigital Library
Marc Auslander, David Edelsohn, Orran Krieger, Bryan Rosenburg, and Robert Wisniewski. 2003. Enhancement to the MCS lock for increased functionality and improved programmability. (Oct. 2003). U.S. Patent Application No. 10/128,745.Google Scholar
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principle (SOSP’09). ACM, New York, NY, 29--44. Google ScholarDigital Library
Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 43--57. Google ScholarDigital Library
Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). USENIX Association. Google ScholarDigital Library
Bjorn B. Brandenburg. 2013. Improved analysis and evaluation of real-time semaphore protocols for P-FP scheduling. In Proceedings of the 2013 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’13). IEEE Computer Society, Washington, DC, 141--152. Google ScholarDigital Library
Alex Brodsky, Faith Ellen, and Philipp Woelfel. 2006. Fully-adaptive algorithms for long-lived renaming. In Proceedings of the 20th International Conference on Distributed Computing (DISC’06). Springer-Verlag, Berlin, 413--427. Google ScholarDigital Library
Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High performance locks for multi-level NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 215--226. DOI:http://dx.doi.org/10.1145/2688500.2688503 Google ScholarDigital Library
Travis S. Craig. 2003. Building FIFO and Priority-Queueing Spin Locks from Atomic Swap. Technical Report TR 93-02-02. Department of Computer Science, University of Washington.Google Scholar
Danga Interactive. 2003. Memcached: Distributed Memory Object Caching System. Retrieved from http://memcached.org.Google Scholar
Data Differential. 2011. Libmemcached. Retrieved from https://launchpad.net/libmemcached.Google Scholar
Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 33--48. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communication of the ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). ACM, New York, NY, 65--74. Google ScholarDigital Library
David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: A general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 247--256. Google ScholarDigital Library
David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2, Article 13 (Feb. 2015), 42 pages. DOI:http://dx.doi.org/10.1145/2686884 Google ScholarDigital Library
Edsger W. Dijkstra. 1965. Cooperating sequential processes. (Sept. 1965). Published as EWD:EWD123pub.Google Scholar
Jonathan Eastep, David Wingate, Marco D. Santambrogio, and Anant Agarwal. 2010. Smartlocks: Lock acquisition scheduling for self-aware synchronization. In Proceedings of the 7th International Conference on Autonomic Computing (ICAC’10). ACM, New York, NY, 215--224. Google ScholarDigital Library
Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. Sim: A Highly-Efficient Wait-Free Universal Construction. Retrieved from https://code.google.com/p/sim-universal-construction/.Google Scholar
Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 257--266. Google ScholarDigital Library
Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux Journal 2004, 124 (Aug. 2004), 5--5. Google ScholarDigital Library
Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarDigital Library
Ahmed Hassan, Roberto Palmieri, and Binoy Ravindran. 2014. Remote invalidation: Optimizing the critical path of memory transactions. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society. Google ScholarDigital Library
Bijun He, William N. Scherer III, and Michael L. Scott. 2005a. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the 11th International Conference on High Performance Computing (HiPC’05). 7--18. Google ScholarDigital Library
Bijun He, William N. Scherer III, and Michael L. Scott. 2005b. Time-Published Queue-Based Spin Locks. Retrieved from http://www.cs.rochester.edu/research/synchronization/pseudocode/tp_lock s.html.Google Scholar
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010a. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, 355--364. Google ScholarDigital Library
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010b. Flat Combining and the Synchronization-Parallelism Tradeoff. (2010). http://mcg.cs.tau.ac.il/projects/flat-combining. Google ScholarDigital Library
Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarDigital Library
Charles Antony Richard Hoare. 1974. Monitors: An operating system structuring concept. Communications of the ACM 17, 10 (Oct. 1974), 549--557. Google ScholarDigital Library
David Koufaty, Dheeraj Reddy, and Scott Hahn. 2010. Bias scheduling in heterogeneous multi-core architectures. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 125--138. Google ScholarDigital Library
Scott T. Leutenegger and Daniel Dias. 1993. A modeling study of the TPC-C benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD’93). ACM, New York, NY, 22--31. Google ScholarDigital Library
Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the 2012. USENIX Annual Technical Conference (USENIX ATC’12). USENIX Association, 65--76. Google ScholarDigital Library
Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (Euro-Par’06). Springer-Verlag, Berlin, 801--810. Google ScholarDigital Library
Peter Magnusson, Anders Landin, and Erik Hagersteny. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium (IPPS’94). IEEE Computer Society Press, 165--171. Google ScholarDigital Library
John M. Mellor-Crummey and Michael L. Scott. 1991a. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65. Google ScholarDigital Library
John M. Mellor-Crummey and Michael L. Scott. 1991b. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, 269--278. Google ScholarDigital Library
Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (USENIX ATC’99). USENIX Association, 43--43. Google ScholarDigital Library
Oracle Corporation. 2004. Berkeley DB. Retrieved from http://www.oracle.com/technetwork/database/berkeleydb.Google Scholar
John K. Ousterhout. 1982. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS’82). 22--30.Google Scholar
Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA’99).Google Scholar
Yoann Padioleau, Julia Lawall, René Rydhof Hansen, and Gilles Muller. 2008. Documenting and automating collateral evolutions in Linux device drivers. In Proceedings of the 3rd European Conference on Computer Systems 2008 (Eurosys’08). ACM, New York, NY, 247--260. Google ScholarDigital Library
Darko Petrović, Thomas Ropars, and André Schiper. 2014. Leveraging hardware message passing for efficient thread synchronization. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 143--154. Google ScholarDigital Library
Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi Narayan Bhuyan. 2014. Lock contention aware thread migrations. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 369--370. Google ScholarDigital Library
Zoran Radovic and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE Computer Society, Washington, DC, 241--253. Google ScholarDigital Library
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, Washington, DC, 13--24. Google ScholarDigital Library
David P. Reed and Rajendra K. Kanodia. 1979. Synchronization with eventcounts and sequencers. Communications of the ACM 22, 2 (Feb. 1979), 115--123. DOI:http://dx.doi.org/10.1145/359060.359076 Google ScholarDigital Library
Manish Shah, Jama Barreh, Jeff Brooks, Robert Golla, Gregory Grohoski, Nils Gura, Rick Hetherington, Paul Jordan, Mark Luttrell, Christopher Olson, Bikram Saha, Denis Sheahan, Lawrence Spracklen, and Aaron Wynn. 2007. UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC. Retrieved from http://www.oracle.com/technetwork/systems/opensparc/02-t2-a-sscc2007-15 30395.pdf.Google Scholar
Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. 1992. SPLASH: Stanford parallel applications for shared-memory. SIGARCH Computer Architecture News 20, 1 (March 1992), 5--44. Google ScholarDigital Library
Stanford University. 2011. The Phoenix System for MapReduce Programming. Retrieved from http://mapreduce.stanford.edu.Google Scholar
Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications (MapReduce’11). ACM, New York, NY, 9--16. Google ScholarDigital Library
University of Delaware. 2007. The Modified SPLASH-2 Home Page. Retrieved from http://www.capsl.udel.edu/splash.Google Scholar
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 24--36. Google ScholarDigital Library
Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 1--8. Google ScholarDigital Library
Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. 2009. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Washington, DC, 198--207. Google ScholarDigital Library
Kamen Yotov, Keshav Pingali, and Paul Stodghill. 2005. Automatic measurement of memory hierarchy parameters. In Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). ACM, New York, NY, 181--192. Google ScholarDigital Library

Index Terms

Fast and Portable Locking for Multicore Architectures
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Mutual exclusion

Recommendations

Lock Cohorting: A General Technique for Designing NUMA Locks
Special Issue on PPOPP 2012

Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machine's nonuniform memory and caching hierarchy, ever more important. This article presents lock ...
Read More
Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems

A plethora of optimized mutex lock algorithms have been designed over the past 25 years to mitigate performance bottlenecks related to critical sections and locks. Unfortunately, there is currently no broad study of the behavior of these optimized lock ...
Read More
Pessimistic software lock-elision
DISC'12: Proceedings of the 26th international conference on Distributed Computing

Read-write locks are one of the most prevalent lock forms in concurrent applications because they allow read accesses to locked code to proceed in parallel. However, they do not offer any parallelism between reads and writes.

This paper introduces ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 33, Issue 4
January 2016
125 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2841315
Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2016
- Accepted: 1 November 2015
- Revised: 1 September 2015
- Received: 1 February 2015
Published in tocs Volume 33, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multicore
RPC
busy-waiting
locality
locks
memory contention
profiling
reengineering
synchronization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 768
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast and Portable Locking for Multicore Architectures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Lock Cohorting: A General Technique for Designing NUMA Locks

Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems

Pessimistic software lock-elision

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast and Portable Locking for Multicore Architectures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Lock Cohorting: A General Technique for Designing NUMA Locks

Lock–Unlock: Is That All? A Pragmatic Analysis of Locking in Software Systems

Pessimistic software lock-elision

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media