skip to main content
research-article

Fast and Portable Locking for Multicore Architectures

Published:04 January 2016Publication History
Skip Abstract Section

Abstract

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks.

Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.

References

  1. Jose L. Abellán, Juan Fernández, and Manuel E. Acacio. 2011. GLocks: Efficient support for highly-contended locks in many-core CMPs. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 893--905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gene M. Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18--20, 1967, Spring Joint Computer Conference (AFIPS’67 (Spring)). ACM, New York, NY, 483--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thomas E. Anderson. 1990. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems 1, 1 (Jan. 1990), 6--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Marc Auslander, David Edelsohn, Orran Krieger, Bryan Rosenburg, and Robert Wisniewski. 2003. Enhancement to the MCS lock for increased functionality and improved programmability. (Oct. 2003). U.S. Patent Application No. 10/128,745.Google ScholarGoogle Scholar
  5. Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principle (SOSP’09). ACM, New York, NY, 29--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang. 2008. Corey: An operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 43--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An analysis of linux scalability to many cores. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10). USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bjorn B. Brandenburg. 2013. Improved analysis and evaluation of real-time semaphore protocols for P-FP scheduling. In Proceedings of the 2013 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’13). IEEE Computer Society, Washington, DC, 141--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alex Brodsky, Faith Ellen, and Philipp Woelfel. 2006. Fully-adaptive algorithms for long-lived renaming. In Proceedings of the 20th International Conference on Distributed Computing (DISC’06). Springer-Verlag, Berlin, 413--427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High performance locks for multi-level NUMA systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 215--226. DOI:http://dx.doi.org/10.1145/2688500.2688503 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Travis S. Craig. 2003. Building FIFO and Priority-Queueing Spin Locks from Atomic Swap. Technical Report TR 93-02-02. Department of Computer Science, University of Washington.Google ScholarGoogle Scholar
  12. Danga Interactive. 2003. Memcached: Distributed Memory Object Caching System. Retrieved from http://memcached.org.Google ScholarGoogle Scholar
  13. Data Differential. 2011. Libmemcached. Retrieved from https://launchpad.net/libmemcached.Google ScholarGoogle Scholar
  14. Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). ACM, New York, NY, 33--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communication of the ACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-combining NUMA locks. In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). ACM, New York, NY, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David Dice, Virendra J. Marathe, and Nir Shavit. 2012. Lock cohorting: A general technique for designing NUMA locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 247--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock cohorting: A general technique for designing NUMA locks. ACM Transactions on Parallel Computing 1, 2, Article 13 (Feb. 2015), 42 pages. DOI:http://dx.doi.org/10.1145/2686884 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Edsger W. Dijkstra. 1965. Cooperating sequential processes. (Sept. 1965). Published as EWD:EWD123pub.Google ScholarGoogle Scholar
  20. Jonathan Eastep, David Wingate, Marco D. Santambrogio, and Anant Agarwal. 2010. Smartlocks: Lock acquisition scheduling for self-aware synchronization. In Proceedings of the 7th International Conference on Autonomic Computing (ICAC’10). ACM, New York, NY, 215--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. Sim: A Highly-Efficient Wait-Free Universal Construction. Retrieved from https://code.google.com/p/sim-universal-construction/.Google ScholarGoogle Scholar
  22. Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the combining synchronization technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, New York, NY, 257--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux Journal 2004, 124 (Aug. 2004), 5--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ahmed Hassan, Roberto Palmieri, and Binoy Ravindran. 2014. Remote invalidation: Optimizing the critical path of memory transactions. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Bijun He, William N. Scherer III, and Michael L. Scott. 2005a. Preemption adaptivity in time-published queue-based spin locks. In Proceedings of the 11th International Conference on High Performance Computing (HiPC’05). 7--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Bijun He, William N. Scherer III, and Michael L. Scott. 2005b. Time-Published Queue-Based Spin Locks. Retrieved from http://www.cs.rochester.edu/research/synchronization/pseudocode/tp_lock s.html.Google ScholarGoogle Scholar
  28. Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010a. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). ACM, New York, NY, 355--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010b. Flat Combining and the Synchronization-Parallelism Tradeoff. (2010). http://mcg.cs.tau.ac.il/projects/flat-combining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Charles Antony Richard Hoare. 1974. Monitors: An operating system structuring concept. Communications of the ACM 17, 10 (Oct. 1974), 549--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. David Koufaty, Dheeraj Reddy, and Scott Hahn. 2010. Bias scheduling in heterogeneous multi-core architectures. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 125--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Scott T. Leutenegger and Daniel Dias. 1993. A modeling study of the TPC-C benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD’93). ACM, New York, NY, 22--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2012. Remote core locking: Migrating critical-section execution to improve the performance of multithreaded applications. In Proceedings of the 2012. USENIX Annual Technical Conference (USENIX ATC’12). USENIX Association, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Victor Luchangco, Dan Nussbaum, and Nir Shavit. 2006. A hierarchical CLH queue lock. In Proceedings of the 12th International Conference on Parallel Processing (Euro-Par’06). Springer-Verlag, Berlin, 801--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peter Magnusson, Anders Landin, and Erik Hagersteny. 1994. Queue locks on cache coherent multiprocessors. In Proceedings of the 8th International Parallel Processing Symposium (IPPS’94). IEEE Computer Society Press, 165--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. John M. Mellor-Crummey and Michael L. Scott. 1991a. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. John M. Mellor-Crummey and Michael L. Scott. 1991b. Synchronization without contention. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, New York, NY, 269--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michael A. Olson, Keith Bostic, and Margo Seltzer. 1999. Berkeley DB. In Proceedings of the Annual Conference on USENIX Annual Technical Conference (USENIX ATC’99). USENIX Association, 43--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Oracle Corporation. 2004. Berkeley DB. Retrieved from http://www.oracle.com/technetwork/database/berkeleydb.Google ScholarGoogle Scholar
  41. John K. Ousterhout. 1982. Scheduling techniques for concurrent systems. In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS’82). 22--30.Google ScholarGoogle Scholar
  42. Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA’99).Google ScholarGoogle Scholar
  43. Yoann Padioleau, Julia Lawall, René Rydhof Hansen, and Gilles Muller. 2008. Documenting and automating collateral evolutions in Linux device drivers. In Proceedings of the 3rd European Conference on Computer Systems 2008 (Eurosys’08). ACM, New York, NY, 247--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Darko Petrović, Thomas Ropars, and André Schiper. 2014. Leveraging hardware message passing for efficient thread synchronization. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi Narayan Bhuyan. 2014. Lock contention aware thread migrations. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 369--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zoran Radovic and Erik Hagersten. 2003. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE Computer Society, Washington, DC, 241--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, Washington, DC, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. David P. Reed and Rajendra K. Kanodia. 1979. Synchronization with eventcounts and sequencers. Communications of the ACM 22, 2 (Feb. 1979), 115--123. DOI:http://dx.doi.org/10.1145/359060.359076 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Manish Shah, Jama Barreh, Jeff Brooks, Robert Golla, Gregory Grohoski, Nils Gura, Rick Hetherington, Paul Jordan, Mark Luttrell, Christopher Olson, Bikram Saha, Denis Sheahan, Lawrence Spracklen, and Aaron Wynn. 2007. UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC. Retrieved from http://www.oracle.com/technetwork/systems/opensparc/02-t2-a-sscc2007-15 30395.pdf.Google ScholarGoogle Scholar
  50. Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. 1992. SPLASH: Stanford parallel applications for shared-memory. SIGARCH Computer Architecture News 20, 1 (March 1992), 5--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Stanford University. 2011. The Phoenix System for MapReduce Programming. Retrieved from http://mapreduce.stanford.edu.Google ScholarGoogle Scholar
  52. Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications (MapReduce’11). ACM, New York, NY, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. University of Delaware. 2007. The Modified SPLASH-2 Home Page. Retrieved from http://www.capsl.udel.edu/splash.Google ScholarGoogle Scholar
  54. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA’95). ACM, New York, NY, 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and Zhiqiang Ma. 2010. Ad hoc synchronization considered harmful. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. 2009. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Washington, DC, 198--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Kamen Yotov, Keshav Pingali, and Paul Stodghill. 2005. Automatic measurement of memory hierarchy parameters. In Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). ACM, New York, NY, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast and Portable Locking for Multicore Architectures

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 33, Issue 4
    January 2016
    125 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/2841315
    Issue’s Table of Contents

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 4 January 2016
    • Accepted: 1 November 2015
    • Revised: 1 September 2015
    • Received: 1 February 2015
    Published in tocs Volume 33, Issue 4

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader