Abstract
As ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged languages have employed highly-invasive runtime program modifications. We observe that managed language runtimes, with garbage collection and JIT code compilation, present unique opportunities to repair such bugs directly, mirroring the techniques used in manual repairs. We present Remix, a modified version of the Oracle HotSpot JVM which can detect cache contention bugs and repair false sharing at runtime. Remix's detection mechanism leverages recent performance counter improvements on Intel platforms, which allow for precise, unobtrusive monitoring of cache contention at the hardware level. Remix can detect and repair known false sharing issues in the LMAX Disruptor high-performance inter-thread messaging library and the Spring Reactor event-processing framework, automatically providing 1.5-2x speedups over unoptimized code and matching the performance of hand-optimization. Remix also finds a new false sharing bug in SPECjvm2008, and uncovers a true sharing bug in the HotSpot JVM that, when fixed, improves the performance of three NAS Parallel Benchmarks by 7-25x. Remix incurs no statistically-significant performance overhead on other benchmarks that do not exhibit cache contention, making Remix practical for always-on use.
- Ali-Reza Adl-Tabatabai, Richard L. Hudson, Mauricio J. Serrano, and Sreenivas Subramoney. Prefetch Injection Based on Hardware Monitoring and Object Metadata. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, PLDI ’04, pages 267–276, 2004. Google ScholarDigital Library
- Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovi´c, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’06, pages 169– 190, 2006. Google ScholarDigital Library
- Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pages 1–8, 2010. Google ScholarDigital Library
- Dries Buytaert, Andy Georges, Michael Hind, Matthew Arnold, Lieven Eeckhout, and Koen De Bosschere. Using HPM-sampling to Drive Dynamic Compilation. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications, OOPSLA ’07, pages 553–568, 2007. Google ScholarDigital Library
- Trishul M. Chilimbi and James R. Larus. Using Generational Garbage Collection to Implement Cache-conscious Data Placement. In Proceedings of the 1st International Symposium on Memory Management, ISMM ’98, pages 37–48, 1998. Google ScholarDigital Library
- Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. Scalable Address Spaces Using RCU Balanced Trees. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 199–210, 2012. Google ScholarDigital Library
- Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. RadixVM: Scalable Address Spaces for Multithreaded Applications. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 211–224, 2013. Google ScholarDigital Library
- Intel Corporation. Avoiding and Identifying False Sharing Among Threads. https://software.intel.com/en-us/articles/ avoiding-and-identifying-false-sharing-among-threads, 2011.Google Scholar
- Intel Corporation. Intel(R) 64 and IA-32 Architectures Software Developer’s Manual, Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C, 6 2015.Google Scholar
- Oracle Corporation. VisualVM: All-in-One Java Troubleshooting Tool. https://visualvm.java.net/, 2015.Google Scholar
- Standard Performance Evaluation Corporation. SPECjvm2008. http://www.spec.org/jvm2008/, 2008.Google Scholar
- Florian David, Gael Thomas, Julia Lawall, and Gilles Muller. Continuously Measuring Critical Section Pressure with the Free-lunch Profiler. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 291– 307, 2014. Google ScholarDigital Library
- David Detlefs, Christine Flood, Steve Heller, and Tony Printezis. Garbage-first Garbage Collection. In Proceedings of the 4th International Symposium on Memory Management, ISMM ’04, pages 37–48, 2004. Google ScholarDigital Library
- Julian Dolby. Automatic Inline Allocation of Objects. In Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation, PLDI ’97, pages 7–17, 1997. Google ScholarDigital Library
- Julian Dolby and Andrew Chien. An Automatic Object Inlining Optimization and Its Evaluation. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, pages 345–357, 2000. Google ScholarDigital Library
- Julian Dolby and Andrew A. Chien. An Evaluation of Automatic Object Inline Allocation Techniques. In Proceedings of the 13th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’98, pages 1–20, 1998. Google ScholarDigital Library
- Apache Software Foundation. Apache Log4j 2 website. http: //logging.apache.org/log4j/2.x/, 2015.Google Scholar
- Michael A. Frumkin, Matthew Schultz, Haoqiang Jin, and Jerry Yan. Implementation of the NAS Parallel Benchmarks in Java. Technical Report NAS-02-009, NASA Advanced Supercomputing Division, 2002.Google Scholar
- functionaljava.org. functionaljava: A Library for Functional Programming in Java. functionaljava.org, 2010.Google Scholar
- Xianglong Huang, Stephen M. Blackburn, Kathryn S. McKinley, J Eliot B. Moss, Zhenlin Wang, and Perry Cheng. The Garbage Collection Advantage: Improving Program Locality. In Proceedings of the 19th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’04, pages 69–80, 2004. Google ScholarDigital Library
- L. Hupel and typelevel.org. scalaz: Functional programming for Scala. http://typelevel.org/projects/scalaz/, 2010.Google Scholar
- Shams Imam and Vivek Sarkar. Habanero-Java Library: A Java 8 Framework for Multicore Programming. In Proceedings of the 2014 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools, PPPJ ’14, pages 75–86, 2014. Google ScholarCross Ref
- Shams M. Imam and Vivek Sarkar. Savina - An Actor Benchmark Suite: Enabling Empirical Evaluation of Actor Libraries. In Proceedings of the 4th International Workshop on Programming Based on Actors, Agents & Decentralized Control, AGERE! ’14, pages 67–80, 2014. Google ScholarDigital Library
- Ondrej Lhoták and Laurie Hendren. Run-time Evaluation of Opportunities for Object Inlining in Java. In Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, pages 175–184, 2002. Google ScholarDigital Library
- Tim Lindholm, Frank Yellin, Gilad Bracha, and Alex Buckley. The Java Virtual Machine Specification: Java SE 8 Edition, chapter 4.4 The class File Format. Oracle Corporation, 2015. Google ScholarDigital Library
- C.-L. Liu. False Sharing Analysis for Multithreaded Programs. Master’s thesis, National Chung Cheng University, 7 2009.Google Scholar
- Tongping Liu and Emery D. Berger. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’11, pages 3–18, 2011. Google ScholarDigital Library
- Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 3–14, 2014. Google ScholarDigital Library
- LMAX. LMAX Disruptor — Open Source — LMAX Exchange. https://www.lmax.com/disruptor, 2015.Google Scholar
- Kai Lu, Xu Zhou, Tom Bergan, and Xiaoping Wang. Efficient Deterministic Multithreading Without Global Barriers. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’14, pages 287–300, 2014. Google ScholarDigital Library
- Liang Luo, Akshitha Sriraman, Brooke Fugate, Shiliang Hu, Gilles Pokam, Chris Newburn, and Joseph Devietti. LASER: Light, Accurate Sharing dEtection and Repair. In Proceedings of the 2016 IEEE 22nd International Symposium on High Performance Computer Architecture, HPCA ’16, 2016.Google ScholarCross Ref
- Linux Programmer’s Manual. perf event open(2) Linux Programmer’s Manual, 2015.Google Scholar
- mcmcc. false sharing in boost::detail::spinlock pool? http://stackoverflow.com/questions/11037655/ false-sharing-in-boostdetailspinlock-pool, June 2012.Google Scholar
- Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. Whose Cache Line is It Anyway?: Operating System Support for Live Detection and Repair of False Sharing. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, pages 141–154, 2013. Google ScholarDigital Library
- Scott Oaks. Java Performance: The Definitive Guide. O’Reilly Media, 3rd edition, April 2014. Page 266. Google ScholarDigital Library
- Oracle. Java 7 SE API documentation: java.util.Random. http: //docs.oracle.com/javase/7/docs/api/java/util/Random.html, 2014.Google Scholar
- Reactor Project. Spring Reactor. http://projectreactor.io/, 2015.Google Scholar
- Mikael Ronstrom. MySQL team increases scalability by > 50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012. http://mikaelronstrom.blogspot.com/2012/ 04/mysql-team-increases-scalability-by-50.html, April 2012.Google Scholar
- Martin Schindewolf. Analysis of Cache Misses Using SIMICS. Master’s thesis, Institute for Computing Systems Architecture, University of Edinburgh, 2007.Google Scholar
- Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. Da Capo con Scala: Design and Analysis of a Scala Benchmark Suite for the Java Virtual Machine. In Proceedings of the 26th Conference on Object-Oriented Programming, Systems, Languages and Applications, OOPSLA ’11, pages 657–676, 2011. Google ScholarDigital Library
- Yefim Shuf, Manish Gupta, Hubertus Franke, Andrew Appel, and Jaswinder Pal Singh. Creating and Preserving Locality of Java Applications at Allocation and Garbage Collection Times. In Proceedings of the 17th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications, OOPSLA ’02, pages 13–25, 2002. Google ScholarDigital Library
- Spring.io. Spring.io website. https://spring.io/, 2015.Google Scholar
- Suriya Subramanian, Michael Hicks, and Kathryn S. McKinley. Dynamic Software Updates: A VM-centric Approach. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’09, pages 1–12, 2009. Google ScholarDigital Library
- Peter F. Sweeney, Matthias Hauswirth, Brendon Cahoon, Perry Cheng, Amer Diwan, David Grove, and Michael Hind. Using Hardware Performance Monitors to Understand the Behavior of Java Applications. In Proceedings of the 3rd Conference on Virtual Machine Research And Technology Symposium - Volume 3, VM’04, pages 5–5, 2004. Google ScholarDigital Library
- The GPars team. The GPars Project - Reference Documentation. http://www.gpars.org/guide/, 2014.Google Scholar
- Martin Thompson, Dave Farley, Michael Barker, Patricia Gee, and Andrew Stewart. Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads. http://disruptor.googlecode.com/files/Disruptor-1.0. pdf, 5 2011.Google Scholar
- Christian Wimmer and Hanspeter Mössenböck. Automatic Feedback-directed Object Inlining in the Java Hotspot Virtual Machine. In Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE ’07, pages 12–21, 2007. Google ScholarDigital Library
- Christian Wimmer and Hanspeter Mössenböck. Automatic Array Inlining in Java Virtual Machines. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’08, pages 14–23, 2008. Google ScholarDigital Library
- Christian Wimmer and Hanspeter Mössenbösck. Automatic Feedback-directed Object Fusing. ACM Trans. Archit. Code Optim., 7(2):7:1–7:35, October 2010. Google ScholarDigital Library
- LLC. WorldWide Conferencing. Lift Framework - LiftActor. http://liftweb.net/, 2014.Google Scholar
- Derek Wyatt. Akka Concurrency - Building reliable software in a multicore world. Technical report, Artima Incorporation, 2013. Google ScholarDigital Library
- YourKit. YourKit Java Profiler - .NET Profiler. https://www. yourkit.com/, 2015.Google Scholar
- Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. Dynamic Cache Contention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE ’11, pages 27–38, 2011. Google ScholarDigital Library
Index Terms
- Remix: online detection and repair of cache contention for the JVM
Recommendations
Remix: online detection and repair of cache contention for the JVM
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationAs ever more computation shifts onto multicore architectures, it is increasingly critical to find effective ways of dealing with multithreaded performance bugs like true and false sharing. Previous approaches to fixing false sharing in unmanaged ...
Effective cache prefetching on bus-based multiprocessors
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
A Performance Study on Bounteous Transfer in Multiprocessor Sectored Caches
Special issue: high performance computing systemsIn a sectored cache, a cache line is divided into several subblocks. Each subblock is a basic coherence unit. In this way partial block invalidation can be done on the cache lines in order to eliminate false sharing on invalidate-based multiprocessors. ...
Comments