Abstract
Region formation is an important step in dynamic binary translation to select hot code regions for translation and optimization. The quality of the formed regions determines the extent of optimizations and thus determines the final execution performance. Moreover, the overall performance is very sensitive to the formation overhead, because region formation can have a non-trivial cost. For addressing the dual issues of region quality and region formation overhead, this article presents a lightweight region formation method guided by processor tracing, e.g., Intel PT. We leverage the branch history information stored in the processor to reconstruct the program execution profile and effectively form high-quality regions with low cost. Furthermore, we present the designs of lightweight hardware performance monitoring sampling and the branch instruction decode cache to minimize region formation overhead. Using ARM64 to x86-64 translations, the experiment results show that our method achieves a performance speedup of up to 1.53× (1.16× on average) for SPEC CPU2006 benchmarks with reference inputs, compared to the well-known software-based trace formation method, Next Executing Tail (NET). The performance results of x86-64 to ARM64 translations also show a speedup of up to 1.25× over NET for CINT2006 benchmarks with reference inputs. The comparison with a relaxed NETPlus region formation method further demonstrates that our method achieves the best performance and lowest compilation overhead.
- B. Alpern, C. R. Attanasio, J. J. Barton, M. G. Burke, P. Cheng, J.-D. Choi, A. Cocchi, S. J. Fink, D. Grove, M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F. Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano, J. C. Shepherd, S. E. Smith, V. C. Sreedhar, H. Srinivasan, and J. Whaley. 2000. The jalapeñO virtual machine. IBM Syst. J. 39, 1 (Jan. 2000), 211--238. Google ScholarDigital Library
- ARM. 2012. CoreSight Components Technical Reference Manual. ARM.Google Scholar
- Matthew Arnold and Barbara G. Ryder. 2001. A framework for reducing the cost of instrumented code. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation. 168--179. Google ScholarDigital Library
- Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarDigital Library
- Thomas Ball and James R. Larus. 1994. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. 16, 4 (Jul. 1994), 1319--1360. Google ScholarDigital Library
- Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. 46--57. Google ScholarDigital Library
- Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. 41--46. Google ScholarDigital Library
- Igor Böhm, Tobias J. K. Edler von Koch, Stephen C. Kyle, Björn Franke, and Nigel Topham. 2011. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 74--85. Google ScholarDigital Library
- Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz, Jr., Shiliang Hu, Esfir Natanzon, Shai Rotem, and Roni Rosner. 2010. TAO: Two-level atomicity for dynamic binary optimizations. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 12--21. Google ScholarDigital Library
- Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization. 265--275. Google ScholarDigital Library
- Dries Buytaert, Andy Georges, Michael Hind, Matthew Arnold, Lieven Eeckhout, and Koen De Bosschere. 2007. Using Hpm-sampling to drive dynamic compilation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications. 553--568. Google ScholarDigital Library
- J. G. Castanos, H. Hayashizaki, H. Inoue, M. J. Serrano, and P. Wu. 2014. Adaptive next-executing-cycle trace selection for trace-driven code optimizers. http://www.google.com/patents/US8756581 US Patent 8,756,581.Google Scholar
- Wen-Ke Chen, Sorin Lerner, Ronnie Chaiken, and David M. Gillies. 2000. Mojo: A dynamic optimization system. In ACM Workshop on Feedback-Directed and Dynamic Optimization. 81--90.Google Scholar
- Amanieu D’Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2017. Low overhead dynamic binary translation on ARM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 333--346. Google ScholarDigital Library
- Derek M. Davis and Kim Hazelwood. 2011. Improving region selection through loop completion. In Proceedings of the ASPLOS Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments.Google Scholar
- James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing™ software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. 15--24. Google ScholarDigital Library
- Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211. Google ScholarDigital Library
- Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R. Haghighat, Blake Kaplan, Graydon Hoare, Boris Zbarsky, Jason Orendorff, Jesse Ruderman, Edwin W. Smith, Rick Reitmaier, Michael Bebenita, Mason Chang, and Michael Franz. 2009. Trace-based just-in-time type specialization for dynamic languages. In Proceedings of the ACM Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. 1982. Gprof: A call graph execution profiler. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction. 120--126. Google ScholarDigital Library
- Hiroshige Hayashizaki, Peng Wu, Hiroshi Inoue, Mauricio J. Serrano, and Toshio Nakatani. 2011. Improving the performance of trace-based systems by false loop filtering. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 405--418. Google ScholarDigital Library
- David Hiniker, Kim Hazelwood, and Michael D. Smith. 2005. Improving region selection in dynamic optimization systems. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. 141--154. Google ScholarDigital Library
- Martin Hirzel and Trishul Chilimbi. 2001. Bursty tracing: A framework for low-overhead temporal profiling. In Proceedings of the 4th ACM Workshop on Feedback-Directed and Dynamic Optimization.Google Scholar
- Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the International Symposium on Code Generation and Optimization. 104--113. Google ScholarDigital Library
- Chun-Chen Hsu, Pangfeng Liu, Jan-Jan Wu, Pen-Chung Yew, Ding-Yong Hong, Wei-Chung Hsu, and Chien-Min Wang. 2013. Improving dynamic binary optimization through early-exit guided code region formation. In Proceedings of the 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 23--32. Google ScholarDigital Library
- Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu, and Toshio Nakatani. 2011. A trace-based Java JIT compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 246--256. Google ScholarDigital Library
- Intel Corporation 2018. Intel(R) 64 and IA-32 Architectures Software Developer’s Manual: Volume 3. Intel Corporation.Google Scholar
- Daniel Jones and Nigel Topham. 2009. High speed CPU simulation using LTU dynamic binary translation. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers. 50--64. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--88. Google ScholarDigital Library
- Linaro. 2018. OpenCSD library. Retrieved from https://github.com/Linaro/OpenCSD.Google Scholar
- Linaro ToolChain. 2017. Linaro ARM GCC toolchain. Retrieved from http://www.linaro.org/downloads/.Google Scholar
- Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei-Chung Hsu. 2004. Design and implementation of a lightweight dynamic optimization system. J. Instruct.-Level Parall. 6 (2004), 1--24.Google Scholar
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
- Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100. Google ScholarDigital Library
- Andreas Neustifter. 2010. Efficient Profiling in the LLVM Compiler. Master’s thesis. Vienna University of Technology.Google Scholar
- Vijay Sundaresan, Daryl Maier, Pramod Ramarao, and Mark Stoodley. 2006. Experiences with multi-threading and dynamic class loading in a java just-in-time compiler. In Proceedings of the International Symposium on Code Generation and Optimization. 87--97. Google ScholarDigital Library
- David Tam and John Wu. 2003. Using Hardware Counters to Improve Dynamic Compilation. Technical Report.Google Scholar
- Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2002. Efficient instrumentation for code coverage testing. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis. 86--96. Google ScholarDigital Library
- Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15. Google ScholarDigital Library
- C. Wang, B. Zheng, H. S. Kim, M. Breternitz, and Y. Wu. 2010. Two-pass MRET trace selection for dynamic optimization. http://www.google.com/patents/US7694281 US Patent 7,694,281.Google Scholar
- John Whaley. 2000. A portable sampling-based profiler for java virtual machines. In Proceedings of the ACM 2000 Conference on Java Grande. 78--87. Google ScholarDigital Library
- Peng Wu, Hiroshige Hayashizaki, Hiroshi Inoue, and Toshio Nakatani. 2011. Reducing trace selection footprint for large-scale java applications without performance loss. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications. 789--804. Google ScholarDigital Library
Index Terms
- Processor-Tracing Guided Region Formation in Dynamic Binary Translation
Recommendations
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation
Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Improving SIMD Parallelism via Dynamic Binary Translation
Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit ...
HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and OptimizationDynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead ...
Comments