Processor-Tracing Guided Region Formation in Dynamic Binary Translation

Authors:
Ding-Yong Hong

Institute of Information Science, Academia Sinica, Taiwan

Institute of Information Science, Academia Sinica, Taiwan
View Profile

,
Jan-Jan Wu

Institute of Information Science, Academia Sinica, Taiwan

Institute of Information Science, Academia Sinica, Taiwan
View Profile

,
Yu-Ping Liu

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
View Profile

,
Sheng-Yu Fu

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
View Profile

,
Wei-Chung Hsu

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
View Profile

ACM Transactions on Architecture and Code Optimization Volume 15 Issue 4Article No.: 52pp 1–25https://doi.org/10.1145/3281664

Published:16 November 2018Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Region formation is an important step in dynamic binary translation to select hot code regions for translation and optimization. The quality of the formed regions determines the extent of optimizations and thus determines the final execution performance. Moreover, the overall performance is very sensitive to the formation overhead, because region formation can have a non-trivial cost. For addressing the dual issues of region quality and region formation overhead, this article presents a lightweight region formation method guided by processor tracing, e.g., Intel PT. We leverage the branch history information stored in the processor to reconstruct the program execution profile and effectively form high-quality regions with low cost. Furthermore, we present the designs of lightweight hardware performance monitoring sampling and the branch instruction decode cache to minimize region formation overhead. Using ARM64 to x86-64 translations, the experiment results show that our method achieves a performance speedup of up to 1.53× (1.16× on average) for SPEC CPU2006 benchmarks with reference inputs, compared to the well-known software-based trace formation method, Next Executing Tail (NET). The performance results of x86-64 to ARM64 translations also show a speedup of up to 1.25× over NET for CINT2006 benchmarks with reference inputs. The comparison with a relaxed NETPlus region formation method further demonstrates that our method achieves the best performance and lowest compilation overhead.

References

B. Alpern, C. R. Attanasio, J. J. Barton, M. G. Burke, P. Cheng, J.-D. Choi, A. Cocchi, S. J. Fink, D. Grove, M. Hind, S. F. Hummel, D. Lieber, V. Litvinov, M. F. Mergen, T. Ngo, J. R. Russell, V. Sarkar, M. J. Serrano, J. C. Shepherd, S. E. Smith, V. C. Sreedhar, H. Srinivasan, and J. Whaley. 2000. The jalapeñO virtual machine. IBM Syst. J. 39, 1 (Jan. 2000), 211--238. Google ScholarDigital Library
ARM. 2012. CoreSight Components Technical Reference Manual. ARM.Google Scholar
Matthew Arnold and Barbara G. Ryder. 2001. A framework for reducing the cost of instrumented code. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation. 168--179. Google ScholarDigital Library
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarDigital Library
Thomas Ball and James R. Larus. 1994. Optimally profiling and tracing programs. ACM Trans. Program. Lang. Syst. 16, 4 (Jul. 1994), 1319--1360. Google ScholarDigital Library
Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture. 46--57. Google ScholarDigital Library
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. 41--46. Google ScholarDigital Library
Igor Böhm, Tobias J. K. Edler von Koch, Stephen C. Kyle, Björn Franke, and Nigel Topham. 2011. Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 74--85. Google ScholarDigital Library
Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz, Jr., Shiliang Hu, Esfir Natanzon, Shai Rotem, and Roni Rosner. 2010. TAO: Two-level atomicity for dynamic binary optimizations. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 12--21. Google ScholarDigital Library
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization. 265--275. Google ScholarDigital Library
Dries Buytaert, Andy Georges, Michael Hind, Matthew Arnold, Lieven Eeckhout, and Koen De Bosschere. 2007. Using Hpm-sampling to drive dynamic compilation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications. 553--568. Google ScholarDigital Library
J. G. Castanos, H. Hayashizaki, H. Inoue, M. J. Serrano, and P. Wu. 2014. Adaptive next-executing-cycle trace selection for trace-driven code optimizers. http://www.google.com/patents/US8756581 US Patent 8,756,581.Google Scholar
Wen-Ke Chen, Sorin Lerner, Ronnie Chaiken, and David M. Gillies. 2000. Mojo: A dynamic optimization system. In ACM Workshop on Feedback-Directed and Dynamic Optimization. 81--90.Google Scholar
Amanieu D’Antras, Cosmin Gorgovan, Jim Garside, and Mikel Luján. 2017. Low overhead dynamic binary translation on ARM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 333--346. Google ScholarDigital Library
Derek M. Davis and Kim Hazelwood. 2011. Improving region selection through loop completion. In Proceedings of the ASPLOS Workshop on Runtime Environments/Systems, Layering, and Virtualized Environments.Google Scholar
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing™ software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. 15--24. Google ScholarDigital Library
Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211. Google ScholarDigital Library
Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R. Haghighat, Blake Kaplan, Graydon Hoare, Boris Zbarsky, Jason Orendorff, Jesse Ruderman, Edwin W. Smith, Rick Reitmaier, Michael Bebenita, Mason Chang, and Michael Franz. 2009. Trace-based just-in-time type specialization for dynamic languages. In Proceedings of the ACM Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. 1982. Gprof: A call graph execution profiler. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction. 120--126. Google ScholarDigital Library
Hiroshige Hayashizaki, Peng Wu, Hiroshi Inoue, Mauricio J. Serrano, and Toshio Nakatani. 2011. Improving the performance of trace-based systems by false loop filtering. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems. 405--418. Google ScholarDigital Library
David Hiniker, Kim Hazelwood, and Michael D. Smith. 2005. Improving region selection in dynamic optimization systems. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. 141--154. Google ScholarDigital Library
Martin Hirzel and Trishul Chilimbi. 2001. Bursty tracing: A framework for low-overhead temporal profiling. In Proceedings of the 4th ACM Workshop on Feedback-Directed and Dynamic Optimization.Google Scholar
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the International Symposium on Code Generation and Optimization. 104--113. Google ScholarDigital Library
Chun-Chen Hsu, Pangfeng Liu, Jan-Jan Wu, Pen-Chung Yew, Ding-Yong Hong, Wei-Chung Hsu, and Chien-Min Wang. 2013. Improving dynamic binary optimization through early-exit guided code region formation. In Proceedings of the 9th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 23--32. Google ScholarDigital Library
Hiroshi Inoue, Hiroshige Hayashizaki, Peng Wu, and Toshio Nakatani. 2011. A trace-based Java JIT compiler retrofitted from a method-based compiler. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 246--256. Google ScholarDigital Library
Intel Corporation 2018. Intel(R) 64 and IA-32 Architectures Software Developer’s Manual: Volume 3. Intel Corporation.Google Scholar
Daniel Jones and Nigel Topham. 2009. High speed CPU simulation using LTU dynamic binary translation. In Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers. 50--64. Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization. 75--88. Google ScholarDigital Library
Linaro. 2018. OpenCSD library. Retrieved from https://github.com/Linaro/OpenCSD.Google Scholar
Linaro ToolChain. 2017. Linaro ARM GCC toolchain. Retrieved from http://www.linaro.org/downloads/.Google Scholar
Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei-Chung Hsu. 2004. Design and implementation of a lightweight dynamic optimization system. J. Instruct.-Level Parall. 6 (2004), 1--24.Google Scholar
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100. Google ScholarDigital Library
Andreas Neustifter. 2010. Efficient Profiling in the LLVM Compiler. Master’s thesis. Vienna University of Technology.Google Scholar
Vijay Sundaresan, Daryl Maier, Pramod Ramarao, and Mark Stoodley. 2006. Experiences with multi-threading and dynamic class loading in a java just-in-time compiler. In Proceedings of the International Symposium on Code Generation and Optimization. 87--97. Google ScholarDigital Library
David Tam and John Wu. 2003. Using Hardware Counters to Improve Dynamic Compilation. Technical Report.Google Scholar
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2002. Efficient instrumentation for code coverage testing. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis. 86--96. Google ScholarDigital Library
Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15. Google ScholarDigital Library
C. Wang, B. Zheng, H. S. Kim, M. Breternitz, and Y. Wu. 2010. Two-pass MRET trace selection for dynamic optimization. http://www.google.com/patents/US7694281 US Patent 7,694,281.Google Scholar
John Whaley. 2000. A portable sampling-based profiler for java virtual machines. In Proceedings of the ACM 2000 Conference on Java Grande. 78--87. Google ScholarDigital Library
Peng Wu, Hiroshige Hayashizaki, Hiroshi Inoue, and Toshio Nakatani. 2011. Reducing trace selection footprint for large-scale java applications without performance loss. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications. 789--804. Google ScholarDigital Library

Index Terms

Processor-Tracing Guided Region Formation in Dynamic Binary Translation
1. General and reference
  1. Cross-computing tools and techniques
    1. Design
    2. Performance
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Dynamic compilers

Recommendations

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Read More
Improving SIMD Parallelism via Dynamic Binary Translation

Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit ...
Read More
HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores
CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation and security. However, there are several factors that often impede its performance: (1) emulation overhead ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 15, Issue 4
December 2018
706 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3284745
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 November 2018
- Accepted: 1 September 2018
- Revised: 1 August 2018
- Received: 1 June 2018
Published in taco Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dynamic binary translation
hardware performance monitoring
next executing tail
processor tracing
region formation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 937
  Total Downloads
- Downloads (Last 12 months)171
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Processor-Tracing Guided Region Formation in Dynamic Binary Translation

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Improving SIMD Parallelism via Dynamic Binary Translation

HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores