Abstract
Dynamic binary translation is a technology for transparently translating and modifying a program at the machine code level as it is running. A significant factor in the performance of a dynamic binary translator is its handling of indirect branches. Unlike direct branches, which have a known target at translation time, an indirect branch requires translating a source program counter address to a translated program counter address every time the branch is executed. This translation can impose a serious runtime penalty if it is not handled efficiently.
MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit ARM (AArch64) code, uses three novel techniques to improve the performance of indirect branch translation. Together, these techniques allow MAMBO-X64 to achieve a very low performance overhead of only 10% on average compared to native execution of 32-bit programs. Hardware-assisted function returns use a software return address stack to predict the targets of function returns, making use of several novel optimizations while also exploiting hardware return address prediction. This technique has a significant impact on most benchmarks, reducing binary translation overhead compared to native execution by 40% on average and by 90% on some benchmarks. Branch table inference, an algorithm for detecting and translating branch tables, can reduce the overhead of translated code by up to 40% on some SPEC CPU2006 benchmarks. The remaining indirect branches are handled using a fast atomic hash table, which is optimized to work with multiple threads. This last technique translates indirect branches using a single shared hash table while avoiding expensive synchronization in performance-critical lookup code. This allows the performance to be on par with thread-private hash tables while having superior memory scalability.
- Keith Adams and Ole Agesen. 2006. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'06). ACM, 2--13. DOI:http://dx.doi.org/10.1145/1168857.1168860Google ScholarDigital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the 2005 USENIX Annual Technical Conference. USENIX, 41--46.Google Scholar
- Derek Bruening, Timothy Garnett, and Saman P. Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the 1st IEEE/ACM International Symposium on Code Generation and Optimization (CGO'03). IEEE Computer Society, 265--275. DOI:http://dx.doi.org/10.1109/ CGO.2003.1191551Google Scholar
- Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. 2006. Thread-shared software code caches. In Proceedings of the 4th IEEE/ACM International Symposium on Code Generation and Optimization (CGO'06). IEEE Computer Society, 28--38. DOI:http://dx.doi.org/10.1109/CGO.2006.36Google ScholarDigital Library
- Derek Bruening, Qin Zhao, and Saman P. Amarasinghe. 2012. Transparent dynamic instrumentation. In Proceedings of the 8th International Conference on Virtual Execution Environments (VEE'12). ACM, 133--144. DOI:http://dx.doi.org/10.1145/2151024.2151043Google Scholar
- Derek Lane Bruening. 2004. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology.Google Scholar
- James C. Dehnert, Brian Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing - software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the 1st IEEE/ACM International Symposium on Code Generation and Optimization (CGO'03). IEEE Computer Society, 15--24. DOI:http://dx.doi.org/10.1109/CGO.2003.1191529Google ScholarCross Ref
- Kim M. Hazelwood and Artur Klauser. 2006. A dynamic binary instrumentation engine for the ARM architecture. In Proceedings of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'06). ACM, 261--270. DOI:http://dx.doi.org/10.1145/1176760.1176793Google Scholar
- Kim M. Hazelwood, Greg Lueck, and Robert Cohn. 2009. Scalable support for multithreaded applications on dynamic binary instrumentation systems. In Proceedings of the 8th International Symposium on Memory Management (ISMM'09), Hillel Kolodner and Guy L. Steele Jr. (Eds.). ACM, 20--29. DOI:http://dx.doi.org/ 10.1145/1542431.1542435Google ScholarDigital Library
- Jason Hiser, Daniel W. Williams, Wei Hu, Jack W. Davidson, Jason Mars, and Bruce R. Childers. 2007. Evaluating indirect branch handling mechanisms in software dynamic translation systems. In Proceedings of the 5th International Symposium on Code Generation and Optimization (CGO'07). IEEE Computer Society, 61--73. DOI:http://dx.doi.org/10.1109/CGO.2007.10Google ScholarDigital Library
- Raymond J. Hookway and Mark A. Herdeg. 1997. DIGITAL FX!32: Combining emulation and binary translation. Digital Technical Journal 9, 1 (1997), 3--12. http://www.hpl.hp.com/hpjournal/dtj/vol9num1/ vol9num1art1.pdf.Google ScholarDigital Library
- R. Nigel Horspool and Nenad Marovac. 1980. An approach to the problem of detranslation of computer programs. Computer Journal 23, 3 (1980), 223--229. DOI:http://dx.doi.org/10.1093/comjnl/23.3.223Google ScholarCross Ref
- Ning Jia, Chun Yang, Yu He, and Xu Cheng. 2014a. DTT: Program structure-aware indirect branch optimization via direct-TPC-table in DBT system. In Proceedings of the Computing Frontiers Conference (CF'14). ACM, 12:1--12:10. DOI:http://dx.doi.org/10.1145/2597917.2597944Google ScholarDigital Library
- Ning Jia, Chun Yang, Yu He, and Xu Cheng. 2014b. SPTU: Improving dynamic binary translation through software prediction with target updating. In Proceedings of the International Conference on Systems and Storage (SYSTOR'14). ACM, 2:1--2:12. DOI:http://dx.doi.org/10.1145/2611354.2611368Google ScholarDigital Library
- Ning Jia, Chun Yang, Jing Wang, Dong Tong, and Keyi Wang. 2013. SPIRE: Improving dynamic binary translation through SPC-indexed indirect branch redirecting. In Proceedings of the ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE'13). ACM, 1--12. DOI:http://dx.doi.org/ 10.1145/2451512.2451516Google ScholarDigital Library
- Ho-Seop Kim and James E. Smith. 2003. Hardware support for control transfers in code caches. In Proceedings of the 36th Annual International Symposium on Microarchitecture. ACM/IEEE Computer Society, 253--264. DOI:http://dx.doi.org/10.1109/MICRO.2003.1253200Google Scholar
- Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. ACM, 190--200. DOI:http://dx.doi.org/10.1145/ 1065010.1065034Google Scholar
- Ryan W. Moore, José Baiocchi, Bruce R. Childers, Jack W. Davidson, and Jason Hiser. 2009. Addressing the challenges of DBT for the ARM architecture. In Proceedings of the 2009 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'09). ACM, 147--156. DOI:http://dx.doi.org/10.1145/1542452.1542472Google ScholarDigital Library
- Tipp Moseley, Daniel A. Connors, Dirk Grunwald, and Ramesh Peri. 2007. Identifying potential parallelism via loop-centric profiling. In Proceedings of the 4th Conference on Computing Frontiers. ACM, 143--152. DOI:http://dx.doi.org/10.1145/1242531.1242554Google ScholarDigital Library
- Mathias Payer and Thomas R. Gross. 2010. Generating low-overhead dynamic binary translators. In Proceedings of SYSTOR 2010: The 3rd Annual Haifa Experimental Systems Conference. ACM. DOI:http://dx.doi.org/10.1145/1815695.1815724Google Scholar
- Yukinori Sato, Yasushi Inoguchi, and Tadao Nakamura. 2011. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th Conference on Computing Frontiers. ACM, 25. DOI:http://dx.doi.org/10.1145/2016604.2016634Google ScholarDigital Library
- Kevin Scott, Naveen Kumar, Bruce R. Childers, Jack W. Davidson, and Mary Lou Soffa. 2004. Overhead reduction techniques for software dynamic translation. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04). IEEE Computer Society. DOI:http://dx.doi.org/ 10.1109/IPDPS.2004.1303224Google ScholarCross Ref
- Julian Seward and Nicholas Nethercote. 2005. Using valgrind to detect undefined value errors with bit-precision. In Proceedings of the 2005 USENIX Annual Technical Conference. USENIX, 17--30. http://www.usenix.org/events/usenix05/tech/general/seward.htmlGoogle ScholarDigital Library
- Swaroop Sridhar, Jonathan S. Shapiro, and Prashanth P. Bungale. 2005. HDTrans: A low-overhead dynamic translator. In Proceedings of the 2005 Workshop on Binary Instrumentation and Applications. IEEE Computer Society.Google Scholar
- Jon Watson. 2008. Virtualbox: Bits and bytes masquerading as machines. Linux Journal 2008, 166 (2008), 1.Google ScholarDigital Library
- Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman P. Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In Proceedings of the 7th International Conference on Virtual Execution Environments (VEE'11). ACM, 27--38. DOI:http://dx.doi.org/10.1145/ 1952682.1952688Google Scholar
Index Terms
- Optimizing Indirect Branches in Dynamic Binary Translators
Recommendations
Optimizing indirect branches in a system-level dynamic binary translator
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage ConferenceA dynamic binary translator (DBT) is a runtime system that translates binary code on the fly, for example to emulate the execution of the binary code on a processor with a different instruction set. One of the major sources of the overhead is the ...
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation
Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Efficient and Retargetable Dynamic Binary Translation on Multicores
Dynamic binary translation (DBT) is a core technology to many important applications such as system virtualization, dynamic binary instrumentation, and security. However, there are several factors that often impede its performance: 1) emulation overhead ...
Comments