skip to main content
research-article
Open Access

Optimizing Indirect Branches in Dynamic Binary Translators

Authors Info & Claims
Published:05 April 2016Publication History
Skip Abstract Section

Abstract

Dynamic binary translation is a technology for transparently translating and modifying a program at the machine code level as it is running. A significant factor in the performance of a dynamic binary translator is its handling of indirect branches. Unlike direct branches, which have a known target at translation time, an indirect branch requires translating a source program counter address to a translated program counter address every time the branch is executed. This translation can impose a serious runtime penalty if it is not handled efficiently.

MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit ARM (AArch64) code, uses three novel techniques to improve the performance of indirect branch translation. Together, these techniques allow MAMBO-X64 to achieve a very low performance overhead of only 10% on average compared to native execution of 32-bit programs. Hardware-assisted function returns use a software return address stack to predict the targets of function returns, making use of several novel optimizations while also exploiting hardware return address prediction. This technique has a significant impact on most benchmarks, reducing binary translation overhead compared to native execution by 40% on average and by 90% on some benchmarks. Branch table inference, an algorithm for detecting and translating branch tables, can reduce the overhead of translated code by up to 40% on some SPEC CPU2006 benchmarks. The remaining indirect branches are handled using a fast atomic hash table, which is optimized to work with multiple threads. This last technique translates indirect branches using a single shared hash table while avoiding expensive synchronization in performance-critical lookup code. This allows the performance to be on par with thread-private hash tables while having superior memory scalability.

References

  1. Keith Adams and Ole Agesen. 2006. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'06). ACM, 2--13. DOI:http://dx.doi.org/10.1145/1168857.1168860Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the 2005 USENIX Annual Technical Conference. USENIX, 41--46.Google ScholarGoogle Scholar
  3. Derek Bruening, Timothy Garnett, and Saman P. Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the 1st IEEE/ACM International Symposium on Code Generation and Optimization (CGO'03). IEEE Computer Society, 265--275. DOI:http://dx.doi.org/10.1109/ CGO.2003.1191551Google ScholarGoogle Scholar
  4. Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. 2006. Thread-shared software code caches. In Proceedings of the 4th IEEE/ACM International Symposium on Code Generation and Optimization (CGO'06). IEEE Computer Society, 28--38. DOI:http://dx.doi.org/10.1109/CGO.2006.36Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Derek Bruening, Qin Zhao, and Saman P. Amarasinghe. 2012. Transparent dynamic instrumentation. In Proceedings of the 8th International Conference on Virtual Execution Environments (VEE'12). ACM, 133--144. DOI:http://dx.doi.org/10.1145/2151024.2151043Google ScholarGoogle Scholar
  6. Derek Lane Bruening. 2004. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  7. James C. Dehnert, Brian Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing - software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the 1st IEEE/ACM International Symposium on Code Generation and Optimization (CGO'03). IEEE Computer Society, 15--24. DOI:http://dx.doi.org/10.1109/CGO.2003.1191529Google ScholarGoogle ScholarCross RefCross Ref
  8. Kim M. Hazelwood and Artur Klauser. 2006. A dynamic binary instrumentation engine for the ARM architecture. In Proceedings of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'06). ACM, 261--270. DOI:http://dx.doi.org/10.1145/1176760.1176793Google ScholarGoogle Scholar
  9. Kim M. Hazelwood, Greg Lueck, and Robert Cohn. 2009. Scalable support for multithreaded applications on dynamic binary instrumentation systems. In Proceedings of the 8th International Symposium on Memory Management (ISMM'09), Hillel Kolodner and Guy L. Steele Jr. (Eds.). ACM, 20--29. DOI:http://dx.doi.org/ 10.1145/1542431.1542435Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jason Hiser, Daniel W. Williams, Wei Hu, Jack W. Davidson, Jason Mars, and Bruce R. Childers. 2007. Evaluating indirect branch handling mechanisms in software dynamic translation systems. In Proceedings of the 5th International Symposium on Code Generation and Optimization (CGO'07). IEEE Computer Society, 61--73. DOI:http://dx.doi.org/10.1109/CGO.2007.10Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Raymond J. Hookway and Mark A. Herdeg. 1997. DIGITAL FX!32: Combining emulation and binary translation. Digital Technical Journal 9, 1 (1997), 3--12. http://www.hpl.hp.com/hpjournal/dtj/vol9num1/ vol9num1art1.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Nigel Horspool and Nenad Marovac. 1980. An approach to the problem of detranslation of computer programs. Computer Journal 23, 3 (1980), 223--229. DOI:http://dx.doi.org/10.1093/comjnl/23.3.223Google ScholarGoogle ScholarCross RefCross Ref
  13. Ning Jia, Chun Yang, Yu He, and Xu Cheng. 2014a. DTT: Program structure-aware indirect branch optimization via direct-TPC-table in DBT system. In Proceedings of the Computing Frontiers Conference (CF'14). ACM, 12:1--12:10. DOI:http://dx.doi.org/10.1145/2597917.2597944Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ning Jia, Chun Yang, Yu He, and Xu Cheng. 2014b. SPTU: Improving dynamic binary translation through software prediction with target updating. In Proceedings of the International Conference on Systems and Storage (SYSTOR'14). ACM, 2:1--2:12. DOI:http://dx.doi.org/10.1145/2611354.2611368Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ning Jia, Chun Yang, Jing Wang, Dong Tong, and Keyi Wang. 2013. SPIRE: Improving dynamic binary translation through SPC-indexed indirect branch redirecting. In Proceedings of the ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE'13). ACM, 1--12. DOI:http://dx.doi.org/ 10.1145/2451512.2451516Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ho-Seop Kim and James E. Smith. 2003. Hardware support for control transfers in code caches. In Proceedings of the 36th Annual International Symposium on Microarchitecture. ACM/IEEE Computer Society, 253--264. DOI:http://dx.doi.org/10.1109/MICRO.2003.1253200Google ScholarGoogle Scholar
  17. Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation. ACM, 190--200. DOI:http://dx.doi.org/10.1145/ 1065010.1065034Google ScholarGoogle Scholar
  18. Ryan W. Moore, José Baiocchi, Bruce R. Childers, Jack W. Davidson, and Jason Hiser. 2009. Addressing the challenges of DBT for the ARM architecture. In Proceedings of the 2009 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES'09). ACM, 147--156. DOI:http://dx.doi.org/10.1145/1542452.1542472Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tipp Moseley, Daniel A. Connors, Dirk Grunwald, and Ramesh Peri. 2007. Identifying potential parallelism via loop-centric profiling. In Proceedings of the 4th Conference on Computing Frontiers. ACM, 143--152. DOI:http://dx.doi.org/10.1145/1242531.1242554Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mathias Payer and Thomas R. Gross. 2010. Generating low-overhead dynamic binary translators. In Proceedings of SYSTOR 2010: The 3rd Annual Haifa Experimental Systems Conference. ACM. DOI:http://dx.doi.org/10.1145/1815695.1815724Google ScholarGoogle Scholar
  21. Yukinori Sato, Yasushi Inoguchi, and Tadao Nakamura. 2011. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th Conference on Computing Frontiers. ACM, 25. DOI:http://dx.doi.org/10.1145/2016604.2016634Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kevin Scott, Naveen Kumar, Bruce R. Childers, Jack W. Davidson, and Mary Lou Soffa. 2004. Overhead reduction techniques for software dynamic translation. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04). IEEE Computer Society. DOI:http://dx.doi.org/ 10.1109/IPDPS.2004.1303224Google ScholarGoogle ScholarCross RefCross Ref
  23. Julian Seward and Nicholas Nethercote. 2005. Using valgrind to detect undefined value errors with bit-precision. In Proceedings of the 2005 USENIX Annual Technical Conference. USENIX, 17--30. http://www.usenix.org/events/usenix05/tech/general/seward.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  24. Swaroop Sridhar, Jonathan S. Shapiro, and Prashanth P. Bungale. 2005. HDTrans: A low-overhead dynamic translator. In Proceedings of the 2005 Workshop on Binary Instrumentation and Applications. IEEE Computer Society.Google ScholarGoogle Scholar
  25. Jon Watson. 2008. Virtualbox: Bits and bytes masquerading as machines. Linux Journal 2008, 166 (2008), 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman P. Amarasinghe. 2011. Dynamic cache contention detection in multi-threaded applications. In Proceedings of the 7th International Conference on Virtual Execution Environments (VEE'11). ACM, 27--38. DOI:http://dx.doi.org/10.1145/ 1952682.1952688Google ScholarGoogle Scholar

Index Terms

  1. Optimizing Indirect Branches in Dynamic Binary Translators

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 1
        April 2016
        347 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2899032
        Issue’s Table of Contents

        Copyright © 2016 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 April 2016
        • Revised: 1 December 2015
        • Accepted: 1 December 2015
        • Received: 1 June 2015
        Published in taco Volume 13, Issue 1

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader