Abstract
Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively.
- Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarDigital Library
- Utpal K. Banerjee. 1976. Data Dependence in Ordinary Programs. Technical Report.Google Scholar
- Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In Annual IEEE/ACM International Symposium on Microarchitecture. 201--212. Google ScholarDigital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference. 41--46. Google ScholarDigital Library
- Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2 (2002), 65--98. Google ScholarDigital Library
- Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization. 265--275. Google ScholarDigital Library
- Patricio Bulić and Veselko Guštin. 2005. On dependence analysis for SIMD enhanced processors. In International Conference on High Performance Computing for Computational Science. 527--540. Google ScholarDigital Library
- James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In International Symposium on Code Generation and Optimization. 15--24. Google ScholarDigital Library
- Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211. Google ScholarDigital Library
- Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In International Symposium on Computer Architecture. 26--37. Google ScholarDigital Library
- Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In IEEE International Conference on Parallel and Distributed Systems. 507--514. Google ScholarDigital Library
- Nabil Hallou, Erven Rohou, and Philippe Clauss. 2017. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2017), 1536--1565. Google ScholarDigital Library
- Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 228--237.Google ScholarCross Ref
- Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In IEEE International Conference on Parallel and Distributed Systems. 853--860.Google ScholarCross Ref
- Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Symposium on Code Generation and Optimization. 104--113. Google ScholarDigital Library
- Intel Corporation. 2016. Intel®64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
- JVM. 1999. HotSpot parallel collector. In Memory Management in the Java HotSpot Virtual Machine Whitepaper.Google Scholar
- Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In International Symposium on Code Generation and Optimization. 141--150. Google ScholarDigital Library
- Vladimir Kiriansky, Derek Bruening, and Saman P. Amarasinghe. 2002. Secure execution via program shepherding. In Security Symposium. 191--206. Google ScholarDigital Library
- Alexander Klaiber. 2000. The Technology Behind the Crusoe Processors. Technical Report.Google Scholar
- Xiangyun Kong, David Klappholz, and Kleanthis Psarris. 1991. The I test: An improved dependence test for automatic parallelization and vectorization. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 342--349. Google ScholarDigital Library
- Aparna Kotha, Kapil Anand, Matthew Smithson, Greeshma Yellareddy, and Rajeev Barua. 2010. Automatic parallelization in a binary rewriter. In IEEE/ACM International Symposium on Microarchitecture. 547--557. Google ScholarDigital Library
- Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In ACM Conference on Programming Language Design and Implementation. 145--156. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization. 75--88. Google ScholarDigital Library
- Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In International Symposium on Code Generation and Optimization. 269--280. Google ScholarDigital Library
- Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A compiler framework for extracting superword level parallelism. In ACM Conference on Programming Language Design and Implementation. 347--358. Google ScholarDigital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Conference on Programming Language Design and Implementation. 190--200. Google ScholarDigital Library
- Luc Michel, Nicolas Fournel, and Frederic Petrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Design, Automation 8 Test in Europe Conference 8 Exhibition. 1530--1591.Google Scholar
- Dorit Naishlos. 2004. Auto-vectorization in GCC. In Proceedings of the GCC Developers Summit. 105--117.Google Scholar
- Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100. Google ScholarDigital Library
- Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization. 151--160. Google ScholarDigital Library
- Alex Pajuelo, Antonio Gonzalez, and Mateo Valero. 2002. Speculative dynamic vectorization. In International Symposium on Computer Architecture. 271--280. Google ScholarDigital Library
- Vasileios Porpodas and Timothy M. Jones. 2015. Throttling automatic vectorization: When less is more. In International Conference on Parallel Architecture and Compilation Techniques. 432--444. Google ScholarDigital Library
- Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In International Symposium on Code Generation and Optimization. 190--201. Google ScholarDigital Library
- Kevin Scott and Jack Davidson. 2001. Strata: A Software Dynamic Translation Infrastructure. Technical Report. Charlottesville, VA. Google ScholarDigital Library
- Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Journal of Instruction-Level Parallelism 5 (2003), 1--28.Google Scholar
- Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15. Google ScholarDigital Library
- Fu-Hwa Wang. 2003. Compiler annotation for binary translation tools. May 8, 2003. U.S. Patent 20030088860 A1.Google Scholar
- Daniel Williams, Jason D. Hiser, and Jack W. Davidson. 2009. Using program metadata to support SDT in object-oriented applications. In Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. 55--62. Google ScholarDigital Library
- Michael Wolfe and Chau-Wen Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems 3, 5 (1992), 591--601. Google ScholarDigital Library
- Chaohao Xu, Jianhui Li, Tao Bao, Yun Wang, and Bo Huang. 2007. Metadata driven memory optimizations in dynamic binary translator. In International Conference on Virtual Execution Environments. 148--157. Google ScholarDigital Library
- Matt T. Yourst. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In International Symposium on Performance Analysis of Systems 8 Software. 23--34.Google ScholarCross Ref
- Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarDigital Library
- Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In International Symposium on Code Generation and Optimization. 59--69. Google ScholarDigital Library
- Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York. Google Scholar
Index Terms
- Improving SIMD Parallelism via Dynamic Binary Translation
Recommendations
Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation
Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsMore and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Comments