Abstract
Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host’s parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6× (2.3×) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9× (4.2×).
- ARM Ltd. 2015. ARM Cortex-A Series Programmer’s Guide for ARMv8-A.Google Scholar
- ARM Ltd. 2017. ARM Architecture Reference Manual Supplement: The Scalable Vector Extension (SVE), for ARMv8-A.Google Scholar
- David H. Bailey, Eric Barszcz, John T. Barton, D. S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, T. A. Lasinski, Robert Schreiber, Horst D. Simon, V. Venkatakrishnan, and Sisira Weeratunga. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73. Google ScholarDigital Library
- Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’00). ACM, 1--12. Google ScholarDigital Library
- Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’03). ACM/IEEE Computer Society, 191--204. Google ScholarDigital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. USENIX, 41--46. Google ScholarDigital Library
- Anton Chernoff, Mark A. Herdeg, Raymond J. Hookway, Chris Reeve, Norman Rubin, Tony Tye, S. Bharadwaj Yadavalli, and John Yates. 1998. FX!32—a profile-directed binary translator. IEEE Micro 18, 2 (1998), 56--64. Google ScholarDigital Library
- Nathan Clark, Amir Hormati, Sami Yehia, Scott A. Mahlke, and Krisztián Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarDigital Library
- James C. Dehnert, Brian Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing—software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarDigital Library
- Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of the Program Protection and Reverse Engineering Workshop (PPREW@ACSAC’15). ACM, 4:1--4:11. Google ScholarDigital Library
- Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report. Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville.Google Scholar
- Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. https://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
- Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’15). IEEE Computer Society, 507--514. Google ScholarDigital Library
- Nabil Hallou, Erven Rohou, and Philippe Clauss. 2016. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2016), 1536--1565. Google ScholarDigital Library
- Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE Computer Society, 228--237.Google ScholarCross Ref
- Israel Hirsh and S. Gideon. 2017. Intel Architecture Code Analyzer User’s Guide.Google Scholar
- Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’16). IEEE Computer Society, 853--860.Google ScholarCross Ref
- Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, and Yeh-Ching Chung. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’12). ACM, 104--113. Google ScholarDigital Library
- Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), Lawrence Rauchwerger and Vivek Sarkar (Eds.). IEEE Computer Society, 78--88. Google ScholarDigital Library
- Xin Huo, Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. 2011. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs. In Proceedings of International Conference on Supercomputing, (ICS’11). ACM, 2--11. Google ScholarDigital Library
- Intel Corp. 2018a. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
- Intel Corp. 2018b. Intel 64 and IA-32 Architectures Software Developer’s Manual.Google Scholar
- J. Jeffers, J. Reinders, and A. Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Elsevier Science. Google ScholarDigital Library
- Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In Proceedings of International Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, 55--64. Google ScholarDigital Library
- Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’12). ACM, 145--156. Google ScholarDigital Library
- Chris Lattner and Vikram S. Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 75--88. Google ScholarDigital Library
- Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarDigital Library
- Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, and Wei-Chung Hsu. 2017. Exploiting asymmetric SIMD register configurations in ARM-to-x86 dynamic binary translation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE Computer Society, 343--355.Google ScholarCross Ref
- Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei-Chung Hsu. 2004. Design and implementation of a lightweight dynamic optimization system. Journal of Instruction-Level Parallelism 6 (2004), 1--24.Google Scholar
- Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’05). ACM, 190--200. Google ScholarDigital Library
- Saeed Maleki, Yaoqing Gao, María Jesús Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarDigital Library
- Luc Michel, Nicolas Fournel, and Frédéric Pétrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE Computer Society, 277--280.Google ScholarCross Ref
- Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’07). ACM, 89--100. Google ScholarDigital Library
- Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarDigital Library
- Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarDigital Library
- Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). IEEE Computer Society, 190--201. Google ScholarDigital Library
- RISC-V Foundation. 2016. RISC-V Vector Extension Proposal.Google Scholar
- Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In Proceedings of the GCC Developers Summit. Red Hat Inc., 131--142.Google Scholar
- Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarDigital Library
- Cheng Wang, Shiliang Hu, Ho-Seop Kim, Sreekumar R. Nair, Mauricio Breternitz Jr., Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Lecture Notes in Computer Science, Vol. 4697. Springer, Berlin, 4--15. Google ScholarDigital Library
- Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarDigital Library
- Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, 59--69. Google ScholarDigital Library
Index Terms
- Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation
Recommendations
Low overhead dynamic binary translation on ARM
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and ImplementationThe ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Improving SIMD Parallelism via Dynamic Binary Translation
Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Comments