Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Authors:
Yu-Ping Liu

National Taiwan University, Daan Dist. Taipei, Taiwan

National Taiwan University, Daan Dist. Taipei, Taiwan
View Profile

,
Ding-Yong Hong

Academia Sinica, Nankang Dist. Taipei, Taiwan

Academia Sinica, Nankang Dist. Taipei, Taiwan
View Profile

,
Jan-Jan Wu

Academia Sinica, Nankang Dist. Taipei, Taiwan

Academia Sinica, Nankang Dist. Taipei, Taiwan
View Profile

,
Sheng-Yu Fu

National Taiwan University, Daan Dist. Taipei, Taiwan

National Taiwan University, Daan Dist. Taipei, Taiwan
View Profile

,
Wei-Chung Hsu

National Taiwan University, Daan Dist. Taipei, Taiwan

National Taiwan University, Daan Dist. Taipei, Taiwan
View Profile

ACM Transactions on Architecture and Code Optimization Volume 16 Issue 1Article No.: 2pp 1–24https://doi.org/10.1145/3301488

Published:13 February 2019Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host’s parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6× (2.3×) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9× (4.2×).

References

ARM Ltd. 2015. ARM Cortex-A Series Programmer’s Guide for ARMv8-A.Google Scholar
ARM Ltd. 2017. ARM Architecture Reference Manual Supplement: The Scalable Vector Extension (SVE), for ARMv8-A.Google Scholar
David H. Bailey, Eric Barszcz, John T. Barton, D. S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, T. A. Lasinski, Robert Schreiber, Horst D. Simon, V. Venkatakrishnan, and Sisira Weeratunga. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73. Google ScholarDigital Library
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’00). ACM, 1--12. Google ScholarDigital Library
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’03). ACM/IEEE Computer Society, 191--204. Google ScholarDigital Library
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. USENIX, 41--46. Google ScholarDigital Library
Anton Chernoff, Mark A. Herdeg, Raymond J. Hookway, Chris Reeve, Norman Rubin, Tony Tye, S. Bharadwaj Yadavalli, and John Yates. 1998. FX&excl;32—a profile-directed binary translator. IEEE Micro 18, 2 (1998), 56--64. Google ScholarDigital Library
Nathan Clark, Amir Hormati, Sami Yehia, Scott A. Mahlke, and Krisztián Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarDigital Library
James C. Dehnert, Brian Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing—software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarDigital Library
Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of the Program Protection and Reverse Engineering Workshop (PPREW@ACSAC’15). ACM, 4:1--4:11. Google ScholarDigital Library
Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report. Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville.Google Scholar
Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. https://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’15). IEEE Computer Society, 507--514. Google ScholarDigital Library
Nabil Hallou, Erven Rohou, and Philippe Clauss. 2016. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2016), 1536--1565. Google ScholarDigital Library
Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE Computer Society, 228--237.Google ScholarCross Ref
Israel Hirsh and S. Gideon. 2017. Intel Architecture Code Analyzer User’s Guide.Google Scholar
Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’16). IEEE Computer Society, 853--860.Google ScholarCross Ref
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, and Yeh-Ching Chung. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’12). ACM, 104--113. Google ScholarDigital Library
Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), Lawrence Rauchwerger and Vivek Sarkar (Eds.). IEEE Computer Society, 78--88. Google ScholarDigital Library
Xin Huo, Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. 2011. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs. In Proceedings of International Conference on Supercomputing, (ICS’11). ACM, 2--11. Google ScholarDigital Library
Intel Corp. 2018a. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
Intel Corp. 2018b. Intel 64 and IA-32 Architectures Software Developer’s Manual.Google Scholar
J. Jeffers, J. Reinders, and A. Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Elsevier Science. Google ScholarDigital Library
Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In Proceedings of International Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, 55--64. Google ScholarDigital Library
Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’12). ACM, 145--156. Google ScholarDigital Library
Chris Lattner and Vikram S. Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 75--88. Google ScholarDigital Library
Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarDigital Library
Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, and Wei-Chung Hsu. 2017. Exploiting asymmetric SIMD register configurations in ARM-to-x86 dynamic binary translation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE Computer Society, 343--355.Google ScholarCross Ref
Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei-Chung Hsu. 2004. Design and implementation of a lightweight dynamic optimization system. Journal of Instruction-Level Parallelism 6 (2004), 1--24.Google Scholar
Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’05). ACM, 190--200. Google ScholarDigital Library
Saeed Maleki, Yaoqing Gao, María Jesús Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarDigital Library
Luc Michel, Nicolas Fournel, and Frédéric Pétrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE Computer Society, 277--280.Google ScholarCross Ref
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’07). ACM, 89--100. Google ScholarDigital Library
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarDigital Library
Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarDigital Library
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). IEEE Computer Society, 190--201. Google ScholarDigital Library
RISC-V Foundation. 2016. RISC-V Vector Extension Proposal.Google Scholar
Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In Proceedings of the GCC Developers Summit. Red Hat Inc., 131--142.Google Scholar
Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarDigital Library
Cheng Wang, Shiliang Hu, Ho-Seop Kim, Sreekumar R. Nair, Mauricio Breternitz Jr., Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Lecture Notes in Computer Science, Vol. 4697. Springer, Berlin, 4--15. Google ScholarDigital Library
Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarDigital Library
Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, 59--69. Google ScholarDigital Library

Index Terms

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Recommendations

Low overhead dynamic binary translation on ARM
PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation

The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations ...
Read More
Improving SIMD Parallelism via Dynamic Binary Translation

Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit ...
Read More
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 16, Issue 1
March 2019
157 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3313806
Editor:
Koen De Bosschere
Ghent University, Belgium
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 February 2019
- Accepted: 1 November 2018
- Revised: 1 October 2018
- Received: 1 July 2018
Published in taco Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dynamic binary translation
SIMD
SLP vectorization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 1,791
  Total Downloads
- Downloads (Last 12 months)462
- Downloads (Last 6 weeks)79
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Low overhead dynamic binary translation on ARM

Improving SIMD Parallelism via Dynamic Binary Translation

Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions