skip to main content
research-article
Open Access

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Authors Info & Claims
Published:13 February 2019Publication History
Skip Abstract Section

Abstract

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD instruction-set architectures (ISAs). Therefore, migrating existing applications to another host ISA that has fewer but longer SIMD registers and more advanced instructions raises the issues of asymmetric SIMD capability. To date, this issue has been overlooked and the host SIMD capability is underutilized, resulting in suboptimal performance. In this article, we present a novel binary translation technique called spill-aware superword level parallelism (saSLP), which combines short ARMv8 instructions and registers in the guest binaries to exploit the x86 AVX2 host’s parallelism, register capacity, and gather instructions. Our experiment results show that saSLP improves the performance by 1.6× (2.3×) across a number of benchmarks and reduces spilling by 97% (99%) for ARMv8 to x86 AVX2 (AVX-512) translation. Furthermore, with AVX2 (AVX-512) gather instructions, saSLP speeds up several data-irregular applications that cannot be vectorized on ARMv8 NEON by up to 3.9× (4.2×).

References

  1. ARM Ltd. 2015. ARM Cortex-A Series Programmer’s Guide for ARMv8-A.Google ScholarGoogle Scholar
  2. ARM Ltd. 2017. ARM Architecture Reference Manual Supplement: The Scalable Vector Extension (SVE), for ARMv8-A.Google ScholarGoogle Scholar
  3. David H. Bailey, Eric Barszcz, John T. Barton, D. S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, T. A. Lasinski, Robert Schreiber, Horst D. Simon, V. Venkatakrishnan, and Sisira Weeratunga. 1991. The NAS parallel benchmarks. International Journal of High Performance Computing Applications 5, 3 (1991), 63--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’00). ACM, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’03). ACM/IEEE Computer Society, 191--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In Proceedings of the USENIX Annual Technical Conference. USENIX, 41--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anton Chernoff, Mark A. Herdeg, Raymond J. Hookway, Chris Reeve, Norman Rubin, Tony Tye, S. Bharadwaj Yadavalli, and John Yates. 1998. FX!32—a profile-directed binary translator. IEEE Micro 18, 2 (1998), 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan Clark, Amir Hormati, Sami Yehia, Scott A. Mahlke, and Krisztián Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. James C. Dehnert, Brian Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing—software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brendan Dolan-Gavitt, Josh Hodosh, Patrick Hulin, Tim Leek, and Ryan Whelan. 2015. Repeatable reverse engineering with PANDA. In Proceedings of the Program Protection and Reverse Engineering Workshop (PPREW@ACSAC’15). ACM, 4:1--4:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report. Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville.Google ScholarGoogle Scholar
  12. Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs. https://www.agner.org/optimize/instruction_tables.pdf.Google ScholarGoogle Scholar
  13. Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’15). IEEE Computer Society, 507--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nabil Hallou, Erven Rohou, and Philippe Clauss. 2016. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2016), 1536--1565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’15). IEEE Computer Society, 228--237.Google ScholarGoogle ScholarCross RefCross Ref
  16. Israel Hirsh and S. Gideon. 2017. Intel Architecture Code Analyzer User’s Guide.Google ScholarGoogle Scholar
  17. Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS’16). IEEE Computer Society, 853--860.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Pangfeng Liu, Chien-Min Wang, and Yeh-Ching Chung. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’12). ACM, 104--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. 2011. Efficient parallel graph exploration on multi-core CPU and GPU. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), Lawrence Rauchwerger and Vivek Sarkar (Eds.). IEEE Computer Society, 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xin Huo, Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. 2011. An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs. In Proceedings of International Conference on Supercomputing, (ICS’11). ACM, 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Intel Corp. 2018a. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google ScholarGoogle Scholar
  22. Intel Corp. 2018b. Intel 64 and IA-32 Architectures Software Developer’s Manual.Google ScholarGoogle Scholar
  23. J. Jeffers, J. Reinders, and A. Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Elsevier Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In Proceedings of International Symposium on Principles and Practice of Parallel Programming (PPoPP’12). ACM, 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Samuel Larsen and Saman P. Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’12). ACM, 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chris Lattner and Vikram S. Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). IEEE Computer Society, 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, and Wei-Chung Hsu. 2017. Exploiting asymmetric SIMD register configurations in ARM-to-x86 dynamic binary translation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE Computer Society, 343--355.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jiwei Lu, Howard Chen, Pen-Chung Yew, and Wei-Chung Hsu. 2004. Design and implementation of a lightweight dynamic optimization system. Journal of Instruction-Level Parallelism 6 (2004), 1--24.Google ScholarGoogle Scholar
  30. Chi-Keung Luk, Robert S. Cohn, Robert Muth, Harish Patil, Artur Klauser, P. Geoffrey Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim M. Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’05). ACM, 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Saeed Maleki, Yaoqing Gao, María Jesús Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Luc Michel, Nicolas Fournel, and Frédéric Pétrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Proceedings of the Design, Automation and Test in Europe (DATE’11). IEEE Computer Society, 277--280.Google ScholarGoogle ScholarCross RefCross Ref
  33. Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI’07). ACM, 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’15). IEEE Computer Society, 190--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. RISC-V Foundation. 2016. RISC-V Vector Extension Proposal.Google ScholarGoogle Scholar
  38. Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In Proceedings of the GCC Developers Summit. Red Hat Inc., 131--142.Google ScholarGoogle Scholar
  39. Jaewook Shin, Mary W. Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Cheng Wang, Shiliang Hu, Ho-Seop Kim, Sreekumar R. Nair, Mauricio Breternitz Jr., Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Proceedings of the Asia-Pacific Computer Systems Architecture Conference, Lecture Notes in Computer Science, Vol. 4697. Springer, Berlin, 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, 59--69. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Architecture and Code Optimization
            ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 1
            March 2019
            157 pages
            ISSN:1544-3566
            EISSN:1544-3973
            DOI:10.1145/3313806
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 February 2019
            • Accepted: 1 November 2018
            • Revised: 1 October 2018
            • Received: 1 July 2018
            Published in taco Volume 16, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format