research-article

Improving SIMD Parallelism via Dynamic Binary Translation

Authors:
Ding-Yong Hong

Institute of Information Science, Academia Sinica, Taipei, Taiwan

Institute of Information Science, Academia Sinica, Taipei, Taiwan
View Profile

,
Yu-Ping Liu

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
View Profile

,
Sheng-Yu Fu

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
View Profile

,
Jan-Jan Wu

Institute of Information Science, Academia Sinica, Taipei, Taiwan

Institute of Information Science, Academia Sinica, Taipei, Taiwan
View Profile

,
Wei-Chung Hsu

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 17 Issue 3Article No.: 61pp 1–27https://doi.org/10.1145/3173456

Published:12 February 2018Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively.

References

Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarDigital Library
Utpal K. Banerjee. 1976. Data Dependence in Ordinary Programs. Technical Report.Google Scholar
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In Annual IEEE/ACM International Symposium on Microarchitecture. 201--212. Google ScholarDigital Library
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference. 41--46. Google ScholarDigital Library
Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2 (2002), 65--98. Google ScholarDigital Library
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization. 265--275. Google ScholarDigital Library
Patricio Bulić and Veselko Guštin. 2005. On dependence analysis for SIMD enhanced processors. In International Conference on High Performance Computing for Computational Science. 527--540. Google ScholarDigital Library
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In International Symposium on Code Generation and Optimization. 15--24. Google ScholarDigital Library
Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211. Google ScholarDigital Library
Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In International Symposium on Computer Architecture. 26--37. Google ScholarDigital Library
Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In IEEE International Conference on Parallel and Distributed Systems. 507--514. Google ScholarDigital Library
Nabil Hallou, Erven Rohou, and Philippe Clauss. 2017. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2017), 1536--1565. Google ScholarDigital Library
Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 228--237.Google ScholarCross Ref
Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In IEEE International Conference on Parallel and Distributed Systems. 853--860.Google ScholarCross Ref
Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Symposium on Code Generation and Optimization. 104--113. Google ScholarDigital Library
Intel Corporation. 2016. Intel®64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
JVM. 1999. HotSpot parallel collector. In Memory Management in the Java HotSpot Virtual Machine Whitepaper.Google Scholar
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In International Symposium on Code Generation and Optimization. 141--150. Google ScholarDigital Library
Vladimir Kiriansky, Derek Bruening, and Saman P. Amarasinghe. 2002. Secure execution via program shepherding. In Security Symposium. 191--206. Google ScholarDigital Library
Alexander Klaiber. 2000. The Technology Behind the Crusoe Processors. Technical Report.Google Scholar
Xiangyun Kong, David Klappholz, and Kleanthis Psarris. 1991. The I test: An improved dependence test for automatic parallelization and vectorization. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 342--349. Google ScholarDigital Library
Aparna Kotha, Kapil Anand, Matthew Smithson, Greeshma Yellareddy, and Rajeev Barua. 2010. Automatic parallelization in a binary rewriter. In IEEE/ACM International Symposium on Microarchitecture. 547--557. Google ScholarDigital Library
Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In ACM Conference on Programming Language Design and Implementation. 145--156. Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization. 75--88. Google ScholarDigital Library
Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In International Symposium on Code Generation and Optimization. 269--280. Google ScholarDigital Library
Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A compiler framework for extracting superword level parallelism. In ACM Conference on Programming Language Design and Implementation. 347--358. Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Conference on Programming Language Design and Implementation. 190--200. Google ScholarDigital Library
Luc Michel, Nicolas Fournel, and Frederic Petrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Design, Automation 8 Test in Europe Conference 8 Exhibition. 1530--1591.Google Scholar
Dorit Naishlos. 2004. Auto-vectorization in GCC. In Proceedings of the GCC Developers Summit. 105--117.Google Scholar
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100. Google ScholarDigital Library
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization. 151--160. Google ScholarDigital Library
Alex Pajuelo, Antonio Gonzalez, and Mateo Valero. 2002. Speculative dynamic vectorization. In International Symposium on Computer Architecture. 271--280. Google ScholarDigital Library
Vasileios Porpodas and Timothy M. Jones. 2015. Throttling automatic vectorization: When less is more. In International Conference on Parallel Architecture and Compilation Techniques. 432--444. Google ScholarDigital Library
Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In International Symposium on Code Generation and Optimization. 190--201. Google ScholarDigital Library
Kevin Scott and Jack Davidson. 2001. Strata: A Software Dynamic Translation Infrastructure. Technical Report. Charlottesville, VA. Google ScholarDigital Library
Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Journal of Instruction-Level Parallelism 5 (2003), 1--28.Google Scholar
Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15. Google ScholarDigital Library
Fu-Hwa Wang. 2003. Compiler annotation for binary translation tools. May 8, 2003. U.S. Patent 20030088860 A1.Google Scholar
Daniel Williams, Jason D. Hiser, and Jack W. Davidson. 2009. Using program metadata to support SDT in object-oriented applications. In Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. 55--62. Google ScholarDigital Library
Michael Wolfe and Chau-Wen Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems 3, 5 (1992), 591--601. Google ScholarDigital Library
Chaohao Xu, Jianhui Li, Tao Bao, Yun Wang, and Bo Huang. 2007. Metadata driven memory optimizations in dynamic binary translator. In International Conference on Virtual Execution Environments. 148--157. Google ScholarDigital Library
Matt T. Yourst. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In International Symposium on Performance Analysis of Systems 8 Software. 23--34.Google ScholarCross Ref
Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarDigital Library
Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In International Symposium on Code Generation and Optimization. 59--69. Google ScholarDigital Library
Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York. Google Scholar

Index Terms

Improving SIMD Parallelism via Dynamic Binary Translation

Recommendations

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

Single instruction multiple data (SIMD) has been adopted for decades because of its superior performance and power efficiency. The SIMD capability (i.e., width, number of registers, and advanced instructions) has diverged rapidly on different SIMD ...
Read More
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Read More
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 17, Issue 3
May 2018
309 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3185335
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 12 February 2018
- Accepted: 1 December 2017
- Revised: 1 September 2017
- Received: 1 April 2017
Published in tecs Volume 17, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dynamic binary translation
SIMD
compiler annotation
dynamic loop peeling
vectorization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 379
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.