skip to main content
research-article

Improving SIMD Parallelism via Dynamic Binary Translation

Authors Info & Claims
Published:12 February 2018Publication History
Skip Abstract Section

Abstract

Recent trends in SIMD architecture have tended toward longer vector lengths, and more enhanced SIMD features have been introduced in newer vector instruction sets. However, legacy or proprietary applications compiled with short-SIMD ISA cannot benefit from the long-SIMD architecture that supports improved parallelism and enhanced vector primitives, resulting in only a small fraction of potential peak performance. This article presents a dynamic binary translation technique that enables short-SIMD binaries to exploit benefits of new SIMD architectures by rewriting short-SIMD loop code. We propose a general approach that translates loops consisting of short-SIMD instructions to machine-independent IR, conducts SIMD loop transformation/optimization at this IR level, and finally translates to long-SIMD instructions. Two solutions are presented to enforce SIMD load/store alignment, one for the problem caused by the binary translator’s internal translation condition and one general approach using dynamic loop peeling optimization. Benchmark results show that average speedups of 1.51× and 2.48× are achieved for an ARM NEON to x86 AVX2 and x86 AVX-512 loop transformation, respectively.

References

  1. Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Utpal K. Banerjee. 1976. Data Dependence in Ordinary Programs. Technical Report.Google ScholarGoogle Scholar
  3. Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2010. Efficient selection of vector instructions using dynamic programming. In Annual IEEE/ACM International Symposium on Microarchitecture. 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference. 41--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2 (2002), 65--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization. 265--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Patricio Bulić and Veselko Guštin. 2005. On dependence analysis for SIMD enhanced processors. In International Conference on High Performance Computing for Computational Science. 527--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In International Symposium on Code Generation and Optimization. 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Evelyn Duesterwald and Vasanth Bala. 2000. Software profiling for hot path prediction: Less is more. In International Conference on Architectural Support for Programming Languages and Operating Systems. 202--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In International Symposium on Computer Architecture. 26--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sheng-Yu Fu, Ding-Yong Hong, Jan-Jan Wu, Pangfeng Liu, and Wei-Chung Hsu. 2015. SIMD code translation in an enhanced HQEMU. In IEEE International Conference on Parallel and Distributed Systems. 507--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nabil Hallou, Erven Rohou, and Philippe Clauss. 2017. Runtime vectorization transformations of binary code. International Journal of Parallel Programming 45, 6 (2017), 1536--1565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nabil Hallou, Erven Rohou, Philippe Clauss, and Alain Ketterlin. 2015. Dynamic re-vectorization of binary code. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 228--237.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ding-Yong Hong, Sheng-Yu Fu, Yu-Ping Liu, Jan-Jan Wu, and Wei-Chung Hsu. 2016. Exploiting longer SIMD lanes in dynamic binary translation. In IEEE International Conference on Parallel and Distributed Systems. 853--860.Google ScholarGoogle ScholarCross RefCross Ref
  16. Ding-Yong Hong, Chun-Chen Hsu, Pen-Chung Yew, Jan-Jan Wu, Wei-Chung Hsu, Yeh-Ching Chung, Pangfeng Liu, and Chien-Min Wang. 2012. HQEMU: A multi-threaded and retargetable dynamic binary translator on multicores. In Symposium on Code Generation and Optimization. 104--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel Corporation. 2016. Intel®64 and IA-32 Architectures Optimization Reference Manual.Google ScholarGoogle Scholar
  18. JVM. 1999. HotSpot parallel collector. In Memory Management in the Java HotSpot Virtual Machine Whitepaper.Google ScholarGoogle Scholar
  19. Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In International Symposium on Code Generation and Optimization. 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vladimir Kiriansky, Derek Bruening, and Saman P. Amarasinghe. 2002. Secure execution via program shepherding. In Security Symposium. 191--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alexander Klaiber. 2000. The Technology Behind the Crusoe Processors. Technical Report.Google ScholarGoogle Scholar
  22. Xiangyun Kong, David Klappholz, and Kleanthis Psarris. 1991. The I test: An improved dependence test for automatic parallelization and vectorization. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 342--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aparna Kotha, Kapil Anand, Matthew Smithson, Greeshma Yellareddy, and Rajeev Barua. 2010. Automatic parallelization in a binary rewriter. In IEEE/ACM International Symposium on Microarchitecture. 547--557. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Samuel Larsen and Saman Amarasinghe. 2000. Exploiting superword level parallelism with multimedia instruction sets. In ACM Conference on Programming Language Design and Implementation. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization. 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In International Symposium on Code Generation and Optimization. 269--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, and Mahmut Kandemir. 2012. A compiler framework for extracting superword level parallelism. In ACM Conference on Programming Language Design and Implementation. 347--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Conference on Programming Language Design and Implementation. 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Luc Michel, Nicolas Fournel, and Frederic Petrot. 2011. Speeding-up SIMD instructions dynamic binary translation in embedded processor simulation. In Design, Automation 8 Test in Europe Conference 8 Exhibition. 1530--1591.Google ScholarGoogle Scholar
  30. Dorit Naishlos. 2004. Auto-vectorization in GCC. In Proceedings of the GCC Developers Summit. 105--117.Google ScholarGoogle Scholar
  31. Nicholas Nethercote and Julian Seward. 2007. Valgrind: A framework for heavyweight dynamic binary instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In International Symposium on Code Generation and Optimization. 151--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Alex Pajuelo, Antonio Gonzalez, and Mateo Valero. 2002. Speculative dynamic vectorization. In International Symposium on Computer Architecture. 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vasileios Porpodas and Timothy M. Jones. 2015. Throttling automatic vectorization: When less is more. In International Conference on Parallel Architecture and Compilation Techniques. 432--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In International Symposium on Code Generation and Optimization. 190--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kevin Scott and Jack Davidson. 2001. Strata: A Software Dynamic Translation Infrastructure. Technical Report. Charlottesville, VA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2003. Exploiting superword-level locality in multimedia extension architectures. Journal of Instruction-Level Parallelism 5 (2003), 1--28.Google ScholarGoogle Scholar
  38. Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R. Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. StarDBT: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Fu-Hwa Wang. 2003. Compiler annotation for binary translation tools. May 8, 2003. U.S. Patent 20030088860 A1.Google ScholarGoogle Scholar
  40. Daniel Williams, Jason D. Hiser, and Jack W. Davidson. 2009. Using program metadata to support SDT in object-oriented applications. In Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems. 55--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Michael Wolfe and Chau-Wen Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems 3, 5 (1992), 591--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chaohao Xu, Jianhui Li, Tao Bao, Yun Wang, and Bo Huang. 2007. Metadata driven memory optimizations in dynamic binary translator. In International Conference on Virtual Execution Environments. 148--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Matt T. Yourst. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In International Symposium on Performance Analysis of Systems 8 Software. 23--34.Google ScholarGoogle ScholarCross RefCross Ref
  44. Hao Zhou and Jingling Xue. 2016a. A compiler approach for exploiting partial SIMD parallelism. ACM Transactions on Architecture and Code Optimization 13, 1 (2016), 11:1--11:26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hao Zhou and Jingling Xue. 2016b. Exploiting mixed SIMD parallelism by reducing data reorganization overhead. In International Symposium on Code Generation and Optimization. 59--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York. Google ScholarGoogle Scholar

Index Terms

  1. Improving SIMD Parallelism via Dynamic Binary Translation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 17, Issue 3
            May 2018
            309 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3185335
            Issue’s Table of Contents

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 February 2018
            • Accepted: 1 December 2017
            • Revised: 1 September 2017
            • Received: 1 April 2017
            Published in tecs Volume 17, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader