Abstract
The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions SIMD lanes into branch paths with comparable execution time. The benefits are twofold: SIMD fragments under divergent branches can execute in parallel, and the pathology of fragment starvation can also be well eliminated. Our experiments show that IISLP doubles the performance of a baseline mechanism and provides a speedup of 28% versus instruction shuffle.
Supplemental Material
Available for Download
Slide deck associated with this paper
- W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. 1972. The Illiac IV system. Proceedings of the IEEE 60, 4, 369--388.Google ScholarCross Ref
- S. Chen, Y. Wang, S. Liu, J. Wan, H. Chen, H. Liu, K. Zhang, X. Liu, and X. Ning. 2014. FT-Matrix:A coordination-aware architecture for signal processing. IEEE MICRO 34, 6, 64--73.Google ScholarCross Ref
- T. Chen, R. Raghavan, and J. Dale. 2007. Cell broadband engine architecture and its first implementation a performance view. IBM Journal of Research and Development 51, 559--572. Google ScholarDigital Library
- A. EITantawy, J. Wenjie Ma, M. O’Connor, and T. M. Aamodt. 2014. A scalable multi-path microarchitecture for efficient GPU control flow. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computing Architecture (HPCA-20). 248--259.Google Scholar
- W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2001 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). 25--36. Google ScholarDigital Library
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. IEEE MICRO 40, 407--420. Google ScholarDigital Library
- W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization 6, 2, Article No. 7. Google ScholarDigital Library
- Q. He. 2006. The Principle of the Computer Graphics. Tsinghua University Press, Beijing, China.Google Scholar
- N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 344--355. Google ScholarDigital Library
- U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-33). 159--170. Google ScholarDigital Library
- B. Khailany, W. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, and A. Chang. 2001. Imagine: Media processing with streams. IEEE Micro 21, 2, 35--46. Google ScholarDigital Library
- B. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casner, and K. Asanovic. 2004. The vector-thread architecture. IEEE Micro 24, 6, 84--90. Google ScholarDigital Library
- Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovic. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 129--140. Google ScholarDigital Library
- J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 235--246. Google ScholarDigital Library
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 308--317. Google ScholarDigital Library
- NVIDIA Corporation. 2008. GeForce Gtx 280 Specifications. Available at http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-280/specifications.Google Scholar
- NVIDIA Corporation. 2009. Nvidia's Next Generation CUDA Compute Architecture: Fermi. Available at http://www.nvidia.com.Google Scholar
- NVIDIA Corporation. 2012. Nvidia's Next Generation CUDA Compute Architecture: Kepler GK110. Available at http://www.nvidia.com.Google Scholar
- OPCODE. 2003. OPCODE Optimized Collision Detection Library (OPCODE). Retrieved December 7, 2015, from http://www.codercorner.com/Opcode.htm.Google Scholar
- M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA’12). Google ScholarDigital Library
- M. Rhu and M. Erez. 2013a. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 356--367. Google ScholarDigital Library
- M. Rhu and M. Erez. 2013b. The dual-path execution model for efficient GPU control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 591--602. Google ScholarDigital Library
- S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. 2006. Vector lane threading. In Proceedings of the International Conference on Parallel Processing (ICPP’06). 55--64. Google ScholarDigital Library
- L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). 1--15. Google ScholarDigital Library
- K. Suhring. 2015 H.264 Joint Model (JM)-h.264/AVC Reference Software. Available at http://iphome.hhi.de/suehring/tml/.Google Scholar
- Y. Wang, S. Chen, J. Wan, J. Meng, K. Zhang, W. Liu, and X. Ning. 2013. A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 603--614. Google ScholarDigital Library
- Y. Wang, S. Chen, K. Zhang, J. Wan, X. Chen, H. Chen, and H. Wang. 2012. Instruction shuffle: Achieving MIMD-like performance on SIMD architectures. IEEE Computer Architecture Letters 11, 2, 37--40. Google ScholarDigital Library
- M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2010. AnySP: Anytime anywhere anyway signal processing. IEEE Micro 30, 1, 81--91. Google ScholarDigital Library
- X. Yang, X. Yan, Z. Xing, Y. Deng, J. Jiang, and Y. Zhang. 2007. A 64-bit stream processor architecture for scientific applications. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). 210--219. Google ScholarDigital Library
Index Terms
- Iteration Interleaving--Based SIMD Lane Partition
Recommendations
Instruction Shuffle: Achieving MIMD-like Performance on SIMD Architectures
SIMD architectures are less efficient for applications with the diverse control-flow behavior, which can be mainly attributed to the requirement of the identical control-flow. In this paper, we propose a novel instruction shuffle scheme that features an ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsMore and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Comments