research-article

Free Access

Iteration Interleaving--Based SIMD Lane Partition

Authors:
Yaohua Wang

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Dong Wang

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Shuming Chen

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Zonglin Liu

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Shenggang Chen

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Xiaowen Chen

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

,
Xu Zhou

National University of Defense Technology, Hunan Province, P.R. China

National University of Defense Technology, Hunan Province, P.R. China
View Profile

ACM Transactions on Architecture and Code Optimization Volume 12 Issue 4Article No.: 58pp 1–18https://doi.org/10.1145/2847253

Published:04 January 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions SIMD lanes into branch paths with comparable execution time. The benefits are twofold: SIMD fragments under divergent branches can execute in parallel, and the pathology of fragment starvation can also be well eliminated. Our experiments show that IISLP doubles the performance of a baseline mechanism and provides a speedup of 28% versus instruction shuffle.

Supplemental Material

Available for Download

ppt

taco1204-58.ppt (1.6 MB)

Slide deck associated with this paper

References

W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. 1972. The Illiac IV system. Proceedings of the IEEE 60, 4, 369--388.Google ScholarCross Ref
S. Chen, Y. Wang, S. Liu, J. Wan, H. Chen, H. Liu, K. Zhang, X. Liu, and X. Ning. 2014. FT-Matrix:A coordination-aware architecture for signal processing. IEEE MICRO 34, 6, 64--73.Google ScholarCross Ref
T. Chen, R. Raghavan, and J. Dale. 2007. Cell broadband engine architecture and its first implementation a performance view. IBM Journal of Research and Development 51, 559--572. Google ScholarDigital Library
A. EITantawy, J. Wenjie Ma, M. O’Connor, and T. M. Aamodt. 2014. A scalable multi-path microarchitecture for efficient GPU control flow. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computing Architecture (HPCA-20). 248--259.Google Scholar
W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2001 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). 25--36. Google ScholarDigital Library
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. IEEE MICRO 40, 407--420. Google ScholarDigital Library
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization 6, 2, Article No. 7. Google ScholarDigital Library
Q. He. 2006. The Principle of the Computer Graphics. Tsinghua University Press, Beijing, China.Google Scholar
N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 344--355. Google ScholarDigital Library
U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-33). 159--170. Google ScholarDigital Library
B. Khailany, W. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, and A. Chang. 2001. Imagine: Media processing with streams. IEEE Micro 21, 2, 35--46. Google ScholarDigital Library
B. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casner, and K. Asanovic. 2004. The vector-thread architecture. IEEE Micro 24, 6, 84--90. Google ScholarDigital Library
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovic. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 129--140. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 235--246. Google ScholarDigital Library
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 308--317. Google ScholarDigital Library
NVIDIA Corporation. 2008. GeForce Gtx 280 Specifications. Available at http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-280/specifications.Google Scholar
NVIDIA Corporation. 2009. Nvidia's Next Generation CUDA Compute Architecture: Fermi. Available at http://www.nvidia.com.Google Scholar
NVIDIA Corporation. 2012. Nvidia's Next Generation CUDA Compute Architecture: Kepler GK110. Available at http://www.nvidia.com.Google Scholar
OPCODE. 2003. OPCODE Optimized Collision Detection Library (OPCODE). Retrieved December 7, 2015, from http://www.codercorner.com/Opcode.htm.Google Scholar
M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA’12). Google ScholarDigital Library
M. Rhu and M. Erez. 2013a. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 356--367. Google ScholarDigital Library
M. Rhu and M. Erez. 2013b. The dual-path execution model for efficient GPU control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 591--602. Google ScholarDigital Library
S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. 2006. Vector lane threading. In Proceedings of the International Conference on Parallel Processing (ICPP’06). 55--64. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). 1--15. Google ScholarDigital Library
K. Suhring. 2015 H.264 Joint Model (JM)-h.264/AVC Reference Software. Available at http://iphome.hhi.de/suehring/tml/.Google Scholar
Y. Wang, S. Chen, J. Wan, J. Meng, K. Zhang, W. Liu, and X. Ning. 2013. A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 603--614. Google ScholarDigital Library
Y. Wang, S. Chen, K. Zhang, J. Wan, X. Chen, H. Chen, and H. Wang. 2012. Instruction shuffle: Achieving MIMD-like performance on SIMD architectures. IEEE Computer Architecture Letters 11, 2, 37--40. Google ScholarDigital Library
M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2010. AnySP: Anytime anywhere anyway signal processing. IEEE Micro 30, 1, 81--91. Google ScholarDigital Library
X. Yang, X. Yan, Z. Xing, Y. Deng, J. Jiang, and Y. Zhang. 2007. A 64-bit stream processor architecture for scientific applications. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). 210--219. Google ScholarDigital Library

Index Terms

Iteration Interleaving--Based SIMD Lane Partition
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

Instruction Shuffle: Achieving MIMD-like Performance on SIMD Architectures

SIMD architectures are less efficient for applications with the diverse control-flow behavior, which can be mainly attributed to the requirement of the identical control-flow. In this paper, we propose a novel instruction shuffle scheme that features an ...
Read More
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Read More
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems

More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 12, Issue 4
January 2016
848 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2836331
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2016
- Accepted: 1 November 2015
- Revised: 1 October 2015
- Received: 1 July 2015
Published in taco Volume 12, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SIMD
SIMD lane partition
instruction shuffle
iteration interleaving
vector iteration
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 573
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.