skip to main content
research-article
Free Access

Iteration Interleaving--Based SIMD Lane Partition

Authors Info & Claims
Published:04 January 2016Publication History
Skip Abstract Section

Abstract

The efficacy of single instruction, multiple data (SIMD) architectures is limited when handling divergent control flows. This circumstance results in SIMD fragments using only a subset of the available lanes. We propose an iteration interleaving--based SIMD lane partition (IISLP) architecture that interleaves the execution of consecutive iterations and dynamically partitions SIMD lanes into branch paths with comparable execution time. The benefits are twofold: SIMD fragments under divergent branches can execute in parallel, and the pathology of fragment starvation can also be well eliminated. Our experiments show that IISLP doubles the performance of a baseline mechanism and provides a speedup of 28% versus instruction shuffle.

Skip Supplemental Material Section

Supplemental Material

References

  1. W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. 1972. The Illiac IV system. Proceedings of the IEEE 60, 4, 369--388.Google ScholarGoogle ScholarCross RefCross Ref
  2. S. Chen, Y. Wang, S. Liu, J. Wan, H. Chen, H. Liu, K. Zhang, X. Liu, and X. Ning. 2014. FT-Matrix:A coordination-aware architecture for signal processing. IEEE MICRO 34, 6, 64--73.Google ScholarGoogle ScholarCross RefCross Ref
  3. T. Chen, R. Raghavan, and J. Dale. 2007. Cell broadband engine architecture and its first implementation a performance view. IBM Journal of Research and Development 51, 559--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. EITantawy, J. Wenjie Ma, M. O’Connor, and T. M. Aamodt. 2014. A scalable multi-path microarchitecture for efficient GPU control flow. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computing Architecture (HPCA-20). 248--259.Google ScholarGoogle Scholar
  5. W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2001 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. IEEE MICRO 40, 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization 6, 2, Article No. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Q. He. 2006. The Principle of the Computer Graphics. Tsinghua University Press, Beijing, China.Google ScholarGoogle Scholar
  9. N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang. 2013. An energy-efficient and scalable eDRAM-based register file architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 344--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. U. J. Kapasi, W. J. Dally, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-33). 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Khailany, W. Dally, S. Rixner, U. J. Kapasi, P. Mattson, J. Namkoong, J. Owens, B. Towles, and A. Chang. 2001. Imagine: Media processing with streams. IEEE Micro 21, 2, 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casner, and K. Asanovic. 2004. The vector-thread architecture. IEEE Micro 24, 6, 84--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovic. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. NVIDIA Corporation. 2008. GeForce Gtx 280 Specifications. Available at http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-280/specifications.Google ScholarGoogle Scholar
  17. NVIDIA Corporation. 2009. Nvidia's Next Generation CUDA Compute Architecture: Fermi. Available at http://www.nvidia.com.Google ScholarGoogle Scholar
  18. NVIDIA Corporation. 2012. Nvidia's Next Generation CUDA Compute Architecture: Kepler GK110. Available at http://www.nvidia.com.Google ScholarGoogle Scholar
  19. OPCODE. 2003. OPCODE Optimized Collision Detection Library (OPCODE). Retrieved December 7, 2015, from http://www.codercorner.com/Opcode.htm.Google ScholarGoogle Scholar
  20. M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Rhu and M. Erez. 2013a. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). 356--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Rhu and M. Erez. 2013b. The dual-path execution model for efficient GPU control flow. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 591--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. 2006. Vector lane threading. In Proceedings of the International Conference on Parallel Processing (ICPP’06). 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. In Proceedings of ACM SIGGRAPH 2008 Papers (SIGGRAPH’08). 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Suhring. 2015 H.264 Joint Model (JM)-h.264/AVC Reference Software. Available at http://iphome.hhi.de/suehring/tml/.Google ScholarGoogle Scholar
  26. Y. Wang, S. Chen, J. Wan, J. Meng, K. Zhang, W. Liu, and X. Ning. 2013. A multiple SIMD, multiple data (MSMD) architecture: Parallel execution of dynamic and static SIMD fragments. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA’13). 603--614. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Wang, S. Chen, K. Zhang, J. Wan, X. Chen, H. Chen, and H. Wang. 2012. Instruction shuffle: Achieving MIMD-like performance on SIMD architectures. IEEE Computer Architecture Letters 11, 2, 37--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2010. AnySP: Anytime anywhere anyway signal processing. IEEE Micro 30, 1, 81--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Yang, X. Yan, Z. Xing, Y. Deng, J. Jiang, and Y. Zhang. 2007. A 64-bit stream processor architecture for scientific applications. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). 210--219. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Iteration Interleaving--Based SIMD Lane Partition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 4
        January 2016
        848 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2836331
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 January 2016
        • Accepted: 1 November 2015
        • Revised: 1 October 2015
        • Received: 1 July 2015
        Published in taco Volume 12, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader