skip to main content
research-article

iGPU: exception support and speculative execution on GPUs

Published:09 June 2012Publication History
Skip Abstract Section

Abstract

Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs.

First, we observe that GPU programs can be broken into code regions that contain little or no live register state at their entry point. We then also recognize that it is simple to generate these regions in such a way that they are idempotent, allowing their entry points to function as program recovery points and enabling support for exception handling, fast context switches, and speculation, all with very low overhead. We call the architecture of GPUs executing these idempotent regions the iGPU architecture. The hardware extensions required are minimal and the construction of idempotent code regions is fully transparent under the typical dynamic compilation framework of GPUs. We demonstrate how iGPU exception support enables virtual memory paging with very low overhead (1% to 4%), and how speculation support enables circuit-speculation techniques that can provide over 25% reduction in energy.

References

  1. AMD. Memory System on Fusion APUs. http://goo.gl/r72cp.Google ScholarGoogle Scholar
  2. AMD. AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.3f. 2011.Google ScholarGoogle Scholar
  3. L. Anghel and M. Nicolaidis. Cost reduction and evaluation of a temporary faults detecting technique. In DATE '00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Austin. DIVA: A Reliable Substrate for Deep Submicron MicroarchitectureDesign. In MICRO '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS '09.Google ScholarGoogle Scholar
  6. E. Blem, M. Sinclair, and K. Sankaralingam. Challenge benchmarks that must be conquered to sustain the GPU revolution. In Proceedings of the 4th Workshop on Emerging Applications for Manycore Architecture, 2011.Google ScholarGoogle Scholar
  7. J. Chen. GPU technology trends and future requirements. In IEDM '09.Google ScholarGoogle Scholar
  8. S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw. RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance. Solid-State Circuits, IEEE Journal of, 44(1):32--48.Google ScholarGoogle Scholar
  9. M. de Kruijf and K. Sankaralingam. Idempotent processor architecture. In MICRO '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In PLDI '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. Tarantula: a vector extension to the alpha architecture. In ISCA '02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Espasa, M. Valero, and J. E. Smith. Out-of-order vector architectures. In MICRO '97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In ASPLOS '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Greskamp, L. Wan, U. Karpuzcu, J. Cook, J. Torrellas, D. Chen, and C. Zilles. Blueshift: Designing processors for timing speculation from the ground up. In HPCA '09.Google ScholarGoogle Scholar
  17. M. Gupta, K. Rangan, M. Smith, G.-Y. Wei, and D. Brooks. Decor: A delayed commit and rollback mechanism for handling inductive noise in processors. In HPCA '08.Google ScholarGoogle Scholar
  18. M. Hampton and K. Asanović. Implementing virtual memory in a vector processor with software restart markers. In ICS '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 5th edition, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. id. id tech 5 challenges: From texture virtualization to massive parallelization. In SIGGRAPH '09.Google ScholarGoogle Scholar
  21. C. Kozyrakis and D. Patterson. Overcoming the limitations of conventional vector processors. In ISCA '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu. Compiler-based multiple instruction retry. IEEE Transactions on Computers, 44(1):35--46, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Lindholm, M. J. Kilgard, and H. Moreton. A user-programmable vertex engine. In SIGGRAPH '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic speculation: an alternative approach. In MICRO '93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, Ver. 1.1. 2009.Google ScholarGoogle Scholar
  27. NVIDIA. NVIDIA CUDA C Programming Guide, Ver. 3.1.1. 2010.Google ScholarGoogle Scholar
  28. A. Padegs, B. Moore, R. Smith, and W. Buchholz. The IBM System/370 vector architecture: design considerations. Computers, IEEE Transactions on, 37(5):509--520, May 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley. Memory exclusion: Optimizing the performance of checkpointing systems. Software -- Practice & Experience, 29(2):125--142, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In MICRO '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. J. Reddi, M. S. Gupta, G. H. Holloway, G.-Y. Wei, M. D. Smith, and D. Brooks. Voltage emergency prediction: Using signatures to reduce operating margins. In HPCA '09.Google ScholarGoogle Scholar
  32. M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The impact of architectural trends on operating system performance. In SOSP '95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. W. Rudd. Efficient exception handling techniques for high-performance processor architectures. Departments of Electrical Engineering and Computer Science, Stanford University, Technical Report CSL-TR-97-732, August 1997. Google ScholarGoogle Scholar
  34. A. Saulsbury and D. Rice. Microprocessor with reduced context switching and overhead and corresponding method. United States Patent 6,314,510, November 2001.Google ScholarGoogle Scholar
  35. J. W. Sheaffer, D. P. Luebke, and K. Skadron. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In EUROGRAPHICS '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37:562--573, May 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. S. Snyder, D. B. Whalley, and T. P. Baker. Fast context switches: Compiler and architectural support for preemptive scheduling. Microprocessors and Microsystems, 19(1):35--42, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  38. G. S. Sohi and S. Vajapeyam. Instruction issue logic for high-performance, interruptable pipelined processors. In ISCA '87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H. Torng and M. Day. Interrupt handling for out-of-order execution processors. Computers, IEEE Transactions on, 42(1), 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. W. J. van der Laan. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google ScholarGoogle Scholar
  41. K. C. Yeager. The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2):28--40, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO '91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. X. Zhou and P. Petrov. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In DAC '06. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 40, Issue 3
    ISCA '12
    June 2012
    559 pages
    ISSN:0163-5964
    DOI:10.1145/2366231
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
      June 2012
      584 pages
      ISBN:9781450316422

    Copyright © 2012 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 June 2012

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader