Abstract
Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs.
First, we observe that GPU programs can be broken into code regions that contain little or no live register state at their entry point. We then also recognize that it is simple to generate these regions in such a way that they are idempotent, allowing their entry points to function as program recovery points and enabling support for exception handling, fast context switches, and speculation, all with very low overhead. We call the architecture of GPUs executing these idempotent regions the iGPU architecture. The hardware extensions required are minimal and the construction of idempotent code regions is fully transparent under the typical dynamic compilation framework of GPUs. We demonstrate how iGPU exception support enables virtual memory paging with very low overhead (1% to 4%), and how speculation support enables circuit-speculation techniques that can provide over 25% reduction in energy.
- AMD. Memory System on Fusion APUs. http://goo.gl/r72cp.Google Scholar
- AMD. AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.3f. 2011.Google Scholar
- L. Anghel and M. Nicolaidis. Cost reduction and evaluation of a temporary faults detecting technique. In DATE '00. Google ScholarDigital Library
- T. Austin. DIVA: A Reliable Substrate for Deep Submicron MicroarchitectureDesign. In MICRO '99. Google ScholarDigital Library
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS '09.Google Scholar
- E. Blem, M. Sinclair, and K. Sankaralingam. Challenge benchmarks that must be conquered to sustain the GPU revolution. In Proceedings of the 4th Workshop on Emerging Applications for Manycore Architecture, 2011.Google Scholar
- J. Chen. GPU technology trends and future requirements. In IEDM '09.Google Scholar
- S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw. RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance. Solid-State Circuits, IEEE Journal of, 44(1):32--48.Google Scholar
- M. de Kruijf and K. Sankaralingam. Idempotent processor architecture. In MICRO '11. Google ScholarDigital Library
- M. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In PLDI '12. Google ScholarDigital Library
- G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT '10. Google ScholarDigital Library
- D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO '03. Google ScholarDigital Library
- R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. Tarantula: a vector extension to the alpha architecture. In ISCA '02. Google ScholarDigital Library
- R. Espasa, M. Valero, and J. E. Smith. Out-of-order vector architectures. In MICRO '97. Google ScholarDigital Library
- I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In ASPLOS '10. Google ScholarDigital Library
- B. Greskamp, L. Wan, U. Karpuzcu, J. Cook, J. Torrellas, D. Chen, and C. Zilles. Blueshift: Designing processors for timing speculation from the ground up. In HPCA '09.Google Scholar
- M. Gupta, K. Rangan, M. Smith, G.-Y. Wei, and D. Brooks. Decor: A delayed commit and rollback mechanism for handling inductive noise in processors. In HPCA '08.Google Scholar
- M. Hampton and K. Asanović. Implementing virtual memory in a vector processor with software restart markers. In ICS '06. Google ScholarDigital Library
- J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 5th edition, 2011. Google ScholarDigital Library
- id. id tech 5 challenges: From texture virtualization to massive parallelization. In SIGGRAPH '09.Google Scholar
- C. Kozyrakis and D. Patterson. Overcoming the limitations of conventional vector processors. In ISCA '03. Google ScholarDigital Library
- C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu. Compiler-based multiple instruction retry. IEEE Transactions on Computers, 44(1):35--46, 1995. Google ScholarDigital Library
- E. Lindholm, M. J. Kilgard, and H. Moreton. A user-programmable vertex engine. In SIGGRAPH '01. Google ScholarDigital Library
- J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA '10. Google ScholarDigital Library
- M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic speculation: an alternative approach. In MICRO '93. Google ScholarDigital Library
- NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, Ver. 1.1. 2009.Google Scholar
- NVIDIA. NVIDIA CUDA C Programming Guide, Ver. 3.1.1. 2010.Google Scholar
- A. Padegs, B. Moore, R. Smith, and W. Buchholz. The IBM System/370 vector architecture: design considerations. Computers, IEEE Transactions on, 37(5):509--520, May 1988. Google ScholarDigital Library
- J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley. Memory exclusion: Optimizing the performance of checkpointing systems. Software -- Practice & Experience, 29(2):125--142, 1999. Google ScholarDigital Library
- J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In MICRO '01. Google ScholarDigital Library
- V. J. Reddi, M. S. Gupta, G. H. Holloway, G.-Y. Wei, M. D. Smith, and D. Brooks. Voltage emergency prediction: Using signatures to reduce operating margins. In HPCA '09.Google Scholar
- M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The impact of architectural trends on operating system performance. In SOSP '95. Google ScholarDigital Library
- K. W. Rudd. Efficient exception handling techniques for high-performance processor architectures. Departments of Electrical Engineering and Computer Science, Stanford University, Technical Report CSL-TR-97-732, August 1997. Google Scholar
- A. Saulsbury and D. Rice. Microprocessor with reduced context switching and overhead and corresponding method. United States Patent 6,314,510, November 2001.Google Scholar
- J. W. Sheaffer, D. P. Luebke, and K. Skadron. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In EUROGRAPHICS '07. Google ScholarDigital Library
- J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37:562--573, May 1988. Google ScholarDigital Library
- J. S. Snyder, D. B. Whalley, and T. P. Baker. Fast context switches: Compiler and architectural support for preemptive scheduling. Microprocessors and Microsystems, 19(1):35--42, 1995.Google ScholarCross Ref
- G. S. Sohi and S. Vajapeyam. Instruction issue logic for high-performance, interruptable pipelined processors. In ISCA '87. Google ScholarDigital Library
- H. Torng and M. Day. Interrupt handling for out-of-order execution processors. Computers, IEEE Transactions on, 42(1), 1993. Google ScholarDigital Library
- W. J. van der Laan. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google Scholar
- K. C. Yeager. The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2):28--40, 1996. Google ScholarDigital Library
- T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO '91. Google ScholarDigital Library
- X. Zhou and P. Petrov. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In DAC '06. Google ScholarDigital Library
Recommendations
iGPU: exception support and speculative execution on GPUs
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer ArchitectureSince the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional ...
Heterogeneous CPU+iGPU Processing for Efficient Epistasis Detection
Euro-Par 2020: Parallel ProcessingAbstractEpistasis detection represents a fundamental problem in bio-medicine to understand the reasons for occurrence of complex phenotypic traits (diseases) across a population of individuals. Exhaustively examining all possible interactions of multiple ...
iGPU-Accelerated Pattern Matching on Event Streams
DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New HardwarePattern matching, also known as Match-Recognize in SQL, is an expensive operator of particular relevance in many event stream applications. However, because of its sequential nature and challenging latency requirements, current stream processing ...
Comments