research-article

iGPU: exception support and speculative execution on GPUs

Authors:
Jaikrishnan Menon

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

,
Marc De Kruijf

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

,
Karthikeyan Sankaralingam

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 40 Issue 3June 2012pp 72–83https://doi.org/10.1145/2366231.2337168

Published:09 June 2012Publication History

ACM SIGARCH Computer Architecture News

Abstract

Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to provide a unified lightweight mechanism for supporting exceptions and speculation on GPUs.

First, we observe that GPU programs can be broken into code regions that contain little or no live register state at their entry point. We then also recognize that it is simple to generate these regions in such a way that they are idempotent, allowing their entry points to function as program recovery points and enabling support for exception handling, fast context switches, and speculation, all with very low overhead. We call the architecture of GPUs executing these idempotent regions the iGPU architecture. The hardware extensions required are minimal and the construction of idempotent code regions is fully transparent under the typical dynamic compilation framework of GPUs. We demonstrate how iGPU exception support enables virtual memory paging with very low overhead (1% to 4%), and how speculation support enables circuit-speculation techniques that can provide over 25% reduction in energy.

References

AMD. Memory System on Fusion APUs. http://goo.gl/r72cp.Google Scholar
AMD. AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.3f. 2011.Google Scholar
L. Anghel and M. Nicolaidis. Cost reduction and evaluation of a temporary faults detecting technique. In DATE '00. Google ScholarDigital Library
T. Austin. DIVA: A Reliable Substrate for Deep Submicron MicroarchitectureDesign. In MICRO '99. Google ScholarDigital Library
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS '09.Google Scholar
E. Blem, M. Sinclair, and K. Sankaralingam. Challenge benchmarks that must be conquered to sustain the GPU revolution. In Proceedings of the 4th Workshop on Emerging Applications for Manycore Architecture, 2011.Google Scholar
J. Chen. GPU technology trends and future requirements. In IEDM '09.Google Scholar
S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw. RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance. Solid-State Circuits, IEEE Journal of, 44(1):32--48.Google Scholar
M. de Kruijf and K. Sankaralingam. Idempotent processor architecture. In MICRO '11. Google ScholarDigital Library
M. de Kruijf, K. Sankaralingam, and S. Jha. Static analysis and compiler design for idempotent processing. In PLDI '12. Google ScholarDigital Library
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT '10. Google ScholarDigital Library
D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO '03. Google ScholarDigital Library
R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. Tarantula: a vector extension to the alpha architecture. In ISCA '02. Google ScholarDigital Library
R. Espasa, M. Valero, and J. E. Smith. Out-of-order vector architectures. In MICRO '97. Google ScholarDigital Library
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In ASPLOS '10. Google ScholarDigital Library
B. Greskamp, L. Wan, U. Karpuzcu, J. Cook, J. Torrellas, D. Chen, and C. Zilles. Blueshift: Designing processors for timing speculation from the ground up. In HPCA '09.Google Scholar
M. Gupta, K. Rangan, M. Smith, G.-Y. Wei, and D. Brooks. Decor: A delayed commit and rollback mechanism for handling inductive noise in processors. In HPCA '08.Google Scholar
M. Hampton and K. Asanović. Implementing virtual memory in a vector processor with software restart markers. In ICS '06. Google ScholarDigital Library
J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 5th edition, 2011. Google ScholarDigital Library
id. id tech 5 challenges: From texture virtualization to massive parallelization. In SIGGRAPH '09.Google Scholar
C. Kozyrakis and D. Patterson. Overcoming the limitations of conventional vector processors. In ISCA '03. Google ScholarDigital Library
C.-C. J. Li, S.-K. Chen, W. K. Fuchs, and W.-M. W. Hwu. Compiler-based multiple instruction retry. IEEE Transactions on Computers, 44(1):35--46, 1995. Google ScholarDigital Library
E. Lindholm, M. J. Kilgard, and H. Moreton. A user-programmable vertex engine. In SIGGRAPH '01. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA '10. Google ScholarDigital Library
M. Moudgill, K. Pingali, and S. Vassiliadis. Register renaming and dynamic speculation: an alternative approach. In MICRO '93. Google ScholarDigital Library
NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, Ver. 1.1. 2009.Google Scholar
NVIDIA. NVIDIA CUDA C Programming Guide, Ver. 3.1.1. 2010.Google Scholar
A. Padegs, B. Moore, R. Smith, and W. Buchholz. The IBM System/370 vector architecture: design considerations. Computers, IEEE Transactions on, 37(5):509--520, May 1988. Google ScholarDigital Library
J. S. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley. Memory exclusion: Optimizing the performance of checkpointing systems. Software -- Practice & Experience, 29(2):125--142, 1999. Google ScholarDigital Library
J. Ray, J. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In MICRO '01. Google ScholarDigital Library
V. J. Reddi, M. S. Gupta, G. H. Holloway, G.-Y. Wei, M. D. Smith, and D. Brooks. Voltage emergency prediction: Using signatures to reduce operating margins. In HPCA '09.Google Scholar
M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta. The impact of architectural trends on operating system performance. In SOSP '95. Google ScholarDigital Library
K. W. Rudd. Efficient exception handling techniques for high-performance processor architectures. Departments of Electrical Engineering and Computer Science, Stanford University, Technical Report CSL-TR-97-732, August 1997. Google Scholar
A. Saulsbury and D. Rice. Microprocessor with reduced context switching and overhead and corresponding method. United States Patent 6,314,510, November 2001.Google Scholar
J. W. Sheaffer, D. P. Luebke, and K. Skadron. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In EUROGRAPHICS '07. Google ScholarDigital Library
J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in pipelined processors. IEEE Transactions on Computers, 37:562--573, May 1988. Google ScholarDigital Library
J. S. Snyder, D. B. Whalley, and T. P. Baker. Fast context switches: Compiler and architectural support for preemptive scheduling. Microprocessors and Microsystems, 19(1):35--42, 1995.Google ScholarCross Ref
G. S. Sohi and S. Vajapeyam. Instruction issue logic for high-performance, interruptable pipelined processors. In ISCA '87. Google ScholarDigital Library
H. Torng and M. Day. Interrupt handling for out-of-order execution processors. Computers, IEEE Transactions on, 42(1), 1993. Google ScholarDigital Library
W. J. van der Laan. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google Scholar
K. C. Yeager. The MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2):28--40, 1996. Google ScholarDigital Library
T.-Y. Yeh and Y. N. Patt. Two-level adaptive training branch prediction. In MICRO '91. Google ScholarDigital Library
X. Zhou and P. Petrov. Rapid and low-cost context-switch through embedded processor customization for real-time and control applications. In DAC '06. Google ScholarDigital Library

Recommendations

iGPU: exception support and speculative execution on GPUs
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional ...
Read More
Heterogeneous CPU+iGPU Processing for Efficient Epistasis Detection
Euro-Par 2020: Parallel Processing
Abstract
Epistasis detection represents a fundamental problem in bio-medicine to understand the reasons for occurrence of complex phenotypic traits (diseases) across a population of individuals. Exhaustively examining all possible interactions of multiple ...
Read More
iGPU-Accelerated Pattern Matching on Event Streams
DaMoN '22: Proceedings of the 18th International Workshop on Data Management on New Hardware

Pattern matching, also known as Match-Recognize in SQL, is an expensive operator of particular relevance in many event stream applications. However, because of its sequential nature and challenging latency requirements, current stream processing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 40, Issue 3
ISCA '12
June 2012
559 pages
ISSN:0163-5964
DOI:10.1145/2366231
Issue’s Table of Contents
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
June 2012
584 pages
ISBN:9781450316422
General Chair:
Shih-Lien Lu
Intel
,
Program Chair:
Josep Torrellas
University of Illinois
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2012
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 65
  Total Citations
  View Citations
- 716
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

iGPU: exception support and speculative execution on GPUs

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Recommendations

iGPU: exception support and speculative execution on GPUs

Heterogeneous CPU+iGPU Processing for Efficient Epistasis Detection

iGPU-Accelerated Pattern Matching on Event Streams