Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU

Authors:
Raghuraman Balasubramanian

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Vinay Gangadhar

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Ziliang Guo

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Chen-Han Ho

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Cherin Joseph

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Jaikrishnan Menon

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Mario Paulo Drumond

University of Wisconsin-Madison and Universidade Federal de Minas Gerais, Madison, WI

University of Wisconsin-Madison and Universidade Federal de Minas Gerais, Madison, WI
View Profile

,
Robin Paul

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Sharath Prasad

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Pradip Valathol

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

,
Karthikeyan Sankaralingam

University of Wisconsin-Madison, Madison, WI

University of Wisconsin-Madison, Madison, WI
View Profile

ACM Transactions on Architecture and Code Optimization Volume 12 Issue 2Article No.: 21pp 21:1–21:25https://doi.org/10.1145/2764908

Published:24 June 2015Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Graphic processing unit (GPU)-based general-purpose computing is developing as a viable alternative to CPU-based computing in many domains. Today’s tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra. While useful for modeling first-order effects, these tools do not provide a detailed view of GPU microarchitecture and physical design. Further, as GPGPU research evolves, design ideas and modifications demand detailed estimates of impact on overall area and power. Fueled by this need, we introduce MIAOW (Many-core Integrated Accelerator Of Wisconsin), an open-source RTL implementation of the AMD Southern Islands GPGPU ISA, capable of running unmodified OpenCL-based applications. We present our design motivated by our goals to create a realistic, flexible, OpenCL-compatible GPGPU, capable of emulating a full system. We first explore if MIAOW is realistic and then use four case studies to show that MIAOW enables the following: physical design perspective to “traditional” microarchitecture, new types of research exploration, and validation/calibration of simulator-based characterization of hardware. The findings and ideas are contributions in their own right, in addition to MIAOW’s utility as a tool for others’ research.

References

2009. Barrasim: NVIDIA G80 Functional Simulator. Retrieved from https://code.google.com/p/barra-sim/.Google Scholar
2012a. AMD Graphics Cores Next Architecture. Retrieved from http://www.amd.com/la/Documents/GCN_Architecture_whitepaper.pdf.Google Scholar
2012b. Reference Guide: Southern Islands Series Instruction Set Architecture. http://developer.amd.com/wordpress/media/2012/10/AMD_Southern_Islands_Instruction_Set_Architecture.pdf.Google Scholar
2013. AMD APP 3.0 SDK, Kernels and Documentation. Retrieved from http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk.Google Scholar
M. Abdel-Majeed and M. Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCA’13. Google ScholarDigital Library
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS’09.Google Scholar
R. Balasubramanian and K. Sankaralingam. 2013. Virtually-aged sampling DMR: Unifying circuit failure prediction and circuit failure detection. In Proceedings of the 46th International Symposium on Microarchitectures (MICRO’13). Google ScholarDigital Library
R. Balasubramanian and K. Sankaralingam. 2014. Understanding the impact of gate-level physical reliability effects on whole program execution. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14).Google Scholar
P. Bernardi, M. Grosso, and M. S. Reorda. 2007. Hardware-accelerated path-delay fault grading of functional test programs for processor-based systems. In GLSVLSI’07. Google ScholarDigital Library
D. Bouvier and B. Sander. 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Hotchips 2014.Google Scholar
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, Washington, DC, 44--54. Google ScholarDigital Library
J. Y. Chen. 2009. GPU technology trends and future requirements. In IEDM’09.Google Scholar
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA’11. Google ScholarDigital Library
V. M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. Espasa. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In ISPASS’06.Google Scholar
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT’10. Google ScholarDigital Library
D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO’03. Google ScholarDigital Library
Michael Fried. 2012. GPGPU Architecture Comparison of ATI and NVIDIA GPUs. http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf.Google Scholar
W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Washington, DC, 25--36. Google ScholarDigital Library
W. W. L. Fung and T. M. Aamodt. 2012. Thread block compaction for efficient SIMT control flow. In HPCA’12. Google ScholarDigital Library
J. Gaisler. 2001. LEON Sparc Processor.Google Scholar
B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen, and C. Zilles. 2009. Blueshift: Designing processors for timing speculation from the ground up. In HPCA’09.Google Scholar
B. A. Hechtman and D. J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In ISCA’13. Google ScholarDigital Library
S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA’09. Google ScholarDigital Library
S. Hong and H. Kim. 2010. An integrated GPU power and performance model. In ISCA’10. Google ScholarDigital Library
H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In MICRO’12. Google ScholarDigital Library
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013a. Orchestrated scheduling and prefetching for GPGPUs. In ISCA’13. Google ScholarDigital Library
A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013b. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS’13. Google ScholarDigital Library
H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, and W. Hwu. 2012. Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPUs). Morgan & Claypool. Google ScholarDigital Library
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In ISCA’11. Google ScholarDigital Library
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA’13. Google ScholarDigital Library
llvmcuda 2009. User Guide for NVPTX Back-end. http://llvm.org/docs/NVPTXUsage.html.Google Scholar
A. Meixner, M. E. Bauer, and D. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In MICRO’07. Google ScholarDigital Library
J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA’10. Google ScholarDigital Library
J. Menon, M. De Kruijf, and K. Sankaralingam. 2012. iGPU: Exception support and speculative execution on GPUs. In ISCA’12. Google ScholarDigital Library
S. S. Muchnick. 1997. Advanced Compiler Design Implementation. Morgan Kaufmann. Google ScholarDigital Library
J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. 2006. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In HPCA’06.Google Scholar
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO’11. Google ScholarDigital Library
S. Nomura, K. Sankaralingam, and R. Sankaralingam. 2011. A fast and highly accurate path delay emulation framework for logic-emulation of timing speculation. In ITC’11.Google Scholar
nvprof. 2008. NVIDIA CUDA Profiler User Guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google Scholar
openrisc. 2010. OpenRISC Project. Retrieved from http://opencores.org/project,or1k.Google Scholar
opensparc. 2006. OpenSPARC T1. Retrieved from http://www.opensparc.net.Google Scholar
A. Pellegrini, K. Constantinides, D. Zhang, S. Sudhakar, V. Bertacco, and T. Austin. 2008. CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework. In CICC’08.Google Scholar
M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In ISCA’02. Google ScholarDigital Library
P. Rech, C. Aguiar, R. Ferreira, C. Frost, and L. Carro. 2012. Neutron radiation test of graphic processing units. In IOLTS’12. Google ScholarDigital Library
M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In ISCA’12. Google ScholarDigital Library
T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO’12. Google ScholarDigital Library
R. M. Russell. 1978. The CRAY-1 computer system. Communications of the ACM 22, 1 (January 1978), 64--72. Google ScholarDigital Library
J. Sartori, B. Ahrens, and R. Kumar. 2012. Power balanced pipelines. In HPCA’12. Google ScholarDigital Library
J. W. Sim, A. Dasgupta, H. Kim, and R. Vuduc. 2012. A performance analysis framework for identifying performance benefits in GPGPU applications. In PPOPP’12. Google ScholarDigital Library
I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In HPCA’13. Google ScholarDigital Library
B. J. Smith. 1981. Architecture and applications of the HEP multiprocessor computer system. In SPIE Real Time Signal Processing IV, 241--248.Google Scholar
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. 2006. Reunion: Complexity-effective multicore redundancy. In MICRO’06 (no 39). Google ScholarDigital Library
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In ISCA’02. Google ScholarDigital Library
J. Tan, N. Goswami, T. Li, and X. Fu. 2011. Analyzing soft-error vulnerability on GPGPU microarchitecture. In IISWC’11. Google ScholarDigital Library
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT’12. Google ScholarDigital Library
W. J. van der Laan. 2010. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google Scholar
N. J. Wang and S. J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. DOI:http://dx.doi.org/10.1109/TDSC.2006.40 Google ScholarDigital Library
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In DSN’04. Google ScholarDigital Library
Y. Zhang, L. Peng, B. Li, J.-K. Peir, and J. Chen. 2011. Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications. In IISWC’11. Google ScholarDigital Library

Index Terms

Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Read More
Kokkos

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly ...
Read More
Graphics hardware for scientific computation

Modern Graphics Processing Units (GPUs) commonly found in today's PCs feature multiple processing units and can be used for general purpose computations and in particular, parallel numerical algorithms. But the available level of abstraction is still ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 12, Issue 2
July 2015
410 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2775085
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 June 2015
- Accepted: 1 April 2015
- Revised: 1 March 2015
- Received: 1 September 2014
Published in taco Volume 12, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
SIMD
manycore
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 1,643
  Total Downloads
- Downloads (Last 12 months)333
- Downloads (Last 6 weeks)73
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

A performance study of general-purpose applications on graphics processors using CUDA

Kokkos

Graphics hardware for scientific computation