Abstract
Graphic processing unit (GPU)-based general-purpose computing is developing as a viable alternative to CPU-based computing in many domains. Today’s tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra. While useful for modeling first-order effects, these tools do not provide a detailed view of GPU microarchitecture and physical design. Further, as GPGPU research evolves, design ideas and modifications demand detailed estimates of impact on overall area and power. Fueled by this need, we introduce MIAOW (Many-core Integrated Accelerator Of Wisconsin), an open-source RTL implementation of the AMD Southern Islands GPGPU ISA, capable of running unmodified OpenCL-based applications. We present our design motivated by our goals to create a realistic, flexible, OpenCL-compatible GPGPU, capable of emulating a full system. We first explore if MIAOW is realistic and then use four case studies to show that MIAOW enables the following: physical design perspective to “traditional” microarchitecture, new types of research exploration, and validation/calibration of simulator-based characterization of hardware. The findings and ideas are contributions in their own right, in addition to MIAOW’s utility as a tool for others’ research.
- 2009. Barrasim: NVIDIA G80 Functional Simulator. Retrieved from https://code.google.com/p/barra-sim/.Google Scholar
- 2012a. AMD Graphics Cores Next Architecture. Retrieved from http://www.amd.com/la/Documents/GCN_Architecture_whitepaper.pdf.Google Scholar
- 2012b. Reference Guide: Southern Islands Series Instruction Set Architecture. http://developer.amd.com/wordpress/media/2012/10/AMD_Southern_Islands_Instruction_Set_Architecture.pdf.Google Scholar
- 2013. AMD APP 3.0 SDK, Kernels and Documentation. Retrieved from http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk.Google Scholar
- M. Abdel-Majeed and M. Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCA’13. Google ScholarDigital Library
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS’09.Google Scholar
- R. Balasubramanian and K. Sankaralingam. 2013. Virtually-aged sampling DMR: Unifying circuit failure prediction and circuit failure detection. In Proceedings of the 46th International Symposium on Microarchitectures (MICRO’13). Google ScholarDigital Library
- R. Balasubramanian and K. Sankaralingam. 2014. Understanding the impact of gate-level physical reliability effects on whole program execution. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14).Google Scholar
- P. Bernardi, M. Grosso, and M. S. Reorda. 2007. Hardware-accelerated path-delay fault grading of functional test programs for processor-based systems. In GLSVLSI’07. Google ScholarDigital Library
- D. Bouvier and B. Sander. 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Hotchips 2014.Google Scholar
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, Washington, DC, 44--54. Google ScholarDigital Library
- J. Y. Chen. 2009. GPU technology trends and future requirements. In IEDM’09.Google Scholar
- N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA’11. Google ScholarDigital Library
- V. M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. Espasa. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In ISPASS’06.Google Scholar
- G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT’10. Google ScholarDigital Library
- D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO’03. Google ScholarDigital Library
- Michael Fried. 2012. GPGPU Architecture Comparison of ATI and NVIDIA GPUs. http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf.Google Scholar
- W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Washington, DC, 25--36. Google ScholarDigital Library
- W. W. L. Fung and T. M. Aamodt. 2012. Thread block compaction for efficient SIMT control flow. In HPCA’12. Google ScholarDigital Library
- J. Gaisler. 2001. LEON Sparc Processor.Google Scholar
- B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen, and C. Zilles. 2009. Blueshift: Designing processors for timing speculation from the ground up. In HPCA’09.Google Scholar
- B. A. Hechtman and D. J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In ISCA’13. Google ScholarDigital Library
- S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA’09. Google ScholarDigital Library
- S. Hong and H. Kim. 2010. An integrated GPU power and performance model. In ISCA’10. Google ScholarDigital Library
- H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In MICRO’12. Google ScholarDigital Library
- A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013a. Orchestrated scheduling and prefetching for GPGPUs. In ISCA’13. Google ScholarDigital Library
- A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013b. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS’13. Google ScholarDigital Library
- H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, and W. Hwu. 2012. Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPUs). Morgan & Claypool. Google ScholarDigital Library
- Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In ISCA’11. Google ScholarDigital Library
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA’13. Google ScholarDigital Library
- llvmcuda 2009. User Guide for NVPTX Back-end. http://llvm.org/docs/NVPTXUsage.html.Google Scholar
- A. Meixner, M. E. Bauer, and D. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In MICRO’07. Google ScholarDigital Library
- J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA’10. Google ScholarDigital Library
- J. Menon, M. De Kruijf, and K. Sankaralingam. 2012. iGPU: Exception support and speculative execution on GPUs. In ISCA’12. Google ScholarDigital Library
- S. S. Muchnick. 1997. Advanced Compiler Design Implementation. Morgan Kaufmann. Google ScholarDigital Library
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. 2006. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In HPCA’06.Google Scholar
- V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO’11. Google ScholarDigital Library
- S. Nomura, K. Sankaralingam, and R. Sankaralingam. 2011. A fast and highly accurate path delay emulation framework for logic-emulation of timing speculation. In ITC’11.Google Scholar
- nvprof. 2008. NVIDIA CUDA Profiler User Guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google Scholar
- openrisc. 2010. OpenRISC Project. Retrieved from http://opencores.org/project,or1k.Google Scholar
- opensparc. 2006. OpenSPARC T1. Retrieved from http://www.opensparc.net.Google Scholar
- A. Pellegrini, K. Constantinides, D. Zhang, S. Sudhakar, V. Bertacco, and T. Austin. 2008. CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework. In CICC’08.Google Scholar
- M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In ISCA’02. Google ScholarDigital Library
- P. Rech, C. Aguiar, R. Ferreira, C. Frost, and L. Carro. 2012. Neutron radiation test of graphic processing units. In IOLTS’12. Google ScholarDigital Library
- M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In ISCA’12. Google ScholarDigital Library
- T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO’12. Google ScholarDigital Library
- R. M. Russell. 1978. The CRAY-1 computer system. Communications of the ACM 22, 1 (January 1978), 64--72. Google ScholarDigital Library
- J. Sartori, B. Ahrens, and R. Kumar. 2012. Power balanced pipelines. In HPCA’12. Google ScholarDigital Library
- J. W. Sim, A. Dasgupta, H. Kim, and R. Vuduc. 2012. A performance analysis framework for identifying performance benefits in GPGPU applications. In PPOPP’12. Google ScholarDigital Library
- I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In HPCA’13. Google ScholarDigital Library
- B. J. Smith. 1981. Architecture and applications of the HEP multiprocessor computer system. In SPIE Real Time Signal Processing IV, 241--248.Google Scholar
- J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. 2006. Reunion: Complexity-effective multicore redundancy. In MICRO’06 (no 39). Google ScholarDigital Library
- D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In ISCA’02. Google ScholarDigital Library
- J. Tan, N. Goswami, T. Li, and X. Fu. 2011. Analyzing soft-error vulnerability on GPGPU microarchitecture. In IISWC’11. Google ScholarDigital Library
- R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT’12. Google ScholarDigital Library
- W. J. van der Laan. 2010. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google Scholar
- N. J. Wang and S. J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. DOI:http://dx.doi.org/10.1109/TDSC.2006.40 Google ScholarDigital Library
- N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In DSN’04. Google ScholarDigital Library
- Y. Zhang, L. Peng, B. Li, J.-K. Peir, and J. Chen. 2011. Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications. In IISWC’11. Google ScholarDigital Library
Index Terms
- Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU
Recommendations
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Graphics hardware for scientific computation
Modern Graphics Processing Units (GPUs) commonly found in today's PCs feature multiple processing units and can be used for general purpose computations and in particular, parallel numerical algorithms. But the available level of abstraction is still ...
Comments