skip to main content
research-article
Free Access

Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU

Published:24 June 2015Publication History
Skip Abstract Section

Abstract

Graphic processing unit (GPU)-based general-purpose computing is developing as a viable alternative to CPU-based computing in many domains. Today’s tools for GPU analysis include simulators like GPGPU-Sim, Multi2Sim, and Barra. While useful for modeling first-order effects, these tools do not provide a detailed view of GPU microarchitecture and physical design. Further, as GPGPU research evolves, design ideas and modifications demand detailed estimates of impact on overall area and power. Fueled by this need, we introduce MIAOW (Many-core Integrated Accelerator Of Wisconsin), an open-source RTL implementation of the AMD Southern Islands GPGPU ISA, capable of running unmodified OpenCL-based applications. We present our design motivated by our goals to create a realistic, flexible, OpenCL-compatible GPGPU, capable of emulating a full system. We first explore if MIAOW is realistic and then use four case studies to show that MIAOW enables the following: physical design perspective to “traditional” microarchitecture, new types of research exploration, and validation/calibration of simulator-based characterization of hardware. The findings and ideas are contributions in their own right, in addition to MIAOW’s utility as a tool for others’ research.

References

  1. 2009. Barrasim: NVIDIA G80 Functional Simulator. Retrieved from https://code.google.com/p/barra-sim/.Google ScholarGoogle Scholar
  2. 2012a. AMD Graphics Cores Next Architecture. Retrieved from http://www.amd.com/la/Documents/GCN_Architecture_whitepaper.pdf.Google ScholarGoogle Scholar
  3. 2012b. Reference Guide: Southern Islands Series Instruction Set Architecture. http://developer.amd.com/wordpress/media/2012/10/AMD_Southern_Islands_Instruction_Set_Architecture.pdf.Google ScholarGoogle Scholar
  4. 2013. AMD APP 3.0 SDK, Kernels and Documentation. Retrieved from http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk.Google ScholarGoogle Scholar
  5. M. Abdel-Majeed and M. Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCA’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS’09.Google ScholarGoogle Scholar
  7. R. Balasubramanian and K. Sankaralingam. 2013. Virtually-aged sampling DMR: Unifying circuit failure prediction and circuit failure detection. In Proceedings of the 46th International Symposium on Microarchitectures (MICRO’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Balasubramanian and K. Sankaralingam. 2014. Understanding the impact of gate-level physical reliability effects on whole program execution. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA’14).Google ScholarGoogle Scholar
  9. P. Bernardi, M. Grosso, and M. S. Reorda. 2007. Hardware-accelerated path-delay fault grading of functional test programs for processor-based systems. In GLSVLSI’07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Bouvier and B. Sander. 2014. Applying AMD’s Kaveri APU for heterogeneous computing. In Hotchips 2014.Google ScholarGoogle Scholar
  11. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, Washington, DC, 44--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Y. Chen. 2009. GPU technology trends and future requirements. In IEDM’09.Google ScholarGoogle Scholar
  13. N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. M. del Barrio, C. Gonzalez, J. Roca, A. Fernandez, and E. Espasa. 2006. ATTILA: A cycle-level execution-driven simulator for modern GPU architectures. In ISPASS’06.Google ScholarGoogle Scholar
  15. G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In PACT’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Ernst, Nam Sung Kim, S. Das, S. Pant, R. Rao, Toan Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. 2003. Razor: A low-power pipeline based on circuit-level timing speculation. In MICRO’03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Fried. 2012. GPGPU Architecture Comparison of ATI and NVIDIA GPUs. http://www.microway.com/pdfs/GPGPU_Architecture_and_Performance_Comparison.pdf.Google ScholarGoogle Scholar
  18. W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Washington, DC, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. W. L. Fung and T. M. Aamodt. 2012. Thread block compaction for efficient SIMT control flow. In HPCA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Gaisler. 2001. LEON Sparc Processor.Google ScholarGoogle Scholar
  21. B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen, and C. Zilles. 2009. Blueshift: Designing processors for timing speculation from the ground up. In HPCA’09.Google ScholarGoogle Scholar
  22. B. A. Hechtman and D. J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In ISCA’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA’09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Hong and H. Kim. 2010. An integrated GPU power and performance model. In ISCA’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In MICRO’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013a. Orchestrated scheduling and prefetching for GPGPUs. In ISCA’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Jog, O. Kayiran, N. C. Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013b. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In ASPLOS’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, and W. Hwu. 2012. Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPUs). Morgan & Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In ISCA’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. llvmcuda 2009. User Guide for NVPTX Back-end. http://llvm.org/docs/NVPTXUsage.html.Google ScholarGoogle Scholar
  32. A. Meixner, M. E. Bauer, and D. Sorin. 2007. Argus: Low-cost, comprehensive error detection in simple cores. In MICRO’07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Meng, D. Tarjan, and K. Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In ISCA’10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Menon, M. De Kruijf, and K. Sankaralingam. 2012. iGPU: Exception support and speculative execution on GPUs. In ISCA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. S. Muchnick. 1997. Advanced Compiler Design Implementation. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. 2006. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In HPCA’06.Google ScholarGoogle Scholar
  37. V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Nomura, K. Sankaralingam, and R. Sankaralingam. 2011. A fast and highly accurate path delay emulation framework for logic-emulation of timing speculation. In ITC’11.Google ScholarGoogle Scholar
  39. nvprof. 2008. NVIDIA CUDA Profiler User Guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide/index.html.Google ScholarGoogle Scholar
  40. openrisc. 2010. OpenRISC Project. Retrieved from http://opencores.org/project,or1k.Google ScholarGoogle Scholar
  41. opensparc. 2006. OpenSPARC T1. Retrieved from http://www.opensparc.net.Google ScholarGoogle Scholar
  42. A. Pellegrini, K. Constantinides, D. Zhang, S. Sudhakar, V. Bertacco, and T. Austin. 2008. CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework. In CICC’08.Google ScholarGoogle Scholar
  43. M. Prvulovic, Z. Zhang, and J. Torrellas. 2002. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In ISCA’02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. P. Rech, C. Aguiar, R. Ferreira, C. Frost, and L. Carro. 2012. Neutron radiation test of graphic processing units. In IOLTS’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Rhu and M. Erez. 2012. CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures. In ISCA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. R. M. Russell. 1978. The CRAY-1 computer system. Communications of the ACM 22, 1 (January 1978), 64--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Sartori, B. Ahrens, and R. Kumar. 2012. Power balanced pipelines. In HPCA’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. W. Sim, A. Dasgupta, H. Kim, and R. Vuduc. 2012. A performance analysis framework for identifying performance benefits in GPGPU applications. In PPOPP’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In HPCA’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. B. J. Smith. 1981. Architecture and applications of the HEP multiprocessor computer system. In SPIE Real Time Signal Processing IV, 241--248.Google ScholarGoogle Scholar
  52. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. 2006. Reunion: Complexity-effective multicore redundancy. In MICRO’06 (no 39). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. 2002. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In ISCA’02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. J. Tan, N. Goswami, T. Li, and X. Fu. 2011. Analyzing soft-error vulnerability on GPGPU microarchitecture. In IISWC’11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT’12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. W. J. van der Laan. 2010. Decuda SM 1.1 (G80) disassembler. https://github.com/laanwj/decuda.Google ScholarGoogle Scholar
  57. N. J. Wang and S. J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3, 3, 188--201. DOI:http://dx.doi.org/10.1109/TDSC.2006.40 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. 2004. Characterizing the effects of transient faults on a high-performance processor pipeline. In DSN’04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Zhang, L. Peng, B. Li, J.-K. Peir, and J. Chen. 2011. Architecture comparisons between Nvidia and ATI GPUs: Computation parallelism and data communications. In IISWC’11. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 2
      July 2015
      410 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2775085
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 June 2015
      • Accepted: 1 April 2015
      • Revised: 1 March 2015
      • Received: 1 September 2014
      Published in taco Volume 12, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Author Tags

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader