skip to main content
research-article
Open Access

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Published:27 February 2019Publication History
Skip Abstract Section

Abstract

Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU—the execution units—frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of execution units actually comes from short but frequent stalls, leaving little potential for common power saving techniques, such as power-gating.

In this article, we propose ITAP, a novel idle-time-aware power management technique, which aims to effectively reduce the static energy consumption of GPU execution units. By taking advantage of different power management techniques (i.e., power-gating and different levels of voltage scaling), ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length of execution units using prediction and peek-ahead techniques in a synergistic way and then applies the most appropriate static power reduction mode based on the estimated idle period length. We design ITAP to be power-aggressive or performance-aggressive, not both at the same time. Our experimental results on several workloads show that the power-aggressive design of ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with less than 2.1% performance overhead. However, the performance-aggressive design of ITAP improves the static energy savings by an average of 16.9%, while keeping the GPU performance almost unaffected (i.e., up to 0.4% performance overhead) compared to the state-of-the-art static energy savings mechanism.

References

  1. Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In Proceedings of HPCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Gating aware scheduling and power gating for GPGPUs. In Proceedings of MICRO 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of ISLPED 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amirhossein Mirhosseini, Mohammad Sadrosadati, Ali Fakhrzadehgan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2015. An energy-efficient virtual channel power-gating mechanism for on-chip networks. In Proceedings of DATE 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Anantpur and R. Govindarajan. 2015. PRO: Progress aware GPU warp scheduling algorithm. In Proceedings of IPDPS 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Amin Ansari, Asit Mishra, Jianping Xu, and Josep Torrellas. 2014. Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. In Proceeedings of HPCA 2014.Google ScholarGoogle ScholarCross RefCross Ref
  7. Manish Arora, Srilatha Manne, Indrani Paul, Nuwan Jayasena, and Dean M. Tullsen. 2015. Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU integrated systems. In Proceedings of HPCA 2015.Google ScholarGoogle Scholar
  8. Rachata Ausavarungnirun. 2017. Techniques for Shared Resource Management in Systems With Throughput Processors. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  9. Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of ISCA 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, et al. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of PACT 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2018. Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv:1804.11038.Google ScholarGoogle Scholar
  12. Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, et al. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of MICRO 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, et al. 2018. MASK: Redesigning the GPU memory hierarchy to support multi-application concurrency. In Proceedings of ASPLOS 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS 2009.Google ScholarGoogle Scholar
  15. Juan M. Cebrin, Gines D. Guerrero, and Jose M. Garcia. 2012. Energy efficiency analysis of GPUs. In Proceedings of IPDPSW 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, et al. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. In Proceedings of SIGMETRICS 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kevin K. Chang, A. Giray Yağlıkçı, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, et al. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. In Proceedings of SIGMETRICS 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lizhong Chen and Timothy M. Pinkston. 2012. NoRD: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the MICRO 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lizhong Chen, Lihang Zhao, Ruisheng Wang, and Timothy M. Pinkston. 2014. MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. In Proceedings of HPCA 2014.Google ScholarGoogle Scholar
  21. Pran Kurup and Taher Abbasi. 2011. Logic Synthesis Using Synopsys (2nd Edition). Springer Publishing Company, Incorporated. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur Mutlu. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of ICAC 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. Memscale: Active low-power modes for main memory. In Proceedings of ASPLOS 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Krisztián Flautner, Nam Sung Kim, Steve Martin, David Blaauw, and Trevor Mudge. 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of ISCA 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Denis Foley, Pankaj Bansal, Don Cherepacha, Robert Wasmuth, Aswin Gunasekar, Srinivasa Gutta, et al. 2011. A low-power integrated x86-64 and graphics processor for mobile computing devices. In Proceeding os ISSCC 2011.Google ScholarGoogle Scholar
  26. Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of HPCA 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of MICRO 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, et al. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of ISCA 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In Proceedings of MICRO 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of HPCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, and Asit Mishra. 2016. ScalCore: Designing a core for voltage scalability. In Proceedings of HPCA 2016.Google ScholarGoogle ScholarCross RefCross Ref
  32. David Hodges, Horace Jackson, and Resve Saleh. 2004. Analysis and Design of Digital Integrated Circuits in Deep Submicron Technology. McGraw-Hill. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of ISCA 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of ISLPED 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Canturk Isci, Alper Buyuktosunoglu, and Margaret Martonosi. 2005. Long-term workload phases: Duration predictions and applications to DVFS. IEEE Micro 25, 5 (Sep. 2005), 39--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hyeran Jeon and Murali Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPUs. In Proceedings of MICRO 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Qing Jiao, Mian Lu, Huynh Phung Huynh, and Tulika Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of CGO 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, et al. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of MICRO 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, et al. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of ASPLOS 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, et al. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of SIGMETRICS 2016. ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ali Jooya and Amirali Baniasadi. 2013. Using synchronization stalls in power-aware accelerators. In Proceedings of DATE 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. David Kadjo, Hyungjun Kim, Paul Gratz, Jiang Hu, and Raid Ayoub. 2013. Power gating with block migration in chip-multiprocessor last-level caches. In Proceedings of ICCD 2013.Google ScholarGoogle ScholarCross RefCross Ref
  43. Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-threshold voltage (NTV) design: Opportunities and challenges. In Proceedings of DAC 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, et al. 2017. RIC: Relaxed inclusion caches for mitigating LLC side-channel attacks. In Proceedings of DAC 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of PACT 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, et al. 2016. C-States: Fine-grained GPU datapath power management. In Proceedings of PACT 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, et al. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of MICRO 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Pierre Bricaud. 2012. Reuse Methodology Manual: For System-on-a-chip Designs. Springer Science and Business Media.Google ScholarGoogle Scholar
  49. Ali Keshavarzi, Kaushik Roy, and Charles F. Hawkins. 1997. Intrinsic leakage in low power deep submicron CMOS ICs. In Proceedings of ITC 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Farzad Khorasani, Hodjat Asghari Esfeden, Nael Abu-Ghazaleh, and Vivek Sarkar. 2018. In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In Proceedings of MICRO 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU register time-sharing. In Proceedings of ISCA 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Efficient warp execution in presence of divergence with collaborative context collection. In Proceedings of MICRO 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Farzad Khorasani, Bryan Rowe, Rajiv Gupta, and Laxmi N. Bhuyan. 2016. Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement. In Proceedings of IPDPS 2016.Google ScholarGoogle Scholar
  54. Gwangsun Kim, John Kim, and Sungjoo Yoo. 2011. Flexibuffer: Reducing leakage power in on-chip network routers. In Proceedings of DAC 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nam Sung Kim, Krisztián Flautner, David Blaauw, and Trevor Mudge. 2002. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Proceedings of MICRO 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of MICRO 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jesper Knudsen. 2008. Nangate 45nm Open Cell Library. Retrieved January 28, 2019 from https://projects.si2.org/events_dir/2008/oacspring2008/nan.pdf.Google ScholarGoogle Scholar
  58. Oshiya Komoda, Shingo Hayashi, Takashi Nakada, Shinobu Miwa, and Hiroshi Nakamura. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In Proceedings of ICCD 2013.Google ScholarGoogle ScholarCross RefCross Ref
  59. Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of PACT 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, et al. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of ISCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Dong Li, Surendra Byna, and Srimat Chakradhar. 2011. Energy-aware workload consolidation on GPU. In Proceedings of ICPPW 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. H. Lia, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy. 2003. Deterministic clock gating for microprocessor power reduction. In Proceedings of HPCA 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In Proceedings of HPCA 2017.Google ScholarGoogle ScholarCross RefCross Ref
  64. Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, and Daniel J. Sorin. 2009. Dynamic power gating with quality guarantees. In Proceedings of ISLPED 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Srilatha Manne, Artur Klauser, and Dirk Grunwald. 1998. Pipeline gating: Speculation control for energy reduction. In Proceedings of ISCA 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Hiroki Matsutani, Michihiro Koibuchi, Daisuke Ikebuchi, Kimiyoshi Usami, Hiroshi Nakamura, and Hideharu Amano. 2010. Ultra fine-grained run-time power gating of on-chip routers for CMPs. In Proceedings of NOCS 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Hiroki Matsutani, Michihiro Koibuchi, Daihan Wang, and Hideharu Amano. 2008. Adding slow-silent virtual channels for low-power on-chip networks. In Proceedings of NOCS 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Amirhossein Mirhosseini, Mohammad Sadrosadati, Sara Aghamohammadi, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2018. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Transactions on Parallel Computing 5, 3 (Jan. 2018), Article 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In Proceedings of NOCS 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravi Iyer, Narayanan Vijaykrishnan, and Chita R. Das. 2009. A case for dynamic frequency tuning in on-chip networks. In Proceedings of MICRO 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of MICRO 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. NVIDIA. 2008. NVIDIA Management Library (NVML). Retrieved January 28, 2019 from https://developer.nvidia.com/nvidia-management-library-nvml.Google ScholarGoogle Scholar
  74. NVIDIA. 2009. Whitepaper: NVIDIA’s Next Generation CUDA<sup>TM</sup> Compute Architecture: Fermi<sup>TM</sup>. Technical Report. NVIDIA.Google ScholarGoogle Scholar
  75. NVIDIA. 2016. How to Tune GPU Performance Using Radeon WattMan and Radeon Chill. Retrieved January 28, 2019 from https://support.amd.com/en-us/kb-articles/Pages/DH-020.aspx.Google ScholarGoogle Scholar
  76. NVIDIA. 2016. White Paper: NVIDIA Tesla P100. Technical Report. NVIDIA.Google ScholarGoogle Scholar
  77. NVIDIA. 2018. Dynamic Clocking. Retrieved January 28, 2019 from https://www.geforce.com/hardware/technology/gpu-boost/technology.Google ScholarGoogle Scholar
  78. NVIDIA. 2018. GTX480. Retrieved January 28, 2019 from https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture.Google ScholarGoogle Scholar
  79. Xiang Pan and Radu Teodorescu. 2014. NVSleep: Using non-volatile memory to enable fast sleep/wakeup of idle cores. In Proceedings of ICCD 2014.Google ScholarGoogle ScholarCross RefCross Ref
  80. Anuj Pathania, Qing Jiao, Alok Prakash, and Tulika Mitra. 2014. Integrated CPU-GPU power management for 3D mobile games. In Proceedings of DAC 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters 14, 2 (2015), 164--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In Proceedings of HPCA 2016.Google ScholarGoogle Scholar
  83. Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2016. CIRCA-GPUs: Increasing instruction reuse through inexact computing in GPGPUs. In Proceedings of DATE 2016.Google ScholarGoogle Scholar
  84. Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of DATE 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of ISCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In Proceedings of ISCA 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Kaushik Roy, Saibal Mukhopadhyay, and Hamid Mahmoodi-Meimand. 2003. Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE 91, 2 (Feb. 2003), 305--327.Google ScholarGoogle ScholarCross RefCross Ref
  88. Mohammad Sadrosadati, Amirhossein Mirhosseini, Homa Aghilinasab, and Hamid Sarbazi-Azad. 2015. An efficient DVS scheme for on-chip networks using reconfigurable virtual channel allocators. In Proceedings of USLPED 2015.Google ScholarGoogle ScholarCross RefCross Ref
  89. Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, et al. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of ASPLOS 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Proceedings of DATE 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In Proceedings of DAC 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Ahmad Samih, Ren Wang, Anil Krishna, Christian Maciocco, Charlie Tai, and Yan Solihin. 2013. Energy-efficient interconnect via router parking. In Proceedings of HPCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Ankit Seething, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2010. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of PACT 2010.Google ScholarGoogle Scholar
  94. Hynix Semiconductor. 2009. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google ScholarGoogle Scholar
  95. Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of GPU resources for efficient execution. In Proceedings of MICRO 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, et al. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, UIUC.Google ScholarGoogle Scholar
  97. Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of ISCA 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. STAG: Spintronic-tape architecture for GPGPU cache hierarchies. In Proceedings of ISCA 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In Proceedings of ISCA 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, et al. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of MICRO 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, et al. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of ISCA 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng. 2009. A predictive shutdown technique for GPU shader processors. IEEE Computer Architecture Letters 8, 1 (Jan. 2009), 9--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM Transactions on Architecture and Code Optimization 8, 3 (Oct. 2011), Article 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Yu Wang, Soumyaroop Roy, and Nagarajan Ranganathan. 2012. Run-time power-gating in caches of GPUs for leakage energy savings. In Proceedings of DATE 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In Proceedings of PACT 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Jieming Yin, Pingqiang Zhou, Sachin S. Sapatnekar, and Antonia Zhai. 2014. Energy-efficient time-division multiplexed hybrid-switched NOC for heterogeneous multicore systems. In Proceedings of IPDPS 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Wing-Kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of ISCA 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. William K. Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. Patent No. 5,630,096. Filed May 10th., 1995, Issued May 13th., 1997.Google ScholarGoogle Scholar

Index Terms

  1. ITAP: Idle-Time-Aware Power Management for GPU Execution Units

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 1
        March 2019
        157 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3313806
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 February 2019
        • Accepted: 1 November 2018
        • Revised: 1 October 2018
        • Received: 1 June 2018
        Published in taco Volume 16, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format