ITAP: Idle-Time-Aware Power Management for GPU Execution Units

Authors:
Mohammad Sadrosadati

Sharif University of Technology, Tehran, Iran

Sharif University of Technology, Tehran, Iran
View Profile

,
Seyed Borna Ehsani

Sharif University of Technology, Tehran, Iran

Sharif University of Technology, Tehran, Iran
View Profile

,
Hajar Falahati

IPM

IPM
View Profile

,
Rachata Ausavarungnirun

Carnegie Mellon University, KMUTNB

Carnegie Mellon University, KMUTNB
View Profile

,
Arash Tavakkol

ETH Zürich

ETH Zürich
View Profile

,
Mojtaba Abaee

IPM

IPM
View Profile

,
Lois Orosa

University of Campinas, ETH Zürich

University of Campinas, ETH Zürich
View Profile

,
Yaohua Wang

ETH Zürich, National University of Defense Technology

ETH Zürich, National University of Defense Technology
View Profile

,
Hamid Sarbazi-Azad

Sharif University of Technology, IPM

Sharif University of Technology, IPM
View Profile

,
Onur Mutlu

ETH Zürich, Carnegie Mellon University

ETH Zürich, Carnegie Mellon University
View Profile

ACM Transactions on Architecture and Code Optimization Volume 16 Issue 1Article No.: 3pp 1–26https://doi.org/10.1145/3291606

Published:27 February 2019Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU—the execution units—frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of execution units actually comes from short but frequent stalls, leaving little potential for common power saving techniques, such as power-gating.

In this article, we propose ITAP, a novel idle-time-aware power management technique, which aims to effectively reduce the static energy consumption of GPU execution units. By taking advantage of different power management techniques (i.e., power-gating and different levels of voltage scaling), ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length of execution units using prediction and peek-ahead techniques in a synergistic way and then applies the most appropriate static power reduction mode based on the estimated idle period length. We design ITAP to be power-aggressive or performance-aggressive, not both at the same time. Our experimental results on several workloads show that the power-aggressive design of ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with less than 2.1% performance overhead. However, the performance-aggressive design of ITAP improves the static energy savings by an average of 16.9%, while keeping the GPU performance almost unaffected (i.e., up to 0.4% performance overhead) compared to the state-of-the-art static energy savings mechanism.

References

Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In Proceedings of HPCA 2013. Google ScholarDigital Library
Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Gating aware scheduling and power gating for GPGPUs. In Proceedings of MICRO 2013. Google ScholarDigital Library
Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of ISLPED 2016. Google ScholarDigital Library
Amirhossein Mirhosseini, Mohammad Sadrosadati, Ali Fakhrzadehgan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2015. An energy-efficient virtual channel power-gating mechanism for on-chip networks. In Proceedings of DATE 2015. Google ScholarDigital Library
J. Anantpur and R. Govindarajan. 2015. PRO: Progress aware GPU warp scheduling algorithm. In Proceedings of IPDPS 2015. Google ScholarDigital Library
Amin Ansari, Asit Mishra, Jianping Xu, and Josep Torrellas. 2014. Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. In Proceeedings of HPCA 2014.Google ScholarCross Ref
Manish Arora, Srilatha Manne, Indrani Paul, Nuwan Jayasena, and Dean M. Tullsen. 2015. Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU integrated systems. In Proceedings of HPCA 2015.Google Scholar
Rachata Ausavarungnirun. 2017. Techniques for Shared Resource Management in Systems With Throughput Processors. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of ISCA 2012. Google ScholarDigital Library
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, et al. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of PACT 2014. Google ScholarDigital Library
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2018. Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv:1804.11038.Google Scholar
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, et al. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of MICRO 2017. Google ScholarDigital Library
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, et al. 2018. MASK: Redesigning the GPU memory hierarchy to support multi-application concurrency. In Proceedings of ASPLOS 2018. Google ScholarDigital Library
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS 2009.Google Scholar
Juan M. Cebrin, Gines D. Guerrero, and Jose M. Garcia. 2012. Energy efficiency analysis of GPUs. In Proceedings of IPDPSW 2012. Google ScholarDigital Library
Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, et al. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. In Proceedings of SIGMETRICS 2016. Google ScholarDigital Library
Kevin K. Chang, A. Giray Yağlıkçı, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, et al. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. In Proceedings of SIGMETRICS 2017. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC 2009. Google ScholarDigital Library
Lizhong Chen and Timothy M. Pinkston. 2012. NoRD: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the MICRO 2012. Google ScholarDigital Library
Lizhong Chen, Lihang Zhao, Ruisheng Wang, and Timothy M. Pinkston. 2014. MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. In Proceedings of HPCA 2014.Google Scholar
Pran Kurup and Taher Abbasi. 2011. Logic Synthesis Using Synopsys (2nd Edition). Springer Publishing Company, Incorporated. Google ScholarDigital Library
Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur Mutlu. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of ICAC 2011. Google ScholarDigital Library
Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. Memscale: Active low-power modes for main memory. In Proceedings of ASPLOS 2011. Google ScholarDigital Library
Krisztián Flautner, Nam Sung Kim, Steve Martin, David Blaauw, and Trevor Mudge. 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of ISCA 2002. Google ScholarDigital Library
Denis Foley, Pankaj Bansal, Don Cherepacha, Robert Wasmuth, Aswin Gunasekar, Srinivasa Gutta, et al. 2011. A low-power integrated x86-64 and graphics processor for mobile computing devices. In Proceeding os ISSCC 2011.Google Scholar
Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of HPCA 2011. Google ScholarDigital Library
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of MICRO 2007. Google ScholarDigital Library
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, et al. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of ISCA 2011. Google ScholarDigital Library
Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In Proceedings of MICRO 2013. Google ScholarDigital Library
Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of HPCA 2013. Google ScholarDigital Library
Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, and Asit Mishra. 2016. ScalCore: Designing a core for voltage scalability. In Proceedings of HPCA 2016.Google ScholarCross Ref
David Hodges, Horace Jackson, and Resve Saleh. 2004. Analysis and Design of Digital Integrated Circuits in Deep Submicron Technology. McGraw-Hill. Google ScholarDigital Library
Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of ISCA 2010. Google ScholarDigital Library
Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of ISLPED 2004. Google ScholarDigital Library
Canturk Isci, Alper Buyuktosunoglu, and Margaret Martonosi. 2005. Long-term workload phases: Duration predictions and applications to DVFS. IEEE Micro 25, 5 (Sep. 2005), 39--51. Google ScholarDigital Library
Hyeran Jeon and Murali Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPUs. In Proceedings of MICRO 2012. Google ScholarDigital Library
Qing Jiao, Mian Lu, Huynh Phung Huynh, and Tulika Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of CGO 2015. Google ScholarDigital Library
Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, et al. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of MICRO 2016. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, et al. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of ASPLOS 2013. Google ScholarDigital Library
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, et al. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of SIGMETRICS 2016. ACM, New York, NY. Google ScholarDigital Library
Ali Jooya and Amirali Baniasadi. 2013. Using synchronization stalls in power-aware accelerators. In Proceedings of DATE 2013. Google ScholarDigital Library
David Kadjo, Hyungjun Kim, Paul Gratz, Jiang Hu, and Raid Ayoub. 2013. Power gating with block migration in chip-multiprocessor last-level caches. In Proceedings of ICCD 2013.Google ScholarCross Ref
Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-threshold voltage (NTV) design: Opportunities and challenges. In Proceedings of DAC 2012. Google ScholarDigital Library
Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, et al. 2017. RIC: Relaxed inclusion caches for mitigating LLC side-channel attacks. In Proceedings of DAC 2017. Google ScholarDigital Library
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of PACT 2013. Google ScholarDigital Library
Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, et al. 2016. C-States: Fine-grained GPU datapath power management. In Proceedings of PACT 2016. Google ScholarDigital Library
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, et al. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of MICRO 2014. Google ScholarDigital Library
Pierre Bricaud. 2012. Reuse Methodology Manual: For System-on-a-chip Designs. Springer Science and Business Media.Google Scholar
Ali Keshavarzi, Kaushik Roy, and Charles F. Hawkins. 1997. Intrinsic leakage in low power deep submicron CMOS ICs. In Proceedings of ITC 1997. Google ScholarDigital Library
Farzad Khorasani, Hodjat Asghari Esfeden, Nael Abu-Ghazaleh, and Vivek Sarkar. 2018. In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In Proceedings of MICRO 2018.Google ScholarDigital Library
Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU register time-sharing. In Proceedings of ISCA 2018. Google ScholarDigital Library
Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Efficient warp execution in presence of divergence with collaborative context collection. In Proceedings of MICRO 2015. Google ScholarDigital Library
Farzad Khorasani, Bryan Rowe, Rajiv Gupta, and Laxmi N. Bhuyan. 2016. Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement. In Proceedings of IPDPS 2016.Google Scholar
Gwangsun Kim, John Kim, and Sungjoo Yoo. 2011. Flexibuffer: Reducing leakage power in on-chip network routers. In Proceedings of DAC 2011. Google ScholarDigital Library
Nam Sung Kim, Krisztián Flautner, David Blaauw, and Trevor Mudge. 2002. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Proceedings of MICRO 2002. Google ScholarDigital Library
John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of MICRO 2017. Google ScholarDigital Library
Jesper Knudsen. 2008. Nangate 45nm Open Cell Library. Retrieved January 28, 2019 from https://projects.si2.org/events_dir/2008/oacspring2008/nan.pdf.Google Scholar
Oshiya Komoda, Shingo Hayashi, Takashi Nakada, Shinobu Miwa, and Hiroshi Nakamura. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In Proceedings of ICCD 2013.Google ScholarCross Ref
Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of PACT 2014. Google ScholarDigital Library
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, et al. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of ISCA 2013. Google ScholarDigital Library
Dong Li, Surendra Byna, and Srimat Chakradhar. 2011. Energy-aware workload consolidation on GPU. In Proceedings of ICPPW 2011. Google ScholarDigital Library
H. Lia, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy. 2003. Deterministic clock gating for microprocessor power reduction. In Proceedings of HPCA 2003. Google ScholarDigital Library
Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In Proceedings of HPCA 2017.Google ScholarCross Ref
Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, and Daniel J. Sorin. 2009. Dynamic power gating with quality guarantees. In Proceedings of ISLPED 2009. Google ScholarDigital Library
Srilatha Manne, Artur Klauser, and Dirk Grunwald. 1998. Pipeline gating: Speculation control for energy reduction. In Proceedings of ISCA 1998. Google ScholarDigital Library
Hiroki Matsutani, Michihiro Koibuchi, Daisuke Ikebuchi, Kimiyoshi Usami, Hiroshi Nakamura, and Hideharu Amano. 2010. Ultra fine-grained run-time power gating of on-chip routers for CMPs. In Proceedings of NOCS 2010. Google ScholarDigital Library
Hiroki Matsutani, Michihiro Koibuchi, Daihan Wang, and Hideharu Amano. 2008. Adding slow-silent virtual channels for low-power on-chip networks. In Proceedings of NOCS 2008. Google ScholarDigital Library
Amirhossein Mirhosseini, Mohammad Sadrosadati, Sara Aghamohammadi, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2018. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Transactions on Parallel Computing 5, 3 (Jan. 2018), Article 11. Google ScholarDigital Library
Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In Proceedings of NOCS 2017. Google ScholarDigital Library
Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravi Iyer, Narayanan Vijaykrishnan, and Chita R. Das. 2009. A case for dynamic frequency tuning in on-chip networks. In Proceedings of MICRO 2009. Google ScholarDigital Library
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of MICRO 2011. Google ScholarDigital Library
Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.Google ScholarDigital Library
NVIDIA. 2008. NVIDIA Management Library (NVML). Retrieved January 28, 2019 from https://developer.nvidia.com/nvidia-management-library-nvml.Google Scholar
NVIDIA. 2009. Whitepaper: NVIDIA’s Next Generation CUDA<sup>TM</sup> Compute Architecture: Fermi<sup>TM</sup>. Technical Report. NVIDIA.Google Scholar
NVIDIA. 2016. How to Tune GPU Performance Using Radeon WattMan and Radeon Chill. Retrieved January 28, 2019 from https://support.amd.com/en-us/kb-articles/Pages/DH-020.aspx.Google Scholar
NVIDIA. 2016. White Paper: NVIDIA Tesla P100. Technical Report. NVIDIA.Google Scholar
NVIDIA. 2018. Dynamic Clocking. Retrieved January 28, 2019 from https://www.geforce.com/hardware/technology/gpu-boost/technology.Google Scholar
NVIDIA. 2018. GTX480. Retrieved January 28, 2019 from https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture.Google Scholar
Xiang Pan and Radu Teodorescu. 2014. NVSleep: Using non-volatile memory to enable fast sleep/wakeup of idle cores. In Proceedings of ICCD 2014.Google ScholarCross Ref
Anuj Pathania, Qing Jiao, Alok Prakash, and Tulika Mitra. 2014. Integrated CPU-GPU power management for 3D mobile games. In Proceedings of DAC 2014. Google ScholarDigital Library
Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters 14, 2 (2015), 164--168. Google ScholarDigital Library
Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In Proceedings of HPCA 2016.Google Scholar
Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2016. CIRCA-GPUs: Increasing instruction reuse through inexact computing in GPGPUs. In Proceedings of DATE 2016.Google Scholar
Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of DATE 2015. Google ScholarDigital Library
Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of ISCA 2013. Google ScholarDigital Library
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In Proceedings of ISCA 2000. Google ScholarDigital Library
Kaushik Roy, Saibal Mukhopadhyay, and Hamid Mahmoodi-Meimand. 2003. Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE 91, 2 (Feb. 2003), 305--327.Google ScholarCross Ref
Mohammad Sadrosadati, Amirhossein Mirhosseini, Homa Aghilinasab, and Hamid Sarbazi-Azad. 2015. An efficient DVS scheme for on-chip networks using reconfigurable virtual channel allocators. In Proceedings of USLPED 2015.Google ScholarCross Ref
Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, et al. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of ASPLOS 2018. Google ScholarDigital Library
Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Proceedings of DATE 2017. Google ScholarDigital Library
Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In Proceedings of DAC 2014. Google ScholarDigital Library
Ahmad Samih, Ren Wang, Anil Krishna, Christian Maciocco, Charlie Tai, and Yan Solihin. 2013. Energy-efficient interconnect via router parking. In Proceedings of HPCA 2013. Google ScholarDigital Library
Ankit Seething, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2010. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of PACT 2010.Google Scholar
Hynix Semiconductor. 2009. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google Scholar
Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of GPU resources for efficient execution. In Proceedings of MICRO 2014. Google ScholarDigital Library
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, et al. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, UIUC.Google Scholar
Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of ISCA 2013. Google ScholarDigital Library
Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. STAG: Spintronic-tape architecture for GPGPU cache hierarchies. In Proceedings of ISCA 2014. Google ScholarDigital Library
Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In Proceedings of ISCA 2018. Google ScholarDigital Library
Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, et al. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of MICRO 2016. Google ScholarDigital Library
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, et al. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of ISCA 2015. Google ScholarDigital Library
Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng. 2009. A predictive shutdown technique for GPU shader processors. IEEE Computer Architecture Letters 8, 1 (Jan. 2009), 9--12. Google ScholarDigital Library
Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM Transactions on Architecture and Code Optimization 8, 3 (Oct. 2011), Article 13. Google ScholarDigital Library
Yu Wang, Soumyaroop Roy, and Nagarajan Ranganathan. 2012. Run-time power-gating in caches of GPUs for leakage energy savings. In Proceedings of DATE 2012. Google ScholarDigital Library
Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In Proceedings of PACT 2014. Google ScholarDigital Library
Jieming Yin, Pingqiang Zhou, Sachin S. Sapatnekar, and Antonia Zhai. 2014. Energy-efficient time-division multiplexed hybrid-switched NOC for heterogeneous multicore systems. In Proceedings of IPDPS 2014. Google ScholarDigital Library
Wing-Kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of ISCA 2011. Google ScholarDigital Library
William K. Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. Patent No. 5,630,096. Filed May 10th., 1995, Issued May 13th., 1997.Google Scholar

Index Terms

ITAP: Idle-Time-Aware Power Management for GPU Execution Units
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Hardware
  1. Power and energy

Recommendations

Reducing Power Consumption of GPGPUs Through Instruction Reordering
ISLPED '16: Proceedings of the 2016 International Symposium on Low Power Electronics and Design

Execution units in GPGPU consume much static power. However, reducing the static power of execution units is not clear based on two reasons. First, the very long idle time of execution units in GPGPU is fragmented in to many short periods. Second, these ...
Read More
Microarchitectural techniques for power gating of execution units
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

Leakage power is a major concern in current and future microprocessor designs. In this paper, we explore the potential of architectural techniques to reduce leakage through power-gating of execution units. This paper first develops parameterized ...
Read More
Voltage-Clock Scaling for Low Energy Consumption in Fixed-Priority Real-Time Systems

Power and energy constraints are becoming increasingly prevalent in real-time embedded systems. Voltage-scaling is a promising technique to reduce energy and power consumption: clock speed tends to decrease linearly with supply voltage while power ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 16, Issue 1
March 2019
157 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3313806
Editor:
Koen De Bosschere
Ghent University, Belgium
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 February 2019
- Accepted: 1 November 2018
- Revised: 1 October 2018
- Received: 1 June 2018
Published in taco Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPUs
execution units
power-gating
static power
voltage-scaling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,306
  Total Downloads
- Downloads (Last 12 months)218
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

ITAP: Idle-Time-Aware Power Management for GPU Execution Units

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Reducing Power Consumption of GPGPUs Through Instruction Reordering

Microarchitectural techniques for power gating of execution units

Voltage-Clock Scaling for Low Energy Consumption in Fixed-Priority Real-Time Systems