Abstract
Graphics Processing Units (GPUs) are widely used as the accelerator of choice for applications with massively data-parallel tasks. However, recent studies show that GPUs suffer heavily from resource underutilization, which, combined with their large static power consumption, imposes a significant power overhead. One of the most power-hungry components of a GPU—the execution units—frequently experience idleness when (1) an underutilized warp is issued to the execution units, leading to partial lane idleness, and (2) there is no active warp to be issued for the execution due to warp stalls (e.g., waiting for memory access and synchronization). Although large in total, the idle time of execution units actually comes from short but frequent stalls, leaving little potential for common power saving techniques, such as power-gating.
In this article, we propose ITAP, a novel idle-time-aware power management technique, which aims to effectively reduce the static energy consumption of GPU execution units. By taking advantage of different power management techniques (i.e., power-gating and different levels of voltage scaling), ITAP employs three static power reduction modes with different overheads and capabilities of static power reduction. ITAP estimates the idle period length of execution units using prediction and peek-ahead techniques in a synergistic way and then applies the most appropriate static power reduction mode based on the estimated idle period length. We design ITAP to be power-aggressive or performance-aggressive, not both at the same time. Our experimental results on several workloads show that the power-aggressive design of ITAP outperforms the state-of-the-art solution by an average of 27.6% in terms of static energy savings, with less than 2.1% performance overhead. However, the performance-aggressive design of ITAP improves the static energy savings by an average of 16.9%, while keeping the GPU performance almost unaffected (i.e., up to 0.4% performance overhead) compared to the state-of-the-art static energy savings mechanism.
- Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In Proceedings of HPCA 2013. Google ScholarDigital Library
- Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Gating aware scheduling and power gating for GPGPUs. In Proceedings of MICRO 2013. Google ScholarDigital Library
- Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of ISLPED 2016. Google ScholarDigital Library
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Ali Fakhrzadehgan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2015. An energy-efficient virtual channel power-gating mechanism for on-chip networks. In Proceedings of DATE 2015. Google ScholarDigital Library
- J. Anantpur and R. Govindarajan. 2015. PRO: Progress aware GPU warp scheduling algorithm. In Proceedings of IPDPS 2015. Google ScholarDigital Library
- Amin Ansari, Asit Mishra, Jianping Xu, and Josep Torrellas. 2014. Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks. In Proceeedings of HPCA 2014.Google ScholarCross Ref
- Manish Arora, Srilatha Manne, Indrani Paul, Nuwan Jayasena, and Dean M. Tullsen. 2015. Understanding idle behavior and power gating mechanisms in the context of modern benchmarks on CPU-GPU integrated systems. In Proceedings of HPCA 2015.Google Scholar
- Rachata Ausavarungnirun. 2017. Techniques for Shared Resource Management in Systems With Throughput Processors. Ph.D. Dissertation. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of ISCA 2012. Google ScholarDigital Library
- Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, et al. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In Proceedings of PACT 2014. Google ScholarDigital Library
- Rachata Ausavarungnirun, Saugata Ghose, Onur Kayıran, Gabriel H. Loh, Chita R. Das, Mahmut T. Kandemir, and Onur Mutlu. 2018. Holistic management of the GPGPU memory hierarchy to manage warp-level latency tolerance. arXiv:1804.11038.Google Scholar
- Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, et al. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of MICRO 2017. Google ScholarDigital Library
- Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, et al. 2018. MASK: Redesigning the GPU memory hierarchy to support multi-application concurrency. In Proceedings of ASPLOS 2018. Google ScholarDigital Library
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS 2009.Google Scholar
- Juan M. Cebrin, Gines D. Guerrero, and Jose M. Garcia. 2012. Energy efficiency analysis of GPUs. In Proceedings of IPDPSW 2012. Google ScholarDigital Library
- Kevin K. Chang, Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, et al. 2016. Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization. In Proceedings of SIGMETRICS 2016. Google ScholarDigital Library
- Kevin K. Chang, A. Giray Yağlıkçı, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, et al. 2017. Understanding reduced-voltage operation in modern DRAM devices: Experimental characterization, analysis, and mechanisms. In Proceedings of SIGMETRICS 2017. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, et al. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC 2009. Google ScholarDigital Library
- Lizhong Chen and Timothy M. Pinkston. 2012. NoRD: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the MICRO 2012. Google ScholarDigital Library
- Lizhong Chen, Lihang Zhao, Ruisheng Wang, and Timothy M. Pinkston. 2014. MP3: Minimizing performance penalty for power-gating of Clos network-on-chip. In Proceedings of HPCA 2014.Google Scholar
- Pran Kurup and Taher Abbasi. 2011. Logic Synthesis Using Synopsys (2nd Edition). Springer Publishing Company, Incorporated. Google ScholarDigital Library
- Howard David, Chris Fallin, Eugene Gorbatov, Ulf R. Hanebutte, and Onur Mutlu. 2011. Memory power management via dynamic voltage/frequency scaling. In Proceedings of ICAC 2011. Google ScholarDigital Library
- Qingyuan Deng, David Meisner, Luiz Ramos, Thomas F. Wenisch, and Ricardo Bianchini. 2011. Memscale: Active low-power modes for main memory. In Proceedings of ASPLOS 2011. Google ScholarDigital Library
- Krisztián Flautner, Nam Sung Kim, Steve Martin, David Blaauw, and Trevor Mudge. 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of ISCA 2002. Google ScholarDigital Library
- Denis Foley, Pankaj Bansal, Don Cherepacha, Robert Wasmuth, Aswin Gunasekar, Srinivasa Gutta, et al. 2011. A low-power integrated x86-64 and graphics processor for mobile computing devices. In Proceeding os ISSCC 2011.Google Scholar
- Wilson W. L. Fung and Tor M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of HPCA 2011. Google ScholarDigital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of MICRO 2007. Google ScholarDigital Library
- Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, et al. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of ISCA 2011. Google ScholarDigital Library
- Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In Proceedings of MICRO 2013. Google ScholarDigital Library
- Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. 2013. Power-efficient computing for compute-intensive GPGPU applications. In Proceedings of HPCA 2013. Google ScholarDigital Library
- Bhargava Gopireddy, Choungki Song, Josep Torrellas, Nam Sung Kim, Aditya Agrawal, and Asit Mishra. 2016. ScalCore: Designing a core for voltage scalability. In Proceedings of HPCA 2016.Google ScholarCross Ref
- David Hodges, Horace Jackson, and Resve Saleh. 2004. Analysis and Design of Digital Integrated Circuits in Deep Submicron Technology. McGraw-Hill. Google ScholarDigital Library
- Sunpyo Hong and Hyesoon Kim. 2010. An integrated GPU power and performance model. In Proceedings of ISCA 2010. Google ScholarDigital Library
- Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacobson, and Pradip Bose. 2004. Microarchitectural techniques for power gating of execution units. In Proceedings of ISLPED 2004. Google ScholarDigital Library
- Canturk Isci, Alper Buyuktosunoglu, and Margaret Martonosi. 2005. Long-term workload phases: Duration predictions and applications to DVFS. IEEE Micro 25, 5 (Sep. 2005), 39--51. Google ScholarDigital Library
- Hyeran Jeon and Murali Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPUs. In Proceedings of MICRO 2012. Google ScholarDigital Library
- Qing Jiao, Mian Lu, Huynh Phung Huynh, and Tulika Mitra. 2015. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of CGO 2015. Google ScholarDigital Library
- Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, et al. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of MICRO 2016. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, et al. 2013. OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. In Proceedings of ASPLOS 2013. Google ScholarDigital Library
- Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, et al. 2016. Exploiting core criticality for enhanced GPU performance. In Proceedings of SIGMETRICS 2016. ACM, New York, NY. Google ScholarDigital Library
- Ali Jooya and Amirali Baniasadi. 2013. Using synchronization stalls in power-aware accelerators. In Proceedings of DATE 2013. Google ScholarDigital Library
- David Kadjo, Hyungjun Kim, Paul Gratz, Jiang Hu, and Raid Ayoub. 2013. Power gating with block migration in chip-multiprocessor last-level caches. In Proceedings of ICCD 2013.Google ScholarCross Ref
- Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and Shekhar Borkar. 2012. Near-threshold voltage (NTV) design: Opportunities and challenges. In Proceedings of DAC 2012. Google ScholarDigital Library
- Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, et al. 2017. RIC: Relaxed inclusion caches for mitigating LLC side-channel attacks. In Proceedings of DAC 2017. Google ScholarDigital Library
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of PACT 2013. Google ScholarDigital Library
- Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, et al. 2016. C-States: Fine-grained GPU datapath power management. In Proceedings of PACT 2016. Google ScholarDigital Library
- Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, et al. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of MICRO 2014. Google ScholarDigital Library
- Pierre Bricaud. 2012. Reuse Methodology Manual: For System-on-a-chip Designs. Springer Science and Business Media.Google Scholar
- Ali Keshavarzi, Kaushik Roy, and Charles F. Hawkins. 1997. Intrinsic leakage in low power deep submicron CMOS ICs. In Proceedings of ITC 1997. Google ScholarDigital Library
- Farzad Khorasani, Hodjat Asghari Esfeden, Nael Abu-Ghazaleh, and Vivek Sarkar. 2018. In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In Proceedings of MICRO 2018.Google ScholarDigital Library
- Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. RegMutex: Inter-Warp GPU register time-sharing. In Proceedings of ISCA 2018. Google ScholarDigital Library
- Farzad Khorasani, Rajiv Gupta, and Laxmi N. Bhuyan. 2015. Efficient warp execution in presence of divergence with collaborative context collection. In Proceedings of MICRO 2015. Google ScholarDigital Library
- Farzad Khorasani, Bryan Rowe, Rajiv Gupta, and Laxmi N. Bhuyan. 2016. Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement. In Proceedings of IPDPS 2016.Google Scholar
- Gwangsun Kim, John Kim, and Sungjoo Yoo. 2011. Flexibuffer: Reducing leakage power in on-chip network routers. In Proceedings of DAC 2011. Google ScholarDigital Library
- Nam Sung Kim, Krisztián Flautner, David Blaauw, and Trevor Mudge. 2002. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Proceedings of MICRO 2002. Google ScholarDigital Library
- John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of MICRO 2017. Google ScholarDigital Library
- Jesper Knudsen. 2008. Nangate 45nm Open Cell Library. Retrieved January 28, 2019 from https://projects.si2.org/events_dir/2008/oacspring2008/nan.pdf.Google Scholar
- Oshiya Komoda, Shingo Hayashi, Takashi Nakada, Shinobu Miwa, and Hiroshi Nakamura. 2013. Power capping of CPU-GPU heterogeneous systems through coordinating DVFS and task mapping. In Proceedings of ICCD 2013.Google ScholarCross Ref
- Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of PACT 2014. Google ScholarDigital Library
- Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, et al. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of ISCA 2013. Google ScholarDigital Library
- Dong Li, Surendra Byna, and Srimat Chakradhar. 2011. Energy-aware workload consolidation on GPU. In Proceedings of ICPPW 2011. Google ScholarDigital Library
- H. Lia, S. Bhunia, Y. Chen, T. N. Vijaykumar, and K. Roy. 2003. Deterministic clock gating for microprocessor power reduction. In Proceedings of HPCA 2003. Google ScholarDigital Library
- Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-Scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In Proceedings of HPCA 2017.Google ScholarCross Ref
- Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, and Daniel J. Sorin. 2009. Dynamic power gating with quality guarantees. In Proceedings of ISLPED 2009. Google ScholarDigital Library
- Srilatha Manne, Artur Klauser, and Dirk Grunwald. 1998. Pipeline gating: Speculation control for energy reduction. In Proceedings of ISCA 1998. Google ScholarDigital Library
- Hiroki Matsutani, Michihiro Koibuchi, Daisuke Ikebuchi, Kimiyoshi Usami, Hiroshi Nakamura, and Hideharu Amano. 2010. Ultra fine-grained run-time power gating of on-chip routers for CMPs. In Proceedings of NOCS 2010. Google ScholarDigital Library
- Hiroki Matsutani, Michihiro Koibuchi, Daihan Wang, and Hideharu Amano. 2008. Adding slow-silent virtual channels for low-power on-chip networks. In Proceedings of NOCS 2008. Google ScholarDigital Library
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Sara Aghamohammadi, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2018. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM Transactions on Parallel Computing 5, 3 (Jan. 2018), Article 11. Google ScholarDigital Library
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems. In Proceedings of NOCS 2017. Google ScholarDigital Library
- Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravi Iyer, Narayanan Vijaykrishnan, and Chita R. Das. 2009. A case for dynamic frequency tuning in on-chip networks. In Proceedings of MICRO 2009. Google ScholarDigital Library
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of MICRO 2011. Google ScholarDigital Library
- Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.Google ScholarDigital Library
- NVIDIA. 2008. NVIDIA Management Library (NVML). Retrieved January 28, 2019 from https://developer.nvidia.com/nvidia-management-library-nvml.Google Scholar
- NVIDIA. 2009. Whitepaper: NVIDIA’s Next Generation CUDA<sup>TM</sup> Compute Architecture: Fermi<sup>TM</sup>. Technical Report. NVIDIA.Google Scholar
- NVIDIA. 2016. How to Tune GPU Performance Using Radeon WattMan and Radeon Chill. Retrieved January 28, 2019 from https://support.amd.com/en-us/kb-articles/Pages/DH-020.aspx.Google Scholar
- NVIDIA. 2016. White Paper: NVIDIA Tesla P100. Technical Report. NVIDIA.Google Scholar
- NVIDIA. 2018. Dynamic Clocking. Retrieved January 28, 2019 from https://www.geforce.com/hardware/technology/gpu-boost/technology.Google Scholar
- NVIDIA. 2018. GTX480. Retrieved January 28, 2019 from https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/architecture.Google Scholar
- Xiang Pan and Radu Teodorescu. 2014. NVSleep: Using non-volatile memory to enable fast sleep/wakeup of idle cores. In Proceedings of ICCD 2014.Google ScholarCross Ref
- Anuj Pathania, Qing Jiao, Alok Prakash, and Tulika Mitra. 2014. Integrated CPU-GPU power management for 3D mobile games. In Proceedings of DAC 2014. Google ScholarDigital Library
- Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2015. Toggle-aware compression for GPUs. In IEEE Computer Architecture Letters 14, 2 (2015), 164--168. Google ScholarDigital Library
- Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler. 2016. A case for toggle-aware compression for GPU systems. In Proceedings of HPCA 2016.Google Scholar
- Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2016. CIRCA-GPUs: Increasing instruction reuse through inexact computing in GPGPUs. In Proceedings of DATE 2016.Google Scholar
- Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, and Rajesh K. Gupta. 2015. Approximate associative memristive memory for energy-efficient GPUs. In Proceedings of DATE 2015. Google ScholarDigital Library
- Minsoo Rhu and Mattan Erez. 2013. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. In Proceedings of ISCA 2013. Google ScholarDigital Library
- Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. In Proceedings of ISCA 2000. Google ScholarDigital Library
- Kaushik Roy, Saibal Mukhopadhyay, and Hamid Mahmoodi-Meimand. 2003. Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE 91, 2 (Feb. 2003), 305--327.Google ScholarCross Ref
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Homa Aghilinasab, and Hamid Sarbazi-Azad. 2015. An efficient DVS scheme for on-chip networks using reconfigurable virtual channel allocators. In Proceedings of USLPED 2015.Google ScholarCross Ref
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, et al. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of ASPLOS 2018. Google ScholarDigital Library
- Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Proceedings of DATE 2017. Google ScholarDigital Library
- Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In Proceedings of DAC 2014. Google ScholarDigital Library
- Ahmad Samih, Ren Wang, Anil Krishna, Christian Maciocco, Charlie Tai, and Yan Solihin. 2013. Energy-efficient interconnect via router parking. In Proceedings of HPCA 2013. Google ScholarDigital Library
- Ankit Seething, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2010. APOGEE: Adaptive prefetching on GPUs for energy efficiency. In Proceedings of PACT 2010.Google Scholar
- Hynix Semiconductor. 2009. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf.Google Scholar
- Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic tuning of GPU resources for efficient execution. In Proceedings of MICRO 2014. Google ScholarDigital Library
- John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, et al. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. Center for Reliable and High-Performance Computing, UIUC.Google Scholar
- Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of ISCA 2013. Google ScholarDigital Library
- Rangharajan Venkatesan, Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. STAG: Spintronic-tape architecture for GPGPU cache hierarchies. In Proceedings of ISCA 2014. Google ScholarDigital Library
- Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In Proceedings of ISCA 2018. Google ScholarDigital Library
- Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, et al. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of MICRO 2016. Google ScholarDigital Library
- Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, et al. 2015. A case for core-assisted bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of ISCA 2015. Google ScholarDigital Library
- Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng. 2009. A predictive shutdown technique for GPU shader processors. IEEE Computer Architecture Letters 8, 1 (Jan. 2009), 9--12. Google ScholarDigital Library
- Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. 2011. Power gating strategies on GPUs. ACM Transactions on Architecture and Code Optimization 8, 3 (Oct. 2011), Article 13. Google ScholarDigital Library
- Yu Wang, Soumyaroop Roy, and Nagarajan Ranganathan. 2012. Run-time power-gating in caches of GPUs for leakage energy savings. In Proceedings of DATE 2012. Google ScholarDigital Library
- Qiumin Xu and Murali Annavaram. 2014. PATS: Pattern aware scheduling and power gating for GPGPUs. In Proceedings of PACT 2014. Google ScholarDigital Library
- Jieming Yin, Pingqiang Zhou, Sachin S. Sapatnekar, and Antonia Zhai. 2014. Energy-efficient time-division multiplexed hybrid-switched NOC for heterogeneous multicore systems. In Proceedings of IPDPS 2014. Google ScholarDigital Library
- Wing-Kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Edwin Kan, and G. Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In Proceedings of ISCA 2011. Google ScholarDigital Library
- William K. Zuravleff and Timothy Robinson. 1997. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. Patent No. 5,630,096. Filed May 10th., 1995, Issued May 13th., 1997.Google Scholar
Index Terms
- ITAP: Idle-Time-Aware Power Management for GPU Execution Units
Recommendations
Reducing Power Consumption of GPGPUs Through Instruction Reordering
ISLPED '16: Proceedings of the 2016 International Symposium on Low Power Electronics and DesignExecution units in GPGPU consume much static power. However, reducing the static power of execution units is not clear based on two reasons. First, the very long idle time of execution units in GPGPU is fragmented in to many short periods. Second, these ...
Microarchitectural techniques for power gating of execution units
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and designLeakage power is a major concern in current and future microprocessor designs. In this paper, we explore the potential of architectural techniques to reduce leakage through power-gating of execution units. This paper first develops parameterized ...
Voltage-Clock Scaling for Low Energy Consumption in Fixed-Priority Real-Time Systems
Power and energy constraints are becoming increasingly prevalent in real-time embedded systems. Voltage-scaling is a promising technique to reduce energy and power consumption: clock speed tends to decrease linearly with supply voltage while power ...
Comments