research-article

Public Access

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

Authors:
Michael Badamo

University of Maryland at College Park, MD

University of Maryland at College Park, MD
View Profile

,
Jeff Casarona

University of Maryland at College Park, MD

University of Maryland at College Park, MD
View Profile

,
Minshu Zhao

University of Maryland at College Park, MD

University of Maryland at College Park, MD
View Profile

,
Donald Yeung

University of Maryland at College Park, MD

University of Maryland at College Park, MD
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 34 Issue 1Article No.: 3pp 1–30https://doi.org/10.1145/2851503

Published:06 April 2016Publication History

ACM Transactions on Computer Systems

Abstract

To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies.

A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption.

This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.

References

Erik Berg and Erik Hagersten. 2004. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software.Google ScholarCross Ref
Erik Berg and Erik Hagersten. 2005. Fast data-locality profiling of native execution. In Proceedings of the ACM SIGMETRICS Conference.Google ScholarDigital Library
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google ScholarDigital Library
Kenzo Van Craeynest and Lieven Eeckhout. 2011. The multi-program performance model: Debunking current practice in multi-core simulation. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization.Google ScholarDigital Library
John Davis, James Laudon, and Kunle Olukotun. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.Google ScholarDigital Library
Peter J. Denning. 1968. The working set model for program behavior. Communications of the ACM 11, 5, 323--333.Google ScholarDigital Library
Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-Threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research.Google Scholar
Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation.Google ScholarDigital Library
David Eklov, David Black-Schaffer, and Erik Hagersten. 2011. Fast modeling of shared caches in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers.Google ScholarDigital Library
David Eklov and Erik Hagersten. 2010. Statstack: Efficient modeling of LRU caches. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software.Google ScholarCross Ref
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and Jim Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2, Article No. 3.Google ScholarDigital Library
J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2, 78--117.Google ScholarDigital Library
Song-Liu Guo, Hai-Xia Wang, Yi-Bo Xue, Chong-Min Li, and Dong-Sheng Wang. 2010. Hierarchical cache directory for CMP. Journal of Computer Science and Technology 25, 2, 246--256.Google ScholarCross Ref
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th International Symposium on Computer Architecture.Google ScholarDigital Library
Lisa Hsu, Ravi Iyer, Srihari Makineni, Steve Reinhardt, and Donald Newell. 2005. Exploring the cache design space for large scale CMPs. ACM SIGARCH Computer Architecture News, 4, 24--33.Google ScholarDigital Library
Jaehyuk Huh, Stephen W. Keckler, and Doug Burger. 2001. Exploring the design space of future CMPs. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques.Google ScholarDigital Library
Intel. 2014. Intel Xeon Phi Product Family. Available at http://www.intel.com/XeonPhi.Google Scholar
Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. DOI:http://dx.doi.org/10.1145/1168857.1168882Google Scholar
Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of the International Conference on Compiler Construction.Google ScholarDigital Library
Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. DOI:http://dx.doi.org/10.1145/1168857.1168881Google Scholar
Jian Li and Jose F. Martinez. 2005. Power-performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.Google Scholar
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.Google Scholar
Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2006. CMP design space exploration subject to physical constraints. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture.Google ScholarCross Ref
Gabriel H. Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th International Symposium on Computer Architecture.Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarDigital Library
Martina Maggio, Henry Hoffman, Anant Agarwal, and Alberto Leva. 2011. Control-theoretical CPU allocation: Design and implementation with feedback control. In Proceedings of the 6th International Workshop on Feedback Control Implementation and Design in Computing Systems and Networks.Google Scholar
Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture.Google ScholarCross Ref
Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokham Memik, and Alok Choudhary. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the International Symposium on Workload Characterization.Google ScholarCross Ref
Apan Qasem and Ken Kennedy. 2005. Evaluating a Model for Cache Conflict Miss Prediction. Technical Report CS-TR05-457. Rice University.Google Scholar
Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th International Symposium on Computer Architecture.Google ScholarDigital Library
Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google Scholar
Derek L. Schuff, Benjamin S. Parsons, and Vijay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University.Google Scholar
Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the 6th International Symposium on Networks-on-Chip.Google ScholarDigital Library
Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol. Master’s Thesis. Massachusetts Institute of Technology.Google Scholar
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture.Google ScholarDigital Library
Meng-Ju Wu and Donald Yeung. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google ScholarDigital Library
Meng-Ju Wu and Donald Yeung. 2012. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness.Google ScholarDigital Library
Meng-Ju Wu and Donald Yeung. 2013. Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Transactions on Computer Systems 31, 1, Article No. 1.Google ScholarDigital Library
Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In Proceedings of the 40th International Symposium on Computer Architecture.Google ScholarDigital Library
Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture.Google ScholarDigital Library
Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.Google Scholar
Yutao Zhong, Steven G. Dropsho, and Chen Ding. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.Google ScholarCross Ref
Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems 31, 6, Article No. 20. DOI:http://dx.doi.org/10.1145/1552309.1552310Google ScholarDigital Library

Index Terms

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared ...
Read More
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

Reuse Distance (RD) analysis is a powerful memory analysis tool that can potentially help architects study multicore processor scaling. One key obstacle, however, is that multicore RD analysis requires measuring Concurrent Reuse Distance (CRD) and ...
Read More
Studying multicore processor scaling via reuse distance analysis
ICSA '13

The trend for multicore processors is towards increasing numbers of cores, with 100s of cores--i.e. large-scale chip multiprocessors (LCMPs)--possible in the future. The key to realizing the potential of LCMPs is the cache hierarchy, so studying how ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 34, Issue 1
April 2016
91 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2912578
Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 April 2016
- Accepted: 1 November 2015
- Revised: 1 October 2015
- Received: 1 October 2014
Published in tocs Volume 34, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache performance
chip multiprocessors
design space exploration
reuse distance
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

Studying multicore processor scaling via reuse distance analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

Studying multicore processor scaling via reuse distance analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media