skip to main content
research-article
Public Access

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

Published:06 April 2016Publication History
Skip Abstract Section

Abstract

To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies.

A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption.

This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.

References

  1. Erik Berg and Erik Hagersten. 2004. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software.Google ScholarGoogle ScholarCross RefCross Ref
  2. Erik Berg and Erik Hagersten. 2005. Fast data-locality profiling of native execution. In Proceedings of the ACM SIGMETRICS Conference.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kenzo Van Craeynest and Lieven Eeckhout. 2011. The multi-program performance model: Debunking current practice in multi-core simulation. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. John Davis, James Laudon, and Kunle Olukotun. 2005. Maximizing CMP throughput with mediocre cores. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter J. Denning. 1968. The working set model for program behavior. Communications of the ACM 11, 5, 323--333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-Threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research.Google ScholarGoogle Scholar
  8. Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David Eklov, David Black-Schaffer, and Erik Hagersten. 2011. Fast modeling of shared caches in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David Eklov and Erik Hagersten. 2010. Statstack: Efficient modeling of LRU caches. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems and Software.Google ScholarGoogle ScholarCross RefCross Ref
  11. Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and Jim Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems 27, 2, Article No. 3.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2, 78--117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Song-Liu Guo, Hai-Xia Wang, Yi-Bo Xue, Chong-Min Li, and Dong-Sheng Wang. 2010. Hierarchical cache directory for CMP. Journal of Computer Science and Technology 25, 2, 246--256.Google ScholarGoogle ScholarCross RefCross Ref
  14. Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lisa Hsu, Ravi Iyer, Srihari Makineni, Steve Reinhardt, and Donald Newell. 2005. Exploring the cache design space for large scale CMPs. ACM SIGARCH Computer Architecture News, 4, 24--33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaehyuk Huh, Stephen W. Keckler, and Doug Burger. 2001. Exploring the design space of future CMPs. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel. 2014. Intel Xeon Phi Product Family. Available at http://www.intel.com/XeonPhi.Google ScholarGoogle Scholar
  18. Engin Ïpek, Sally A. McKee, Rich Caruana, Bronis R. de Supinski, and Martin Schulz. 2006. Efficiently exploring architectural design spaces via predictive modeling. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. DOI:http://dx.doi.org/10.1145/1168857.1168882Google ScholarGoogle Scholar
  19. Yunlian Jiang, Eddy Z. Zhang, Kai Tian, and Xipeng Shen. 2010. Is reuse distance applicable to data locality analysis on chip multiprocessors? In Proceeding of the International Conference on Compiler Construction.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems. DOI:http://dx.doi.org/10.1145/1168857.1168881Google ScholarGoogle Scholar
  21. Jian Li and Jose F. Martinez. 2005. Power-performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.Google ScholarGoogle Scholar
  22. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  23. Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2006. CMP design space exploration subject to physical constraints. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  24. Gabriel H. Loh. 2008. 3D-stacked memory architectures for multi-core processors. In Proceedings of the 35th International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Martina Maggio, Henry Hoffman, Anant Agarwal, and Alberto Leva. 2011. Control-theoretical CPU allocation: Design and implementation with feedback control. In Proceedings of the 6th International Workshop on Feedback Control Implementation and Design in Computing Systems and Networks.Google ScholarGoogle Scholar
  27. Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  28. Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokham Memik, and Alok Choudhary. 2006. MineBench: A benchmark suite for data mining workloads. In Proceedings of the International Symposium on Workload Characterization.Google ScholarGoogle ScholarCross RefCross Ref
  29. Apan Qasem and Ken Kennedy. 2005. Evaluating a Model for Cache Conflict Miss Prediction. Technical Report CS-TR05-457. Rice University.Google ScholarGoogle Scholar
  30. Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the bandwidth wall: Challenges in and avenues for CMP scaling. In Proceedings of the 36th International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle Scholar
  32. Derek L. Schuff, Benjamin S. Parsons, and Vijay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University.Google ScholarGoogle Scholar
  33. Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the 6th International Symposium on Networks-on-Chip.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol. Master’s Thesis. Massachusetts Institute of Technology.Google ScholarGoogle Scholar
  35. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Meng-Ju Wu and Donald Yeung. 2011. Coherent profiles: Enabling efficient reuse distance analysis of multicore scaling for loop-based parallel programs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Meng-Ju Wu and Donald Yeung. 2012. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Meng-Ju Wu and Donald Yeung. 2013. Efficient reuse distance analysis of multicore scaling for loop-based parallel programs. ACM Transactions on Computer Systems 31, 1, Article No. 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Meng-Ju Wu, Minshu Zhao, and Donald Yeung. 2013. Studying multicore processor scaling via reuse distance analysis. In Proceedings of the 40th International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Michael Zhang and Krste Asanovic. 2005. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32nd International Symposium on Computer Architecture.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.Google ScholarGoogle Scholar
  42. Yutao Zhong, Steven G. Dropsho, and Chen Ding. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems 31, 6, Article No. 20. DOI:http://dx.doi.org/10.1145/1552309.1552310Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 34, Issue 1
        April 2016
        91 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/2912578
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 April 2016
        • Accepted: 1 November 2015
        • Revised: 1 October 2015
        • Received: 1 October 2014
        Published in tocs Volume 34, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader