skip to main content
10.1145/2485922.2485949acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Published:23 June 2013Publication History

ABSTRACT

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

References

  1. Apple Inc. iOS App Programming Guide. http://developer.apple.com/library/ios/DOCUMENTATION/iPhone/Conceptual/iPhoneOSProgrammingGuide/iPhoneAppProgrammingGuide.pdf.Google ScholarGoogle Scholar
  2. L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Beamer, K. Asanovic, and D. A. Patterson. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for graph500. Technical Report UCB/EECS-2011-117, EECS Department, University of California, Berkeley, Nov 2011.Google ScholarGoogle Scholar
  4. C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Bird, B. Smith, K. Asanović, and D. A. Patterson. PACORA: Dynamically Optimizing Resource Allocations for Interactive Applications. Technical report, University of California, Berkeley, April 2013.Google ScholarGoogle Scholar
  6. S. M. Blackburn et al. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA, pages 169--190, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable Performance in SMT Processors: Synergy between the OS and SMTs. IEEE Trans. Computers, 55(7):785--799, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, pages 455--468, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Chong, G. Friedland, A. Janin, N. Morgan, and C. Oei. Opportunities and challenges of parallelizing speech recognition. In HotPar, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Eranian. Perfmon2: a flexible performance monitoring interface for linux. In Ottawa Linux Symposium, pages 269--288, 2006.Google ScholarGoogle Scholar
  12. H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley. Looking back and looking forward: power, performance, and upheaval. Commun. ACM, 55(7):105--114, July 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53(2):49--57, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS, pages 37--48, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. Assessing the scalability of garbage collectors on many cores. In PLOS, pages 1--5, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In MICRO, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach (5. ed.). Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel Corp. Intel 64 and ia-32 architectures optimization reference manual, June 2011.Google ScholarGoogle Scholar
  19. Intel Corp. Intel 64 and ia-32 architectures software developer's manual, March 2012.Google ScholarGoogle Scholar
  20. R. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. R. Hsu, and S. K. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jaleel. Memory characterization of workloads using instrumentation-driven simulation -- a pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites. Technical report, VSSAD, Intel Corporation, 2007.Google ScholarGoogle Scholar
  22. S. Kamil. Stencil probe, 2012. http://www.cs.berkeley.edu/~skamil/projects/stencilprobe/.Google ScholarGoogle Scholar
  23. J. W. Lee, M. C. Ng, and K. Asanovic. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, pages 89--100, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, feb. 2008.Google ScholarGoogle Scholar
  25. L. A. Meyerovich, M. E. Torok, E. Atkinson, and R. Bodik. Parallel schedule synthesis for attribute grammars. In PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero. FlexDCP: a QoS framework for CMP architectures. SIGOPS Oper. Syst. Rev., 43(2):86--96, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Perfmon2 webpage. perfmon2.sourceforge.net/.Google ScholarGoogle Scholar
  28. A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. In ISCA, pages 412--423, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO, pages 423--432, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA), June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity, 2009.Google ScholarGoogle Scholar
  32. Standard Performance Evaluation Corporation. SPEC CPU 2006 benchmark suite. http://www.spec.org.Google ScholarGoogle Scholar
  33. G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA, pages 117--128, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA, 2007.Google ScholarGoogle Scholar
  35. D. K. Tam, R. Azimi, L. Soares, and M. Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In ASPLOS, pages 121--132, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C.-J. Wu and M. Martonosi. Characterization and dynamic mitigation of intra-application cache interference. In ISPASS, pages 2--11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Xie and G. H. Loh. Scalable shared-cache management by containing thrashing workloads. In HiPEAC, pages 262--276, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP, pages 203--212, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
    June 2013
    686 pages
    ISBN:9781450320795
    DOI:10.1145/2485922
    • cover image ACM SIGARCH Computer Architecture News
      ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
      ICSA '13
      June 2013
      666 pages
      ISSN:0163-5964
      DOI:10.1145/2508148
      Issue’s Table of Contents

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 June 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader