skip to main content
research-article
Free Access

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

Published:26 January 2012Publication History
Skip Abstract Section

Abstract

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.

References

  1. Bienia, C. 2011. Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Cantin, J. F., Lipasti, M., and Smith, J. E. 2006. Stealth prefetching. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS-XII. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cho, S. and Jin, L. 2006. Managing distributed, shared l2 caches through os-level page allocation. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., and Hughes, B. 2010. Cache hierarchy and memory subsystem of the amd opteron processor. IEEE Micro 30, 16--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dahlgren, F., Dubois, M., and Stenstrom, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ebrahimi, E., Mutlu, O., Lee, C. J., and Patt, Y. N. 2009. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of the 42th Annual International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eyerman, S. and Eeckhout, L. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hennessy, J. and Patterson, D. 2007. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Intel. 2011. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google ScholarGoogle Scholar
  10. Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25, 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Koppelman, D. M. 2000. Neighborhood prefetching on multiprocessors using instruction history. In Proceedings of the 9th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kottapalli, S. and Baxter, J. 2009. Nehalem-ex cpu architecture. In Hot Chips.Google ScholarGoogle Scholar
  13. Le, H. Q., Starke, W. J., Fields, J. S., O'Connell, F. P., Nguyen, D. Q., Ronchetti, B. J., Sauer, W. M., Schwarz, E. M., and Vaden, M. T. 2007. IBM power6 microarchitecture. IBM J. Rese. Devel. 51, 639--662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Luo, K., Gummaraju, J., and Franklin, M. 2001. Balancing thoughput and fairness in smt processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.Google ScholarGoogle Scholar
  15. Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Computer 35, 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Martin, M., Sorin, D. J., Beckmann, B. M., Marty, M., Xu, M., Alameldeen, A., K., M., Hill, M., and Wood, D. 2005. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Architect. News 33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mutlu, O. and Moscibroda, T. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nesbit, K. J. and Smith, J. E. 2005. Data cache prefetching using a global history buffer. IEEE Micro 25, 90--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ramos, L. M., Briz, J., Ibáñez, P. E., and Viñals, V. 2011. Multi-level adaptive prefetching based on performance gradient tracking. J. Instruction-Level Paral. 13, 1--14.Google ScholarGoogle Scholar
  21. Smith, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 473--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Snavely, A. and Tullsen, D. M. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. SIGARCH Comput. Architec. News 28, 234--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Somogyi, S., Wenisch, T. F., Ailamaki, A., and Falsafi, B. 2009. Spatio-temporal memory streaming. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 13rd International Symposium on High Performance Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tcheun, M., Yoon, H., and Maeng, S. R. 1997. An adaptive sequential prefetching scheme in shared-memory multiprocessors. In Proceedings of the 26th International Conference on Parallel Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wallin, D. and Hagersten, E. 2003. Miss penalty reduction using bundled capacity prefetching in multiprocessors. In Proceedings of the 17th International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The splash-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 8, Issue 4
      Special Issue on High-Performance Embedded Architectures and Compilers
      January 2012
      765 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/2086696
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 January 2012
      • Accepted: 1 November 2011
      • Revised: 1 October 2011
      • Received: 1 July 2011
      Published in taco Volume 8, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader