research-article

Free Access

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

Authors:
Jorge Albericio

University of Zaragoza

University of Zaragoza
View Profile

,
Rubén Gran

University of Zaragoza

University of Zaragoza
View Profile

,
Pablo Ibáñez

University of Zaragoza

University of Zaragoza
View Profile

,
Víctor Viñals

University of Zaragoza

University of Zaragoza
View Profile

,
Jose María Llabería

UPC Barcelona Tech

UPC Barcelona Tech
View Profile

ACM Transactions on Architecture and Code Optimization Volume 8 Issue 4Article No.: 19pp 1–20https://doi.org/10.1145/2086696.2086698

Published:26 January 2012Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.

References

Bienia, C. 2011. Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University. Google ScholarDigital Library
Cantin, J. F., Lipasti, M., and Smith, J. E. 2006. Stealth prefetching. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS-XII. Google ScholarDigital Library
Cho, S. and Jin, L. 2006. Managing distributed, shared l2 caches through os-level page allocation. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., and Hughes, B. 2010. Cache hierarchy and memory subsystem of the amd opteron processor. IEEE Micro 30, 16--29. Google ScholarDigital Library
Dahlgren, F., Dubois, M., and Stenstrom, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing. Google ScholarDigital Library
Ebrahimi, E., Mutlu, O., Lee, C. J., and Patt, Y. N. 2009. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of the 42th Annual International Symposium on Microarchitecture. Google ScholarDigital Library
Eyerman, S. and Eeckhout, L. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 42--53. Google ScholarDigital Library
Hennessy, J. and Patterson, D. 2007. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarDigital Library
Intel. 2011. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25, 21--29. Google ScholarDigital Library
Koppelman, D. M. 2000. Neighborhood prefetching on multiprocessors using instruction history. In Proceedings of the 9th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
Kottapalli, S. and Baxter, J. 2009. Nehalem-ex cpu architecture. In Hot Chips.Google Scholar
Le, H. Q., Starke, W. J., Fields, J. S., O'Connell, F. P., Nguyen, D. Q., Ronchetti, B. J., Sauer, W. M., Schwarz, E. M., and Vaden, M. T. 2007. IBM power6 microarchitecture. IBM J. Rese. Devel. 51, 639--662. Google ScholarDigital Library
Luo, K., Gummaraju, J., and Franklin, M. 2001. Balancing thoughput and fairness in smt processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.Google Scholar
Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Computer 35, 50--58. Google ScholarDigital Library
Martin, M., Sorin, D. J., Beckmann, B. M., Marty, M., Xu, M., Alameldeen, A., K., M., Hill, M., and Wood, D. 2005. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Architect. News 33, 2005. Google ScholarDigital Library
Mutlu, O. and Moscibroda, T. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarDigital Library
Nesbit, K. J. and Smith, J. E. 2005. Data cache prefetching using a global history buffer. IEEE Micro 25, 90--97. Google ScholarDigital Library
Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st International Symposium on Computer Architecture. Google ScholarDigital Library
Ramos, L. M., Briz, J., Ibáñez, P. E., and Viñals, V. 2011. Multi-level adaptive prefetching based on performance gradient tracking. J. Instruction-Level Paral. 13, 1--14.Google Scholar
Smith, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 473--530. Google ScholarDigital Library
Snavely, A. and Tullsen, D. M. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. SIGARCH Comput. Architec. News 28, 234--244. Google ScholarDigital Library
Somogyi, S., Wenisch, T. F., Ailamaki, A., and Falsafi, B. 2009. Spatio-temporal memory streaming. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 13rd International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
Tcheun, M., Yoon, H., and Maeng, S. R. 1997. An adaptive sequential prefetching scheme in shared-memory multiprocessors. In Proceedings of the 26th International Conference on Parallel Processing. Google ScholarDigital Library
Wallin, D. and Hagersten, E. 2003. Miss penalty reduction using bundled capacity prefetching in multiprocessors. In Proceedings of the 17th International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The splash-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library

Index Terms

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and Compilers

Optimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...
Read More
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

In order to bridge the gap of the growing speed disparity between processors and their memory subsystems, aggressive prefetch mechanisms, either hardware-based or compiler-assisted, are employed to hide memory latencies. As the first-level cache gets ...
Read More
Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Cache misses frequently exhibit repeated streaming behavior, i.e. a sequence of cache misses has a high tendency of being repeated. Correlation-based prefetchers record the missing streams in a history table for accurate prefetching. Saving a large miss ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 8, Issue 4
Special Issue on High-Performance Embedded Architectures and Compilers
January 2012
765 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2086696
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 January 2012
- Accepted: 1 November 2011
- Revised: 1 October 2011
- Received: 1 July 2011
Published in taco Volume 8, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Prefetch
shared resources management
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 775
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting reuse locality on inclusive shared last-level caches

Reducing Cache Pollution via Dynamic Data Prefetch Filtering

Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting reuse locality on inclusive shared last-level caches

Reducing Cache Pollution via Dynamic Data Prefetch Filtering

Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media