Abstract
Hardware data prefetch is a very well known technique for hiding memory latencies. However, in a multicore system fitted with a shared Last-Level Cache (LLC), prefetch induced by a core consumes common resources such as shared cache space and main memory bandwidth. This may degrade the performance of other cores and even the overall system performance unless the prefetch aggressiveness of each core is controlled from a system standpoint. On the other hand, LLCs in commercial chip multiprocessors are more and more frequently organized in independent banks. In this contribution, we target for the first time prefetch in a banked LLC organization and propose ABS, a low-cost controller with a hill-climbing approach that runs stand-alone at each LLC bank without requiring inter-bank communication. Using multiprogrammed SPEC2K6 workloads, our analysis shows that the mechanism improves both user-oriented metrics (Harmonic Mean of Speedups by 27% and Fairness by 11%) and system-oriented metrics (Weighted Speedup increases 22% and Memory Bandwidth Consumption decreases 14%) over an eight-core baseline system that uses aggressive sequential prefetch with a fixed degree. Similar conclusions can be drawn by varying the number of cores or the LLC size, when running parallel applications, or when other prefetch engines are controlled.
- Bienia, C. 2011. Benchmarking modern multiprocessors. Ph.D. thesis, Princeton University. Google ScholarDigital Library
- Cantin, J. F., Lipasti, M., and Smith, J. E. 2006. Stealth prefetching. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ASPLOS-XII. Google ScholarDigital Library
- Cho, S. and Jin, L. 2006. Managing distributed, shared l2 caches through os-level page allocation. In Proceedings of the 39th International Symposium on Microarchitecture. Google ScholarDigital Library
- Conway, P., Kalyanasundharam, N., Donley, G., Lepak, K., and Hughes, B. 2010. Cache hierarchy and memory subsystem of the amd opteron processor. IEEE Micro 30, 16--29. Google ScholarDigital Library
- Dahlgren, F., Dubois, M., and Stenstrom, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the 22nd International Conference on Parallel Processing. Google ScholarDigital Library
- Ebrahimi, E., Mutlu, O., Lee, C. J., and Patt, Y. N. 2009. Coordinated control of multiple prefetchers in multi-core systems. In Proceedings of the 42th Annual International Symposium on Microarchitecture. Google ScholarDigital Library
- Eyerman, S. and Eeckhout, L. 2008. System-level performance metrics for multiprogram workloads. IEEE Micro 28, 42--53. Google ScholarDigital Library
- Hennessy, J. and Patterson, D. 2007. Computer Architecture: A Quantitative Approach. Morgan Kaufmann. Google ScholarDigital Library
- Intel. 2011. Intel 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
- Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: a 32-way multithreaded sparc processor. IEEE Micro 25, 21--29. Google ScholarDigital Library
- Koppelman, D. M. 2000. Neighborhood prefetching on multiprocessors using instruction history. In Proceedings of the 9th International Conference on Parallel Architectures and Compilation Techniques. Google ScholarDigital Library
- Kottapalli, S. and Baxter, J. 2009. Nehalem-ex cpu architecture. In Hot Chips.Google Scholar
- Le, H. Q., Starke, W. J., Fields, J. S., O'Connell, F. P., Nguyen, D. Q., Ronchetti, B. J., Sauer, W. M., Schwarz, E. M., and Vaden, M. T. 2007. IBM power6 microarchitecture. IBM J. Rese. Devel. 51, 639--662. Google ScholarDigital Library
- Luo, K., Gummaraju, J., and Franklin, M. 2001. Balancing thoughput and fairness in smt processors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.Google Scholar
- Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Computer 35, 50--58. Google ScholarDigital Library
- Martin, M., Sorin, D. J., Beckmann, B. M., Marty, M., Xu, M., Alameldeen, A., K., M., Hill, M., and Wood, D. 2005. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Architect. News 33, 2005. Google ScholarDigital Library
- Mutlu, O. and Moscibroda, T. 2007. Stall-time fair memory access scheduling for chip multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. Google ScholarDigital Library
- Nesbit, K. J. and Smith, J. E. 2005. Data cache prefetching using a global history buffer. IEEE Micro 25, 90--97. Google ScholarDigital Library
- Palacharla, S. and Kessler, R. E. 1994. Evaluating stream buffers as a secondary cache replacement. In Proceedings of the 21st International Symposium on Computer Architecture. Google ScholarDigital Library
- Ramos, L. M., Briz, J., Ibáñez, P. E., and Viñals, V. 2011. Multi-level adaptive prefetching based on performance gradient tracking. J. Instruction-Level Paral. 13, 1--14.Google Scholar
- Smith, A. J. 1982. Cache memories. ACM Comput. Surv. 14, 473--530. Google ScholarDigital Library
- Snavely, A. and Tullsen, D. M. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. SIGARCH Comput. Architec. News 28, 234--244. Google ScholarDigital Library
- Somogyi, S., Wenisch, T. F., Ailamaki, A., and Falsafi, B. 2009. Spatio-temporal memory streaming. In Proceedings of the 36th Annual International Symposium on Computer Architecture. Google ScholarDigital Library
- Srinath, S., Mutlu, O., Kim, H., and Patt, Y. N. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proceedings of the 13rd International Symposium on High Performance Computer Architecture. Google ScholarDigital Library
- Tcheun, M., Yoon, H., and Maeng, S. R. 1997. An adaptive sequential prefetching scheme in shared-memory multiprocessors. In Proceedings of the 26th International Conference on Parallel Processing. Google ScholarDigital Library
- Wallin, D. and Hagersten, E. 2003. Miss penalty reduction using bundled capacity prefetching in multiprocessors. In Proceedings of the 17th International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
- Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The splash-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. Google ScholarDigital Library
Index Terms
- ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache
Recommendations
Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and CompilersOptimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...
Reducing Cache Pollution via Dynamic Data Prefetch Filtering
In order to bridge the gap of the growing speed disparity between processors and their memory subsystems, aggressive prefetch mechanisms, either hardware-based or compiler-assisted, are employed to hide memory latencies. As the first-level cache gets ...
Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing SymposiumCache misses frequently exhibit repeated streaming behavior, i.e. a sequence of cache misses has a high tendency of being repeated. Correlation-based prefetchers record the missing streams in a history table for accurate prefetching. Saving a large miss ...
Comments