Abstract
Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 72--81. DOI:http://dx.doi.org/10.1145/1454115.1454128Google ScholarDigital Library
- D. Burger and T. M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Arch. News 25, 3 (June 1997), 13--25. DOI:http://dx.doi.org/10.1145/268806.268810Google ScholarDigital Library
- Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proc. Int’l Conf. on Supercomputing. ACM, 402--412. DOI:http://dx.doi.org/10.1145/1274971.1275005Google Scholar
- Yong Chen, Huaiyu Zhu, Hui Jin, and Xian-He Sun. 2012. Algorithm-level feedback-controlled adaptive data prefetcher: Accelerating data access for high-performance processors. Parallel Comput. 38, 10--11 (October/November 2012), 533--551. DOI:http://dx.doi.org/10.1016/j.parco.2012.06.002Google ScholarDigital Library
- Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 279--290. DOI:http://dx.doi.org/10.1145/605397.605427Google Scholar
- Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int’l Symp. on Parallel Processing. IEEE, 56--63. DOI:http://dx.doi.org/10.1109/ICPP.1993.92Google ScholarDigital Library
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011a. Prefetch-aware shared resource management for multi-core systems. In Proc. Int’l Symp. on Comp. Arch. ACM, 141--152. DOI:http://dx.doi.org/10.1145/2000064.2000081Google Scholar
- Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos A. Joao, Onur Mutlu, and Yale N. Patt. 2011b. Parallel application memory scheduling. In Proc. Int’l Symp. on Microarch. ACM, 362--373. DOI:http://dx.doi.org/10.1145/2155620.2155663Google Scholar
- Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In Proc. Int’l Symp. on Microarch. IEEE, 316--326. DOI:http://dx.doi.org/10.1145/1669112.1669154Google Scholar
- Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 7--17. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798232Google Scholar
- John W. C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride directed prefetching in scalar processors. In Proc. Int’l Symp. on Microarch. IEEE, 102--110. DOI:http://dx.doi.org/10.1109/MICRO.1992.697004Google Scholar
- Ilya Ganusov and Martin Burtscher. 2005. On the importance of optimizing the configuration of stream prefetchers. In Proc. Workshop on Memory System Performance (MSP’05). ACM, New York, NY, 54--61. DOI:http://dx.doi.org/10.1145/1111583.1111591Google ScholarDigital Library
- Yan Huang, Zhi-min Gu, Jie Tang, Min Cai, Jianxun Zhang, and Ninghan Zheng. 2012. Reducing cache pollution of threaded prefetching by controlling prefetch distance. In Proc. Int’l Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE, 1812--1819. DOI:http://dx.doi.org/10.1109/IPDPSW.2012.224Google ScholarDigital Library
- Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proc. Int’l Conf. on Supercomputing. ACM, 1--11. DOI:http://dx.doi.org/10.1145/1006209.1006211Google Scholar
- Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proc. Int’l Symp. on Microarch. ACM, 247--259. DOI:http://dx.doi.org/10.1145/2540708.2540730Google ScholarDigital Library
- Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 208--219. DOI:http://dx.doi.org/10.1145/1454115.1454145Google ScholarDigital Library
- Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and Joel Emer. 2012. CRUISE: Cache replacement and utility-aware scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 249--260. DOI:http://dx.doi.org/10.1145/2150976.2151003Google Scholar
- Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proc. Int’l Symp. on Performance Analysis of Systems and Software. IEEE, 177--188. DOI:http://dx.doi.org/10.1109/ISPASS.2006.1620802Google Scholar
- Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 137--146. DOI:http://dx.doi.org/10.1145/2370816.2370837Google Scholar
- Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1145/264107.264207Google Scholar
- Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. Int’l Symp. on Comp. Arch. IEEE, 364--373. DOI:http://dx.doi.org/10.1109/ISCA.1990.134547Google Scholar
- David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proc. Int’l Symp. on Microarch. IEEE Computer Society, 623--634. DOI:http://dx.doi.org/10.1109/MICRO.2014.29Google ScholarDigital Library
- Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2010.5416658Google Scholar
- Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2011. Thread cluster memory scheduling. IEEE Micro 31, 1 (Jan./Feb. 2011), 78--89. DOI:http://dx.doi.org/10.1109/MM.2011.15Google ScholarDigital Library
- Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proc. Int’l Symp. on Microarch. IEEE, 200--209. DOI:http://dx.doi.org/10.1109/MICRO.2008.4771791Google Scholar
- Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for gpgpu applications. In Proc. Int’l Symp. on Microarch. IEEE, 213--224. DOI:http://dx.doi.org/10.1109/MICRO.2010.44Google ScholarDigital Library
- Shang Li. 2007. PoPNet simulator. Retrieved from http://www.princeton.edu/∼peh/orion.html.Google Scholar
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int’l Symp. on Microarch. ACM, 469--480. DOI:http://dx.doi.org/10.1145/1669112.1669172Google Scholar
- James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA, 281--297.Google Scholar
- MIPS Technologies, Inc. 2008. MIPS32® 24KETM Processor Core Family Software User’s Manual. (Dec. 2008). Document Number: MD00468.Google Scholar
- Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. Int’l Symp. on Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/ISCA.2008.7Google ScholarDigital Library
- Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proc. Int’l Symp. on Microarch. IEEE, 208--222. DOI:http://dx.doi.org/10.1109/MICRO.2006.24Google Scholar
- Subbarao Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proc. Int’l Symp. on Comp. Arch. IEEE, 24--33. DOI:http://dx.doi.org/10.1145/191995.192014Google Scholar
- Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 626--637. DOI:http://dx.doi.org/10.1109/HPCA.2014.6835971Google ScholarCross Ref
- Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. Int’l Symp. on Microarch. IEEE, 423--432. DOI:http://dx.doi.org/10.1109/MICRO.2006.49Google Scholar
- Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 115--126. DOI:http://dx.doi.org/10.1145/291069.291034Google Scholar
- Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Archit. Code Optim. 11, 4 (Jan. 2015), 51:1--51:22. DOI:http://dx.doi.org/10.1145/2677956Google ScholarDigital Library
- Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 69--80. DOI:http://dx.doi.org/10.1145/1555754.1555766Google Scholar
- Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1109/ISCA.2006.38Google ScholarDigital Library
- Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/HPCA.2007.346185Google Scholar
- Chen Sun, C.-H. O. Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT: A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. Int’l Symp. on Networks on Chip. IEEE, Lyngby, Denmark, 201--210. DOI:http://dx.doi.org/10.1109/NOCS.2012.31Google ScholarDigital Library
- Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 79--90. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798239Google ScholarCross Ref
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. Int’l Symp. on Comp. Arch. ACM, 24--36. DOI:http://dx.doi.org/10.1145/223982.223990Google Scholar
- Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proc. Int’l Symp. on Microarch. ACM, 442--453. DOI:http://dx.doi.org/10.1145/2155620.2155672Google ScholarDigital Library
- Jiyang Yu and Peng Liu. 2014. A thread-aware adaptive data prefetcher. In Proc. Int’l Conf. on Computer Design. IEEE, 278--285. DOI:http://dx.doi.org/10.1109/ICCD.2014.6974694Google ScholarCross Ref
- Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010a. Addressing shared resource contention in multicore processors via scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 129--142. DOI:http://dx.doi.org/10.1145/1736020.1736036Google Scholar
- Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010b. AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 249--260. DOI:http://dx.doi.org/10.1109/MICRO.2010.51Google ScholarDigital Library
Index Terms
- Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Introducing thread criticality awareness in prefetcher aggressiveness control
DATE '14: Proceedings of the conference on Design, Automation & Test in EuropeA single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent ...
The locality-aware adaptive cache coherence protocol
ICSA '13Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Comments