Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Authors:
Peng Liu

Zhejiang University, State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China

Zhejiang University, State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China
View Profile

,
Jiyang Yu

Huawei Technologies Co., Ltd., Hangzhou, China

Huawei Technologies Co., Ltd., Hangzhou, China
View Profile

,
Michael C. Huang

University of Rochester, Rochester, NY

University of Rochester, Rochester, NY
View Profile

ACM Transactions on Architecture and Code Optimization Volume 13 Issue 1Article No.: 13pp 1–25https://doi.org/10.1145/2890505

Published:28 March 2016Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.

References

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 72--81. DOI:http://dx.doi.org/10.1145/1454115.1454128Google ScholarDigital Library
D. Burger and T. M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Arch. News 25, 3 (June 1997), 13--25. DOI:http://dx.doi.org/10.1145/268806.268810Google ScholarDigital Library
Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proc. Int’l Conf. on Supercomputing. ACM, 402--412. DOI:http://dx.doi.org/10.1145/1274971.1275005Google Scholar
Yong Chen, Huaiyu Zhu, Hui Jin, and Xian-He Sun. 2012. Algorithm-level feedback-controlled adaptive data prefetcher: Accelerating data access for high-performance processors. Parallel Comput. 38, 10--11 (October/November 2012), 533--551. DOI:http://dx.doi.org/10.1016/j.parco.2012.06.002Google ScholarDigital Library
Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 279--290. DOI:http://dx.doi.org/10.1145/605397.605427Google Scholar
Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int’l Symp. on Parallel Processing. IEEE, 56--63. DOI:http://dx.doi.org/10.1109/ICPP.1993.92Google ScholarDigital Library
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011a. Prefetch-aware shared resource management for multi-core systems. In Proc. Int’l Symp. on Comp. Arch. ACM, 141--152. DOI:http://dx.doi.org/10.1145/2000064.2000081Google Scholar
Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos A. Joao, Onur Mutlu, and Yale N. Patt. 2011b. Parallel application memory scheduling. In Proc. Int’l Symp. on Microarch. ACM, 362--373. DOI:http://dx.doi.org/10.1145/2155620.2155663Google Scholar
Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In Proc. Int’l Symp. on Microarch. IEEE, 316--326. DOI:http://dx.doi.org/10.1145/1669112.1669154Google Scholar
Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 7--17. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798232Google Scholar
John W. C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride directed prefetching in scalar processors. In Proc. Int’l Symp. on Microarch. IEEE, 102--110. DOI:http://dx.doi.org/10.1109/MICRO.1992.697004Google Scholar
Ilya Ganusov and Martin Burtscher. 2005. On the importance of optimizing the configuration of stream prefetchers. In Proc. Workshop on Memory System Performance (MSP’05). ACM, New York, NY, 54--61. DOI:http://dx.doi.org/10.1145/1111583.1111591Google ScholarDigital Library
Yan Huang, Zhi-min Gu, Jie Tang, Min Cai, Jianxun Zhang, and Ninghan Zheng. 2012. Reducing cache pollution of threaded prefetching by controlling prefetch distance. In Proc. Int’l Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE, 1812--1819. DOI:http://dx.doi.org/10.1109/IPDPSW.2012.224Google ScholarDigital Library
Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proc. Int’l Conf. on Supercomputing. ACM, 1--11. DOI:http://dx.doi.org/10.1145/1006209.1006211Google Scholar
Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proc. Int’l Symp. on Microarch. ACM, 247--259. DOI:http://dx.doi.org/10.1145/2540708.2540730Google ScholarDigital Library
Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 208--219. DOI:http://dx.doi.org/10.1145/1454115.1454145Google ScholarDigital Library
Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and Joel Emer. 2012. CRUISE: Cache replacement and utility-aware scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 249--260. DOI:http://dx.doi.org/10.1145/2150976.2151003Google Scholar
Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proc. Int’l Symp. on Performance Analysis of Systems and Software. IEEE, 177--188. DOI:http://dx.doi.org/10.1109/ISPASS.2006.1620802Google Scholar
Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 137--146. DOI:http://dx.doi.org/10.1145/2370816.2370837Google Scholar
Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1145/264107.264207Google Scholar
Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. Int’l Symp. on Comp. Arch. IEEE, 364--373. DOI:http://dx.doi.org/10.1109/ISCA.1990.134547Google Scholar
David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proc. Int’l Symp. on Microarch. IEEE Computer Society, 623--634. DOI:http://dx.doi.org/10.1109/MICRO.2014.29Google ScholarDigital Library
Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2010.5416658Google Scholar
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2011. Thread cluster memory scheduling. IEEE Micro 31, 1 (Jan./Feb. 2011), 78--89. DOI:http://dx.doi.org/10.1109/MM.2011.15Google ScholarDigital Library
Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proc. Int’l Symp. on Microarch. IEEE, 200--209. DOI:http://dx.doi.org/10.1109/MICRO.2008.4771791Google Scholar
Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for gpgpu applications. In Proc. Int’l Symp. on Microarch. IEEE, 213--224. DOI:http://dx.doi.org/10.1109/MICRO.2010.44Google ScholarDigital Library
Shang Li. 2007. PoPNet simulator. Retrieved from http://www.princeton.edu/&sim;peh/orion.html.Google Scholar
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int’l Symp. on Microarch. ACM, 469--480. DOI:http://dx.doi.org/10.1145/1669112.1669172Google Scholar
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA, 281--297.Google Scholar
MIPS Technologies, Inc. 2008. MIPS32® 24KE^TM Processor Core Family Software User’s Manual. (Dec. 2008). Document Number: MD00468.Google Scholar
Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. Int’l Symp. on Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/ISCA.2008.7Google ScholarDigital Library
Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proc. Int’l Symp. on Microarch. IEEE, 208--222. DOI:http://dx.doi.org/10.1109/MICRO.2006.24Google Scholar
Subbarao Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proc. Int’l Symp. on Comp. Arch. IEEE, 24--33. DOI:http://dx.doi.org/10.1145/191995.192014Google Scholar
Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 626--637. DOI:http://dx.doi.org/10.1109/HPCA.2014.6835971Google ScholarCross Ref
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. Int’l Symp. on Microarch. IEEE, 423--432. DOI:http://dx.doi.org/10.1109/MICRO.2006.49Google Scholar
Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 115--126. DOI:http://dx.doi.org/10.1145/291069.291034Google Scholar
Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Archit. Code Optim. 11, 4 (Jan. 2015), 51:1--51:22. DOI:http://dx.doi.org/10.1145/2677956Google ScholarDigital Library
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 69--80. DOI:http://dx.doi.org/10.1145/1555754.1555766Google Scholar
Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1109/ISCA.2006.38Google ScholarDigital Library
Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/HPCA.2007.346185Google Scholar
Chen Sun, C.-H. O. Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT: A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. Int’l Symp. on Networks on Chip. IEEE, Lyngby, Denmark, 201--210. DOI:http://dx.doi.org/10.1109/NOCS.2012.31Google ScholarDigital Library
Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 79--90. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798239Google ScholarCross Ref
Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. Int’l Symp. on Comp. Arch. ACM, 24--36. DOI:http://dx.doi.org/10.1145/223982.223990Google Scholar
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proc. Int’l Symp. on Microarch. ACM, 442--453. DOI:http://dx.doi.org/10.1145/2155620.2155672Google ScholarDigital Library
Jiyang Yu and Peng Liu. 2014. A thread-aware adaptive data prefetcher. In Proc. Int’l Conf. on Computer Design. IEEE, 278--285. DOI:http://dx.doi.org/10.1109/ICCD.2014.6974694Google ScholarCross Ref
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010a. Addressing shared resource contention in multicore processors via scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 129--142. DOI:http://dx.doi.org/10.1145/1736020.1736036Google Scholar
Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010b. AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 249--260. DOI:http://dx.doi.org/10.1109/MICRO.2010.51Google ScholarDigital Library

Index Terms

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Read More
Introducing thread criticality awareness in prefetcher aggressiveness control
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

A single parallel application running on a multi-core system shows sub-linear speedup because of slow progress of one or more threads known as critical threads. Some of the reasons for the slow progress of threads are (1) load imbalance, (2) frequent ...
Read More
The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 13, Issue 1
April 2016
347 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2899032
Editor:
Koen De Bosschere
Ghent University
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2016
- Revised: 1 January 2016
- Accepted: 1 January 2016
- Received: 1 April 2015
Published in taco Volume 13, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data prefetcher
multicore
multithreaded
self-tuning
thread-aware
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 762
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)33
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Introducing thread criticality awareness in prefetcher aggressiveness control

The locality-aware adaptive cache coherence protocol