skip to main content
research-article
Free Access

Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

Authors Info & Claims
Published:28 March 2016Publication History
Skip Abstract Section

Abstract

Most processors employ hardware data prefetching techniques to hide memory access latencies. However, the prefetching requests from different threads on a multicore processor can cause severe interference with prefetching and/or demand requests of others. The data prefetching can lead to significant performance degradation due to shared resource contention on shared memory multicore systems. This article proposes a thread-aware data prefetching mechanism based on low-overhead runtime information to tune prefetching modes and aggressiveness, mitigating the resource contention in the memory system. Our solution has three new components: (1) a self-tuning prefetcher that uses runtime feedback to dynamically adjust data prefetching modes and arguments of each thread, (2) a filtering mechanism that informs the hardware about which prefetching request can cause shared data invalidation and should be discarded, and (3) a limiter thread acceleration mechanism to estimate and accelerate the critical thread which has the longest completion time in the parallel region of execution. On a set of multithreaded parallel benchmarks, our thread-aware data prefetching mechanism improves the overall performance of 64-core system by 13% over a multimode prefetch baseline system with two-level cache organization and conventional modified, exclusive, shared, and invalid-based directory coherence protocol. We compare our approach with the feedback directed prefetching technique and find that it provides 9% performance improvement on multicore systems, while saving the memory bandwidth consumption.

References

  1. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 72--81. DOI:http://dx.doi.org/10.1145/1454115.1454128Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Burger and T. M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Comput. Arch. News 25, 3 (June 1997), 13--25. DOI:http://dx.doi.org/10.1145/268806.268810Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jichuan Chang and Gurindar S. Sohi. 2007. Cooperative cache partitioning for chip multiprocessors. In Proc. Int’l Conf. on Supercomputing. ACM, 402--412. DOI:http://dx.doi.org/10.1145/1274971.1275005Google ScholarGoogle Scholar
  4. Yong Chen, Huaiyu Zhu, Hui Jin, and Xian-He Sun. 2012. Algorithm-level feedback-controlled adaptive data prefetcher: Accelerating data access for high-performance processors. Parallel Comput. 38, 10--11 (October/November 2012), 533--551. DOI:http://dx.doi.org/10.1016/j.parco.2012.06.002Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Robert Cooksey, Stephan Jourdan, and Dirk Grunwald. 2002. A stateless, content-directed data prefetching mechanism. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 279--290. DOI:http://dx.doi.org/10.1145/605397.605427Google ScholarGoogle Scholar
  6. Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proc. Int’l Symp. on Parallel Processing. IEEE, 56--63. DOI:http://dx.doi.org/10.1109/ICPP.1993.92Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011a. Prefetch-aware shared resource management for multi-core systems. In Proc. Int’l Symp. on Comp. Arch. ACM, 141--152. DOI:http://dx.doi.org/10.1145/2000064.2000081Google ScholarGoogle Scholar
  8. Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Jos A. Joao, Onur Mutlu, and Yale N. Patt. 2011b. Parallel application memory scheduling. In Proc. Int’l Symp. on Microarch. ACM, 362--373. DOI:http://dx.doi.org/10.1145/2155620.2155663Google ScholarGoogle Scholar
  9. Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009b. Coordinated control of multiple prefetchers in multi-core systems. In Proc. Int’l Symp. on Microarch. IEEE, 316--326. DOI:http://dx.doi.org/10.1145/1669112.1669154Google ScholarGoogle Scholar
  10. Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2009a. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 7--17. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798232Google ScholarGoogle Scholar
  11. John W. C. Fu, Janak H. Patel, and Bob L. Janssens. 1992. Stride directed prefetching in scalar processors. In Proc. Int’l Symp. on Microarch. IEEE, 102--110. DOI:http://dx.doi.org/10.1109/MICRO.1992.697004Google ScholarGoogle Scholar
  12. Ilya Ganusov and Martin Burtscher. 2005. On the importance of optimizing the configuration of stream prefetchers. In Proc. Workshop on Memory System Performance (MSP’05). ACM, New York, NY, 54--61. DOI:http://dx.doi.org/10.1145/1111583.1111591Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yan Huang, Zhi-min Gu, Jie Tang, Min Cai, Jianxun Zhang, and Ninghan Zheng. 2012. Reducing cache pollution of threaded prefetching by controlling prefetch distance. In Proc. Int’l Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). IEEE, 1812--1819. DOI:http://dx.doi.org/10.1109/IPDPSW.2012.224Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sorin Iacobovici, Lawrence Spracklen, Sudarshan Kadambi, Yuan Chou, and Santosh G. Abraham. 2004. Effective stream-based and execution-based data prefetching. In Proc. Int’l Conf. on Supercomputing. ACM, 1--11. DOI:http://dx.doi.org/10.1145/1006209.1006211Google ScholarGoogle Scholar
  15. Akanksha Jain and Calvin Lin. 2013. Linearizing irregular memory accesses for improved correlated prefetching. In Proc. Int’l Symp. on Microarch. ACM, 247--259. DOI:http://dx.doi.org/10.1145/2540708.2540730Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. 2008. Adaptive insertion policies for managing shared caches. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 208--219. DOI:http://dx.doi.org/10.1145/1454115.1454145Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and Joel Emer. 2012. CRUISE: Cache replacement and utility-aware scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 249--260. DOI:http://dx.doi.org/10.1145/2150976.2151003Google ScholarGoogle Scholar
  18. Natalie D. Enright Jerger, Eric L. Hill, and Mikko H. Lipasti. 2006. Friendly fire: Understanding the effects of multiprocessor prefetches. In Proc. Int’l Symp. on Performance Analysis of Systems and Software. IEEE, 177--188. DOI:http://dx.doi.org/10.1109/ISPASS.2006.1620802Google ScholarGoogle Scholar
  19. Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O’Connell. 2012. Making data prefetch smarter: Adaptive prefetching on POWER7. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 137--146. DOI:http://dx.doi.org/10.1145/2370816.2370837Google ScholarGoogle Scholar
  20. Doug Joseph and Dirk Grunwald. 1997. Prefetching using markov predictors. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1145/264107.264207Google ScholarGoogle Scholar
  21. Norman P. Jouppi. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. Int’l Symp. on Comp. Arch. IEEE, 364--373. DOI:http://dx.doi.org/10.1109/ISCA.1990.134547Google ScholarGoogle Scholar
  22. David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul Gratz, and Daniel Jimenez. 2014. B-Fetch: Branch prediction directed prefetching for chip-multiprocessors. In Proc. Int’l Symp. on Microarch. IEEE Computer Society, 623--634. DOI:http://dx.doi.org/10.1109/MICRO.2014.29Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2010.5416658Google ScholarGoogle Scholar
  24. Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2011. Thread cluster memory scheduling. IEEE Micro 31, 1 (Jan./Feb. 2011), 78--89. DOI:http://dx.doi.org/10.1109/MM.2011.15Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-aware DRAM controllers. In Proc. Int’l Symp. on Microarch. IEEE, 200--209. DOI:http://dx.doi.org/10.1109/MICRO.2008.4771791Google ScholarGoogle Scholar
  26. Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for gpgpu applications. In Proc. Int’l Symp. on Microarch. IEEE, 213--224. DOI:http://dx.doi.org/10.1109/MICRO.2010.44Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Shang Li. 2007. PoPNet simulator. Retrieved from http://www.princeton.edu/∼peh/orion.html.Google ScholarGoogle Scholar
  28. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int’l Symp. on Microarch. ACM, 469--480. DOI:http://dx.doi.org/10.1145/1669112.1669172Google ScholarGoogle Scholar
  29. James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1. University of California Press, Berkeley, CA, 281--297.Google ScholarGoogle Scholar
  30. MIPS Technologies, Inc. 2008. MIPS32® 24KETM Processor Core Family Software User’s Manual. (Dec. 2008). Document Number: MD00468.Google ScholarGoogle Scholar
  31. Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proc. Int’l Symp. on Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/ISCA.2008.7Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. 2006. Fair queuing memory systems. In Proc. Int’l Symp. on Microarch. IEEE, 208--222. DOI:http://dx.doi.org/10.1109/MICRO.2006.24Google ScholarGoogle Scholar
  33. Subbarao Palacharla and R. E. Kessler. 1994. Evaluating stream buffers as a secondary cache replacement. In Proc. Int’l Symp. on Comp. Arch. IEEE, 24--33. DOI:http://dx.doi.org/10.1145/191995.192014Google ScholarGoogle Scholar
  34. Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 626--637. DOI:http://dx.doi.org/10.1109/HPCA.2014.6835971Google ScholarGoogle ScholarCross RefCross Ref
  35. Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. Int’l Symp. on Microarch. IEEE, 423--432. DOI:http://dx.doi.org/10.1109/MICRO.2006.49Google ScholarGoogle Scholar
  36. Amir Roth, Andreas Moshovos, and Gurindar S. Sohi. 1998. Dependence based prefetching for linked data structures. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 115--126. DOI:http://dx.doi.org/10.1145/291069.291034Google ScholarGoogle Scholar
  37. Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks. ACM Trans. Archit. Code Optim. 11, 4 (Jan. 2015), 51:1--51:22. DOI:http://dx.doi.org/10.1145/2677956Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, and Babak Falsafi. 2009. Spatio-temporal memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 69--80. DOI:http://dx.doi.org/10.1145/1555754.1555766Google ScholarGoogle Scholar
  39. Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2006. Spatial memory streaming. In Proc. Int’l Symp. on Comp. Arch. ACM, 252--263. DOI:http://dx.doi.org/10.1109/ISCA.2006.38Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt. 2007. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 63--74. DOI:http://dx.doi.org/10.1109/HPCA.2007.346185Google ScholarGoogle Scholar
  41. Chen Sun, C.-H. O. Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT: A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proc. Int’l Symp. on Networks on Chip. IEEE, Lyngby, Denmark, 201--210. DOI:http://dx.doi.org/10.1109/NOCS.2012.31Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Thomas F. Wenisch, Michael Ferdman, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2009. Practical off-chip meta-data for temporal memory streaming. In Proc. Int’l Symp. on High Performance Comp. Arch. IEEE, 79--90. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798239Google ScholarGoogle ScholarCross RefCross Ref
  43. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. Int’l Symp. on Comp. Arch. ACM, 24--36. DOI:http://dx.doi.org/10.1145/223982.223990Google ScholarGoogle Scholar
  44. Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proc. Int’l Symp. on Microarch. ACM, 442--453. DOI:http://dx.doi.org/10.1145/2155620.2155672Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jiyang Yu and Peng Liu. 2014. A thread-aware adaptive data prefetcher. In Proc. Int’l Conf. on Computer Design. IEEE, 278--285. DOI:http://dx.doi.org/10.1109/ICCD.2014.6974694Google ScholarGoogle ScholarCross RefCross Ref
  46. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010a. Addressing shared resource contention in multicore processors via scheduling. In Proc. Int’l Conf. on Arch. Support for Prog. Lang. and Operating Systems. ACM, 129--142. DOI:http://dx.doi.org/10.1145/1736020.1736036Google ScholarGoogle Scholar
  47. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010b. AKULA: A toolset for experimenting and developing thread placement algorithms on multicore systems. In Proc. Int’l Conf. on Parallel Arch. and Compilation Techniques. ACM, 249--260. DOI:http://dx.doi.org/10.1109/MICRO.2010.51Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 1
        April 2016
        347 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2899032
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 28 March 2016
        • Revised: 1 January 2016
        • Accepted: 1 January 2016
        • Received: 1 April 2015
        Published in taco Volume 13, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader