skip to main content
Tolerating latency through software-controlled data prefetching
Publisher:
  • Stanford University
  • 408 Panama Mall, Suite 217
  • Stanford
  • CA
  • United States
Order Number:UMI Order No. GAX94-29983
Bibliometrics
Skip Abstract Section
Abstract

The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea.

Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits.

This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.

Cited By

  1. ACM
    Ainsworth S and Jones T (2019). Software Prefetching for Indirect Memory Accesses, ACM Transactions on Computer Systems, 36:3, (1-34), Online publication date: 16-Aug-2019.
  2. Zhou B, Huang Y, Xu J, Guo S and Qi H (2019). Memory latency optimizations for the elementary functions on the Sunway architecture, The Journal of Supercomputing, 75:7, (3917-3944), Online publication date: 1-Jul-2019.
  3. Ainsworth S and Jones T Software prefetching for indirect memory accesses Proceedings of the 2017 International Symposium on Code Generation and Optimization, (305-317)
  4. ACM
    Ntafam P, Paire E, Clouard A and Petrot F Simulation driven insertion of data prefetching instructions for early software-on-SoC optimization Proceedings of the 27th International Symposium on Rapid System Prototyping: Shortening the Path from Specification to Prototype, (93-99)
  5. ACM
    Nimako G, Otoo E and Ohene-Kwofie D Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, (100-110)
  6. ACM
    Cong J, Huang H, Liu C and Zou Y A reuse-aware prefetching scheme for scratchpad memory Proceedings of the 48th Design Automation Conference, (960-965)
  7. ACM
    Gu J, Kumar R, Lumetta S and Sun Y Accelerating data movement on future chip multi-processors Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, (1-12)
  8. ACM
    Askitis N and Zobel J (2011). Redesigning the string hash table, burst trie, and BST to exploit cache, ACM Journal of Experimental Algorithmics, 15, (1.1-1.61), Online publication date: 1-Mar-2010.
  9. ACM
    Chen S, Ailamaki A, Gibbons P and Mowry T (2007). Improving hash join performance through prefetching, ACM Transactions on Database Systems, 32:3, (17-es), Online publication date: 1-Aug-2007.
  10. Zhao Y and Kennedy K Dependence-based code generation for a CELL processor Proceedings of the 19th international conference on Languages and compilers for parallel computing, (64-79)
  11. ACM
    Jeong J, Stenström P and Dubois M Simple penalty-sensitive replacement policies for caches Proceedings of the 3rd conference on Computing frontiers, (341-352)
  12. Chen J, Dong Y, Yi H and Yang X Compiler-Directed energy-aware prefetching optimization for embedded applications Proceedings of the Second international conference on Embedded Software and Systems, (230-243)
  13. Chen J, Dong Y, Yi H and Yang X Energy-Constrained prefetching optimization in embedded applications Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing, (267-280)
  14. Zhang W (2005). Replication Cache, IEEE Transactions on Computers, 54:12, (1547-1555), Online publication date: 1-Dec-2005.
  15. ACM
    Guo Y, Naser M and Moritz C PARE Proceedings of the 2005 international symposium on Low power electronics and design, (339-344)
  16. Kadayif I, Kandemir M, Chen G, Ozturk O, Karakoy M and Sezer U (2005). Optimizing Array-Intensive Applications for On-Chip Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 16:5, (396-411), Online publication date: 1-May-2005.
  17. Guo Y, Chheda S, Koren I, Krishna C and Moritz C Energy-aware data prefetching for general-purpose programs Proceedings of the 4th international conference on Power-Aware Computer Systems, (78-94)
  18. ACM
    Wang T, Blagojevic F and Nikolopoulos D Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, (1-12)
  19. Chen S, Ailamaki A, Gibbons P and Mowry T Improving Hash Join Performance through Prefetching Proceedings of the 20th International Conference on Data Engineering
  20. Guo Y, Chheda S and Moritz C Runtime biased pointer reuse analysis and its application to energy efficiency Proceedings of the Third international conference on Power - Aware Computer Systems, (1-12)
  21. ACM
    CaΒcaval C and Padua D Estimating cache misses and locality using stack distances Proceedings of the 17th annual international conference on Supercomputing, (150-159)
  22. ACM
    Stephenson M, Amarasinghe S, Martin M and O'Reilly U Meta optimization Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, (77-90)
  23. ACM
    Stephenson M, Amarasinghe S, Martin M and O'Reilly U (2003). Meta optimization, ACM SIGPLAN Notices, 38:5, (77-90), Online publication date: 9-May-2003.
  24. ACM
    Cahoon B and McKinley K Simple and effective array prefetching in Java Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande, (86-95)
  25. Manegold S, Boncz P and Kersten M (2002). Optimizing Main-Memory Join on Modern Hardware, IEEE Transactions on Knowledge and Data Engineering, 14:4, (709-730), Online publication date: 1-Jul-2002.
  26. Chang C, Sheu J and Chen H (2002). Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping, The Journal of Supercomputing, 22:2, (197-219), Online publication date: 1-Jun-2002.
  27. Sarkar V (2001). Optimized Unrolling of Nested Loops, International Journal of Parallel Programming, 29:5, (545-581), Online publication date: 1-Oct-2001.
  28. ACM
    Badawy A, Aggarwal A, Yeung D and Tseng C Evaluating the impact of memory system performance on software prefetching and locality optimizations Proceedings of the 15th international conference on Supercomputing, (486-500)
  29. ACM
    Shuf Y, Serrano M, Gupta M and Singh J (2001). Characterizing the memory behavior of Java workloads, ACM SIGMETRICS Performance Evaluation Review, 29:1, (194-205), Online publication date: 1-Jun-2001.
  30. ACM
    Luk C Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors Proceedings of the 28th annual international symposium on Computer architecture, (40-51)
  31. ACM
    Shuf Y, Serrano M, Gupta M and Singh J Characterizing the memory behavior of Java workloads Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (194-205)
  32. ACM
    Luk C (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors, ACM SIGARCH Computer Architecture News, 29:2, (40-51), Online publication date: 1-May-2001.
  33. Milenkovic A (2000). Achieving High Performance in Bus-Based Shared-Memory Multiprocessors, IEEE Concurrency, 8:3, (36-44), Online publication date: 1-Jul-2000.
  34. ACM
    Chou Y and Shen J Instruction path coprocessors Proceedings of the 27th annual international symposium on Computer architecture, (270-281)
  35. ACM
    Pirvu M and Bhuyan L Hardware spatial forwarding for widely shared data Proceedings of the 14th international conference on Supercomputing, (264-273)
  36. ACM
    Sarkar V Optimized unrolling of nested loops Proceedings of the 14th international conference on Supercomputing, (153-166)
  37. ACM
    Chou Y and Shen J (2000). Instruction path coprocessors, ACM SIGARCH Computer Architecture News, 28:2, (270-281), Online publication date: 1-May-2000.
  38. ACM
    Shrewsbury D and Norris C Reducing the impact of software prefetching on register pressure Proceedings of the 2000 ACM symposium on Applied computing - Volume 2, (767-773)
  39. ACM
    Eiron N, Rodeh M and Steinwarts I (1999). Matrix multiplication, ACM Journal of Experimental Algorithmics, 4, (3-es), Online publication date: 31-Dec-2000.
  40. Pai V and Adve S Code transformations to improve memory parallelism Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, (147-155)
  41. Boncz P, Manegold S and Kersten M Database Architecture Optimized for the New Bottleneck Proceedings of the 25th International Conference on Very Large Data Bases, (54-65)
  42. Ranganathan P, Adve S and Jouppi N Performance of image and video processing with general-purpose processors and media ISA extensions Proceedings of the 26th annual international symposium on Computer architecture, (124-135)
  43. ACM
    Ranganathan P, Adve S and Jouppi N (1999). Performance of image and video processing with general-purpose processors and media ISA extensions, ACM SIGARCH Computer Architecture News, 27:2, (124-135), Online publication date: 1-May-1999.
  44. Pai V, Ranganathan P, Abdel-Shafi H and Adve S (1999). The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors, IEEE Transactions on Computers, 48:2, (218-226), Online publication date: 1-Feb-1999.
  45. Luk C and Mowry T (1999). Automatic Compiler-Inserted Prefetching for Pointer-Based Applications, IEEE Transactions on Computers, 48:2, (134-141), Online publication date: 1-Feb-1999.
  46. ACM
    Chi C and Cheung C Hardware-driven prefetching for pointer data references Proceedings of the 12th international conference on Supercomputing, (377-384)
  47. ACM
    Mukherjee S and Hill M (1998). Using prediction to accelerate coherence protocols, ACM SIGARCH Computer Architecture News, 26:3, (179-190), Online publication date: 1-Jun-1998.
  48. Wong D, Davis E and Young J (1998). A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems, IEEE Transactions on Parallel and Distributed Systems, 9:6, (601-608), Online publication date: 1-Jun-1998.
  49. Mukherjee S and Hill M Using prediction to accelerate coherence protocols Proceedings of the 25th annual international symposium on Computer architecture, (179-190)
  50. Manjikia N Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors Proceedings of the international Conference on Parallel Processing
  51. Skeppstedt J and Dubois M Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps Proceedings of the international Conference on Parallel Processing, (298-305)
  52. ACM
    Ranganathan P, Pai V, Abdel-Shafi H and Adve S The interaction of software prefetching with ILP processors in shared-memory systems Proceedings of the 24th annual international symposium on Computer architecture, (144-156)
  53. ACM
    Ranganathan P, Pai V, Abdel-Shafi H and Adve S (1997). The interaction of software prefetching with ILP processors in shared-memory systems, ACM SIGARCH Computer Architecture News, 25:2, (144-156), Online publication date: 1-May-1997.
  54. Lim H and Yew P A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Proceedings of the 11th International Symposium on Parallel Processing, (643-649)
  55. Grahn H and Stenström P Relative Performance of Hardware and Software-Only Directory Protocols Under Latency Tolerating and Reducing Techniques Proceedings of the 11th International Symposium on Parallel Processing
  56. ACM
    Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M (1996). Compiler-directed page coloring for multiprocessors, ACM SIGOPS Operating Systems Review, 30:5, (244-255), Online publication date: 1-Dec-1996.
  57. ACM
    Luk C and Mowry T (1996). Compiler-based prefetching for recursive data structures, ACM SIGOPS Operating Systems Review, 30:5, (222-233), Online publication date: 1-Dec-1996.
  58. ACM
    Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M Compiler-directed page coloring for multiprocessors Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, (244-255)
  59. ACM
    Luk C and Mowry T Compiler-based prefetching for recursive data structures Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, (222-233)
  60. ACM
    Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M (1996). Compiler-directed page coloring for multiprocessors, ACM SIGPLAN Notices, 31:9, (244-255), Online publication date: 1-Sep-1996.
  61. ACM
    Luk C and Mowry T (1996). Compiler-based prefetching for recursive data structures, ACM SIGPLAN Notices, 31:9, (222-233), Online publication date: 1-Sep-1996.
  62. Dahlgren F and Stenström P (1996). Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 7:4, (385-398), Online publication date: 1-Apr-1996.
  63. Landin A and Dahlgren F Bus-based COMA-reducing traffic in shared-bus multiprocessors Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
  64. ACM
    Harrison L Examination of a memory access classification scheme for pointer-intensive and numeric programs Proceedings of the 10th international conference on Supercomputing, (133-140)
  65. ACM
    Navarro J, García-Diego E and Herrero J Data prefetching and multilevel blocking for linear algebra operations Proceedings of the 10th international conference on Supercomputing, (109-116)
  66. Lipasti M, Schmidt W, Kunkel S and Roediger R SPAID Proceedings of the 28th annual international symposium on Microarchitecture, (231-236)
  67. Luk C Memory disambiguation for general-purpose applications Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research
  68. ACM
    Bordawekar R, Choudhary A, Kennedy K, Koelbel C and Paleczny M (1995). A model and compilation strategy for out-of-core data parallel programs, ACM SIGPLAN Notices, 30:8, (1-10), Online publication date: 1-Aug-1995.
  69. ACM
    Bordawekar R, Choudhary A, Kennedy K, Koelbel C and Paleczny M A model and compilation strategy for out-of-core data parallel programs Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, (1-10)
  70. ACM
    Zhang Z and Torrellas J Speeding up irregular applications in shared-memory multiprocessors Proceedings of the 22nd annual international symposium on Computer architecture, (188-199)
  71. Bernstein D, Cohen D and Freund A Compiler techniques for data prefetching on the PowerPC Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, (19-26)
  72. ACM
    Zhang Z and Torrellas J (1995). Speeding up irregular applications in shared-memory multiprocessors, ACM SIGARCH Computer Architecture News, 23:2, (188-199), Online publication date: 1-May-1995.
  73. ACM
    Skeppstedt J and Stenström P (1994). Simple compiler algorithms to reduce ownership overhead in cache coherence protocols, ACM SIGOPS Operating Systems Review, 28:5, (286-296), Online publication date: 1-Dec-1994.
  74. ACM
    Skeppstedt J and Stenström P Simple compiler algorithms to reduce ownership overhead in cache coherence protocols Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (286-296)
  75. ACM
    Skeppstedt J and Stenström P (1994). Simple compiler algorithms to reduce ownership overhead in cache coherence protocols, ACM SIGPLAN Notices, 29:11, (286-296), Online publication date: 1-Nov-1994.
  76. ACM
    Lee J, Lee M, Choi S and Park M (1994). Reducing cache conflicts in data cache prefetching, ACM SIGARCH Computer Architecture News, 22:4, (71-77), Online publication date: 1-Sep-1994.
Contributors
  • Carnegie Mellon University

Recommendations