The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea.
Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits.
This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.
Cited By
- Ainsworth S and Jones T (2019). Software Prefetching for Indirect Memory Accesses, ACM Transactions on Computer Systems, 36:3, (1-34), Online publication date: 16-Aug-2019.
- Zhou B, Huang Y, Xu J, Guo S and Qi H (2019). Memory latency optimizations for the elementary functions on the Sunway architecture, The Journal of Supercomputing, 75:7, (3917-3944), Online publication date: 1-Jul-2019.
- Ainsworth S and Jones T Software prefetching for indirect memory accesses Proceedings of the 2017 International Symposium on Code Generation and Optimization, (305-317)
- Ntafam P, Paire E, Clouard A and Petrot F Simulation driven insertion of data prefetching instructions for early software-on-SoC optimization Proceedings of the 27th International Symposium on Rapid System Prototyping: Shortening the Path from Specification to Prototype, (93-99)
- Nimako G, Otoo E and Ohene-Kwofie D Cache-sensitive MapReduce DGEMM algorithms for shared memory architectures Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, (100-110)
- Cong J, Huang H, Liu C and Zou Y A reuse-aware prefetching scheme for scratchpad memory Proceedings of the 48th Design Automation Conference, (960-965)
- Gu J, Kumar R, Lumetta S and Sun Y Accelerating data movement on future chip multi-processors Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies, (1-12)
- Askitis N and Zobel J (2011). Redesigning the string hash table, burst trie, and BST to exploit cache, ACM Journal of Experimental Algorithmics, 15, (1.1-1.61), Online publication date: 1-Mar-2010.
- Chen S, Ailamaki A, Gibbons P and Mowry T (2007). Improving hash join performance through prefetching, ACM Transactions on Database Systems, 32:3, (17-es), Online publication date: 1-Aug-2007.
- Zhao Y and Kennedy K Dependence-based code generation for a CELL processor Proceedings of the 19th international conference on Languages and compilers for parallel computing, (64-79)
- Jeong J, Stenström P and Dubois M Simple penalty-sensitive replacement policies for caches Proceedings of the 3rd conference on Computing frontiers, (341-352)
- Chen J, Dong Y, Yi H and Yang X Compiler-Directed energy-aware prefetching optimization for embedded applications Proceedings of the Second international conference on Embedded Software and Systems, (230-243)
- Chen J, Dong Y, Yi H and Yang X Energy-Constrained prefetching optimization in embedded applications Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing, (267-280)
- Zhang W (2005). Replication Cache, IEEE Transactions on Computers, 54:12, (1547-1555), Online publication date: 1-Dec-2005.
- Guo Y, Naser M and Moritz C PARE Proceedings of the 2005 international symposium on Low power electronics and design, (339-344)
- Kadayif I, Kandemir M, Chen G, Ozturk O, Karakoy M and Sezer U (2005). Optimizing Array-Intensive Applications for On-Chip Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 16:5, (396-411), Online publication date: 1-May-2005.
- Guo Y, Chheda S, Koren I, Krishna C and Moritz C Energy-aware data prefetching for general-purpose programs Proceedings of the 4th international conference on Power-Aware Computer Systems, (78-94)
- Wang T, Blagojevic F and Nikolopoulos D Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, (1-12)
- Chen S, Ailamaki A, Gibbons P and Mowry T Improving Hash Join Performance through Prefetching Proceedings of the 20th International Conference on Data Engineering
- Guo Y, Chheda S and Moritz C Runtime biased pointer reuse analysis and its application to energy efficiency Proceedings of the Third international conference on Power - Aware Computer Systems, (1-12)
- CaΒcaval C and Padua D Estimating cache misses and locality using stack distances Proceedings of the 17th annual international conference on Supercomputing, (150-159)
- Stephenson M, Amarasinghe S, Martin M and O'Reilly U Meta optimization Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, (77-90)
- Stephenson M, Amarasinghe S, Martin M and O'Reilly U (2003). Meta optimization, ACM SIGPLAN Notices, 38:5, (77-90), Online publication date: 9-May-2003.
- Cahoon B and McKinley K Simple and effective array prefetching in Java Proceedings of the 2002 joint ACM-ISCOPE conference on Java Grande, (86-95)
- Manegold S, Boncz P and Kersten M (2002). Optimizing Main-Memory Join on Modern Hardware, IEEE Transactions on Knowledge and Data Engineering, 14:4, (709-730), Online publication date: 1-Jul-2002.
- Chang C, Sheu J and Chen H (2002). Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping, The Journal of Supercomputing, 22:2, (197-219), Online publication date: 1-Jun-2002.
- Sarkar V (2001). Optimized Unrolling of Nested Loops, International Journal of Parallel Programming, 29:5, (545-581), Online publication date: 1-Oct-2001.
- Badawy A, Aggarwal A, Yeung D and Tseng C Evaluating the impact of memory system performance on software prefetching and locality optimizations Proceedings of the 15th international conference on Supercomputing, (486-500)
- Shuf Y, Serrano M, Gupta M and Singh J (2001). Characterizing the memory behavior of Java workloads, ACM SIGMETRICS Performance Evaluation Review, 29:1, (194-205), Online publication date: 1-Jun-2001.
- Luk C Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors Proceedings of the 28th annual international symposium on Computer architecture, (40-51)
- Shuf Y, Serrano M, Gupta M and Singh J Characterizing the memory behavior of Java workloads Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (194-205)
- Luk C (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors, ACM SIGARCH Computer Architecture News, 29:2, (40-51), Online publication date: 1-May-2001.
- Milenkovic A (2000). Achieving High Performance in Bus-Based Shared-Memory Multiprocessors, IEEE Concurrency, 8:3, (36-44), Online publication date: 1-Jul-2000.
- Chou Y and Shen J Instruction path coprocessors Proceedings of the 27th annual international symposium on Computer architecture, (270-281)
- Pirvu M and Bhuyan L Hardware spatial forwarding for widely shared data Proceedings of the 14th international conference on Supercomputing, (264-273)
- Sarkar V Optimized unrolling of nested loops Proceedings of the 14th international conference on Supercomputing, (153-166)
- Chou Y and Shen J (2000). Instruction path coprocessors, ACM SIGARCH Computer Architecture News, 28:2, (270-281), Online publication date: 1-May-2000.
- Shrewsbury D and Norris C Reducing the impact of software prefetching on register pressure Proceedings of the 2000 ACM symposium on Applied computing - Volume 2, (767-773)
- Eiron N, Rodeh M and Steinwarts I (1999). Matrix multiplication, ACM Journal of Experimental Algorithmics, 4, (3-es), Online publication date: 31-Dec-2000.
- Pai V and Adve S Code transformations to improve memory parallelism Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, (147-155)
- Boncz P, Manegold S and Kersten M Database Architecture Optimized for the New Bottleneck Proceedings of the 25th International Conference on Very Large Data Bases, (54-65)
- Ranganathan P, Adve S and Jouppi N Performance of image and video processing with general-purpose processors and media ISA extensions Proceedings of the 26th annual international symposium on Computer architecture, (124-135)
- Ranganathan P, Adve S and Jouppi N (1999). Performance of image and video processing with general-purpose processors and media ISA extensions, ACM SIGARCH Computer Architecture News, 27:2, (124-135), Online publication date: 1-May-1999.
- Pai V, Ranganathan P, Abdel-Shafi H and Adve S (1999). The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors, IEEE Transactions on Computers, 48:2, (218-226), Online publication date: 1-Feb-1999.
- Luk C and Mowry T (1999). Automatic Compiler-Inserted Prefetching for Pointer-Based Applications, IEEE Transactions on Computers, 48:2, (134-141), Online publication date: 1-Feb-1999.
- Chi C and Cheung C Hardware-driven prefetching for pointer data references Proceedings of the 12th international conference on Supercomputing, (377-384)
- Mukherjee S and Hill M (1998). Using prediction to accelerate coherence protocols, ACM SIGARCH Computer Architecture News, 26:3, (179-190), Online publication date: 1-Jun-1998.
- Wong D, Davis E and Young J (1998). A Software Approach to Avoiding Spatial Cache Collisions in Parallel Processor Systems, IEEE Transactions on Parallel and Distributed Systems, 9:6, (601-608), Online publication date: 1-Jun-1998.
- Mukherjee S and Hill M Using prediction to accelerate coherence protocols Proceedings of the 25th annual international symposium on Computer architecture, (179-190)
- Manjikia N Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors Proceedings of the international Conference on Parallel Processing
- Skeppstedt J and Dubois M Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps Proceedings of the international Conference on Parallel Processing, (298-305)
- Ranganathan P, Pai V, Abdel-Shafi H and Adve S The interaction of software prefetching with ILP processors in shared-memory systems Proceedings of the 24th annual international symposium on Computer architecture, (144-156)
- Ranganathan P, Pai V, Abdel-Shafi H and Adve S (1997). The interaction of software prefetching with ILP processors in shared-memory systems, ACM SIGARCH Computer Architecture News, 25:2, (144-156), Online publication date: 1-May-1997.
- Lim H and Yew P A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Proceedings of the 11th International Symposium on Parallel Processing, (643-649)
- Grahn H and Stenström P Relative Performance of Hardware and Software-Only Directory Protocols Under Latency Tolerating and Reducing Techniques Proceedings of the 11th International Symposium on Parallel Processing
- Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M (1996). Compiler-directed page coloring for multiprocessors, ACM SIGOPS Operating Systems Review, 30:5, (244-255), Online publication date: 1-Dec-1996.
- Luk C and Mowry T (1996). Compiler-based prefetching for recursive data structures, ACM SIGOPS Operating Systems Review, 30:5, (222-233), Online publication date: 1-Dec-1996.
- Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M Compiler-directed page coloring for multiprocessors Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, (244-255)
- Luk C and Mowry T Compiler-based prefetching for recursive data structures Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, (222-233)
- Bugnion E, Anderson J, Mowry T, Rosenblum M and Lam M (1996). Compiler-directed page coloring for multiprocessors, ACM SIGPLAN Notices, 31:9, (244-255), Online publication date: 1-Sep-1996.
- Luk C and Mowry T (1996). Compiler-based prefetching for recursive data structures, ACM SIGPLAN Notices, 31:9, (222-233), Online publication date: 1-Sep-1996.
- Dahlgren F and Stenström P (1996). Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 7:4, (385-398), Online publication date: 1-Apr-1996.
- Landin A and Dahlgren F Bus-based COMA-reducing traffic in shared-bus multiprocessors Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
- Harrison L Examination of a memory access classification scheme for pointer-intensive and numeric programs Proceedings of the 10th international conference on Supercomputing, (133-140)
- Navarro J, García-Diego E and Herrero J Data prefetching and multilevel blocking for linear algebra operations Proceedings of the 10th international conference on Supercomputing, (109-116)
- Lipasti M, Schmidt W, Kunkel S and Roediger R SPAID Proceedings of the 28th annual international symposium on Microarchitecture, (231-236)
- Luk C Memory disambiguation for general-purpose applications Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research
- Bordawekar R, Choudhary A, Kennedy K, Koelbel C and Paleczny M (1995). A model and compilation strategy for out-of-core data parallel programs, ACM SIGPLAN Notices, 30:8, (1-10), Online publication date: 1-Aug-1995.
- Bordawekar R, Choudhary A, Kennedy K, Koelbel C and Paleczny M A model and compilation strategy for out-of-core data parallel programs Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, (1-10)
- Zhang Z and Torrellas J Speeding up irregular applications in shared-memory multiprocessors Proceedings of the 22nd annual international symposium on Computer architecture, (188-199)
- Bernstein D, Cohen D and Freund A Compiler techniques for data prefetching on the PowerPC Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, (19-26)
- Zhang Z and Torrellas J (1995). Speeding up irregular applications in shared-memory multiprocessors, ACM SIGARCH Computer Architecture News, 23:2, (188-199), Online publication date: 1-May-1995.
- Skeppstedt J and Stenström P (1994). Simple compiler algorithms to reduce ownership overhead in cache coherence protocols, ACM SIGOPS Operating Systems Review, 28:5, (286-296), Online publication date: 1-Dec-1994.
- Skeppstedt J and Stenström P Simple compiler algorithms to reduce ownership overhead in cache coherence protocols Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, (286-296)
- Skeppstedt J and Stenström P (1994). Simple compiler algorithms to reduce ownership overhead in cache coherence protocols, ACM SIGPLAN Notices, 29:11, (286-296), Online publication date: 1-Nov-1994.
- Lee J, Lee M, Choi S and Park M (1994). Reducing cache conflicts in data cache prefetching, ACM SIGARCH Computer Architecture News, 22:4, (71-77), Online publication date: 1-Sep-1994.
Index Terms
- Tolerating latency through software-controlled data prefetching
Recommendations
Tolerating latency in multiprocessors through compiler-inserted prefetching
The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...