skip to main content
Software methods for improvement of cache performance on supercomputer applications
Publisher:
  • Rice University
  • 6100 S. Main Houston, TX
  • United States
Order Number:AAI9012855
Pages:
159
Bibliometrics
Skip Abstract Section
Abstract

Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures.

PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%.

By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses.

Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded.

When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

Cited By

  1. ACM
    Rafique M and Zhu Z CAMPS Proceedings of the 47th International Conference on Parallel Processing, (1-9)
  2. ACM
    Bjørnseth B, Meyer J and Natvig L Efficient array slicing on the Intel Xeon Phi coprocessor Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, (40-47)
  3. ACM
    Gornish E, Granston E and Veidenbaum A Compiler-directed data prefetching in multiprocessors with memory hierarchies ACM International Conference on Supercomputing 25th Anniversary Volume, (128-142)
  4. ACM
    Kennedy K and McKinley K Optimizing for parallelism and data locality ACM International Conference on Supercomputing 25th Anniversary Volume, (151-162)
  5. Zhuang X and Lee H (2007). Reducing Cache Pollution via Dynamic Data Prefetch Filtering, IEEE Transactions on Computers, 56:1, (18-31), Online publication date: 1-Jan-2007.
  6. Zhong Y, Dropsho S, Shen X, Studer A and Ding C (2007). Miss Rate Prediction Across Program Inputs and Cache Configurations, IEEE Transactions on Computers, 56:3, (328-343), Online publication date: 1-Mar-2007.
  7. Xue J and Vera X (2004). Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior, IEEE Transactions on Computers, 53:5, (547-566), Online publication date: 1-May-2004.
  8. ACM
    Callahan D, Carr S and Kennedy K (2004). Improving register allocation for subscripted variables, ACM SIGPLAN Notices, 39:4, (328-342), Online publication date: 1-Apr-2004.
  9. ACM
    Lam M and Wolf M (2004). A data locality optimizing algorithm, ACM SIGPLAN Notices, 39:4, (442-459), Online publication date: 1-Apr-2004.
  10. Ding C and Kennedy K (2004). Improving effective bandwidth through compiler enhancement of global cache reuse, Journal of Parallel and Distributed Computing, 64:1, (108-134), Online publication date: 1-Jan-2004.
  11. ACM
    CaΒcaval C and Padua D Estimating cache misses and locality using stack distances Proceedings of the 17th annual international conference on Supercomputing, (150-159)
  12. Fraguela B, Doallo R and Zapata E (2003). Probabilistic Miss Equations, IEEE Transactions on Computers, 52:3, (321-336), Online publication date: 1-Mar-2003.
  13. Mellor-Crummey J, Whalley D and Kennedy K (2001). Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings, International Journal of Parallel Programming, 29:3, (217-247), Online publication date: 1-Jun-2001.
  14. Sarkar V (2001). Optimized Unrolling of Nested Loops, International Journal of Parallel Programming, 29:5, (545-581), Online publication date: 1-Oct-2001.
  15. Manjikian N and Abdelrahman T (2001). Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 12:3, (259-271), Online publication date: 1-Mar-2001.
  16. ACM
    Chatterjee S, Parker E, Hanlon P and Lebeck A Exact analysis of the cache behavior of nested loops Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, (286-297)
  17. ACM
    Chatterjee S, Parker E, Hanlon P and Lebeck A (2001). Exact analysis of the cache behavior of nested loops, ACM SIGPLAN Notices, 36:5, (286-297), Online publication date: 1-May-2001.
  18. Kandemir M, Banerjee P, Choudhary A, Ramanujam J and Ayguadé E (2001). Static and Dynamic Locality Optimizations Using Integer Linear Programming, IEEE Transactions on Parallel and Distributed Systems, 12:9, (922-941), Online publication date: 1-Sep-2001.
  19. ACM
    Sarkar V Optimized unrolling of nested loops Proceedings of the 14th international conference on Supercomputing, (153-166)
  20. ACM
    Yang C and Lebeck A Push vs. pull Proceedings of the 14th international conference on Supercomputing, (176-186)
  21. ACM
    Vanderwiel S and Lilja D (2000). Data prefetch mechanisms, ACM Computing Surveys, 32:2, (174-199), Online publication date: 1-Jun-2000.
  22. Johnson T, Connors D, Merten M and Hwu W (1999). Run-Time Cache Bypassing, IEEE Transactions on Computers, 48:12, (1338-1354), Online publication date: 1-Dec-1999.
  23. ACM
    Lebeck A Cache conscious programming in undergraduate computer science The proceedings of the thirtieth SIGCSE technical symposium on Computer science education, (247-251)
  24. ACM
    Mellor-Crummey J, Whalley D and Kennedy K Improving memory hierarchy performance for irregular applications Proceedings of the 13th international conference on Supercomputing, (425-433)
  25. ACM
    Chatterjee S, Jain V, Lebeck A, Mundhra S and Thottethodi M Nonlinear array layouts for hierarchical memory systems Proceedings of the 13th international conference on Supercomputing, (444-453)
  26. ACM
    Lebeck A (1999). Cache conscious programming in undergraduate computer science, ACM SIGCSE Bulletin, 31:1, (247-251), Online publication date: 1-Mar-1999.
  27. Skadron K, Ahuja P, Martonosi M and Clark D (1999). Branch Prediction, Instruction-Window Size, and Cache Size, IEEE Transactions on Computers, 48:11, (1260-1281), Online publication date: 1-Nov-1999.
  28. ACM
    Ghosh S, Martonosi M and Malik S (1999). Cache miss equations, ACM Transactions on Programming Languages and Systems, 21:4, (703-746), Online publication date: 1-Jul-1999.
  29. Manjikian N and Abdelrahman T (1997). Fusion of Loops for Parallelism and Locality, IEEE Transactions on Parallel and Distributed Systems, 8:2, (193-209), Online publication date: 1-Feb-1997.
  30. ACM
    Johnson T and Hwu W Run-time adaptive cache hierarchy management via reference analysis Proceedings of the 24th annual international symposium on Computer architecture, (315-326)
  31. ACM
    Johnson T and Hwu W (1997). Run-time adaptive cache hierarchy management via reference analysis, ACM SIGARCH Computer Architecture News, 25:2, (315-326), Online publication date: 1-May-1997.
  32. Johnson T, Merten M and Hwu W Run-time spatial locality detection and optimization Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, (57-64)
  33. Manjikia N Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors Proceedings of the international Conference on Parallel Processing
  34. Lim H and Yew P A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Proceedings of the 11th International Symposium on Parallel Processing, (643-649)
  35. ACM
    Horowitz M, Martonosi M, Mowry T and Smith M Informing memory operations Proceedings of the 23rd annual international symposium on Computer architecture, (260-270)
  36. ACM
    Horowitz M, Martonosi M, Mowry T and Smith M (1996). Informing memory operations, ACM SIGARCH Computer Architecture News, 24:2, (260-270), Online publication date: 1-May-1996.
  37. Saavedra-Barrera R, Mao W, Park D, Chame J and Moon S The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching Proceedings of the 10th International Parallel Processing Symposium, (39-45)
  38. ACM
    Cierniak M and Li W Unifying data and control transformations for distributed shared-memory machines Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation, (205-217)
  39. ACM
    Cierniak M and Li W (1995). Unifying data and control transformations for distributed shared-memory machines, ACM SIGPLAN Notices, 30:6, (205-217), Online publication date: 1-Jun-1995.
  40. ACM
    Lebeck A and Wood D (1995). Active memory, ACM SIGMETRICS Performance Evaluation Review, 23:1, (220-230), Online publication date: 1-May-1995.
  41. ACM
    Lebeck A and Wood D Active memory Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, (220-230)
  42. ACM
    Kavi K, Hurson A, Patadia P, Abraham E and Shanmugam P Design of cache memories for multi-threaded dataflow architecture Proceedings of the 22nd annual international symposium on Computer architecture, (253-264)
  43. ACM
    Li W Compiler cache optimizations for banded matrix problems Proceedings of the 9th international conference on Supercomputing, (21-30)
  44. ACM
    Kavi K, Hurson A, Patadia P, Abraham E and Shanmugam P (1995). Design of cache memories for multi-threaded dataflow architecture, ACM SIGARCH Computer Architecture News, 23:2, (253-264), Online publication date: 1-May-1995.
  45. Lipasti M, Schmidt W, Kunkel S and Roediger R SPAID Proceedings of the 28th annual international symposium on Microarchitecture, (231-236)
  46. Mckee S and Wulf W Access ordering and memory-conscious cache utilization Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
  47. John L, Reddy V, Hulina P and Coraor L Program balance and its impact on high performance RISC architectures Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
  48. Lebeck A and Wood D (1994). Cache Profiling and the SPEC Benchmarks, Computer, 27:10, (15-26), Online publication date: 1-Oct-1994.
  49. ACM
    Fahringer T and Zima H A static parameter based performance prediction tool for parallel programs Proceedings of the 7th international conference on Supercomputing, (207-219)
  50. ACM
    Öner K and Dubois M Effects of memory latencies on non-blocking processor/cache architectures Proceedings of the 7th international conference on Supercomputing, (338-347)
  51. ACM
    Gharachorloo K, Gupta A and Hennessy J Hiding memory latency using dynamic scheduling in shared-memory multiprocessors Proceedings of the 19th annual international symposium on Computer architecture, (22-33)
  52. ACM
    Maslov V Delinearization Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, (152-161)
  53. ACM
    Maslov V (1992). Delinearization, ACM SIGPLAN Notices, 27:7, (152-161), Online publication date: 1-Jul-1992.
  54. ACM
    Mowry T, Lam M and Gupta A Design and evaluation of a compiler algorithm for prefetching Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, (62-73)
  55. ACM
    Li W and Pingali K Access normalization Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, (285-295)
  56. ACM
    Kennedy K and McKinley K Optimizing for parallelism and data locality Proceedings of the 6th international conference on Supercomputing, (323-334)
  57. ACM
    Mowry T, Lam M and Gupta A (1992). Design and evaluation of a compiler algorithm for prefetching, ACM SIGPLAN Notices, 27:9, (62-73), Online publication date: 1-Sep-1992.
  58. ACM
    Li W and Pingali K (1992). Access normalization, ACM SIGPLAN Notices, 27:9, (285-295), Online publication date: 1-Sep-1992.
  59. ACM
    Gharachorloo K, Gupta A and Hennessy J (1992). Hiding memory latency using dynamic scheduling in shared-memory multiprocessors, ACM SIGARCH Computer Architecture News, 20:2, (22-33), Online publication date: 1-May-1992.
  60. Carr S and Kennedy K Compiler blockability of numerical algorithms Proceedings of the 1992 ACM/IEEE conference on Supercomputing, (114-124)
  61. Hsieh B, Hind M and Cytron R Loop distribution with multiple exits Proceedings of the 1992 ACM/IEEE conference on Supercomputing, (204-213)
  62. Havlak P and Kennedy K (1991). An Implementation of Interprocedural Bounded Regular Section Analysis, IEEE Transactions on Parallel and Distributed Systems, 2:3, (350-360), Online publication date: 1-Jul-1991.
  63. Wolf M and Lam M (1991). A Loop Transformation Theory and an Algorithm to Maximize Parallelism, IEEE Transactions on Parallel and Distributed Systems, 2:4, (452-471), Online publication date: 1-Oct-1991.
  64. ACM
    Callahan D, Kennedy K and Porterfield A Software prefetching Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, (40-52)
  65. ACM
    Lam M, Rothberg E and Wolf M The cache performance and optimizations of blocked algorithms Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, (63-74)
  66. ACM
    Callahan D, Kennedy K and Porterfield A (1991). Software prefetching, ACM SIGPLAN Notices, 26:4, (40-52), Online publication date: 2-Apr-1991.
  67. ACM
    Lam M, Rothberg E and Wolf M (1991). The cache performance and optimizations of blocked algorithms, ACM SIGPLAN Notices, 26:4, (63-74), Online publication date: 2-Apr-1991.
  68. ACM
    Callahan D, Kennedy K and Porterfield A (1991). Software prefetching, ACM SIGOPS Operating Systems Review, 25:Special Issue, (40-52), Online publication date: 2-Apr-1991.
  69. ACM
    Lam M, Rothberg E and Wolf M (1991). The cache performance and optimizations of blocked algorithms, ACM SIGOPS Operating Systems Review, 25:Special Issue, (63-74), Online publication date: 2-Apr-1991.
  70. ACM
    Callahan D, Kennedy K and Porterfield A (1991). Software prefetching, ACM SIGARCH Computer Architecture News, 19:2, (40-52), Online publication date: 2-Apr-1991.
  71. ACM
    Lam M, Rothberg E and Wolf M (1991). The cache performance and optimizations of blocked algorithms, ACM SIGARCH Computer Architecture News, 19:2, (63-74), Online publication date: 2-Apr-1991.
  72. ACM
    Goff G, Kennedy K and Tseng C Practical dependence testing Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, (15-29)
  73. ACM
    Wolf M and Lam M A data locality optimizing algorithm Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, (30-44)
  74. ACM
    Goff G, Kennedy K and Tseng C (1991). Practical dependence testing, ACM SIGPLAN Notices, 26:6, (15-29), Online publication date: 1-Jun-1991.
  75. ACM
    Wolf M and Lam M (1991). A data locality optimizing algorithm, ACM SIGPLAN Notices, 26:6, (30-44), Online publication date: 1-Jun-1991.
  76. ACM
    Klaiber A and Levy H An architecture for software-controlled data prefetching Proceedings of the 18th annual international symposium on Computer architecture, (43-53)
  77. ACM
    Gupta A, Hennessy J, Gharachorloo K, Mowry T and Weber W Comparative evaluation of latency reducing and tolerating techniques Proceedings of the 18th annual international symposium on Computer architecture, (254-263)
  78. ACM
    Klaiber A and Levy H (1991). An architecture for software-controlled data prefetching, ACM SIGARCH Computer Architecture News, 19:3, (43-53), Online publication date: 1-May-1991.
  79. ACM
    Gupta A, Hennessy J, Gharachorloo K, Mowry T and Weber W (1991). Comparative evaluation of latency reducing and tolerating techniques, ACM SIGARCH Computer Architecture News, 19:3, (254-263), Online publication date: 1-May-1991.
  80. ACM
    Gornish E, Granston E and Veidenbaum A (1990). Compiler-directed data prefetching in multiprocessors with memory hierarchies, ACM SIGARCH Computer Architecture News, 18:3b, (354-368), Online publication date: 1-Sep-1990.
  81. ACM
    Gornish E, Granston E and Veidenbaum A Compiler-directed data prefetching in multiprocessors with memory hierarchies Proceedings of the 4th international conference on Supercomputing, (354-368)
  82. ACM
    Callahan D, Carr S and Kennedy K Improving register allocation for subscripted variables Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, (53-65)
  83. ACM
    Callahan D, Carr S and Kennedy K (1990). Improving register allocation for subscripted variables, ACM SIGPLAN Notices, 25:6, (53-65), Online publication date: 1-Jun-1990.
  84. Kennedy K and McKinley K Loop distribution with arbitrary control flow Proceedings of the 1990 ACM/IEEE conference on Supercomputing, (407-416)
  85. Callahan D and Porterfield A Data cache performance of supercomputer applications Proceedings of the 1990 ACM/IEEE conference on Supercomputing, (564-572)
  86. Havlak P and Kennedy K Experience with interprocedural analysis of array side effects Proceedings of the 1990 ACM/IEEE conference on Supercomputing, (952-961)
Contributors
  • The University of North Carolina at Chapel Hill
  • Rice University

Recommendations