Software methods for improvement of cache performance on supercomputer applications

January 1989

Author:
Allan Kennedy Porterfield,
Chairman:
K. W. Kennedy

Publisher:

Rice University
6100 S. Main Houston, TX
United States

Order Number:AAI9012855

Pages:

159

Purchase on ProQuest

Bibliometrics

Abstract

Measurements of actual supercomputer cache performance has not been previously undertaken. PFC-Sim is a program-driven event tracing facility that can simulate data cache performance of very long programs. PFC-Sim simulates cache concurrently with program execution, allowing very long traces to be used. Programs with traces in excess of 4 billion entries have been used to measure the performance of various cache structures.

PFC-Sim was used to measure the cache performance of array references in a benchmark set of supercomputer applications, RiCEPS. Data cache hit ratios varied on average between 70% for a 16K cache and 91% for a 256K cache. Programs with very large working sets generate poor cache performance even with large caches. The hit ratios of individual references are measured to either 0% or 100%.

By locating the references that miss, attempts to improve memory performance can focus on references where improvement is possible. The compiler can estimate the number of loop iterations which can execute without filling the cache, the overflow iteration. The overflow iteration combined with the dependence graph can be used to determine at each reference whether execution will result in hits or misses.

Program transformation can be used to improve cache performance by reordering computation to move references to the same memory location closer together, thereby eliminating cache misses. Using the overflow iteration, the compiler can often do this transformation automatically. Standard blocking transformations cannot be used on many loop nests that contain transformation preventing dependences. Wavefront blocking allows any loop nest to be blocked, when the components of dependence vectors are bounded.

When the cache misses cannot be eliminated, software prefetching can overlap the miss delays with computation. Software prefetching uses a special instruction to preload values into the cache. A cache load resembles a register load in structure, but does not block computation and only moves the address into cache where a later register load will be required. The compiler can inform the cache (on average) over 100 cycles before a load is required. Cache misses can be serviced in parallel with computation.

Cited By

Contributors

Allan Kennedy Porterfield
The University of North Carolina at Chapel Hill
- Publication Years1989 - 2017
- Publication counts22
- Citation count3,557
- Available for Download18
- Downloads (cumulative)22,422
- Downloads (12 months)3,051
- Downloads (6 weeks)560
- Average Downloads per Article1,246
- Average Citation per Article162
View Full Profile
Ken Wade Kennedy
Rice University
- Publication Years1971 - 2011
- Publication counts188
- Citation count11,025
- Available for Download110
- Downloads (cumulative)83,997
- Downloads (12 months)6,531
- Downloads (6 weeks)1,036
- Average Downloads per Article764
- Average Citation per Article59
View Full Profile

Recommendations

Evaluating the performance of software cache coherence
Special issue: Proceedings of ASPLOS-III: the third international conference on architecture support for programming languages and operating systems

In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they ...
Read More
A Performance Study of Instruction Cache Prefetching Methods

Prefetching methods for instruction caches are studied via trace-driven simulation. The two primary methods are "fall-through" prefetch (sometimes referred to as "one block lookahead") and "target" prefetch. Fall-through prefetches are for sequential ...
Read More
Data cache performance of supercomputer applications
Supercomputing '90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing

Processor speed has been increasing faster than mass memory speed. One method of matching a processor's speed to memory's is high-speed caches. This paper examines the data cache performance of a set of computationally intensive programs. Our interset ...
Read More

Comments

Browse Theses

Sections

Cited By

Evaluating the performance of software cache coherence

A Performance Study of Instruction Cache Prefetching Methods

Data cache performance of supercomputer applications

Sections

Cited By

Save to Binder

Recommendations

Evaluating the performance of software cache coherence

A Performance Study of Instruction Cache Prefetching Methods

Data cache performance of supercomputer applications