skip to main content
Multiprocessor cache memory performance: characterization and optimization
Publisher:
  • Stanford University
  • 408 Panama Mall, Suite 217
  • Stanford
  • CA
  • United States
Order Number:UMI Order No. GAX93-02325
Bibliometrics
Skip Abstract Section
Abstract

Good cache memory performance is critical to achieving high CPU utilization in shared-memory multiprocessors. Reliably characterizing the performance of multiprocessor caches is hard, however, for it often requires experimental measurements on real machines across several workload domains. In this dissertation, we characterize some of the major sources of cache performance degradation, namely data sharing, operating system activity, and poor reuse of cache state in multiprogrammed workloads. We use data from a hardware performance monitor in a high-performance 4-CPU multiprocessor running scientific, engineering, software-development, and database workloads.

While some of the misses on shared data result from the intrinsic inter-CPU communication required by the application, the rest, false sharing misses, are a consequence of the way data sharing interacts with multi-word cache blocks. We separate false sharing misses from the remaining, true sharing misses. We find that, while applications suffer false sharing, their miss rate is also affected by the poor spatial locality of true sharing. To reduce the miss rate, we then evaluate optimizations of the layout of shared data in cache blocks.

We discover three major sources of operating system misses: instruction fetches, block operations (copy and clear), and process migration. Instruction misses are more commonplace than suspected. They are often caused by operating system self-interference in the cache. Hence, we propose optimizing the layout of the operating system code and consider increasing the cache associativity. The effect of misses in block operations can be partially eliminated by using special support for these operations. Finally, process migration misses are a consequence of the poor reuse of cache state in multiprogrammed workloads.

In multiprogrammed workloads, the cache state built up by a process may be lost when the process is preempted, either because intervening processes destroy the state or because the process migrates to another CPU. We evaluate affinity scheduling, a technique that increases the reuse of cache state by encouraging processes to run on the CPUs whose caches keep useful state. We show that affinity scheduling attains most of the increase in cache state reuse possible in the workloads. Overall, affinity scheduling produces moderate speedups at nearly no cost.

Contributors
  • University of Illinois Urbana-Champaign

Recommendations