Auto-tuning stencil codes for cache-based multicore platforms

January 2009

Author:
Kaushik Datta
University of California, Berkeley
,
Adviser:
Katherine A. Yelick
University of California, Berkeley

Publisher:

University of California at Berkeley
Computer Science Division 571 Evans Hall Berkeley, CA
United States

ISBN:978-1-124-03708-0

Order Number:AAI3411221

Pages:

197

Purchase on ProQuest

Bibliometrics

Abstract

As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.

One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.

We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32× range in available hardware threads, a 4.5× range in attained DRAM bandwidth, and a 6.3× range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.

Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner , that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4× faster than a straightforward implementation and is more scalable across cores.

By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.

Cited By

Contributors

K. A. Yelick
University of California, Berkeley
- Publication Years1985 - 2023
- Publication counts134
- Citation count4,182
- Available for Download71
- Downloads (cumulative)74,714
- Downloads (12 months)5,478
- Downloads (6 weeks)863
- Average Downloads per Article1,052
- Average Citation per Article31
View Full Profile
Kaushik Datta
University of California, Berkeley
- Publication Years2005 - 2009
- Publication counts8
- Citation count488
- Available for Download3
- Downloads (cumulative)4,526
- Downloads (12 months)62
- Downloads (6 weeks)21
- Average Downloads per Article1,509
- Average Citation per Article61
View Full Profile

Recommendations

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-...
Read More
Auto-tuning performance on multicore computers
Read More
Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL ...
Read More

Comments

Browse Theses

Sections

Cited By

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Auto-tuning performance on multicore computers

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Sections

Cited By

Save to Binder

Recommendations

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Auto-tuning performance on multicore computers

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation