skip to main content
Auto-tuning stencil codes for cache-based multicore platforms
Publisher:
  • University of California at Berkeley
  • Computer Science Division 571 Evans Hall Berkeley, CA
  • United States
ISBN:978-1-124-03708-0
Order Number:AAI3411221
Pages:
197
Bibliometrics
Skip Abstract Section
Abstract

As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.

One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.

We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32× range in available hardware threads, a 4.5× range in attained DRAM bandwidth, and a 6.3× range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.

Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner , that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4× faster than a straightforward implementation and is more scalable across cores.

By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.

Cited By

  1. ACM
    Plotnitskii P, Beaurepaire L, Qu L, Akbudak K, Ltaief H and Keyes D Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging Applications Proceedings of the Platform for Advanced Scientific Computing Conference, (1-13)
  2. Qu L, Abdelkhalak R, Ltaief H, Said I and Keyes D (2023). Exploiting temporal data reuse and asynchrony in the reverse time migration, International Journal of High Performance Computing Applications, 37:2, (132-150), Online publication date: 1-Mar-2023.
  3. ACM
    Kronawitter S and Lengauer C (2018). Polyhedral Search Space Exploration in the ExaStencils Code Generator, ACM Transactions on Architecture and Code Optimization, 15:4, (1-25), Online publication date: 31-Dec-2019.
  4. ACM
    Malas T, Hager G, Ltaief H and Keyes D (2017). Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations, ACM Transactions on Parallel Computing, 4:3, (1-32), Online publication date: 27-Apr-2018.
  5. ACM
    Cattaneo R, Natale G, Sicignano C, Sciuto D and Santambrogio M (2015). On How to Accelerate Iterative Stencil Loops, ACM Transactions on Architecture and Code Optimization, 12:4, (1-26), Online publication date: 7-Jan-2016.
  6. ACM
    Rawat P, Kong M, Henretty T, Holewinski J, Stock K, Pouchet L, Ramanujam J, Rountev A and Sadayappan P SDSLc Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, (1-10)
  7. ACM
    Hammouda A, Siegel A and Siegel S (2015). Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates, ACM Transactions on Parallel Computing, 2:1, (1-33), Online publication date: 21-May-2015.
  8. ACM
    Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P (2015). Can traditional programming bridge the ninja performance gap for parallel computing applications?, Communications of the ACM, 58:5, (77-86), Online publication date: 23-Apr-2015.
  9. Shrestha S, Gao G, Manzano J, Marquez A and Feo J Locality aware concurrent start for stencil applications Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, (157-166)
  10. ACM
    Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J and Sadayappan P A framework for enhancing data reuse via associative reordering Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, (65-76)
  11. ACM
    Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J and Sadayappan P (2014). A framework for enhancing data reuse via associative reordering, ACM SIGPLAN Notices, 49:6, (65-76), Online publication date: 5-Jun-2014.
  12. Bandishti V, Pananilath I and Bondhugula U Tiling stencil computations to maximize parallelism Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-11)
  13. Christen M, Schenk O and Cui Y Patus for convenient high-performance stencils Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-10)
  14. ACM
    Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P (2012). Can traditional programming bridge the Ninja performance gap for parallel computing applications?, ACM SIGARCH Computer Architecture News, 40:3, (440-451), Online publication date: 5-Sep-2012.
  15. ACM
    Holewinski J, Pouchet L and Sadayappan P High-performance code generation for stencil computations on GPU architectures Proceedings of the 26th ACM international conference on Supercomputing, (311-320)
  16. Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P Can traditional programming bridge the Ninja performance gap for parallel computing applications? Proceedings of the 39th Annual International Symposium on Computer Architecture, (440-451)
  17. ACM
    Udupa A, Rajan K and Thies W ALTER Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, (480-491)
  18. ACM
    Udupa A, Rajan K and Thies W (2019). ALTER, ACM SIGPLAN Notices, 46:6, (480-491), Online publication date: 4-Jun-2011.
  19. ACM
    Tang Y, Chowdhury R, Kuszmaul B, Luk C and Leiserson C The pochoir stencil compiler Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, (117-128)
  20. Nguyen A, Satish N, Chhugani J, Kim C and Dubey P 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, (1-13)
Contributors
  • University of California, Berkeley
  • University of California, Berkeley

Recommendations