As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.
One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.
We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32× range in available hardware threads, a 4.5× range in attained DRAM bandwidth, and a 6.3× range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.
Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner , that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4× faster than a straightforward implementation and is more scalable across cores.
By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.
Cited By
- Plotnitskii P, Beaurepaire L, Qu L, Akbudak K, Ltaief H and Keyes D Leveraging the High Bandwidth of Last-Level Cache for HPC Seismic Imaging Applications Proceedings of the Platform for Advanced Scientific Computing Conference, (1-13)
- Qu L, Abdelkhalak R, Ltaief H, Said I and Keyes D (2023). Exploiting temporal data reuse and asynchrony in the reverse time migration, International Journal of High Performance Computing Applications, 37:2, (132-150), Online publication date: 1-Mar-2023.
- Kronawitter S and Lengauer C (2018). Polyhedral Search Space Exploration in the ExaStencils Code Generator, ACM Transactions on Architecture and Code Optimization, 15:4, (1-25), Online publication date: 31-Dec-2019.
- Malas T, Hager G, Ltaief H and Keyes D (2017). Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations, ACM Transactions on Parallel Computing, 4:3, (1-32), Online publication date: 27-Apr-2018.
- Cattaneo R, Natale G, Sicignano C, Sciuto D and Santambrogio M (2015). On How to Accelerate Iterative Stencil Loops, ACM Transactions on Architecture and Code Optimization, 12:4, (1-26), Online publication date: 7-Jan-2016.
- Rawat P, Kong M, Henretty T, Holewinski J, Stock K, Pouchet L, Ramanujam J, Rountev A and Sadayappan P SDSLc Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, (1-10)
- Hammouda A, Siegel A and Siegel S (2015). Noise-Tolerant Explicit Stencil Computations for Nonuniform Process Execution Rates, ACM Transactions on Parallel Computing, 2:1, (1-33), Online publication date: 21-May-2015.
- Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P (2015). Can traditional programming bridge the ninja performance gap for parallel computing applications?, Communications of the ACM, 58:5, (77-86), Online publication date: 23-Apr-2015.
- Shrestha S, Gao G, Manzano J, Marquez A and Feo J Locality aware concurrent start for stencil applications Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, (157-166)
- Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J and Sadayappan P A framework for enhancing data reuse via associative reordering Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, (65-76)
- Stock K, Kong M, Grosser T, Pouchet L, Rastello F, Ramanujam J and Sadayappan P (2014). A framework for enhancing data reuse via associative reordering, ACM SIGPLAN Notices, 49:6, (65-76), Online publication date: 5-Jun-2014.
- Bandishti V, Pananilath I and Bondhugula U Tiling stencil computations to maximize parallelism Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-11)
- Christen M, Schenk O and Cui Y Patus for convenient high-performance stencils Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-10)
- Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P (2012). Can traditional programming bridge the Ninja performance gap for parallel computing applications?, ACM SIGARCH Computer Architecture News, 40:3, (440-451), Online publication date: 5-Sep-2012.
- Holewinski J, Pouchet L and Sadayappan P High-performance code generation for stencil computations on GPU architectures Proceedings of the 26th ACM international conference on Supercomputing, (311-320)
- Satish N, Kim C, Chhugani J, Saito H, Krishnaiyer R, Smelyanskiy M, Girkar M and Dubey P Can traditional programming bridge the Ninja performance gap for parallel computing applications? Proceedings of the 39th Annual International Symposium on Computer Architecture, (440-451)
- Udupa A, Rajan K and Thies W ALTER Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, (480-491)
- Udupa A, Rajan K and Thies W (2019). ALTER, ACM SIGPLAN Notices, 46:6, (480-491), Online publication date: 4-Jun-2011.
- Tang Y, Chowdhury R, Kuszmaul B, Luk C and Leiserson C The pochoir stencil compiler Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, (117-128)
- Nguyen A, Satish N, Chhugani J, Kim C and Dubey P 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, (1-13)
Recommendations
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
SC '08: Proceedings of the 2008 ACM/IEEE conference on SupercomputingUnderstanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-...
Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation
CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded SystemsGraphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL ...