ABSTRACT
As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.
- 2016. Intel Xeon Processor E5 v3 Product Family: Processor Specification Update. Technical Report 330785-010US. Intel Corporation.Google Scholar
- Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Softw. 42, 3, Article 18 (May 2016), 31 pages. Google ScholarDigital Library
- Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1 (2009), 012037. http://stacks.iop.org/1742-6596/180/i=1/a=012037Google ScholarCross Ref
- Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. Google ScholarDigital Library
- Allison H. Baker, Robert D. Falgout, Todd Gamblin, Tzanio V. Kolev, Martin Schulz, and Ulrike Meier Yang. 2012. Scaling Algebraic Multigrid Solvers: On the Road to Exascale. Springer Berlin Heidelberg, Berlin, Heidelberg, 215--226.Google Scholar
- Barcelona Supercomputing Center 2016. Extrae. Barcelona Supercomputing Center. https://www.bsc.es/computer-sciences/extraeGoogle Scholar
- Barcelona Supercomputing Center 2016. Paraver: Performance Analysis Tools. Barcelona Supercomputing Center. https://www.bsc.es/computer-sciences/performance-tools/paraverGoogle Scholar
- Peter Bastian, Markus Blatt, Andreas Dedner, Christian Engwer, Robert Klöfkorn, Ralf Kornhuber, Mario Ohlberger, and Oliver Sander. 2008. A generic grid interface for parallel and adaptive scientific computing. Part II: Implementation and tests in DUNE. Computing 82, 2 (2008), 121--138. Google ScholarDigital Library
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086 Google ScholarDigital Library
- Wieb Bosma, John Cannon, and Catherine Playoust. 1997. The MAGMA algebra system. I. The user language. J. Symbolic Comput. 24, 3--4 (1997), 235--265. Computational algebra and number theory (London, 1993). Google ScholarDigital Library
- Achi Brandt. 1977. Multi-level adaptive solutions to boundary-value problems. Math. Comp. 31 (1977), 333--390.Google ScholarCross Ref
- Marian Brezina, R Falgout, Scott MacLachlan, T Manteuffel, S McCormick, and John Ruge. 2006. Adaptive Algebraic Multigrid. SIAM Journal on Scientific Computing 27, 4 (2006), 1261--1286. Google ScholarDigital Library
- Center for Applied Scientific Computing, Lawrence Livermore National Laboratory 2016. hypre: High Performnce Preconditioners. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory.Google Scholar
- Hans De Sterck, Robert D. Falgout, Joshua W. Nolting, and Ulrike Meier Yang. 2008. Distance-two interpolation for parallel algebraic multigrid. Numerical Linear Algebra with Applications 15, 2--3 (2008), 115--139.Google ScholarCross Ref
- Jack Dongarra, Jakub Kurzak, Asim YarKhan, Mathieu Faverge, and Vijay Joshi. 2011. QUARK (QUeuing And Runtime for Kernels) Library. (2011). http://icl.cs.utk.edu/quark/index.htmlGoogle Scholar
- Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, and Rajeev Thakur. 2010. Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems. Springer Berlin Heidelberg, Berlin, Heidelberg, 11--20. Google ScholarDigital Library
- Alejandro Duran, Eduard Ayguad, Rosa M. Badia, Jesus Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES. Parallel Processing Letters 21, 02 (2011), 173--193.Google ScholarCross Ref
- Hormozd Gahvari, Allison H. Baker, Martin Schulz, Ulrike Meier Yang, Kirk E. Jordan, and William Gropp. 2011. Modeling the Performance of an Algebraic Multigrid Cycle on HPC Platforms. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 172--181. Google ScholarDigital Library
- M.W. Gee, C.M. Siefert, J.J. Hu, R.S. Tuminaro, and M.G. Sala. 2006. ML 5.0 Smoothed Aggregation User's Guide. Technical Report SAND 2006-2649. Sandia National Laboratories.Google Scholar
- W. Hackbusch. 1985. Multigrid Methods and Applications. Springer-Verlag.Google Scholar
- Van Emden Henson and Ulrike Meier Yang. 2002. BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 1 (2002), 155--177. Google ScholarDigital Library
- Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. 2013. MPI+ MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing 95, 12 (2013), 1121--1136. Google ScholarDigital Library
- Tzanio V. Kolev and Panayot S. Vassilevski. 2008. Auxiliary Space AMG for H(curl) Problems. Springer Berlin Heidelberg, Berlin, Heidelberg, 147--154.Google Scholar
- Jesús Labarta. 2010. StarSs: A Programming Model for the Multicore Era. (2010-03-01 2010).Google Scholar
- X. Lacoste. 2015. Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU cluster systems. Ph.D. Dissertation. LaBRI, UniversitÃl' Bordeaux, Talence, France.Google Scholar
- Jan Mandel, Steve McCormick, and John Ruge. 1988. An Algebraic Theory for Multigrid Methods for Variational Problems. SIAM J. Numer. Anal. 25, 1 (Feb. 1988), 91--110. Google ScholarDigital Library
- S. McCormick. 1989. Multilevel Adaptive Methods for Partial Differential Equations. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- William F. Mitchell. 1999. The Full Domain Partition Approach to Parallel Adaptive Refinement. Springer New York, New York, NY, 151--161.Google Scholar
- Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A Portable Interface to Hardware Performance Counters. In In Proceedings of the Department of Defense HPCMP Users Group Conference. 7--10.Google Scholar
- Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, and Pradeep Dubey. 2015. High-performance Algebraic Multigrid Solver Optimized for Multi-core Based Distributed Parallel Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 54, 12 pages. Google ScholarDigital Library
- Sebastian Reiter, Andreas Vogel, Ingo Heppner, Martin Rupp, and Gabriel Wittum. 2013. A Massively Parallel Geometric Multigrid Solver on Hierarchically Distributed Grids. Comput. Vis. Sci. 16, 4 (Aug. 2013), 151--164. Google ScholarDigital Library
- Hans De Sterck, Ulrike Meier Yang, and Jeffrey J. Heys. 2006. Reducing Complexity in Parallel Algebraic Multigrid Preconditioners. SIAM J. Matrix Anal. Appl. 27, 4 (2006), 1019--1039. arXiv:http://dx.doi.org/10.1137/040615729 Google ScholarDigital Library
- Didem Unat, Tan Nguyen, Weiqun Zhang, Muhammed Nufail Farooqi, Burak Bastem, George Michelogiannakis, Ann Almgren, and John Shalf. 2016. TiDA: High-Level Programming Abstractions for Data Locality Management. Springer International Publishing, Cham, 116--135.Google Scholar
- Panayot S. Vassilevski and Ulrike M. Yang. 2014. Reducing communication in algebraic multigrid using additive variants. Numerical Lin. Alg. with Applic. 21, 2 (2014), 275--296.Google ScholarCross Ref
- P. Wesseling. 1982. A robust and efficient multigrid method. Springer Berlin Heidelberg, Berlin, Heidelberg, 614--630.Google Scholar
- Ulrike Meier Yang. 2006. Parallel Algebraic Multigrid Methods --- High Performance Preconditioners. Springer Berlin Heidelberg, Berlin, Heidelberg, 209--236.Google Scholar
Index Terms
- Asynchronous Task-Based Parallelization of Algebraic Multigrid
Recommendations
Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms
The author demonstrates the methodology for parallelizing of finding stochastic bounds for Markov chains on multicore and manycore platforms. The stochastic bounds algorithm for Markov chains with the sparse matrices is investigated, thus needing a lot ...
Higher-level parallelization for local and distributed asynchronous task-based programming
ESPM '15: Proceedings of the First International Workshop on Extreme Scale Programming Models and MiddlewareOne of the biggest challenges on the way to exascale computing is programmability in the context of performance portability. The efficient utilization of the prospective architectures of exascale supercomputers will be challenging in many ways, very ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
AbstractGPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Comments