research-article

Asynchronous Task-Based Parallelization of Algebraic Multigrid

Authors:
Amani AlOnazi

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
View Profile

,
George S. Markomanolis

KAUST Supercomputing Laboratory, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

KAUST Supercomputing Laboratory, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
View Profile

,
David Keyes

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
View Profile

PASC '17: Proceedings of the Platform for Advanced Scientific Computing ConferenceJune 2017Article No.: 5Pages 1–11https://doi.org/10.1145/3093172.3093230

Published:26 June 2017Publication History

PASC '17: Proceedings of the Platform for Advanced Scientific Computing Conference

Pages 1–11

ABSTRACT

As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.

References

2016. Intel Xeon Processor E5 v3 Product Family: Processor Specification Update. Technical Report 330785-010US. Intel Corporation.Google Scholar
Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Softw. 42, 3, Article 18 (May 2016), 31 pages. Google ScholarDigital Library
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1 (2009), 012037. http://stacks.iop.org/1742-6596/180/i=1/a=012037Google ScholarCross Ref
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187--198. Google ScholarDigital Library
Allison H. Baker, Robert D. Falgout, Todd Gamblin, Tzanio V. Kolev, Martin Schulz, and Ulrike Meier Yang. 2012. Scaling Algebraic Multigrid Solvers: On the Road to Exascale. Springer Berlin Heidelberg, Berlin, Heidelberg, 215--226.Google Scholar
Barcelona Supercomputing Center 2016. Extrae. Barcelona Supercomputing Center. https://www.bsc.es/computer-sciences/extraeGoogle Scholar
Barcelona Supercomputing Center 2016. Paraver: Performance Analysis Tools. Barcelona Supercomputing Center. https://www.bsc.es/computer-sciences/performance-tools/paraverGoogle Scholar
Peter Bastian, Markus Blatt, Andreas Dedner, Christian Engwer, Robert Klöfkorn, Ralf Kornhuber, Mario Ohlberger, and Oliver Sander. 2008. A generic grid interface for parallel and adaptive scientific computing. Part II: Implementation and tests in DUNE. Computing 82, 2 (2008), 121--138. Google ScholarDigital Library
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 66, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389086 Google ScholarDigital Library
Wieb Bosma, John Cannon, and Catherine Playoust. 1997. The MAGMA algebra system. I. The user language. J. Symbolic Comput. 24, 3--4 (1997), 235--265. Computational algebra and number theory (London, 1993). Google ScholarDigital Library
Achi Brandt. 1977. Multi-level adaptive solutions to boundary-value problems. Math. Comp. 31 (1977), 333--390.Google ScholarCross Ref
Marian Brezina, R Falgout, Scott MacLachlan, T Manteuffel, S McCormick, and John Ruge. 2006. Adaptive Algebraic Multigrid. SIAM Journal on Scientific Computing 27, 4 (2006), 1261--1286. Google ScholarDigital Library
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory 2016. hypre: High Performnce Preconditioners. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory.Google Scholar
Hans De Sterck, Robert D. Falgout, Joshua W. Nolting, and Ulrike Meier Yang. 2008. Distance-two interpolation for parallel algebraic multigrid. Numerical Linear Algebra with Applications 15, 2--3 (2008), 115--139.Google ScholarCross Ref
Jack Dongarra, Jakub Kurzak, Asim YarKhan, Mathieu Faverge, and Vijay Joshi. 2011. QUARK (QUeuing And Runtime for Kernels) Library. (2011). http://icl.cs.utk.edu/quark/index.htmlGoogle Scholar
Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, and Rajeev Thakur. 2010. Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems. Springer Berlin Heidelberg, Berlin, Heidelberg, 11--20. Google ScholarDigital Library
Alejandro Duran, Eduard Ayguad, Rosa M. Badia, Jesus Labarta, Luis Martinell, Xavier Martorell, and Judit Planas. 2011. OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES. Parallel Processing Letters 21, 02 (2011), 173--193.Google ScholarCross Ref
Hormozd Gahvari, Allison H. Baker, Martin Schulz, Ulrike Meier Yang, Kirk E. Jordan, and William Gropp. 2011. Modeling the Performance of an Algebraic Multigrid Cycle on HPC Platforms. In Proceedings of the International Conference on Supercomputing (ICS '11). ACM, New York, NY, USA, 172--181. Google ScholarDigital Library
M.W. Gee, C.M. Siefert, J.J. Hu, R.S. Tuminaro, and M.G. Sala. 2006. ML 5.0 Smoothed Aggregation User's Guide. Technical Report SAND 2006-2649. Sandia National Laboratories.Google Scholar
W. Hackbusch. 1985. Multigrid Methods and Applications. Springer-Verlag.Google Scholar
Van Emden Henson and Ulrike Meier Yang. 2002. BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41, 1 (2002), 155--177. Google ScholarDigital Library
Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. 2013. MPI+ MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing 95, 12 (2013), 1121--1136. Google ScholarDigital Library
Tzanio V. Kolev and Panayot S. Vassilevski. 2008. Auxiliary Space AMG for H(curl) Problems. Springer Berlin Heidelberg, Berlin, Heidelberg, 147--154.Google Scholar
Jesús Labarta. 2010. StarSs: A Programming Model for the Multicore Era. (2010-03-01 2010).Google Scholar
X. Lacoste. 2015. Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU cluster systems. Ph.D. Dissertation. LaBRI, UniversitÃl' Bordeaux, Talence, France.Google Scholar
Jan Mandel, Steve McCormick, and John Ruge. 1988. An Algebraic Theory for Multigrid Methods for Variational Problems. SIAM J. Numer. Anal. 25, 1 (Feb. 1988), 91--110. Google ScholarDigital Library
S. McCormick. 1989. Multilevel Adaptive Methods for Partial Differential Equations. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
William F. Mitchell. 1999. The Full Domain Partition Approach to Parallel Adaptive Refinement. Springer New York, New York, NY, 151--161.Google Scholar
Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A Portable Interface to Hardware Performance Counters. In In Proceedings of the Department of Defense HPCMP Users Group Conference. 7--10.Google Scholar
Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, and Pradeep Dubey. 2015. High-performance Algebraic Multigrid Solver Optimized for Multi-core Based Distributed Parallel Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 54, 12 pages. Google ScholarDigital Library
Sebastian Reiter, Andreas Vogel, Ingo Heppner, Martin Rupp, and Gabriel Wittum. 2013. A Massively Parallel Geometric Multigrid Solver on Hierarchically Distributed Grids. Comput. Vis. Sci. 16, 4 (Aug. 2013), 151--164. Google ScholarDigital Library
Hans De Sterck, Ulrike Meier Yang, and Jeffrey J. Heys. 2006. Reducing Complexity in Parallel Algebraic Multigrid Preconditioners. SIAM J. Matrix Anal. Appl. 27, 4 (2006), 1019--1039. arXiv:http://dx.doi.org/10.1137/040615729 Google ScholarDigital Library
Didem Unat, Tan Nguyen, Weiqun Zhang, Muhammed Nufail Farooqi, Burak Bastem, George Michelogiannakis, Ann Almgren, and John Shalf. 2016. TiDA: High-Level Programming Abstractions for Data Locality Management. Springer International Publishing, Cham, 116--135.Google Scholar
Panayot S. Vassilevski and Ulrike M. Yang. 2014. Reducing communication in algebraic multigrid using additive variants. Numerical Lin. Alg. with Applic. 21, 2 (2014), 275--296.Google ScholarCross Ref
P. Wesseling. 1982. A robust and efficient multigrid method. Springer Berlin Heidelberg, Berlin, Heidelberg, 614--630.Google Scholar
Ulrike Meier Yang. 2006. Parallel Algebraic Multigrid Methods --- High Performance Preconditioners. Springer Berlin Heidelberg, Berlin, Heidelberg, 209--236.Google Scholar

Index Terms

Asynchronous Task-Based Parallelization of Algebraic Multigrid
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
      2. Shared memory algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms

The author demonstrates the methodology for parallelizing of finding stochastic bounds for Markov chains on multicore and manycore platforms. The stochastic bounds algorithm for Markov chains with the sparse matrices is investigated, thus needing a lot ...
Read More
Higher-level parallelization for local and distributed asynchronous task-based programming
ESPM '15: Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware

One of the biggest challenges on the way to exascale computing is programmability in the context of performance portability. The efficient utilization of the prospective architectures of exascale supercomputers will be challenging in many ways, very ...
Read More
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

PASC '17: Proceedings of the Platform for Advanced Scientific Computing Conference
June 2017
136 pages
ISBN:9781450350624
DOI:10.1145/3093172

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multigrid
additive multigrid
hybrid MPI OmpSs implementation
task-based parallelism
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
PASC '17 Paper Acceptance Rate13of33submissions,39%Overall Acceptance Rate83of185submissions,45%
More
Upcoming Conference
PASC '24

Sponsor:

sighpc

Platform for Advanced Scientific Computing Conference

June 3 - 5, 2024

Zurich , Switzerland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 161
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Asynchronous Task-Based Parallelization of Algebraic Multigrid

PASC '17: Proceedings of the Platform for Advanced Scientific Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms

Higher-level parallelization for local and distributed asynchronous task-based programming

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Asynchronous Task-Based Parallelization of Algebraic Multigrid

PASC '17: Proceedings of the Platform for Advanced Scientific Computing Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallelization of stochastic bounds for Markov chains on multicore and manycore platforms

Higher-level parallelization for local and distributed asynchronous task-based programming

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media