ABSTRACT
We present a patch-based approach for tsunami simulation with parallel adaptive mesh refinement on the Salomon supercomputer. The special architecture of Salomon, with two Intel Xeon CPUs (Haswell architecture) and two Intel Xeon Phi coprocessors (Knights Corner) per compute node, suggests truly heterogeneous load balancing instead of offload approaches, because host and accelerator achieve comparable performance for our simulations.
We use a tree-structured mesh refinement strategy resulting from newest-vertex bisection of triangular grid cells, but introduce small uniform grid patches into the leaves of the tree to allow vectorisation of the Finite Volume solver over grid cells. In particular, we implemented vectorised versions of the approximate Riemann solvers, exploiting Fortran's array notations where possible. While large patches increase computational performance due to vectorisation, improved memory access and reduced meshing overhead, they also increase the overall number of processed cells. Thus, a trade-off must be found regarding the patch size. We experimented with different patch sizes in a study of the time-to-solution of a simulation of the 2011 Tohoku tsunami, and found that relatively small patches with 82 cells resulted in the smallest execution times.
We use the Xeon Phis in symmetric mode and apply heterogeneous load balancing between hosts and coprocessors, identifying the relative load distribution either from on-the-fly runtime measurements or from a priori exhaustive testing. Both approaches perform better than homogeneous load balancing and better than using only the CPUs or only the Xeon Phi coprocessors in native mode. In all set-ups, however, the absolute speedups are impeded by the slow MPI communication between Xeon Phi coprocessors.
- Alexey Androsov, Jörn Behrens, and Sergey Danilov. 2011. Tsunami Modelling with Unstructured Grids. Interaction between Tides and Tsunami Waves. In Computational Science and High Performance Computing IV, Vol. 115. 191--206.Google Scholar
- Michael Bader, Christian Böck, Johannes Schwaiger, and Csaba Attila Vigh. 2010. Dynamically Adaptive Simulations with Minimal Memory Requirement -- Solving the Shallow Water Equations Using Sierpinski Curves. SIAM Journal of Scientific Computing 32, 1 (2010), 212--228.Google ScholarDigital Library
- Derek S. Bale, Randall J. LeVeque, Sorin Mitran, and James A. Rossmanith. 2002. A wave propagation method for conservation laws and balance laws with spatially varying flux functions. SIAM Journal on Scientific Computing 24, 3 (2002), 955--978. Google ScholarDigital Library
- Jörn Behrens and Jens Zimmermann. 2000. Parallelizing an Unstructured Grid Generator with a Space-Filling Curve Approach. In Euro-Par 2000 Parallel Processing (Lecture Notes in Computer Science), Vol. 1900. Springer Berlin Heidelberg, 815--823. Google ScholarDigital Library
- Gheorghe-Teodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, and Paul H. J. Kelly. 2016. A structure-exploiting numbering algorithm for finite elements on extruded meshes, and its performance evaluation in Firedrake. Geoscientific Model Development 9, 10 (2016), 3803--3815.Google ScholarCross Ref
- Marsha J. Berger and Phillip Colella. 1989. Local adaptive mesh refinement for shock hydrodynamics. Journal of Computational Physics 82 (1989), 64--84. Google ScholarDigital Library
- Marsha J. Berger, David L. George, Randall J. LeVeque, and Kyle T. Mandli. 2011. The GeoClaw software for depth-averaged flows with adaptive refinement. Advances in Water Resources 34, 9 (2011), 1195--1206.Google ScholarCross Ref
- Marsha J. Berger and Joseph Oliger. 1984. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of Computational Physics 53 (1984), 484--512.Google ScholarCross Ref
- Carsten Burstedde, Donna Calhoun, Kyle Mandli, and Andy R. Terrel. 2014. ForestClaw: Hybrid forest-of-octrees AMR for hyperbolic conservation laws. In Parallel Computing: Accelerating Computational Science and Engineering (CSE) (Advances in Parallel Computing), Vol. 25. 253--262.Google Scholar
- Carsten Burstedde, Lucas C. Wilcox, and Omar Ghattas. 2011. p4est: Scalable Algorithms for Parallel Adaptive Mesh Refinement on Forests of Octrees. SIAM Journal on Scientific Computing 33, 3 (2011), 1103--1133. Google ScholarDigital Library
- Richard Courant, Kurt Friedrichs, and Hans Lewy. 1967. On the partial difference equations of mathematical physics. IBM journal 11, 2 (1967), 215--234. Google ScholarDigital Library
- Anshu Dubey, Ann Almgren, John Bell, Martin Berzins, Steve Brandt, Greg Bryan, Phillip Colella, Daniel Graves, Michael Lijewski, Frank Löffler, Brian O'Shea, Erik Schnetter, Brian Van Straalen, and Klaus Weide. 2014. A survey of high level frameworks in block-structured adaptive mesh refinement packages. J. Parallel and Distrib. Comput. 74, 12 (2014), 3217--3227. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing. Google ScholarDigital Library
- Bernd Einfeldt. 1988. On Godunov-type methods for gas dynamics. SIAM J. Numer. Anal. 25, 2 (1988), 294--318. Google ScholarDigital Library
- Percy Galvez, Jean-Paul Ampuero, Luis A. Dalguer, Surendra N. Somala, and Tarje Nissen-Meyer. 2014. Dynamic earthquake rupture modelled with an unstructured 3-D spectral element method applied to the 2011 M9 Tohoku earthquake. Geophysical Journal International 198, 2 (2014), 1222--1240.Google ScholarCross Ref
- David L. George. 2008. Augmented Riemann solvers for the shallow water equations over variable topography with steady states and inundation. J. Comput. Phys. 227, 6 (2008), 3089--3113. Google ScholarDigital Library
- Sven Harig, Chaeroni, Widodo S. Pranowo, and Jörn Behrens. 2008. Tsunami simulations on several scales. Ocean Dynamics 58, 5 (2008), 429--440.Google ScholarCross Ref
- Alexander Heinecke, Roman Karlstetter, Dirk Pflüger, and Hans-Joachim Bungartz. 2015. Data Mining on Vast Datasets as a Cluster System Benchmark. Concurrency and Computation: Practice and Experience 28, 7 (2015), 2145--2165. Google ScholarDigital Library
- Yuta Hirokawa, Taisuke Boku, Shunsuke A. Sato, and Kazuhiro Yabana. 2016. Electron Dynamics Simulation with Time-Dependent Density Functional Theory on Large Scale Symmetric Mode Xeon Phi Cluster. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1202--1211.Google Scholar
- Alan Humphrey, Daniel Sunderland, Todd Harman, and Martin Berzins. 2016. Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1222--1231. http://www.sci.utah.edu/publications/Hum2016a/ipdps-pdsec16.pdfGoogle ScholarCross Ref
- James Jeffers and James Reinders. 2013. Intel Xeon Phi coprocessor high-performance programming. Newnes. Google ScholarDigital Library
- Randall J. LeVeque, David L. George, and Marsha J. Berger. 2011. Tsunami modelling with adaptively refined finite volume methods. Acta Numerica 20 (2011), 211--289.Google ScholarCross Ref
- Kyle T. Mandli and Clint N. Dawson. 2014. Adaptive mesh refinement for storm surge. Ocean Modelling 75 (2014), 36--50.Google ScholarCross Ref
- Oliver Meister. 2016. Sierpinski Curves for Parallel Adaptive Mesh Refinement in Finite Element and Finite Volume Methods. Dissertation. Institut für Informatik, Technische Universität München. https://mediatum.ub.tum.de/doc/1320149/1320149.pdfGoogle Scholar
- Oliver Meister and Michael Bader. 2015. 2D adaptivity for 3D problems: Parallel SPE10 reservoir simulation on dynamically adaptive prism grids. Journal of Computational Science 9 (2015), 101--106.Google ScholarCross Ref
- Oliver Meister, Kaveh Rahnema, and Michael Bader. 2016. Parallel Memory-Efficient Adaptive Mesh Refinement on Structured Triangular Meshes with Billions of Grid Cells. ACM Transactions on Mathematical Software 43, 3 (2016), 19. Google ScholarDigital Library
- Qingyu Meng, Alan Humphrey, John Schmidt, and Martin Berzins. 2013. Investigating Applications Portability with the Uintah DAG-based Runtime System on PetaScale Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, 96:1--96:12. Google ScholarDigital Library
- Qingyu Meng, Alan Humphrey, John Schmidt, and Martin Berzins. 2013. Preliminary Experiences with the Uintah Framework on Intel Xeon Phi and Stampede. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery (XSEDE '13). ACM, 48:1--48:8. Google ScholarDigital Library
- William F. Mitchell. 1991. Adaptive refinement for arbitrary finite-element spaces with hierarchical bases. Journal of computational and applied mathematics 36, 1 (1991), 65--78. Google ScholarDigital Library
- William F. Mitchell. 2007. A Refinement-Tree Based Partitioning Method for Dynamic Load Balancing with Adaptively Refined Grids. J. Parallel and Distrib. Comput. 67, 4 (2007), 417--429. Google ScholarDigital Library
- Andreas Mueller, Michal Kopera, Simone Marras, Lucas Wilcox, Tobin Isaac, and Francis X. Giraldo. 2016. Strong scaling for numerical weather prediction at petascale with the atmospheric model NUMA. International Journal for High-Performance Computing Applications (2016).Google Scholar
- Ali Pinar and Cevdet Aykanat. 2004. Fast optimal load balancing algorithms for 1D partitioning. J. Parallel Distrib. Comput. 64, 8 (2004), 974--996. Google ScholarDigital Library
- Ali Pinar, E. Kartal Tabak, and Cevdet Aykanat. 2008. One-dimensional partitioning for heterogeneous systems: Theory and practice. J. Parallel and Distrib. Comput. 68, 11 (2008), 1473--1486. Google ScholarDigital Library
- Stephane Popinet. 2012. Adaptive modelling of long-distance wave propagation and fine-scale flooding during the Tohoku tsunami. Natural Hazards and Earth System Sciences 12 (2012), 1213--1227.Google ScholarCross Ref
- Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K. (Dk) Panda. 2013. MVAPICH-PRISM: A Proxy-based Communication Framework Using InfiniBand and SCIF for Intel MIC Clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, Article 54, 11 pages. Google ScholarDigital Library
- Abtin Rahimian, Ilya Lashuk, Shravan Veerapaneni, Aparna Chandramowlishwaran, Dhairya Malhotra, Logan Moon, Rahul Sampath, Aashay Shringarpure, Jeffrey Vetter, Richard Vuduc, Denis Zorin, and George Biros. 2010. Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures. In Supercomputing 2010. 1--11. Google ScholarDigital Library
- Sebastian Rettenberger, Oliver Meister, Michael Bader, and Alice-Agnes Gabriel. 2016. ASAGI -- A Parallel Server for Adaptive Geoinformation. In Proceedings of the Exascale Applications and Software Conference 2016 (EASC '16). ACM, 2:1--2:9. http://delivery.acm.org/10.1145/2940000/2938618/a2-Rettenberger.pdf Google ScholarDigital Library
- Martin Schreiber and Hans-Joachim Bungartz. 2014. Cluster-based communication and load balancing for simulations on dynamically adaptive grids. In Proceedings of the International Conference on Computational Science (ICCS'14) (Procedia Computer Science), Vol. 29. Elsevier, 2241--2253.Google ScholarCross Ref
- Jie Shen, Ana Lucia Varbanescu, Yutong Lu, Peng Zou, and Henk Sips. 2016. Workload Partitioning for Accelerating Applications on Heterogeneous Platforms. IEEE Transactions on Parallel and Distributed Systems 27, 9 (2016), 2766--2780. Google ScholarDigital Library
- Hari Sundar and Omar Ghattas. 2015. A Nested Partitioning Algorithm for Adaptive Meshes on Heterogeneous Clusters. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 319--328. Google ScholarDigital Library
- Kristof Unterweger, Roland Wittmann, Philipp Neumann, Tobias Weinzierl, and Hans-Joachim Bungartz. 2015. Integration of FULLSWOF2D and PeanoClaw: Adaptivity and Local Time-stepping for Complex Overland Flows. In Recent Trends in Computational Engineering -- CE2014 (Lecture Notes in Computational Science and Engineering), Vol. 105. Springer, 181--195.Google Scholar
- Karthikeyan Vaidyanathan, Kiran Pamnany, Dhiraj D. Kalamkar, Alexander Heinecke, Mikhail Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha Shet G., Bharat Kaul, B'alint Jo'o, and Pradeep Dubey. 2014. Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters. In 28th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2014, Phoenix, AZ, USA, May 19-23, 2014. Google ScholarDigital Library
- Mohamed Wahib, Naoya Maruyama, and Takayuki Aoki. 2016. Daino: A High-level Framework for Parallel and Efficient AMR on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, 53:1--53:12. http://dl.acm.org/citation.cfm?id=3014904.3014975 Google ScholarDigital Library
- Tobias Weinzierl, Michael Bader, Kristof Unterweger, and Roland Wittmann. 2014. Block Fusion on Dynamically Adaptive Spacetree Grids for Shallow Water Waves. Parallel Processing Letters 24, 3 (2014), 1441006.Google ScholarCross Ref
Index Terms
- Load Balancing and Patch-Based Parallel Adaptive Mesh Refinement for Tsunami Simulation on Heterogeneous Platforms Using Xeon Phi Coprocessors
Recommendations
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous ArchitecturesEfficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Comments