Abstract
In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multicoloring approach for parallelization and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77% of the theoretical memory bandwidth and scale to the full system of more than 10 million cores, with an aggregated performance of 480.8 Tflop/s and a weak scaling efficiency of 87.3%.
- Mark Adams. 2014. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems. Technical Report LBNL-6630E. eScholarship.Google Scholar
- Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). 111--125. Google ScholarDigital Library
- Hesham Ali, Yong Shi, Deepak Khazanchi, Michael Lees, G. Dick van Albada, Jack Dongarra, Peter M. A. Sloot, et al. 2012. Block-asynchronous multigrid smoothers for GPU-accelerated systems. Procedia Computer Science 9, 7--16.Google ScholarCross Ref
- Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 01, 73--95. Google ScholarDigital Library
- Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 2013 20th International Conference on High Performance Computing (HiPC’13). IEEE, Los Alamitos, CA.Google ScholarCross Ref
- Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, 18:1--18:11. Google ScholarDigital Library
- Edmond Chow, Robert D. Falgout, Jonathan J. Hu, Raymond S. Tuminaro, and Ulrike Meier Yang. 2006. A survey of parallelization techniques for multigrid solvers. In Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 179--201.Google Scholar
- Jack Dongarra and Michael Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia.Google Scholar
- Jack Dongarra, Michael Heroux, and Luszczek Piotr. 2017. HPCG Results: ISC’17. Available at http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid===291.Google Scholar
- Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. International Journal of High Performance Computing Applications 30, 1, 3--10. Google ScholarDigital Library
- Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 803--820.Google ScholarCross Ref
- Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 1--16.Google ScholarCross Ref
- Frank Hülsemann, Markus Kowarschik, Marcus Mohr, and Ulrich Rüde. 2006. Parallel geometric multigrid. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, Berlin, Germany, 165--208.Google Scholar
- Takeshi Iwashita, Yuuichi Nakanishi, and Masaaki Shimasaki. 2005. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing 26, 1234--1260. Google ScholarDigital Library
- Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 474--483. Google ScholarDigital Library
- Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 24, 1930--1940. Google ScholarDigital Library
- George Karypis and Vipin Kumar. 1998. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48, 1, 96--129. Google ScholarDigital Library
- David R. Kincaid, John R. Respess, David M. Young, and Rober R. Grimes. 1982. ITPACK 2C: A FORTRAN package for solving large sparse linear systems by adaptive accelerated iterative methods. ACM Transactions on Mathematical Software 8, 302--322. Google ScholarDigital Library
- Kiyoshi Kumahata, Kazuo Minami, Akira Hosoi, and Ikuo Miyoshi. 2016. HPCG Performance Improvement on the K computer. Retrieved February 14, 2018, from http://www.hpcg-benchmark.org/downloads/sc16/HPCG_on_the_K_Computer.pdf.Google Scholar
- Kiyoshi Kumahata, Kazuo Minami, and Naoya Maruyama. 2016. High-performance conjugate gradient performance improvement on the K computer. International Journal of High Performance Computing Applications 30, 55--70. Google ScholarDigital Library
- Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350. Google ScholarDigital Library
- Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282. Google ScholarDigital Library
- Yiqun Liu, Chao Yang, Fangfang Liu, Xianyi Zhang, Yutong Lu, Yunfei Du, Canqun Yang, Min Xie, and Xiangke Liao. 2015. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. International Journal of High Performance Computing Applications 30, 1, 39--54. Google ScholarDigital Library
- Yiqun Liu, Xianyi Zhang, Chao Yang, Fangfang Liu, and Yutong Lu. 2014. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 542--551.Google ScholarCross Ref
- Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 291--312. Google ScholarDigital Library
- Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Meuer Martin. 2017. Top 500 Supercomputer Lists. Retrieved February 14, 2018, from http://www.top500.orgGoogle Scholar
- Kengo Nakajima. 2014. Optimization of serial and parallel communications for parallel geometric multigrid method. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 25--32.Google ScholarCross Ref
- Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Retrieved February 14, 2018, from http://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf.Google Scholar
- Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing. Springer International, 124--140. Google ScholarDigital Library
- Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). IEEE, Los Alamitos, CA, 945--955. Google ScholarDigital Library
- Everett Phillips and Massimiliano Fatica. 2014. A CUDA implementation of the high performance conjugate gradient benchmark. In High Performance Computing Systems: Performance Modeling, Benchmarking, and Simulation. Springer International, 68--84.Google Scholar
- Eugene L. Poole and James M. Ortega. 1987. Multicolor ICCG methods for vector computers. SIAM Journal on Numerical Analysis 24, 25. Google ScholarDigital Library
- Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
- Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16, 521.Google ScholarCross Ref
- Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. 2012. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 96:1--96:11. Google ScholarDigital Library
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, 38:1--38:12. Google ScholarDigital Library
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 65--76. Google ScholarDigital Library
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 107--118. Google ScholarDigital Library
Index Terms
- Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer
Recommendations
Enabling and scaling the HPCG benchmark on the newest generation Sunway supercomputer with 42 million heterogeneous cores
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisWe study and evaluate performance optimization techniques for the HPCG benchmark on the newest generation Sunway supercomputer. Specifically, a two-level blocking scheme is proposed to expose adequate parallelism in the symmetric Gauss-Seidel kernel ...
18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThis paper reports our large-scale nonlinear earthquake simulation software on Sunway TaihuLight. Our innovations include: (1) a customized parallelization scheme that employs the 10 million cores efficiently at both the process and the thread levels; (...
Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer
The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
Comments