skip to main content
research-article
Free Access

Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

Authors Info & Claims
Published:22 March 2018Publication History
Skip Abstract Section

Abstract

In this article, we present some key techniques for optimizing HPCG on Sunway TaihuLight and demonstrate how to achieve high performance in memory-bound applications by exploiting specific characteristics of the hardware architecture. In particular, we utilize a block multicoloring approach for parallelization and propose methods such as requirement-based data mapping and customized gather collective to enhance the effective memory bandwidth. Experiments indicate that the optimized HPCG code can sustain 77% of the theoretical memory bandwidth and scale to the full system of more than 10 million cores, with an aggregated performance of 480.8 Tflop/s and a weak scaling efficiency of 87.3%.

References

  1. Mark Adams. 2014. HPGMG 1.0: A Benchmark for Ranking High Performance Computing Systems. Technical Report LBNL-6630E. eScholarship.Google ScholarGoogle Scholar
  2. Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In Proceedings of the 5th International Conference on High Performance Embedded Architectures and Compilers (HiPEAC’10). 111--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hesham Ali, Yong Shi, Deepak Khazanchi, Michael Lees, G. Dick van Albada, Jack Dongarra, Peter M. A. Sloot, et al. 2012. Block-asynchronous multigrid smoothers for GPU-accelerated systems. Procedia Computer Science 9, 7--16.Google ScholarGoogle ScholarCross RefCross Ref
  4. Edward Anderson and Youcef Saad. 1989. Solving sparse triangular linear systems on parallel computers. International Journal of High Speed Computing 01, 73--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Protonu Basu, Anand Venkat, Mary Hall, Samuel Williams, Brian Van Straalen, and Leonid Oliker. 2013. Compiler generation and autotuning of communication-avoiding operators for geometric multigrid. In Proceedings of the 2013 20th International Conference on High Performance Computing (HiPC’13). IEEE, Los Alamitos, CA.Google ScholarGoogle ScholarCross RefCross Ref
  6. Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage, and Analysis (SC’09). ACM, New York, NY, 18:1--18:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edmond Chow, Robert D. Falgout, Jonathan J. Hu, Raymond S. Tuminaro, and Ulrike Meier Yang. 2006. A survey of parallelization techniques for multigrid solvers. In Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, Philadelphia, PA, 179--201.Google ScholarGoogle Scholar
  8. Jack Dongarra and Michael Heroux. 2013. Toward a New Metric for Ranking High Performance Computing Systems. Technical Report. Sandia.Google ScholarGoogle Scholar
  9. Jack Dongarra, Michael Heroux, and Luszczek Piotr. 2017. HPCG Results: ISC’17. Available at http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid===291.Google ScholarGoogle Scholar
  10. Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. International Journal of High Performance Computing Applications 30, 1, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present and future. Concurrency and Computation: Practice and Experience 15, 803--820.Google ScholarGoogle ScholarCross RefCross Ref
  12. Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59, 1--16.Google ScholarGoogle ScholarCross RefCross Ref
  13. Frank Hülsemann, Markus Kowarschik, Marcus Mohr, and Ulrich Rüde. 2006. Parallel geometric multigrid. In Numerical Solution of Partial Differential Equations on Parallel Computers. Springer, Berlin, Germany, 165--208.Google ScholarGoogle Scholar
  14. Takeshi Iwashita, Yuuichi Nakanishi, and Masaaki Shimasaki. 2005. Comparison criteria for parallel orderings in ILU preconditioning. SIAM Journal on Scientific Computing 26, 1234--1260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS’12). 474--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. An extended compression format for the optimization of sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems 24, 1930--1940. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George Karypis and Vipin Kumar. 1998. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing 48, 1, 96--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. David R. Kincaid, John R. Respess, David M. Young, and Rober R. Grimes. 1982. ITPACK 2C: A FORTRAN package for solving large sparse linear systems by adaptive accelerated iterative methods. ACM Transactions on Mathematical Software 8, 302--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kiyoshi Kumahata, Kazuo Minami, Akira Hosoi, and Ikuo Miyoshi. 2016. HPCG Performance Improvement on the K computer. Retrieved February 14, 2018, from http://www.hpcg-benchmark.org/downloads/sc16/HPCG_on_the_K_Computer.pdf.Google ScholarGoogle Scholar
  20. Kiyoshi Kumahata, Kazuo Minami, and Naoya Maruyama. 2016. High-performance conjugate gradient performance improvement on the K computer. International Journal of High Performance Computing Applications 30, 55--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yiqun Liu, Chao Yang, Fangfang Liu, Xianyi Zhang, Yutong Lu, Yunfei Du, Canqun Yang, Min Xie, and Xiangke Liao. 2015. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. International Journal of High Performance Computing Applications 30, 1, 39--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yiqun Liu, Xianyi Zhang, Chao Yang, Fangfang Liu, and Yutong Lu. 2014. Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 542--551.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jan Mayer. 2009. Parallel algorithms for solving linear systems with sparse triangular matrices. Computing 86, 291--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Meuer Martin. 2017. Top 500 Supercomputer Lists. Retrieved February 14, 2018, from http://www.top500.orgGoogle ScholarGoogle Scholar
  27. Kengo Nakajima. 2014. Optimization of serial and parallel communications for parallel geometric multigrid method. In Proceedings of the 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). 25--32.Google ScholarGoogle ScholarCross RefCross Ref
  28. Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. Retrieved February 14, 2018, from http://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf.Google ScholarGoogle Scholar
  29. Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing. Springer International, 124--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa Ali Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). IEEE, Los Alamitos, CA, 945--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Everett Phillips and Massimiliano Fatica. 2014. A CUDA implementation of the high performance conjugate gradient benchmark. In High Performance Computing Systems: Performance Modeling, Benchmarking, and Simulation. Springer International, 68--84.Google ScholarGoogle Scholar
  32. Eugene L. Poole and James M. Ortega. 1987. Multicolor ICCG methods for vector computers. SIAM Journal on Numerical Analysis 24, 25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16, 521.Google ScholarGoogle ScholarCross RefCross Ref
  35. Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. 2012. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’12). IEEE, Los Alamitos, CA, 96:1--96:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). ACM, New York, NY, 38:1--38:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, New York, NY, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 1
          March 2018
          401 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/3199680
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 March 2018
          • Accepted: 1 December 2017
          • Revised: 1 November 2017
          • Received: 1 September 2017
          Published in taco Volume 15, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader