skip to main content
research-article

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Published:10 May 2016Publication History
Skip Abstract Section

Abstract

KBLAS is an open-source, high-performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set of tuning parameters, KBLAS efficiently runs on various GPU architectures while avoiding code rewriting and retaining compliance with the standard BLAS API. Another optimization technique allows ensuring coalesced memory access when dealing with submatrices, especially for high-level dense linear algebra algorithms. All KBLAS kernels have been leveraged to a multi-GPU environment, which requires the introduction of new APIs. Considering general matrices, KBLAS is very competitive with existing state-of-the-art kernels and provides a smoother performance across a wide range of matrix dimensions. Considering symmetric and Hermitian matrices, the KBLAS performance outperforms existing state-of-the-art implementations on all matrix sizes and achieves asymptotically up to 50% and 60% speedup against the best competitor on single GPU and multi-GPUs systems, respectively. Performance results also validate our performance model. A subset of KBLAS high-performance kernels have been integrated into NVIDIA's standard BLAS implementation (cuBLAS) for larger dissemination, starting from version 6.0.

References

  1. Ahmad Abdelfattah, Jack Dongarra, David Keyes, and Hatem Ltaief. 2013a. Optimizing memory-bound SYMV kernel on GPU hardware accelerators. In High Performance Computing for Computational Science (VECPAR'12), Michel Dayd, Osni Marques, and Kengo Nakajima (Eds.). Lecture Notes in Computer Science, Vol. 7851. Springer, Berlin, 72--79. DOI:http://dx.doi.org/10.1007/978-3-642-38718-0_10Google ScholarGoogle Scholar
  2. Ahmad Abdelfattah, Eric Gendron, Damien Gratadour, David Keyes, Hatem Ltaief, Arnaud Sevin, and Fabrice Vidal. 2014. High performance pseudo-analytical simulation of multi-object adaptive optics over multi-GPU systems. In Euro-Par 2014 Parallel Processing, Fernando Silva, I. Dutra, and V. Santos Costa (Eds.). Lecture Notes in Computer Science, Vol. 8632. Springer International Publishing, 704--715. DOI:http://dx.doi.org/10.1007/978-3-319-09873-9_59Google ScholarGoogle Scholar
  3. Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2013b. Systematic approach in optimizing numerical memory-bound kernels on GPU. In Euro-Par 2012: Parallel Processing Workshops, Ioannis Caragiannis, Michael Alexander, RosaMaria Badia, Mario Cannataro, Alexandru Costan, Marco Danelutto, F. Desprez, Bettina Krammer, Julio Sahuquillo, StephenL. Scott, and Josef Weidendorfer (Eds.). Lecture Notes in Computer Science, Vol. 7640. Springer, Berlin, 207--216. DOI:http://dx.doi.org/10.1007/978-3-642-36949-0_23 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180 (2009), 012037.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Anderson, Z. Bai, C. Bischof, Suzan L. Blackford, James W. Demmel, Jack J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and Danny C. Sorensen. 1999. LAPACK User's Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BLAS. 1979. Basic Linear Algebra Subprograms. Retrieved from http://www.netlib.org/blas/.Google ScholarGoogle Scholar
  7. Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. In ACM SIGGRAPH 2004 Papers (SIGGRAPH'04). ACM, New York, NY, 777--786. DOI:http://dx.doi.org/10.1145/1186562.1015800 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. cuBLAS-XT. 2014. Accelerate BLAS calls with multiple GPUs. Retrieved from https://developer.nvidia.com/cublasxt.Google ScholarGoogle Scholar
  9. J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, and E. J. Kelmelis. 2010. CULA: Hybrid GPU accelerated linear algebra routines. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Vol. 7705. 1.Google ScholarGoogle Scholar
  10. KBLAS. 2014. KAUST Basic Linear Algebra Subprograms. Available at http://cec.kaust.edu.sa/Pages/kblas.aspx. (2014).Google ScholarGoogle Scholar
  11. David B. Kirk and Wen-mei W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. MAGMA. 2009. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee. Retrieved from http://icl.cs.utk.edu/magma/.Google ScholarGoogle Scholar
  13. John D. McCalpin. 1991-2007. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Technical Report. University of Virginia, Charlottesville, Virginia. Retrieved from http://www.cs.virginia.edu/stream/.Google ScholarGoogle Scholar
  14. John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.Google ScholarGoogle Scholar
  15. Rajib Nath, Stanimire Tomov, Tingxing “Tim” Dong, and Jack Dongarra. 2011b. Optimizing symmetric dense matrix-vector multiplication on GPUs. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, New York, NY, Article 6, 10 pages. DOI:http://dx.doi.org/10.1145/2063384.2063392 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010a. An improved magma gemm for fermi graphics processing units. Internaitonal Journal on High Performance Computing Applications 24, 4 (Nov. 2010), 511--515. DOI:http://dx.doi.org/10.1177/1094342010385729 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2010b. BLAS for GPUs. CRC Press, 57--80. DOI:http://dx.doi.org/doi:10.1201/b10376-6Google ScholarGoogle Scholar
  18. Rajib Nath, Stanimire Tomov, and Jack Dongarra. 2011a. Accelerating GPU kernels for dense linear algebra. In Proceedings of the 9th International Conference on High Performance Computing for Computational Science (VECPAR'10). Springer-Verlag, Berlin, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. NVIDIA. 2009. NVIDIA Fermi Compute Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.Google ScholarGoogle Scholar
  20. NVIDIA. 2012. NVIDIA Kepler GK110 Architecture Whitepaper. Retrieved from http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.Google ScholarGoogle Scholar
  21. NVIDIA. 2014a. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google ScholarGoogle Scholar
  22. NVIDIA. 2014b. The NVIDIA CUDA Basic Linear Algebra Subroutines. Retrieved from https://developer.nvidia.com/cublas/.Google ScholarGoogle Scholar
  23. NVIDIA. 2014c. cuBLAS::CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/cublas/#appendix-acknowledgements.Google ScholarGoogle Scholar
  24. OPENACC. 2011. Directives for Accelerators. Retrieved from http://www.openacc-standard.org/.Google ScholarGoogle Scholar
  25. OPENCL. 2009. The open standard for parallel programming of heterogeneous systems. Retrieved from http://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  26. Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, and Ninghui Sun. 2011. Fast implementation of DGEMM on fermi GPU. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, New York, NY, Article 35, 11 pages. DOI:http://dx.doi.org/10.1145/2063384.2063431 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stanimire Tomov, Rajib Nath, and Jack Dongarra. 2010. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36, 12 (Dec. 2010), 645--654. DOI:http://dx.doi.org/10.1016/j.parco.2010.06.001 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Volkov and J. W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2008 (SC'08). 1--11. DOI:http://dx.doi.org/10.1109/SC.2008.5214359 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM 52, 4 (April 2009), 65--76. DOI:http://dx.doi.org/10.1145/1498765.1498785 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ichitaro Yamazaki, Tingxing Dong, Raffaele Solc, Stanimire Tomov, Jack Dongarra, and Thomas Schulthess. 2013. Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurrency and Computation: Practice and Experience 26, 16 (2013), 2652--2666. DOI:http://dx.doi.org/10.1002/cpe.3152 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Mathematical Software
      ACM Transactions on Mathematical Software  Volume 42, Issue 3
      June 2016
      208 pages
      ISSN:0098-3500
      EISSN:1557-7295
      DOI:10.1145/2935754
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 May 2016
      • Accepted: 1 August 2015
      • Revised: 1 May 2015
      • Received: 1 September 2014
      Published in toms Volume 42, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader