ABSTRACT
We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel® Xeon Phi™ coprocessor1 utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.
- V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, pp. 354--356, 1969. Google ScholarDigital Library
- S. Winograd, "On multiplication of 2 × 2 matrices," Linear algebra and its applications, vol. 4, no. 4, pp. 381--388, 1971.Google Scholar
- D. Bini, M. Capovani, F. Romani, and G. Lotti, "O (n2.7799) complexity for n x n approximate matrix multiplication," Information processing letters, vol. 8, no. 5, pp. 234--235, 1979.Google ScholarCross Ref
- A. Schönhage, "Partial and total matrix multiplication," SIAM Journal on Computing, vol. 10, no. 3, pp. 434--455, 1981.Google ScholarCross Ref
- A. V. Smirnov, "The bilinear complexity and practical algorithms for matrix multiplication," Computational Mathematics and Mathematical Physics, vol. 53, no. 12, pp. 1781--1795, 2013. {Online}. Available: http://dx.doi.org/10.1134/S0965542513120129Google ScholarCross Ref
- C. Douglas, M. Heroux, G. Slishman, and R. Smith, "GEMMW - a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiplication algorithm," J. Computational Physics, pp. 1--10, 1994. Google ScholarDigital Library
- S. Huss-Lederman, E. M. Jacobson, A. Tsao, T. Turnbull, and J. R. Johnson, "Implementation of Strassen's algorithm for matrix multiplication," in Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, ser. Supercomputing '96. Washington, DC, USA: IEEE Computer Society, 1996. {Online}. Available: http://dx.doi.org/10.1145/369028.369096 Google ScholarDigital Library
- P. D'Alberto, M. Bodrato, and A. Nicolau, "Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 2:1--2:30, December 2011. {Online}. Available: http://doi.acm.org/10.1145/2049662.2049664 Google ScholarDigital Library
- A. R. Benson and G. Ballard, "A framework for practical parallel fast matrix multiplication," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2015. New York, NY, USA: ACM, 2015, pp. 42--53. Google ScholarDigital Library
- N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. Philadelphia, PA, USA: SIAM, 2002. Google ScholarDigital Library
- J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg, "Fast matrix multiplication is stable," Numerische Mathematik, vol. 106, no. 2, pp. 199--224, 2007. Google ScholarDigital Library
- G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz, and O. Schwartz, "Improving the numerical stability of fast matrix multiplication algorithms," CoRR, vol. abs/1507.00687, 2015. {Online}. Available: http://arxiv.org/abs/1507.00687Google Scholar
- J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, pp. 1--17, March 1990. Google ScholarDigital Library
- F. G. Van Zee and R. A. van de Geijn, "BLIS: A framework for rapidly instantiating BLAS functionality," ACM Trans. Math. Soft., vol. 41, no. 3, pp. 14:1--14:33, June 2015. {Online}. Available: http://doi.acm.org/10.1145/2764454 Google ScholarDigital Library
- K. Goto and R. A. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008, article 12, 25 pages. Google ScholarDigital Library
- C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, "Performance optimization for the K-Nearest Neighbors kernel on x86 architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '15. New York, NY, USA: ACM, 2015, pp. 7:1--7:12. Google ScholarDigital Library
- D. A. Matthews, "High-performance tensor contraction without BLAS," CoRR, vol. abs/1607.00291, 2016. {Online}. Available: http://arxiv.org/abs/1607.00291Google Scholar
- P. Springer and P. Bientinesi, "Design of a high-performance gemm-like tensor-tensor multiplication," CoRR, vol. abs/1607.00145, 2016. {Online}. Available: http://arxiv.org/abs/1607.00145Google Scholar
- E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. D. Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen, LAPACK Users' Guide (Third Ed.). Philadelphia, PA, USA: SIAM, 1999. Google ScholarDigital Library
- F. G. Van Zee and T. M. Smith, "Implementing high-performance complex matrix ymultiplication," ACM Transactions on Mathematical Software, 2016, submitted.Google Scholar
- "GOTOBLAS," https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.Google Scholar
- "OpenBLAS, an optimized BLAS library," http://www.openblas.net.Google Scholar
- T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, "Anatomy of high-performance many-threaded matrix multiplication," in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014), 2014. Google ScholarDigital Library
- B. Boyer, J.-G. Dumas, C. Pernet, and W. Zhou, "Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm," in Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation, ser. ISSAC '09. New York, NY, USA: ACM, 2009, pp. 55--62. Google ScholarDigital Library
- B. Grayson and R. van de Geijn, "A high performance parallel Strassen implementation," Parallel Processing Letters, vol. 6, no. 1, pp. 3--12, 1996.Google ScholarCross Ref
- B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz, "Communication-avoiding parallel Strassen: Implementation and performance," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 101:1--101:11. Google ScholarDigital Library
- "Intel Math Kernel Library," https://software.intel.com/en-us/intel-mkl.Google Scholar
- R. van de Geijn and J. Watts, "SUMMA: Scalable universal matrix multiplication algorithm," Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255--274, April 1997.Google ScholarDigital Library
- W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda, "Design of high performance MVAPICH2: MPI2 over infiniband," in Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, vol. 1, May 2006, pp. 43--48. Google ScholarDigital Library
- K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Soft., vol. 35, no. 1, pp. 1--14, 2008. Google ScholarDigital Library
- J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, "ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers," in Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 1992, pp. 120--127.Google Scholar
Recommendations
Strassen’s Algorithm Reloaded on GPUs
Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup ...
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
AbstractWe present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix ...
Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core
Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) ...
Comments