research-article

Strassen's algorithm reloaded

Authors:
Jianyu Huang

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Tyler M. Smith

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Greg M. Henry

Intel Corporation

Intel Corporation
View Profile

,
Robert A. van de Geijn

The University of Texas at Austin

The University of Texas at Austin
View Profile

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2016Article No.: 59Pages 1–12

Published:13 November 2016Publication History

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel® Xeon Phi™ coprocessor¹ utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

References

V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, pp. 354--356, 1969. Google ScholarDigital Library
S. Winograd, "On multiplication of 2 × 2 matrices," Linear algebra and its applications, vol. 4, no. 4, pp. 381--388, 1971.Google Scholar
D. Bini, M. Capovani, F. Romani, and G. Lotti, "O (n^2.7799) complexity for n x n approximate matrix multiplication," Information processing letters, vol. 8, no. 5, pp. 234--235, 1979.Google ScholarCross Ref
A. Schönhage, "Partial and total matrix multiplication," SIAM Journal on Computing, vol. 10, no. 3, pp. 434--455, 1981.Google ScholarCross Ref
A. V. Smirnov, "The bilinear complexity and practical algorithms for matrix multiplication," Computational Mathematics and Mathematical Physics, vol. 53, no. 12, pp. 1781--1795, 2013. {Online}. Available: http://dx.doi.org/10.1134/S0965542513120129Google ScholarCross Ref
C. Douglas, M. Heroux, G. Slishman, and R. Smith, "GEMMW - a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiplication algorithm," J. Computational Physics, pp. 1--10, 1994. Google ScholarDigital Library
S. Huss-Lederman, E. M. Jacobson, A. Tsao, T. Turnbull, and J. R. Johnson, "Implementation of Strassen's algorithm for matrix multiplication," in Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, ser. Supercomputing '96. Washington, DC, USA: IEEE Computer Society, 1996. {Online}. Available: http://dx.doi.org/10.1145/369028.369096 Google ScholarDigital Library
P. D'Alberto, M. Bodrato, and A. Nicolau, "Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 2:1--2:30, December 2011. {Online}. Available: http://doi.acm.org/10.1145/2049662.2049664 Google ScholarDigital Library
A. R. Benson and G. Ballard, "A framework for practical parallel fast matrix multiplication," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2015. New York, NY, USA: ACM, 2015, pp. 42--53. Google ScholarDigital Library
N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. Philadelphia, PA, USA: SIAM, 2002. Google ScholarDigital Library
J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg, "Fast matrix multiplication is stable," Numerische Mathematik, vol. 106, no. 2, pp. 199--224, 2007. Google ScholarDigital Library
G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz, and O. Schwartz, "Improving the numerical stability of fast matrix multiplication algorithms," CoRR, vol. abs/1507.00687, 2015. {Online}. Available: http://arxiv.org/abs/1507.00687Google Scholar
J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, pp. 1--17, March 1990. Google ScholarDigital Library
F. G. Van Zee and R. A. van de Geijn, "BLIS: A framework for rapidly instantiating BLAS functionality," ACM Trans. Math. Soft., vol. 41, no. 3, pp. 14:1--14:33, June 2015. {Online}. Available: http://doi.acm.org/10.1145/2764454 Google ScholarDigital Library
K. Goto and R. A. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008, article 12, 25 pages. Google ScholarDigital Library
C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, "Performance optimization for the K-Nearest Neighbors kernel on x86 architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '15. New York, NY, USA: ACM, 2015, pp. 7:1--7:12. Google ScholarDigital Library
D. A. Matthews, "High-performance tensor contraction without BLAS," CoRR, vol. abs/1607.00291, 2016. {Online}. Available: http://arxiv.org/abs/1607.00291Google Scholar
P. Springer and P. Bientinesi, "Design of a high-performance gemm-like tensor-tensor multiplication," CoRR, vol. abs/1607.00145, 2016. {Online}. Available: http://arxiv.org/abs/1607.00145Google Scholar
E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. D. Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen, LAPACK Users' Guide (Third Ed.). Philadelphia, PA, USA: SIAM, 1999. Google ScholarDigital Library
F. G. Van Zee and T. M. Smith, "Implementing high-performance complex matrix ymultiplication," ACM Transactions on Mathematical Software, 2016, submitted.Google Scholar
"GOTOBLAS," https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.Google Scholar
"OpenBLAS, an optimized BLAS library," http://www.openblas.net.Google Scholar
T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, "Anatomy of high-performance many-threaded matrix multiplication," in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014), 2014. Google ScholarDigital Library
B. Boyer, J.-G. Dumas, C. Pernet, and W. Zhou, "Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm," in Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation, ser. ISSAC '09. New York, NY, USA: ACM, 2009, pp. 55--62. Google ScholarDigital Library
B. Grayson and R. van de Geijn, "A high performance parallel Strassen implementation," Parallel Processing Letters, vol. 6, no. 1, pp. 3--12, 1996.Google ScholarCross Ref
B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz, "Communication-avoiding parallel Strassen: Implementation and performance," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 101:1--101:11. Google ScholarDigital Library
"Intel Math Kernel Library," https://software.intel.com/en-us/intel-mkl.Google Scholar
R. van de Geijn and J. Watts, "SUMMA: Scalable universal matrix multiplication algorithm," Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255--274, April 1997.Google ScholarDigital Library
W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda, "Design of high performance MVAPICH2: MPI2 over infiniband," in Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, vol. 1, May 2006, pp. 43--48. Google ScholarDigital Library
K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Soft., vol. 35, no. 1, pp. 1--14, 2008. Google ScholarDigital Library
J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, "ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers," in Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 1992, pp. 120--127.Google Scholar

Recommendations

Strassen’s Algorithm Reloaded on GPUs

Conventional Graphics Processing Unit (GPU) implementations of Strassen’s algorithm (Strassen) rely on the existing high-performance matrix multiplication (gemm), trading space for time. As a result, such approaches can only achieve practical speedup ...
Read More
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
Abstract
We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix ...
Read More
Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2016
1034 pages
ISBN:9781467388153
Conference Chair:
John West
University of Texas at Austin
Sponsors
In-Cooperation
Publisher
IEEE Press
Publication History
- Published: 13 November 2016
Check for updates
Author Tags
BLAS
linear algebra library
matrix multiplication
numerical algorithm
performance model
strassen
Qualifiers
- research-article
Conference

Acceptance Rates
SC '16 Paper Acceptance Rate81of442submissions,18%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 833
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Strassen's algorithm reloaded

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Strassen’s Algorithm Reloaded on GPUs

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Strassen's algorithm reloaded

SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Strassen’s Algorithm Reloaded on GPUs

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media