skip to main content
10.5555/3014904.3014983acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Strassen's algorithm reloaded

Published:13 November 2016Publication History

ABSTRACT

We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel® Xeon Phi™ coprocessor1 utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

References

  1. V. Strassen, "Gaussian elimination is not optimal," Numer. Math., vol. 13, pp. 354--356, 1969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Winograd, "On multiplication of 2 × 2 matrices," Linear algebra and its applications, vol. 4, no. 4, pp. 381--388, 1971.Google ScholarGoogle Scholar
  3. D. Bini, M. Capovani, F. Romani, and G. Lotti, "O (n2.7799) complexity for n x n approximate matrix multiplication," Information processing letters, vol. 8, no. 5, pp. 234--235, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Schönhage, "Partial and total matrix multiplication," SIAM Journal on Computing, vol. 10, no. 3, pp. 434--455, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. V. Smirnov, "The bilinear complexity and practical algorithms for matrix multiplication," Computational Mathematics and Mathematical Physics, vol. 53, no. 12, pp. 1781--1795, 2013. {Online}. Available: http://dx.doi.org/10.1134/S0965542513120129Google ScholarGoogle ScholarCross RefCross Ref
  6. C. Douglas, M. Heroux, G. Slishman, and R. Smith, "GEMMW - a portable level 3 BLAS Winograd variant of Strassen's matrix-matrix multiplication algorithm," J. Computational Physics, pp. 1--10, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Huss-Lederman, E. M. Jacobson, A. Tsao, T. Turnbull, and J. R. Johnson, "Implementation of Strassen's algorithm for matrix multiplication," in Proceedings of the 1996 ACM/IEEE Conference on Supercomputing, ser. Supercomputing '96. Washington, DC, USA: IEEE Computer Society, 1996. {Online}. Available: http://dx.doi.org/10.1145/369028.369096 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. D'Alberto, M. Bodrato, and A. Nicolau, "Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 2:1--2:30, December 2011. {Online}. Available: http://doi.acm.org/10.1145/2049662.2049664 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. R. Benson and G. Ballard, "A framework for practical parallel fast matrix multiplication," in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP 2015. New York, NY, USA: ACM, 2015, pp. 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. Philadelphia, PA, USA: SIAM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg, "Fast matrix multiplication is stable," Numerische Mathematik, vol. 106, no. 2, pp. 199--224, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Ballard, A. R. Benson, A. Druinsky, B. Lipshitz, and O. Schwartz, "Improving the numerical stability of fast matrix multiplication algorithms," CoRR, vol. abs/1507.00687, 2015. {Online}. Available: http://arxiv.org/abs/1507.00687Google ScholarGoogle Scholar
  13. J. J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff, "A set of level 3 basic linear algebra subprograms," ACM Trans. Math. Soft., vol. 16, no. 1, pp. 1--17, March 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. G. Van Zee and R. A. van de Geijn, "BLIS: A framework for rapidly instantiating BLAS functionality," ACM Trans. Math. Soft., vol. 41, no. 3, pp. 14:1--14:33, June 2015. {Online}. Available: http://doi.acm.org/10.1145/2764454 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Goto and R. A. van de Geijn, "Anatomy of a high-performance matrix multiplication," ACM Trans. Math. Soft., vol. 34, no. 3, p. 12, May 2008, article 12, 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. D. Yu, J. Huang, W. Austin, B. Xiao, and G. Biros, "Performance optimization for the K-Nearest Neighbors kernel on x86 architectures," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '15. New York, NY, USA: ACM, 2015, pp. 7:1--7:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. A. Matthews, "High-performance tensor contraction without BLAS," CoRR, vol. abs/1607.00291, 2016. {Online}. Available: http://arxiv.org/abs/1607.00291Google ScholarGoogle Scholar
  18. P. Springer and P. Bientinesi, "Design of a high-performance gemm-like tensor-tensor multiplication," CoRR, vol. abs/1607.00145, 2016. {Online}. Available: http://arxiv.org/abs/1607.00145Google ScholarGoogle Scholar
  19. E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. J. Dongarra, J. D. Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen, LAPACK Users' Guide (Third Ed.). Philadelphia, PA, USA: SIAM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. F. G. Van Zee and T. M. Smith, "Implementing high-performance complex matrix ymultiplication," ACM Transactions on Mathematical Software, 2016, submitted.Google ScholarGoogle Scholar
  21. "GOTOBLAS," https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.Google ScholarGoogle Scholar
  22. "OpenBLAS, an optimized BLAS library," http://www.openblas.net.Google ScholarGoogle Scholar
  23. T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, "Anatomy of high-performance many-threaded matrix multiplication," in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Boyer, J.-G. Dumas, C. Pernet, and W. Zhou, "Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm," in Proceedings of the 2009 International Symposium on Symbolic and Algebraic Computation, ser. ISSAC '09. New York, NY, USA: ACM, 2009, pp. 55--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Grayson and R. van de Geijn, "A high performance parallel Strassen implementation," Parallel Processing Letters, vol. 6, no. 1, pp. 3--12, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  26. B. Lipshitz, G. Ballard, J. Demmel, and O. Schwartz, "Communication-avoiding parallel Strassen: Implementation and performance," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 101:1--101:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. "Intel Math Kernel Library," https://software.intel.com/en-us/intel-mkl.Google ScholarGoogle Scholar
  28. R. van de Geijn and J. Watts, "SUMMA: Scalable universal matrix multiplication algorithm," Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255--274, April 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Huang, G. Santhanaraman, H. W. Jin, Q. Gao, and D. K. Panda, "Design of high performance MVAPICH2: MPI2 over infiniband," in Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, vol. 1, May 2006, pp. 43--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Goto and R. van de Geijn, "High-performance implementation of the level-3 BLAS," ACM Trans. Math. Soft., vol. 35, no. 1, pp. 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker, "ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers," in Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 1992, pp. 120--127.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2016
    1034 pages
    ISBN:9781467388153
    • Conference Chair:
    • John West

    Publisher

    IEEE Press

    Publication History

    • Published: 13 November 2016

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '16 Paper Acceptance Rate81of442submissions,18%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader