skip to main content
10.1145/1046192.1046204acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
Article

64-bit floating-point FPGA matrix multiplication

Published:20 February 2005Publication History

ABSTRACT

We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.

References

  1. A. Abdul Gaar, W. Luk, P. Y. Cheung, N. Shirazi, and J. Hwang. Automating Customisation of Floating-Point Designs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 523--533. LNCS 2438, August 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Choi. A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers. In 11th IEEE International Parallel Processing Symposium (IPPS'97), pages 310--314, April 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, pages 1--17, March 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Gigliotti. Implementing Barrel Shifters Using Multipliers. In Xilinx Applicatioin Notes, http://direct.xilinx.com/bvdocs/appnotes/xapp195.pdf.]]Google ScholarGoogle Scholar
  5. K. Goto and R. A. van de Geijn. On Reducing TLB Misses in Matrix Multiplication. In FLAME Working Notes #9, Technical Report TR-2002-55. The University of Texas at Austin, Department of Computer Sciences, November 2002.]]Google ScholarGoogle Scholar
  6. G. Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of High-performance Floating-point Arithmetic on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 149--156, April 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Gunnels, G. Henry, and R. van de Geijn. High-Performance Matrix Multiplication Algorithms for Architectures with Hierarchical Memories. In FLAME Working Notes #4, Technical Report TR-2001-22. The University of Texas at Austin, Department of Computer Sciences, June 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Jang, S. Choi, and V. K. Prasanna. Area and Time Efficient Implementations of Matrix Multiplication on FPGAs. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 93--100, December 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multiplication on FPGAs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 534--544. LNCS 2438, August 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Kuzmanov, G. N. Gaydadjiev, and S. Vassiliadis. Visual data rectangular memory. In Proceedings of the 10th International Euro-Par Conference (Euro-Par 2004), pages 760--767, September 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Liang, R. Tessier, and O. Mencer. Floating Point Unit Generation and Evaluation for FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2003), pages 185--194, April 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Lienhart, A. Kugel, and R. Manner. Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 182--191, April 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. B. Ligon III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, and K. D. Underwood. A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 1998), pages 206--215, April 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Roesler and B. Nelson. Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 637--646. LNCS 2438, August 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. S. Schmookler and K. J. Nowka. Leading Zero Anticipation and Detection -- A Comparison of Methods. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pages 7--16, June 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Scrofano, S. Choi, and V. K. Prasanna. Energy Efficiency of FPGAs and Programmable Processors for Matrix Multiplication. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 422--425, December 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  17. N. Shirazi, P. Y. Cheung, A. A. Gaffar, and W. Luk. Customising Floating- Point Designs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 315--317, April 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. D. Underwood. FPGA vs. CPUs: Trends in Peak Floating-Point Performance. In Proceedings of the ACM International Symposium on Field Programmable Gate Arrays (FPGA 2004), pages 171--180, February 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. D. Underwood and K. S. Hemmert. Closing the gap: CPU and FPGA Trends in sustainable floating-point BLAS performance. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2004), April 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. A. van~de Geijnv and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. In LAPACK Working Note 99, technical report, University of Tennessee, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte. The Molen Polymorphic Processor. IEEE Transactions on Computers, 53(11):1363--1375, November 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 94--103, April 2004.]]Google ScholarGoogle Scholar

Index Terms

  1. 64-bit floating-point FPGA matrix multiplication

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
            February 2005
            288 pages
            ISBN:1595930299
            DOI:10.1145/1046192

            Copyright © 2005 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 February 2005

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate125of627submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader