Article

64-bit floating-point FPGA matrix multiplication

Authors:
Yong Dou

National Laboratory for Parallel and Distributed Processing, Changsha, P.R. China

National Laboratory for Parallel and Distributed Processing, Changsha, P.R. China
View Profile

,
S. Vassiliadis

EEMCS, TU Delft, Delft, The Netherlands

EEMCS, TU Delft, Delft, The Netherlands
View Profile

,
G. K. Kuzmanov

EEMCS, TU Delft, Delft, The Netherlands

EEMCS, TU Delft, Delft, The Netherlands
View Profile

,
G. N. Gaydadjiev

EEMCS, TU Delft, Delft, The Netherlands

EEMCS, TU Delft, Delft, The Netherlands
View Profile

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arraysFebruary 2005Pages 86–95https://doi.org/10.1145/1046192.1046204

Published:20 February 2005Publication History

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

Pages 86–95

ABSTRACT

We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.

References

A. Abdul Gaar, W. Luk, P. Y. Cheung, N. Shirazi, and J. Hwang. Automating Customisation of Floating-Point Designs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 523--533. LNCS 2438, August 2002.]] Google ScholarDigital Library
J. Choi. A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers. In 11th IEEE International Parallel Processing Symposium (IPPS'97), pages 310--314, April 1997.]] Google ScholarDigital Library
J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, pages 1--17, March 1990.]] Google ScholarDigital Library
P. Gigliotti. Implementing Barrel Shifters Using Multipliers. In Xilinx Applicatioin Notes, http://direct.xilinx.com/bvdocs/appnotes/xapp195.pdf.]]Google Scholar
K. Goto and R. A. van de Geijn. On Reducing TLB Misses in Matrix Multiplication. In FLAME Working Notes #9, Technical Report TR-2002-55. The University of Texas at Austin, Department of Computer Sciences, November 2002.]]Google Scholar
G. Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of High-performance Floating-point Arithmetic on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 149--156, April 2004.]]Google ScholarCross Ref
J. Gunnels, G. Henry, and R. van de Geijn. High-Performance Matrix Multiplication Algorithms for Architectures with Hierarchical Memories. In FLAME Working Notes #4, Technical Report TR-2001-22. The University of Texas at Austin, Department of Computer Sciences, June 2001.]] Google ScholarDigital Library
J. Jang, S. Choi, and V. K. Prasanna. Area and Time Efficient Implementations of Matrix Multiplication on FPGAs. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 93--100, December 2002.]]Google ScholarCross Ref
J. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multiplication on FPGAs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 534--544. LNCS 2438, August 2002.]] Google ScholarDigital Library
G. Kuzmanov, G. N. Gaydadjiev, and S. Vassiliadis. Visual data rectangular memory. In Proceedings of the 10th International Euro-Par Conference (Euro-Par 2004), pages 760--767, September 2004.]]Google ScholarCross Ref
J. Liang, R. Tessier, and O. Mencer. Floating Point Unit Generation and Evaluation for FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2003), pages 185--194, April 2003.]] Google ScholarDigital Library
G. Lienhart, A. Kugel, and R. Manner. Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 182--191, April 2002.]] Google ScholarDigital Library
W. B. Ligon III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, and K. D. Underwood. A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 1998), pages 206--215, April 1998.]] Google ScholarDigital Library
E. Roesler and B. Nelson. Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 637--646. LNCS 2438, August 2002.]] Google ScholarDigital Library
M. S. Schmookler and K. J. Nowka. Leading Zero Anticipation and Detection -- A Comparison of Methods. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pages 7--16, June 2001.]] Google ScholarDigital Library
R. Scrofano, S. Choi, and V. K. Prasanna. Energy Efficiency of FPGAs and Programmable Processors for Matrix Multiplication. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 422--425, December 2002.]]Google ScholarCross Ref
N. Shirazi, P. Y. Cheung, A. A. Gaffar, and W. Luk. Customising Floating- Point Designs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 315--317, April 2002.]] Google ScholarDigital Library
K. D. Underwood. FPGA vs. CPUs: Trends in Peak Floating-Point Performance. In Proceedings of the ACM International Symposium on Field Programmable Gate Arrays (FPGA 2004), pages 171--180, February 2004.]] Google ScholarDigital Library
K. D. Underwood and K. S. Hemmert. Closing the gap: CPU and FPGA Trends in sustainable floating-point BLAS performance. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2004), April 2004.]] Google ScholarDigital Library
R. A. van~de Geijnv and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. In LAPACK Working Note 99, technical report, University of Tennessee, 1995.]] Google ScholarDigital Library
S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte. The Molen Polymorphic Processor. IEEE Transactions on Computers, 53(11):1363--1375, November 2004.]] Google ScholarDigital Library
L. Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 94--103, April 2004.]]Google Scholar

Index Terms

64-bit floating-point FPGA matrix multiplication

Recommendations

32-bit floating-point FPGA gaussian elimination
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

The well-known Gaussian elimination (with partial pivoting) is a widely-used algorithm, one of traditional methods for solving dense linear systems of equations (LSEs). This paper presents a hardware-optimized variant of Gaussian elimination and its 32-...
Read More
Flexible multi-mode embedded floating-point unit for field programmable gate arrays
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Performance of Field Programmable Gate Arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units on FPGAs consume a large amount of resources. This makes FPGAs less ...
Read More
FPGA-based, floating-point reduction operations
MATH'06: Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS

Floating-point reduction operations are a vital part of scientific computational kernels, such as vector dot-products, discrete cosine transforms (DCT), and matrix-matrix multiplications. As FPGAs continue to gain popularity in custom and embedded ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
February 2005
288 pages
ISBN:1595930299
DOI:10.1145/1046192
General Chair:
Herman Schmit
Tabula
,
Program Chair:
Steve Wilton
University of British Columbia
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 February 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
floating-point
matrix multiplication
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 139
  Total Citations
  View Citations
- 2,945
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

64-bit floating-point FPGA matrix multiplication

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

32-bit floating-point FPGA gaussian elimination

Flexible multi-mode embedded floating-point unit for field programmable gate arrays

FPGA-based, floating-point reduction operations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

64-bit floating-point FPGA matrix multiplication

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

32-bit floating-point FPGA gaussian elimination

Flexible multi-mode embedded floating-point unit for field programmable gate arrays

FPGA-based, floating-point reduction operations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media