ABSTRACT
We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.
- A. Abdul Gaar, W. Luk, P. Y. Cheung, N. Shirazi, and J. Hwang. Automating Customisation of Floating-Point Designs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 523--533. LNCS 2438, August 2002.]] Google ScholarDigital Library
- J. Choi. A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers. In 11th IEEE International Parallel Processing Symposium (IPPS'97), pages 310--314, April 1997.]] Google ScholarDigital Library
- J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, pages 1--17, March 1990.]] Google ScholarDigital Library
- P. Gigliotti. Implementing Barrel Shifters Using Multipliers. In Xilinx Applicatioin Notes, http://direct.xilinx.com/bvdocs/appnotes/xapp195.pdf.]]Google Scholar
- K. Goto and R. A. van de Geijn. On Reducing TLB Misses in Matrix Multiplication. In FLAME Working Notes #9, Technical Report TR-2002-55. The University of Texas at Austin, Department of Computer Sciences, November 2002.]]Google Scholar
- G. Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of High-performance Floating-point Arithmetic on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 149--156, April 2004.]]Google ScholarCross Ref
- J. Gunnels, G. Henry, and R. van de Geijn. High-Performance Matrix Multiplication Algorithms for Architectures with Hierarchical Memories. In FLAME Working Notes #4, Technical Report TR-2001-22. The University of Texas at Austin, Department of Computer Sciences, June 2001.]] Google ScholarDigital Library
- J. Jang, S. Choi, and V. K. Prasanna. Area and Time Efficient Implementations of Matrix Multiplication on FPGAs. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 93--100, December 2002.]]Google ScholarCross Ref
- J. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multiplication on FPGAs. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 534--544. LNCS 2438, August 2002.]] Google ScholarDigital Library
- G. Kuzmanov, G. N. Gaydadjiev, and S. Vassiliadis. Visual data rectangular memory. In Proceedings of the 10th International Euro-Par Conference (Euro-Par 2004), pages 760--767, September 2004.]]Google ScholarCross Ref
- J. Liang, R. Tessier, and O. Mencer. Floating Point Unit Generation and Evaluation for FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2003), pages 185--194, April 2003.]] Google ScholarDigital Library
- G. Lienhart, A. Kugel, and R. Manner. Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 182--191, April 2002.]] Google ScholarDigital Library
- W. B. Ligon III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, and K. D. Underwood. A Re-evaluation of the Practicality of Floating-Point Operations on FPGAs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 1998), pages 206--215, April 1998.]] Google ScholarDigital Library
- E. Roesler and B. Nelson. Novel Optimizations for Hardware Floating-Point Units in a Modern FPGA Architecture. In Proceedings of the 12th International Workshop on Field Programmable Logic and Application (FPL 2002), pages 637--646. LNCS 2438, August 2002.]] Google ScholarDigital Library
- M. S. Schmookler and K. J. Nowka. Leading Zero Anticipation and Detection -- A Comparison of Methods. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, pages 7--16, June 2001.]] Google ScholarDigital Library
- R. Scrofano, S. Choi, and V. K. Prasanna. Energy Efficiency of FPGAs and Programmable Processors for Matrix Multiplication. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT 2002), pages 422--425, December 2002.]]Google ScholarCross Ref
- N. Shirazi, P. Y. Cheung, A. A. Gaffar, and W. Luk. Customising Floating- Point Designs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2002), pages 315--317, April 2002.]] Google ScholarDigital Library
- K. D. Underwood. FPGA vs. CPUs: Trends in Peak Floating-Point Performance. In Proceedings of the ACM International Symposium on Field Programmable Gate Arrays (FPGA 2004), pages 171--180, February 2004.]] Google ScholarDigital Library
- K. D. Underwood and K. S. Hemmert. Closing the gap: CPU and FPGA Trends in sustainable floating-point BLAS performance. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing machines (FCCM 2004), April 2004.]] Google ScholarDigital Library
- R. A. van~de Geijnv and J. Watts. SUMMA: Scalable Universal Matrix Multiplication Algorithm. In LAPACK Working Note 99, technical report, University of Tennessee, 1995.]] Google ScholarDigital Library
- S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov, and E. M. Panainte. The Molen Polymorphic Processor. IEEE Transactions on Computers, 53(11):1363--1375, November 2004.]] Google ScholarDigital Library
- L. Zhuo and V. K. Prasanna. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on FPGAs. In Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS'04), pages 94--103, April 2004.]]Google Scholar
Index Terms
- 64-bit floating-point FPGA matrix multiplication
Recommendations
32-bit floating-point FPGA gaussian elimination
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysThe well-known Gaussian elimination (with partial pivoting) is a widely-used algorithm, one of traditional methods for solving dense linear systems of equations (LSEs). This paper presents a hardware-optimized variant of Gaussian elimination and its 32-...
Flexible multi-mode embedded floating-point unit for field programmable gate arrays
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysPerformance of Field Programmable Gate Arrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floating-point units on FPGAs consume a large amount of resources. This makes FPGAs less ...
FPGA-based, floating-point reduction operations
MATH'06: Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICSFloating-point reduction operations are a vital part of scientific computational kernels, such as vector dot-products, discrete cosine transforms (DCT), and matrix-matrix multiplications. As FPGAs continue to gain popularity in custom and embedded ...
Comments