article

Free Access

Area and performance tradeoffs in floating-point divide and square-root implementations

Authors:
Peter Soderquist

Cornell Univ, Ithaca, NY

Cornell Univ, Ithaca, NY
View Profile

,
Miriam Leeser

Northeastern Univ., Boston, MA

Northeastern Univ., Boston, MA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 28 Issue 3pp 518–564https://doi.org/10.1145/243439.243481

Published:01 September 1996Publication History

ACM Computing Surveys

Abstract

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.

References

ALPERT, D. AND AVNON, D. 1993.Architecture of the Pentium microprocessor. IEEE Micro (June) 11-21.]] Google Scholar
ANDERSON, S. F. ET AL. 1967. The IBM System/ 360 Model 91: Floating-point execution unit. IBM J. Res. (Jan.) 34-53.]]Google Scholar
ASPREY, T. ET AL. 1993. Performance features of the PA7100 microprocessor. IEEE Micro (June) 22-35.]] Google Scholar
ATKINS, D.E. 1968. Higher-radix division using estimates of the divisor and partial remainders. IEEE Trans. Comput. (Oct.) 925-934.]]Google Scholar
BANNON, P. AND KELLER, J. 1995. Internal architecture of Alpha 21164 microprocessor. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 79-87.]] Google Scholar
BECKER, M.C. ET AL. 1993. The PowerPC 601 microprocessor. IEEE Micro (Oct.) 54-68.]] Google Scholar
BERRY, M., CHEN, D., KOSS, P., AND KUCK, D. 1988. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. CSRD Rep. No. 827, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Nov.]]Google Scholar
BERRY, M., CYBENKO, G., AND LARSON, J. 1991. Scientific benchmark characterizations. Parallel Comput. (Dec.) 1173-1194.]]Google Scholar
BRIGGS, W. S. AND MATULA, D.W. 1993. A 17 ~ 69 multiply and add unit with redundant binary feedback and single cycle latency. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, June 163-170.]]Google Scholar
BURGESS, B. ET AL. 1994. The PowerPC 603 RISC microprocessor. Commun. ACM (June) 34-42.]] Google Scholar
CASE, B. 1995. SPEC95 retires SPEC92. Microprocessor Rep. (Aug.) 11-14.]]Google Scholar
COL, T. AND TANG, P. T.P. 1995. It takes six ones to reach a flaw. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 140-146.]] Google Scholar
CONTE, T.M. 1993. Architectural resource requirements of contemporary benchmarks: A wish list. In Proceedings of the 26th Annual Hawaii International Conference on System Sciences, IEEE, Jan. 517-529.]]Google Scholar
DARLEY, H. M. ET AL. 1989. Floating-point/integer processor with divide and square root functions. U.S. Patent 4,878,190, Oct.]]Google Scholar
DARLEY, M. ET AL. 1990. The TMS390C602A floating-point coprocessor for Sparc systems. IEEE Micro, (June) 36-47.]] Google Scholar
DIXIT, K.M. 1991. The SPEC benchmarks. Parallel Computing, (Dec.) 1195-1209.]]Google Scholar
ERCEGOVAC, M. D. AND LANG, T. 1994. Division and Square Root: Digit Recurrence Algorithms and Implementations. Kluwer Academic Publishers; Norwell, MA.]] Google Scholar
ERCEGOVAC, M.D., LANG, T., AND MONTUSCHI, P. 1993. Very high radix division with selection by rounding and prescaling. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 112-119.]]Google Scholar
FOWLER, D. L. AND SMITH, J.E. 1989. An accurate, high speed implementation of division by reciprocal approximation. In Proceedings of the 9th IEEE Symposium on Computer Arithmetic, Sept. 60-67.]]Google Scholar
FRANTZESKAKIS, E. N. AND LIU, K. J.R. 1994. A class of square root and division free algorithms and architectures for QRD-based adaptive signal processing. IRE Trans. Signal Process. (Sept.) 2455-2469.]]Google Scholar
GOLUB, G. H. AND VAN LOAN, C.F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, Second Ed.]]Google Scholar
GREENLY, D. ET AL. 1995. UltraSPARC(tm): The next generation superscalar 64-bit SPARC. In Digest of Papers: COMPCON Spring 1995, IEEE (Feb.) 442-451.]] Google Scholar
GROSS, T. 1985. Software implementation of floating-point arithmetic on a reduced-instruction-set processor. J. Parallel Distrib. Comput. 2, 362-375.]] Google Scholar
GWENNAP, L. 1994. PA-7200 enables inexpensive MP systems: HP's next-generation PA- RISC also contains unique "assist" cache. Microprocessor Rep. (Jan.).]]Google Scholar
GWENNAP, L. 1995a. Hal reveals multichip SPARC processor: High-performance CPU for Hal systems only--no merchant sales. Microprocessor Rep. 9, 3 (Mar.) 1,6-11.]]Google Scholar
GWENNAP, L. 1995b. Intel's P6 uses decoupled superscalar design: Next generation of x86 integrates L2 cache in package with CPU. Microprocessor Rep. 9, 2 (Feb.) 9-15.]]Google Scholar
HENNESSY, J.L. AND PATTERSON, D.A. 1990a. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA.]] Google Scholar
HENNESSY, J.L. AND PATTERSON, D.A. 1990b. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA. Appendix A: Computer Arithmetic by David Goldberg.]] Google Scholar
HUNT, D. 1995. Advanced performance features of the 64-bit PA-8000. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 123- 128.]] Google Scholar
IEEE STANDARD FOR BINARY FLOATING-POINT ARITHMETIC. 1985. New York ANSI/IEEE Std. 754--1985, Aug.]]Google Scholar
JULIUSSEN, E. 1994. Which low-end workstation? IEEE Spectrum, (Apr.) 51-59.]] Google Scholar
KABUO, H. ET AL. 1994. Accurate rounding scheme for the Newton-Raphson method using redundant binary representation. IEEE Trans. Comput. (Jan.)43-50.]] Google Scholar
KAHAN, W. 1994. Using MathCAD 3.1 on a Mac, Aug.]]Google Scholar
KOHN, L. AND MARGULIS, N. 1989. Introducing the Intel i860 64-bit microprocessor. IEEE Micro, (Aug.) 15-30.]] Google Scholar
LANG, T. AND MONTUSCHI, P. 1995. Very-high radix combined division and square root with prescaling and selection by rounding. In Proceedngs of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 124-131.]] Google Scholar
Lu, P. Y. AND DAWALLU, K. 1989. A VLSI module for IEEE floating-pont multiplication/division/square root. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, 366-368.]]Google Scholar
MARKSTEIN, P.W. 1990. Computation of elementary functions on the IBM RISC System/ 6000 processor. IBM J. Res. Develop. (Jan.) 111-119.]] Google Scholar
MATSuBARA, G., ET AL. 1995. 30-ns 55-b shared radix-2 division and square root using a selftimed circuit. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 98-105.]] Google Scholar
MCLELLAN, E. 1993. The Alpha AXP architecture and 21064 processor. IEEE Micro, (June) 36-47.]] Google Scholar
McQUILLAN, S. E. AND MCCANNY, J.V. 1991. A VLSI architecture for multiplication, division, and square root. In Proceedings of the 1991 International Conference on Acoustics, Speech and Signal Processing, IEEE, May 1205-1208.]] Google Scholar
McQUILLAN, S.E., MCCANNY, J.V., AND HAMILL, R. 1993. New algorithms and VLSI architectures for SRT division and square root. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 80-86.]]Google Scholar
MIPS TECHNOLOGIES, INC. 1994a. Mountain View, CA. RIO000 Microprocessor: Product Overview, Oct.]]Google Scholar
MIPS TECHNOLOGIES, INC. 1994b. Mountain View, CA. R8000 Microprocessor Chip Set: Product Overview, Aug.]]Google Scholar
MIRAPURI, S., WOODACRE, M., AND VASSEGHI, N. 1992. The Mips R4000 processor. IEEE Micro Apr. 10-22.]] Google Scholar
MISRA, M. 1990. IBM RISC System~6000 Technology. IBM.]]Google Scholar
MOLER, C. 1995. A tale of two numbers. SIAM News 28, 1 Jan. 1,16.]]Google Scholar
MONTOYE, R.K., HOKENEK, E., AND RUNYON, S.L. 1990. Design of the IBM RISC System/6000 floating-point execution unit. IBM J. Res. Develop. (Jan.) 59-70.]] Google Scholar
OBERMAN, S.F. AND FLYNN, M.J. 1994. An analysis of division algorithms and implementations. Tech. Rep. CSL-TR-95-675, Stanford Univ. Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google Scholar
OBERMAN, S. F. AND FLYNN, M.J. 1994. Design issues in floating-point division. Tech. Rep. CSL-TR-94-647, Stanford University Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google Scholar
PENG, V., SAMUDRALA, S., AND GAVRIELOV, M. 1987. On the implementation of shifters, multipliers, and dividers in VLSI floating point units. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic, IEEE, May 95-102.]]Google Scholar
SARMA, D. D. AND MATULA, D.W. 1993. Measuring the accuracy of ROM reciprocal tables. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, (June) 95-102.]]Google Scholar
SARMA, D. D. AND MATULA, D.W. 1995. Faithful bipartite ROM reciprocal tables. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE (July) 17-28.]] Google Scholar
SCOTT, N.R. 1985. Computer Number Systems and Arithmetic. Prentice Hall, Englewood Cliffs, NJ.]] Google Scholar
SHARANGPANI, H.P. AND BARTON, M.L. 1994. Statistical analysis of floating point flaw in the PentiumTM processor. Tech. Rep., Intel Corporation, Nov.]]Google Scholar
SIMHA, S. 1993. R4400 Microprocessor: Product Information. MIPS Technologies, Inc., Mountain View, CA. Sept.]]Google Scholar
SONG, S.P. ET AL. 1994. The PowerPC 604 RISC microprocessor. IEEE Micro (Oct.) 8-17.]] Google Scholar
STEARNS, C.C. 1989. Subtractive floating-point division and square root for VLSI DSP. In European Conference on Circuit Theory and Design. Sept. 405-409.]]Google Scholar
SUN MICROSYSTEMS COMPUTER CORPORATION. 1992. Mountain View, CA. The SuperSPARCTM Microprocessor, May.]]Google Scholar
TAYLOR, G.S. 1985. Radix 16 SRT dividers with overlapped quotient selection stages. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic, IEEE, June 64-71.]]Google Scholar
WASER, S. AND FLYNN, M.J. 1982. An Introduction to Arithmetic for Digital System Designers. Holt, Rinehart and Winston, New York.]] Google Scholar
WEICKER, R.P. 1991. A detailed look at some popular benchmarks. Parallel Computing, (Dec.) 1153-1172.]]Google Scholar
WHITE, S. W., ET AL. 1993. How does processor MHz relate to end-user performance? Part 1: Pipelines and functional units. IEEE Micro, (Aug.) 8-16.]] Google Scholar
WHITE, S.W. 1994. POWER2: Architecture and performance. In Digest of Papers: COMPCON Spring 1994, IEEE, Feb. 384-388.]]Google Scholar
WILLIAMS, T.E. 1991. A zero-overhead selftimed 160-ns 54-b CMOS divider. IEEE J. Solid-State Circuits, Nov. 1651-1661.]]Google Scholar
TONG, D. AND FLYNN, M. 1992. Fast division using accurate quotient approximations to reduce the number of iterations. IEEE Trans. Comput. (Aug.) 981-995.]] Google Scholar

Recommendations

An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations
ARITH '95: Proceedings of the 12th Symposium on Computer Arithmetic

The implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the Newton-Raphson method and Goldschmidt's algorithm, share ...
Read More
FPGA Implementations of Radix-10 Digit Recurrence Fixed-Point and Floating-Point Dividers
RECONFIG '11: Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAs

In this paper we present three different radix-10 digit recurrence division algorithms for FPGA architectures. The first one implements the simple shift-and-subtract algorithm, whereas the second and third implementations each perform digit recurrence ...
Read More
Radix-16 Combined Division and Square Root Unit
ARITH '11: Proceedings of the 2011 IEEE 20th Symposium on Computer Arithmetic

Division and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 28, Issue 3
Sept. 1996
127 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/243439
Issue’s Table of Contents

Copyright © 1996 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 1996
Published in csur Volume 28, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPU
SRT
area and performance tradeoffs
division
floating-point
square root
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 2,064
  Total Downloads
- Downloads (Last 12 months)191
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Area and performance tradeoffs in floating-point divide and square-root implementations

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations

FPGA Implementations of Radix-10 Digit Recurrence Fixed-Point and Floating-Point Dividers

Radix-16 Combined Division and Square Root Unit

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Area and performance tradeoffs in floating-point divide and square-root implementations

ACM Computing Surveys

Abstract

References

Cited By

Recommendations

An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations

FPGA Implementations of Radix-10 Digit Recurrence Fixed-Point and Floating-Point Dividers

Radix-16 Combined Division and Square Root Unit

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media