Abstract
Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.
- ALPERT, D. AND AVNON, D. 1993.Architecture of the Pentium microprocessor. IEEE Micro (June) 11-21.]] Google Scholar
- ANDERSON, S. F. ET AL. 1967. The IBM System/ 360 Model 91: Floating-point execution unit. IBM J. Res. (Jan.) 34-53.]]Google Scholar
- ASPREY, T. ET AL. 1993. Performance features of the PA7100 microprocessor. IEEE Micro (June) 22-35.]] Google Scholar
- ATKINS, D.E. 1968. Higher-radix division using estimates of the divisor and partial remainders. IEEE Trans. Comput. (Oct.) 925-934.]]Google Scholar
- BANNON, P. AND KELLER, J. 1995. Internal architecture of Alpha 21164 microprocessor. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 79-87.]] Google Scholar
- BECKER, M.C. ET AL. 1993. The PowerPC 601 microprocessor. IEEE Micro (Oct.) 54-68.]] Google Scholar
- BERRY, M., CHEN, D., KOSS, P., AND KUCK, D. 1988. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. CSRD Rep. No. 827, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Nov.]]Google Scholar
- BERRY, M., CYBENKO, G., AND LARSON, J. 1991. Scientific benchmark characterizations. Parallel Comput. (Dec.) 1173-1194.]]Google Scholar
- BRIGGS, W. S. AND MATULA, D.W. 1993. A 17 ~ 69 multiply and add unit with redundant binary feedback and single cycle latency. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, June 163-170.]]Google Scholar
- BURGESS, B. ET AL. 1994. The PowerPC 603 RISC microprocessor. Commun. ACM (June) 34-42.]] Google Scholar
- CASE, B. 1995. SPEC95 retires SPEC92. Microprocessor Rep. (Aug.) 11-14.]]Google Scholar
- COL, T. AND TANG, P. T.P. 1995. It takes six ones to reach a flaw. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 140-146.]] Google Scholar
- CONTE, T.M. 1993. Architectural resource requirements of contemporary benchmarks: A wish list. In Proceedings of the 26th Annual Hawaii International Conference on System Sciences, IEEE, Jan. 517-529.]]Google Scholar
- DARLEY, H. M. ET AL. 1989. Floating-point/integer processor with divide and square root functions. U.S. Patent 4,878,190, Oct.]]Google Scholar
- DARLEY, M. ET AL. 1990. The TMS390C602A floating-point coprocessor for Sparc systems. IEEE Micro, (June) 36-47.]] Google Scholar
- DIXIT, K.M. 1991. The SPEC benchmarks. Parallel Computing, (Dec.) 1195-1209.]]Google Scholar
- ERCEGOVAC, M. D. AND LANG, T. 1994. Division and Square Root: Digit Recurrence Algorithms and Implementations. Kluwer Academic Publishers; Norwell, MA.]] Google Scholar
- ERCEGOVAC, M.D., LANG, T., AND MONTUSCHI, P. 1993. Very high radix division with selection by rounding and prescaling. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 112-119.]]Google Scholar
- FOWLER, D. L. AND SMITH, J.E. 1989. An accurate, high speed implementation of division by reciprocal approximation. In Proceedings of the 9th IEEE Symposium on Computer Arithmetic, Sept. 60-67.]]Google Scholar
- FRANTZESKAKIS, E. N. AND LIU, K. J.R. 1994. A class of square root and division free algorithms and architectures for QRD-based adaptive signal processing. IRE Trans. Signal Process. (Sept.) 2455-2469.]]Google Scholar
- GOLUB, G. H. AND VAN LOAN, C.F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, Second Ed.]]Google Scholar
- GREENLY, D. ET AL. 1995. UltraSPARC(tm): The next generation superscalar 64-bit SPARC. In Digest of Papers: COMPCON Spring 1995, IEEE (Feb.) 442-451.]] Google Scholar
- GROSS, T. 1985. Software implementation of floating-point arithmetic on a reduced-instruction-set processor. J. Parallel Distrib. Comput. 2, 362-375.]] Google Scholar
- GWENNAP, L. 1994. PA-7200 enables inexpensive MP systems: HP's next-generation PA- RISC also contains unique "assist" cache. Microprocessor Rep. (Jan.).]]Google Scholar
- GWENNAP, L. 1995a. Hal reveals multichip SPARC processor: High-performance CPU for Hal systems only--no merchant sales. Microprocessor Rep. 9, 3 (Mar.) 1,6-11.]]Google Scholar
- GWENNAP, L. 1995b. Intel's P6 uses decoupled superscalar design: Next generation of x86 integrates L2 cache in package with CPU. Microprocessor Rep. 9, 2 (Feb.) 9-15.]]Google Scholar
- HENNESSY, J.L. AND PATTERSON, D.A. 1990a. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA.]] Google Scholar
- HENNESSY, J.L. AND PATTERSON, D.A. 1990b. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA. Appendix A: Computer Arithmetic by David Goldberg.]] Google Scholar
- HUNT, D. 1995. Advanced performance features of the 64-bit PA-8000. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 123- 128.]] Google Scholar
- IEEE STANDARD FOR BINARY FLOATING-POINT ARITHMETIC. 1985. New York ANSI/IEEE Std. 754--1985, Aug.]]Google Scholar
- JULIUSSEN, E. 1994. Which low-end workstation? IEEE Spectrum, (Apr.) 51-59.]] Google Scholar
- KABUO, H. ET AL. 1994. Accurate rounding scheme for the Newton-Raphson method using redundant binary representation. IEEE Trans. Comput. (Jan.)43-50.]] Google Scholar
- KAHAN, W. 1994. Using MathCAD 3.1 on a Mac, Aug.]]Google Scholar
- KOHN, L. AND MARGULIS, N. 1989. Introducing the Intel i860 64-bit microprocessor. IEEE Micro, (Aug.) 15-30.]] Google Scholar
- LANG, T. AND MONTUSCHI, P. 1995. Very-high radix combined division and square root with prescaling and selection by rounding. In Proceedngs of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 124-131.]] Google Scholar
- Lu, P. Y. AND DAWALLU, K. 1989. A VLSI module for IEEE floating-pont multiplication/division/square root. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, 366-368.]]Google Scholar
- MARKSTEIN, P.W. 1990. Computation of elementary functions on the IBM RISC System/ 6000 processor. IBM J. Res. Develop. (Jan.) 111-119.]] Google Scholar
- MATSuBARA, G., ET AL. 1995. 30-ns 55-b shared radix-2 division and square root using a selftimed circuit. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 98-105.]] Google Scholar
- MCLELLAN, E. 1993. The Alpha AXP architecture and 21064 processor. IEEE Micro, (June) 36-47.]] Google Scholar
- McQUILLAN, S. E. AND MCCANNY, J.V. 1991. A VLSI architecture for multiplication, division, and square root. In Proceedings of the 1991 International Conference on Acoustics, Speech and Signal Processing, IEEE, May 1205-1208.]] Google Scholar
- McQUILLAN, S.E., MCCANNY, J.V., AND HAMILL, R. 1993. New algorithms and VLSI architectures for SRT division and square root. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 80-86.]]Google Scholar
- MIPS TECHNOLOGIES, INC. 1994a. Mountain View, CA. RIO000 Microprocessor: Product Overview, Oct.]]Google Scholar
- MIPS TECHNOLOGIES, INC. 1994b. Mountain View, CA. R8000 Microprocessor Chip Set: Product Overview, Aug.]]Google Scholar
- MIRAPURI, S., WOODACRE, M., AND VASSEGHI, N. 1992. The Mips R4000 processor. IEEE Micro Apr. 10-22.]] Google Scholar
- MISRA, M. 1990. IBM RISC System~6000 Technology. IBM.]]Google Scholar
- MOLER, C. 1995. A tale of two numbers. SIAM News 28, 1 Jan. 1,16.]]Google Scholar
- MONTOYE, R.K., HOKENEK, E., AND RUNYON, S.L. 1990. Design of the IBM RISC System/6000 floating-point execution unit. IBM J. Res. Develop. (Jan.) 59-70.]] Google Scholar
- OBERMAN, S.F. AND FLYNN, M.J. 1994. An analysis of division algorithms and implementations. Tech. Rep. CSL-TR-95-675, Stanford Univ. Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google Scholar
- OBERMAN, S. F. AND FLYNN, M.J. 1994. Design issues in floating-point division. Tech. Rep. CSL-TR-94-647, Stanford University Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google Scholar
- PENG, V., SAMUDRALA, S., AND GAVRIELOV, M. 1987. On the implementation of shifters, multipliers, and dividers in VLSI floating point units. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic, IEEE, May 95-102.]]Google Scholar
- SARMA, D. D. AND MATULA, D.W. 1993. Measuring the accuracy of ROM reciprocal tables. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, (June) 95-102.]]Google Scholar
- SARMA, D. D. AND MATULA, D.W. 1995. Faithful bipartite ROM reciprocal tables. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE (July) 17-28.]] Google Scholar
- SCOTT, N.R. 1985. Computer Number Systems and Arithmetic. Prentice Hall, Englewood Cliffs, NJ.]] Google Scholar
- SHARANGPANI, H.P. AND BARTON, M.L. 1994. Statistical analysis of floating point flaw in the PentiumTM processor. Tech. Rep., Intel Corporation, Nov.]]Google Scholar
- SIMHA, S. 1993. R4400 Microprocessor: Product Information. MIPS Technologies, Inc., Mountain View, CA. Sept.]]Google Scholar
- SONG, S.P. ET AL. 1994. The PowerPC 604 RISC microprocessor. IEEE Micro (Oct.) 8-17.]] Google Scholar
- STEARNS, C.C. 1989. Subtractive floating-point division and square root for VLSI DSP. In European Conference on Circuit Theory and Design. Sept. 405-409.]]Google Scholar
- SUN MICROSYSTEMS COMPUTER CORPORATION. 1992. Mountain View, CA. The SuperSPARCTM Microprocessor, May.]]Google Scholar
- TAYLOR, G.S. 1985. Radix 16 SRT dividers with overlapped quotient selection stages. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic, IEEE, June 64-71.]]Google Scholar
- WASER, S. AND FLYNN, M.J. 1982. An Introduction to Arithmetic for Digital System Designers. Holt, Rinehart and Winston, New York.]] Google Scholar
- WEICKER, R.P. 1991. A detailed look at some popular benchmarks. Parallel Computing, (Dec.) 1153-1172.]]Google Scholar
- WHITE, S. W., ET AL. 1993. How does processor MHz relate to end-user performance? Part 1: Pipelines and functional units. IEEE Micro, (Aug.) 8-16.]] Google Scholar
- WHITE, S.W. 1994. POWER2: Architecture and performance. In Digest of Papers: COMPCON Spring 1994, IEEE, Feb. 384-388.]]Google Scholar
- WILLIAMS, T.E. 1991. A zero-overhead selftimed 160-ns 54-b CMOS divider. IEEE J. Solid-State Circuits, Nov. 1651-1661.]]Google Scholar
- TONG, D. AND FLYNN, M. 1992. Fast division using accurate quotient approximations to reduce the number of iterations. IEEE Trans. Comput. (Aug.) 981-995.]] Google Scholar
Recommendations
An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations
ARITH '95: Proceedings of the 12th Symposium on Computer ArithmeticThe implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the Newton-Raphson method and Goldschmidt's algorithm, share ...
FPGA Implementations of Radix-10 Digit Recurrence Fixed-Point and Floating-Point Dividers
RECONFIG '11: Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAsIn this paper we present three different radix-10 digit recurrence division algorithms for FPGA architectures. The first one implements the simple shift-and-subtract algorithm, whereas the second and third implementations each perform digit recurrence ...
Radix-16 Combined Division and Square Root Unit
ARITH '11: Proceedings of the 2011 IEEE 20th Symposium on Computer ArithmeticDivision and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 ...
Comments