skip to main content
article
Free Access

Area and performance tradeoffs in floating-point divide and square-root implementations

Published:01 September 1996Publication History
Skip Abstract Section

Abstract

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.

References

  1. ALPERT, D. AND AVNON, D. 1993.Architecture of the Pentium microprocessor. IEEE Micro (June) 11-21.]] Google ScholarGoogle Scholar
  2. ANDERSON, S. F. ET AL. 1967. The IBM System/ 360 Model 91: Floating-point execution unit. IBM J. Res. (Jan.) 34-53.]]Google ScholarGoogle Scholar
  3. ASPREY, T. ET AL. 1993. Performance features of the PA7100 microprocessor. IEEE Micro (June) 22-35.]] Google ScholarGoogle Scholar
  4. ATKINS, D.E. 1968. Higher-radix division using estimates of the divisor and partial remainders. IEEE Trans. Comput. (Oct.) 925-934.]]Google ScholarGoogle Scholar
  5. BANNON, P. AND KELLER, J. 1995. Internal architecture of Alpha 21164 microprocessor. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 79-87.]] Google ScholarGoogle Scholar
  6. BECKER, M.C. ET AL. 1993. The PowerPC 601 microprocessor. IEEE Micro (Oct.) 54-68.]] Google ScholarGoogle Scholar
  7. BERRY, M., CHEN, D., KOSS, P., AND KUCK, D. 1988. The Perfect Club benchmarks: Effective performance evaluation of supercomputers. CSRD Rep. No. 827, Center for Supercomputing Research and Development, University of Illinois at Urbana-Champaign, Nov.]]Google ScholarGoogle Scholar
  8. BERRY, M., CYBENKO, G., AND LARSON, J. 1991. Scientific benchmark characterizations. Parallel Comput. (Dec.) 1173-1194.]]Google ScholarGoogle Scholar
  9. BRIGGS, W. S. AND MATULA, D.W. 1993. A 17 ~ 69 multiply and add unit with redundant binary feedback and single cycle latency. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, June 163-170.]]Google ScholarGoogle Scholar
  10. BURGESS, B. ET AL. 1994. The PowerPC 603 RISC microprocessor. Commun. ACM (June) 34-42.]] Google ScholarGoogle Scholar
  11. CASE, B. 1995. SPEC95 retires SPEC92. Microprocessor Rep. (Aug.) 11-14.]]Google ScholarGoogle Scholar
  12. COL, T. AND TANG, P. T.P. 1995. It takes six ones to reach a flaw. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 140-146.]] Google ScholarGoogle Scholar
  13. CONTE, T.M. 1993. Architectural resource requirements of contemporary benchmarks: A wish list. In Proceedings of the 26th Annual Hawaii International Conference on System Sciences, IEEE, Jan. 517-529.]]Google ScholarGoogle Scholar
  14. DARLEY, H. M. ET AL. 1989. Floating-point/integer processor with divide and square root functions. U.S. Patent 4,878,190, Oct.]]Google ScholarGoogle Scholar
  15. DARLEY, M. ET AL. 1990. The TMS390C602A floating-point coprocessor for Sparc systems. IEEE Micro, (June) 36-47.]] Google ScholarGoogle Scholar
  16. DIXIT, K.M. 1991. The SPEC benchmarks. Parallel Computing, (Dec.) 1195-1209.]]Google ScholarGoogle Scholar
  17. ERCEGOVAC, M. D. AND LANG, T. 1994. Division and Square Root: Digit Recurrence Algorithms and Implementations. Kluwer Academic Publishers; Norwell, MA.]] Google ScholarGoogle Scholar
  18. ERCEGOVAC, M.D., LANG, T., AND MONTUSCHI, P. 1993. Very high radix division with selection by rounding and prescaling. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 112-119.]]Google ScholarGoogle Scholar
  19. FOWLER, D. L. AND SMITH, J.E. 1989. An accurate, high speed implementation of division by reciprocal approximation. In Proceedings of the 9th IEEE Symposium on Computer Arithmetic, Sept. 60-67.]]Google ScholarGoogle Scholar
  20. FRANTZESKAKIS, E. N. AND LIU, K. J.R. 1994. A class of square root and division free algorithms and architectures for QRD-based adaptive signal processing. IRE Trans. Signal Process. (Sept.) 2455-2469.]]Google ScholarGoogle Scholar
  21. GOLUB, G. H. AND VAN LOAN, C.F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, Second Ed.]]Google ScholarGoogle Scholar
  22. GREENLY, D. ET AL. 1995. UltraSPARC(tm): The next generation superscalar 64-bit SPARC. In Digest of Papers: COMPCON Spring 1995, IEEE (Feb.) 442-451.]] Google ScholarGoogle Scholar
  23. GROSS, T. 1985. Software implementation of floating-point arithmetic on a reduced-instruction-set processor. J. Parallel Distrib. Comput. 2, 362-375.]] Google ScholarGoogle Scholar
  24. GWENNAP, L. 1994. PA-7200 enables inexpensive MP systems: HP's next-generation PA- RISC also contains unique "assist" cache. Microprocessor Rep. (Jan.).]]Google ScholarGoogle Scholar
  25. GWENNAP, L. 1995a. Hal reveals multichip SPARC processor: High-performance CPU for Hal systems only--no merchant sales. Microprocessor Rep. 9, 3 (Mar.) 1,6-11.]]Google ScholarGoogle Scholar
  26. GWENNAP, L. 1995b. Intel's P6 uses decoupled superscalar design: Next generation of x86 integrates L2 cache in package with CPU. Microprocessor Rep. 9, 2 (Feb.) 9-15.]]Google ScholarGoogle Scholar
  27. HENNESSY, J.L. AND PATTERSON, D.A. 1990a. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA.]] Google ScholarGoogle Scholar
  28. HENNESSY, J.L. AND PATTERSON, D.A. 1990b. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Mateo, CA. Appendix A: Computer Arithmetic by David Goldberg.]] Google ScholarGoogle Scholar
  29. HUNT, D. 1995. Advanced performance features of the 64-bit PA-8000. In Digest of Papers: COMPCON Spring 1995, IEEE, Feb. 123- 128.]] Google ScholarGoogle Scholar
  30. IEEE STANDARD FOR BINARY FLOATING-POINT ARITHMETIC. 1985. New York ANSI/IEEE Std. 754--1985, Aug.]]Google ScholarGoogle Scholar
  31. JULIUSSEN, E. 1994. Which low-end workstation? IEEE Spectrum, (Apr.) 51-59.]] Google ScholarGoogle Scholar
  32. KABUO, H. ET AL. 1994. Accurate rounding scheme for the Newton-Raphson method using redundant binary representation. IEEE Trans. Comput. (Jan.)43-50.]] Google ScholarGoogle Scholar
  33. KAHAN, W. 1994. Using MathCAD 3.1 on a Mac, Aug.]]Google ScholarGoogle Scholar
  34. KOHN, L. AND MARGULIS, N. 1989. Introducing the Intel i860 64-bit microprocessor. IEEE Micro, (Aug.) 15-30.]] Google ScholarGoogle Scholar
  35. LANG, T. AND MONTUSCHI, P. 1995. Very-high radix combined division and square root with prescaling and selection by rounding. In Proceedngs of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 124-131.]] Google ScholarGoogle Scholar
  36. Lu, P. Y. AND DAWALLU, K. 1989. A VLSI module for IEEE floating-pont multiplication/division/square root. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, 366-368.]]Google ScholarGoogle Scholar
  37. MARKSTEIN, P.W. 1990. Computation of elementary functions on the IBM RISC System/ 6000 processor. IBM J. Res. Develop. (Jan.) 111-119.]] Google ScholarGoogle Scholar
  38. MATSuBARA, G., ET AL. 1995. 30-ns 55-b shared radix-2 division and square root using a selftimed circuit. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE, July 98-105.]] Google ScholarGoogle Scholar
  39. MCLELLAN, E. 1993. The Alpha AXP architecture and 21064 processor. IEEE Micro, (June) 36-47.]] Google ScholarGoogle Scholar
  40. McQUILLAN, S. E. AND MCCANNY, J.V. 1991. A VLSI architecture for multiplication, division, and square root. In Proceedings of the 1991 International Conference on Acoustics, Speech and Signal Processing, IEEE, May 1205-1208.]] Google ScholarGoogle Scholar
  41. McQUILLAN, S.E., MCCANNY, J.V., AND HAMILL, R. 1993. New algorithms and VLSI architectures for SRT division and square root. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, IEEE, June 80-86.]]Google ScholarGoogle Scholar
  42. MIPS TECHNOLOGIES, INC. 1994a. Mountain View, CA. RIO000 Microprocessor: Product Overview, Oct.]]Google ScholarGoogle Scholar
  43. MIPS TECHNOLOGIES, INC. 1994b. Mountain View, CA. R8000 Microprocessor Chip Set: Product Overview, Aug.]]Google ScholarGoogle Scholar
  44. MIRAPURI, S., WOODACRE, M., AND VASSEGHI, N. 1992. The Mips R4000 processor. IEEE Micro Apr. 10-22.]] Google ScholarGoogle Scholar
  45. MISRA, M. 1990. IBM RISC System~6000 Technology. IBM.]]Google ScholarGoogle Scholar
  46. MOLER, C. 1995. A tale of two numbers. SIAM News 28, 1 Jan. 1,16.]]Google ScholarGoogle Scholar
  47. MONTOYE, R.K., HOKENEK, E., AND RUNYON, S.L. 1990. Design of the IBM RISC System/6000 floating-point execution unit. IBM J. Res. Develop. (Jan.) 59-70.]] Google ScholarGoogle Scholar
  48. OBERMAN, S.F. AND FLYNN, M.J. 1994. An analysis of division algorithms and implementations. Tech. Rep. CSL-TR-95-675, Stanford Univ. Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google ScholarGoogle Scholar
  49. OBERMAN, S. F. AND FLYNN, M.J. 1994. Design issues in floating-point division. Tech. Rep. CSL-TR-94-647, Stanford University Departments of Electrical Engineering and Computer Science, Stanford, CA, Dec.]] Google ScholarGoogle Scholar
  50. PENG, V., SAMUDRALA, S., AND GAVRIELOV, M. 1987. On the implementation of shifters, multipliers, and dividers in VLSI floating point units. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic, IEEE, May 95-102.]]Google ScholarGoogle Scholar
  51. SARMA, D. D. AND MATULA, D.W. 1993. Measuring the accuracy of ROM reciprocal tables. In Proceedings of the 11th IEEE Symposium on Computer Arithmetic, (June) 95-102.]]Google ScholarGoogle Scholar
  52. SARMA, D. D. AND MATULA, D.W. 1995. Faithful bipartite ROM reciprocal tables. In Proceedings of the 12th IEEE Symposium on Computer Arithmetic, IEEE (July) 17-28.]] Google ScholarGoogle Scholar
  53. SCOTT, N.R. 1985. Computer Number Systems and Arithmetic. Prentice Hall, Englewood Cliffs, NJ.]] Google ScholarGoogle Scholar
  54. SHARANGPANI, H.P. AND BARTON, M.L. 1994. Statistical analysis of floating point flaw in the PentiumTM processor. Tech. Rep., Intel Corporation, Nov.]]Google ScholarGoogle Scholar
  55. SIMHA, S. 1993. R4400 Microprocessor: Product Information. MIPS Technologies, Inc., Mountain View, CA. Sept.]]Google ScholarGoogle Scholar
  56. SONG, S.P. ET AL. 1994. The PowerPC 604 RISC microprocessor. IEEE Micro (Oct.) 8-17.]] Google ScholarGoogle Scholar
  57. STEARNS, C.C. 1989. Subtractive floating-point division and square root for VLSI DSP. In European Conference on Circuit Theory and Design. Sept. 405-409.]]Google ScholarGoogle Scholar
  58. SUN MICROSYSTEMS COMPUTER CORPORATION. 1992. Mountain View, CA. The SuperSPARCTM Microprocessor, May.]]Google ScholarGoogle Scholar
  59. TAYLOR, G.S. 1985. Radix 16 SRT dividers with overlapped quotient selection stages. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic, IEEE, June 64-71.]]Google ScholarGoogle Scholar
  60. WASER, S. AND FLYNN, M.J. 1982. An Introduction to Arithmetic for Digital System Designers. Holt, Rinehart and Winston, New York.]] Google ScholarGoogle Scholar
  61. WEICKER, R.P. 1991. A detailed look at some popular benchmarks. Parallel Computing, (Dec.) 1153-1172.]]Google ScholarGoogle Scholar
  62. WHITE, S. W., ET AL. 1993. How does processor MHz relate to end-user performance? Part 1: Pipelines and functional units. IEEE Micro, (Aug.) 8-16.]] Google ScholarGoogle Scholar
  63. WHITE, S.W. 1994. POWER2: Architecture and performance. In Digest of Papers: COMPCON Spring 1994, IEEE, Feb. 384-388.]]Google ScholarGoogle Scholar
  64. WILLIAMS, T.E. 1991. A zero-overhead selftimed 160-ns 54-b CMOS divider. IEEE J. Solid-State Circuits, Nov. 1651-1661.]]Google ScholarGoogle Scholar
  65. TONG, D. AND FLYNN, M. 1992. Fast division using accurate quotient approximations to reduce the number of iterations. IEEE Trans. Comput. (Aug.) 981-995.]] Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader