skip to main content
article
Free Access

What every computer scientist should know about floating-point arithmetic

Published:01 March 1991Publication History
Skip Abstract Section

Abstract

Floating-point arithmetic is considered as esoteric subject by many people. This is rather surprising, because floating-point is ubiquitous in computer systems: Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating point standard, and concludes with examples of how computer system builders can better support floating point.

References

  1. AHO, A. V., SETHI, R., AND ULLMAN, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass. Google ScholarGoogle Scholar
  2. ANSI 1978. American National Standard Programming Language FORTRAN, ANSI Standard X3.9-1978. American National Standards Institute, New York.Google ScholarGoogle Scholar
  3. BARNETT, D. 1987. A portable floating-point environment. Unpublished manuscript.Google ScholarGoogle Scholar
  4. BROWN, W. S. 1981. A simple but realistic model of floating-point computation. ACM Trans. Math. Softw. 7, 4, 445-480. Google ScholarGoogle Scholar
  5. CARDELLI, L., DONAHUE, J., GLASSMAN, L., JORDAN, M., KASLOW, B., AND NELSON, G. 1989. Modula-3 Report (revised). Digital Systems Research Center Report *~52, Palo Alto, Calif.Google ScholarGoogle Scholar
  6. CODY, W. J. et al. 1984. A proposed radix- and word-length-independent standard for floatingpoint arithmetic. IEEE Micro 4, 4, 86-100.Google ScholarGoogle Scholar
  7. CODY, W. J. 1988. Floating-point standards--Theory and practice. In Reliability in Computing: The Role of lnterval Methods on Scientific Computing, Ramon E. Moore, Ed. Academic Press, Boston, Mass., pp. 99-107. Google ScholarGoogle Scholar
  8. COONEN, J. 1984. Contributions to a proposed standard for binary floating-point arithmetic. PhD dissertation, Univ. of California, Berkeley. Google ScholarGoogle Scholar
  9. DEKKER, T. J. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3, 224-242.Google ScholarGoogle Scholar
  10. DEMMEL, J. 1984. Underflow and the reliability of numerical software. SIAM J. Sci. Stat. Cornput. 5, 4, 887-919.Google ScholarGoogle Scholar
  11. FARNUM, C. 1988. Compiler support for floatingpoint computation. Softw. Pract. Experi. 18, 7, 701-709. Google ScholarGoogle Scholar
  12. FORSYTHE, G. E., AND MOLER, C. B. 1967. Computer Solutmn of Linear Algebraic Systems. Prentice-Hall, Englewood Cliffs, N.J.Google ScholarGoogle Scholar
  13. GOLDBERG, I. B. 1967. 27 Bits are not enough for 8-digit accuracy. Commum. ACM 10, 2, 105-106. Google ScholarGoogle Scholar
  14. GOLDBERC, D. 1990. Computer arithmetic. In Computer Architecture: A Quantitative Approach, David Patterson and John L. Hennessy, Eds. Morgan Kaufmann, Los Altos, Calif., Appendix A.Google ScholarGoogle Scholar
  15. GOLUB, G. H., AND VAN LOAN, C. F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD.Google ScholarGoogle Scholar
  16. HEWLETT PACKARD 1982. HP-15C Advanced Functions Handbook.Google ScholarGoogle Scholar
  17. IEEE 1987. IEEE Standard 754-1985 for Binary Floating-Point Arithmetic, IEEE. Reprinted in SIGPLAN 22, 2, 9-25.Google ScholarGoogle Scholar
  18. KASAN, W. 1972. A Survey of Error Analysis. In Information Processing 71, (Ljubljana, Yugoslavia), North Holland, Amsterdam, vol. 2, pp. 1214-1239.Google ScholarGoogle Scholar
  19. KAHAN, W. 1986. Calculating Area and Angle of a Needle-like Triangle. Unpublished manuscript.Google ScholarGoogle Scholar
  20. KAHAN, W. 1987. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analyszs, M. J. D. Powell and A Iserles, Eds., Oxford University Press, N.Y., Chap. 7.Google ScholarGoogle Scholar
  21. KAnAN, W. 1988. Unpublished lectures given at Sun Microsystems, Mountain View, Calif.Google ScholarGoogle Scholar
  22. KAHAN, W., AND COONEN, J. T. 1982. The near orthogonality of syntax, semantics, and diagnostics in numerical programming environments. In The Relationship between Numerical Computation and Programmi,g Languages, J. K. Reid~ Ed. North-Holland~ Amsterdam~ pp 103 115.Google ScholarGoogle Scholar
  23. KAI~AN, W., AND LEBLANC, E. 1985. Anomalies in the IBM acrith package. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic (Urbana, Ill.), pp. 322-331.Google ScholarGoogle Scholar
  24. KERXIGHAN, B. W., AND RITCHm, D. M. 1978. The C Programming Language. Prentice~Hall, Englewood Cliffs, N.J. Google ScholarGoogle Scholar
  25. KIRCHNER, R., AND KULISCH, U 1987 Arithmetic for vector processors. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic (Como, Italy), pp. 256-269.Google ScholarGoogle Scholar
  26. KNUT~, D. E. 1981. The Art of Computer Programming Addison-Wesley, Reading, Mass., vol. II, 2nd ed. Google ScholarGoogle Scholar
  27. KULISH, U. W., ArCD MmANKER W. L. 1986. The Arithmetic of the Digital Computer: A new approach. SIAM Rev 28, 1, 1-36. Google ScholarGoogle Scholar
  28. MATULA, D. W., AND KORNERUP, P. 1985. Finite Precision Rational Arithmetic: Slash Number Systems. IEEE Trans. Comput. C-34, 1, 3-18.Google ScholarGoogle Scholar
  29. REISER, J. F., A2CD KNUTH, D E. 1975. Evading the drift in floating-point addition. Inf. Process. Lett 3, 3, 84-87Google ScholarGoogle Scholar
  30. STERBETZ, P. H. 1974. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, N.J.Google ScholarGoogle Scholar
  31. SWARTZLANDER, E. E , AND ALEXOPOULOS, G. 1975. The sign/logarithm number system. IEEE Trans. Comput. C-24, 12, 1238-1242Google ScholarGoogle Scholar
  32. WALTHER, J. S. 1971. A unified algorithm for elementary functions. Proceedings of the AFIP Spr~ng Joint Computer Conference, pp. 379- 385.Google ScholarGoogle Scholar

Index Terms

  1. What every computer scientist should know about floating-point arithmetic

                    Recommendations

                    Reviews

                    Louis W. Ehrlich

                    The title of this paper is appropriate. In fact, the words “user of computers for computation” could replace the words “computer scientist.” In Section 1, “Rounding Errors,” the details of rounding are described. Some of the topics discussed are floating-point formats, relative error and ulps (units in the last place), guard digits, and cancellation. Goldberg discusses the relations between ulps, relative error, and machine epsilon, with examples illustrating “wobble” in ulps. (Recent correspondence over NA-NET was appropriately concerned with new measures of precision in floating-point arithmetic, which fits in here.) He points out the significance of guard digits. Catastrophic cancellation (between computed numbers) and benign cancellation (between exact numbers) are described with some examples showing how some of this may be avoided by rewriting formulas. Section 2, “IEEE Standard,” describes the two different standards, 754 (for binary) and 854 (for binary or decimal). Subtopics include formats and operations (including base and precision), special quantities (such as NaNs and infinity), and exceptions, flags, and trap handlers. Not only are the standards described, but the reasons they were chosen are discussed and examples are given to illustrate the need or desire for them. This section should be kept in mind when one is considering purchasing a new computer. Section 3, “Systems Aspects,” contains discussion on such topics as instruction sets, languages and compilers (ambiguity and optimizers), and exception handling. Section 4, “Details,” presents proofs of some of the statements made earlier (such as “a single guard digit is enough to guarantee that addition and subtraction will always be accurate”). Finally, in Section 5, “Summary,” the author again states that rigorous reasoning can be applied to floating-point algorithms, consistent with underlying hardware and efficient algorithms. The paper contains an appendix in which several theorems are proven, including a highly accurate summation formula due to W. Kahan. Everyone who uses a computer to compute should read or at least peruse this paper. Many of the examples given are things we do without thinking of possible results. We should at least be aware of what can happen.

                    Access critical reviews of Computing literature here

                    Become a reviewer for Computing Reviews.

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader