article

Free Access

What every computer scientist should know about floating-point arithmetic

Author:
David Goldberg

Xerox Palo Alto Research Center, Palo Alto, CA

Xerox Palo Alto Research Center, Palo Alto, CA
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 23 Issue 1pp 5–48https://doi.org/10.1145/103162.103163

Published:01 March 1991Publication History

ACM Computing Surveys

Abstract

Floating-point arithmetic is considered as esoteric subject by many people. This is rather surprising, because floating-point is ubiquitous in computer systems: Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions such as overflow. This paper presents a tutorial on the aspects of floating-point that have a direct impact on designers of computer systems. It begins with background on floating-point representation and rounding error, continues with a discussion of the IEEE floating point standard, and concludes with examples of how computer system builders can better support floating point.

References

AHO, A. V., SETHI, R., AND ULLMAN, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, Mass. Google Scholar
ANSI 1978. American National Standard Programming Language FORTRAN, ANSI Standard X3.9-1978. American National Standards Institute, New York.Google Scholar
BARNETT, D. 1987. A portable floating-point environment. Unpublished manuscript.Google Scholar
BROWN, W. S. 1981. A simple but realistic model of floating-point computation. ACM Trans. Math. Softw. 7, 4, 445-480. Google Scholar
CARDELLI, L., DONAHUE, J., GLASSMAN, L., JORDAN, M., KASLOW, B., AND NELSON, G. 1989. Modula-3 Report (revised). Digital Systems Research Center Report *~52, Palo Alto, Calif.Google Scholar
CODY, W. J. et al. 1984. A proposed radix- and word-length-independent standard for floatingpoint arithmetic. IEEE Micro 4, 4, 86-100.Google Scholar
CODY, W. J. 1988. Floating-point standards--Theory and practice. In Reliability in Computing: The Role of lnterval Methods on Scientific Computing, Ramon E. Moore, Ed. Academic Press, Boston, Mass., pp. 99-107. Google Scholar
COONEN, J. 1984. Contributions to a proposed standard for binary floating-point arithmetic. PhD dissertation, Univ. of California, Berkeley. Google Scholar
DEKKER, T. J. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3, 224-242.Google Scholar
DEMMEL, J. 1984. Underflow and the reliability of numerical software. SIAM J. Sci. Stat. Cornput. 5, 4, 887-919.Google Scholar
FARNUM, C. 1988. Compiler support for floatingpoint computation. Softw. Pract. Experi. 18, 7, 701-709. Google Scholar
FORSYTHE, G. E., AND MOLER, C. B. 1967. Computer Solutmn of Linear Algebraic Systems. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
GOLDBERG, I. B. 1967. 27 Bits are not enough for 8-digit accuracy. Commum. ACM 10, 2, 105-106. Google Scholar
GOLDBERC, D. 1990. Computer arithmetic. In Computer Architecture: A Quantitative Approach, David Patterson and John L. Hennessy, Eds. Morgan Kaufmann, Los Altos, Calif., Appendix A.Google Scholar
GOLUB, G. H., AND VAN LOAN, C. F. 1989. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD.Google Scholar
HEWLETT PACKARD 1982. HP-15C Advanced Functions Handbook.Google Scholar
IEEE 1987. IEEE Standard 754-1985 for Binary Floating-Point Arithmetic, IEEE. Reprinted in SIGPLAN 22, 2, 9-25.Google Scholar
KASAN, W. 1972. A Survey of Error Analysis. In Information Processing 71, (Ljubljana, Yugoslavia), North Holland, Amsterdam, vol. 2, pp. 1214-1239.Google Scholar
KAHAN, W. 1986. Calculating Area and Angle of a Needle-like Triangle. Unpublished manuscript.Google Scholar
KAHAN, W. 1987. Branch cuts for complex elementary functions. In The State of the Art in Numerical Analyszs, M. J. D. Powell and A Iserles, Eds., Oxford University Press, N.Y., Chap. 7.Google Scholar
KAnAN, W. 1988. Unpublished lectures given at Sun Microsystems, Mountain View, Calif.Google Scholar
KAHAN, W., AND COONEN, J. T. 1982. The near orthogonality of syntax, semantics, and diagnostics in numerical programming environments. In The Relationship between Numerical Computation and Programmi,g Languages, J. K. Reid~ Ed. North-Holland~ Amsterdam~ pp 103 115.Google Scholar
KAI~AN, W., AND LEBLANC, E. 1985. Anomalies in the IBM acrith package. In Proceedings of the 7th IEEE Symposium on Computer Arithmetic (Urbana, Ill.), pp. 322-331.Google Scholar
KERXIGHAN, B. W., AND RITCHm, D. M. 1978. The C Programming Language. Prentice~Hall, Englewood Cliffs, N.J. Google Scholar
KIRCHNER, R., AND KULISCH, U 1987 Arithmetic for vector processors. In Proceedings of the 8th IEEE Symposium on Computer Arithmetic (Como, Italy), pp. 256-269.Google Scholar
KNUT~, D. E. 1981. The Art of Computer Programming Addison-Wesley, Reading, Mass., vol. II, 2nd ed. Google Scholar
KULISH, U. W., ArCD MmANKER W. L. 1986. The Arithmetic of the Digital Computer: A new approach. SIAM Rev 28, 1, 1-36. Google Scholar
MATULA, D. W., AND KORNERUP, P. 1985. Finite Precision Rational Arithmetic: Slash Number Systems. IEEE Trans. Comput. C-34, 1, 3-18.Google Scholar
REISER, J. F., A2CD KNUTH, D E. 1975. Evading the drift in floating-point addition. Inf. Process. Lett 3, 3, 84-87Google Scholar
STERBETZ, P. H. 1974. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, N.J.Google Scholar
SWARTZLANDER, E. E , AND ALEXOPOULOS, G. 1975. The sign/logarithm number system. IEEE Trans. Comput. C-24, 12, 1238-1242Google Scholar
WALTHER, J. S. 1971. A unified algorithm for elementary functions. Proceedings of the AFIP Spr~ng Joint Computer Conference, pp. 379- 385.Google Scholar

Index Terms

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier
ASAP '09: Proceedings of the 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors

In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded ...
Read More
A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Increasing chip densities and transistor counts provide more room for designers to add functionality for important application domains into future microprocessors. As a result of rapid growth in financial, commercial, and Internet-based applications, ...
Read More
Options for Denormal Representation in Logarithmic Arithmetic
Abstract
Economical hardware often uses a FiXed-point Number System (FXNS), whose constant absolute precision is acceptable for many signal-processing algorithms. The almost-constant relative precision of the more expensive Floating-Point (FP) number ...
Read More

Reviews

Reviewer: Louis W. Ehrlich

The title of this paper is appropriate. In fact, the words “user of computers for computation” could replace the words “computer scientist.” In Section 1, “Rounding Errors,” the details of rounding are described. Some of the topics discussed are floating-point formats, relative error and ulps (units in the last place), guard digits, and cancellation. Goldberg discusses the relations between ulps, relative error, and machine epsilon, with examples illustrating “wobble” in ulps. (Recent correspondence over NA-NET was appropriately concerned with new measures of precision in floating-point arithmetic, which fits in here.) He points out the significance of guard digits. Catastrophic cancellation (between computed numbers) and benign cancellation (between exact numbers) are described with some examples showing how some of this may be avoided by rewriting formulas. Section 2, “IEEE Standard,” describes the two different standards, 754 (for binary) and 854 (for binary or decimal). Subtopics include formats and operations (including base and precision), special quantities (such as NaNs and infinity), and exceptions, flags, and trap handlers. Not only are the standards described, but the reasons they were chosen are discussed and examples are given to illustrate the need or desire for them. This section should be kept in mind when one is considering purchasing a new computer. Section 3, “Systems Aspects,” contains discussion on such topics as instruction sets, languages and compilers (ambiguity and optimizers), and exception handling. Section 4, “Details,” presents proofs of some of the statements made earlier (such as “a single guard digit is enough to guarantee that addition and subtraction will always be accurate”). Finally, in Section 5, “Summary,” the author again states that rigorous reasoning can be applied to floating-point algorithms, consistent with underlying hardware and efficient algorithms. The paper contains an appendix in which several theorems are proven, including a highly accurate summation formula due to W. Kahan. Everyone who uses a computer to compute should read or at least peruse this paper. Many of the examples given are things we do without thinking of possible results. We should at least be aware of what can happen.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 23, Issue 1
March 1991
123 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/103162
Issue’s Table of Contents

Copyright © 1991 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 1991
Published in csur Volume 23, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NaN
denormalized number
exception
floating-point
floating-point standard
gradual underflow
guard digit
overflow
relative error
rounding error
rounding mode
ulp
underflow
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,102
  Total Citations
  View Citations
- 26,651
  Total Downloads
- Downloads (Last 12 months)5,745
- Downloads (Last 6 weeks)709
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

What every computer scientist should know about floating-point arithmetic

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier

A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Options for Denormal Representation in Logarithmic Arithmetic

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

What every computer scientist should know about floating-point arithmetic

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

A Combined Decimal and Binary Floating-Point Multiplier

A Decimal Floating-Point Divider Using Newton---Raphson Iteration

Options for Denormal Representation in Logarithmic Arithmetic

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media