floating point math

Upload: mbilca

Post on 06-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Floating Point Math

    1/94

    171

    What Every Computer Scientist Should Know A bout Floating-Point

    Arithmetic D

    Note This document is an edited reprint of the paper What Every Computer Scient ist Should Know A bout Floating-Point A rithmetic , by David Goldberg,pu blished in the M arch, 1991 issue of Comp uting Surveys. Copyright 1991,Association for Comp uting Machinery, Inc., reprinted by p ermission.

    This appendix has the following organization:

    Abstract page 172

    Introduction page 172

    Rounding Error page 173The IEEE Standard page 189

    Systems A spects page 211

    The Details page 225

    Summary page 239

    Acknowledgments page 240

    References page 240

    Theorem 14 and Theorem 8 page 243

    Differences A mong IEEE 754 Implementations page 248

  • 8/3/2019 Floating Point Math

    2/94

    172 Nu merical Computation Guide

    D

    Abstract Floating-point arithm etic is considered an esoteric subject by ma ny p eople.This is rather surprising because oating-point is ubiquitous in computersystems. Almost every langu age has a oating-point datatyp e; compu ters fromPCs to sup ercompu ters have oating -point accelerators; most comp ilers willbe called upon to compile oating-point algorithms from time to time; andvirtually every operating system must respond to oating-point exceptionssuch as overow. This paper presents a tutorial on those aspects of oating-point that have a direct impact on designers of computer systems. It beginswith background on oating-point representation an d round ing error,continues with a d iscussion of the IEEE oating-point stand ard, and conclud eswith nu merous examples of how computer builders can better supp ortoating-point.

    Categories and Subject Descriptors: (Primary) C.0 [Comp uter SystemsOrganization]: General instruction set design ; D.3.4 [ProgrammingLanguages]: Processors compilers, optim ization ; G.1.0 [Numerical Analysis]:General compu ter arithmetic, error analysis, numerical algorithm s (Secondary)D.2.1 [Software Engineering]: Requiremen ts/ Specications languages ; D.3.4Programming Languages]: Formal Denitions and Theory semantics ; D.4.1Opera ting Systems]: Process Managem ent synchronization .

    General Terms: Algorithms, Design, Langu ages

    Add itional Key Words an d Ph rases: Denorm alized nu mber, exception, oating-point, oating-point standard, gradu al und erow, guard digit, NaN , overow,relative error, rounding error, rounding mode, ulp, underow.

    IntroductionBuilders of computer systems often need information about oating-pointarithmetic. There are, how ever, remarkab ly few sources of detailed informationabout it. One of the few books on th e subject, Floating-Point Comp utation byPat Sterbenz, is long out of print. This paper is a tutorial on those asp ects of oating-point arithm etic (oating-point h ereafter) that h ave a direct connectionto systems bu ilding. It consists of three loosely connected p arts. The rst

    Section , Roun ding Error, on pag e 173, discusses the imp lications of usingdifferent roun ding strategies for the ba sic operations of ad dition, subtraction,multiplication and division. It also contains background information on the

  • 8/3/2019 Floating Point Math

    3/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 173

    D

    two m ethods of measuring rounding error, ulps an d relative error . Thesecond part discusses the IEEE oating-point standard, which is becomingrapid ly accepted by comm ercial hardw are man ufacturers. Includ ed in the IEEEstandard is the rounding method for basic operations. The discussion of thestandard d raws on the material in the Section , Round ing Error, on page 173.The third part discusses the connections between oating-point and the designof various asp ects of compu ter systems. Topics includ e instruction set d esign,optimizing compilers and exception handling.

    I have tried to avoid making statements about oating-point without alsogiving reasons w hy the statemen ts are tru e, especially since the justicationsinvolve nothing more complicated than elementary calculus. Thoseexplanations that are n ot central to the main argument have been group ed into

    a section called The Details, so that they can be skipped if desired. Inparticular, the p roofs of man y of the theorems a pp ear in this section. The endof each proof is marked with the symbol; when a proof is not included, the appears immediately following the statement of the theorem.

    Rounding Error Squeezing inn itely man y real num bers into a nite num ber of bits requires anapproximate representation. Although there are innitely many integers, inmost p rogram s the result of integer compu tations can be stored in 32 bits. Incontrast, given any xed number of bits, most calculations with real numberswill produce quantities that cannot be exactly represented using that many

    bits. Therefore the result of a oating-point calculation m ust often be rou nd edin order to t back into its nite representation. This rounding error is thecharacteristic feature of oating-point comp utation. Relative Error and Ulpson page 176 describes how it is measured.

    Since most oating-point calculations hav e roun ding error any way, does itmatter if the basic arithm etic operations introd uce a little bit more rou nd ingerror than necessary? That question is a main theme throughout this section.Guard Digits on p age 178 discusses guard digits, a means of reducing theerror when subtracting two nearby nu mbers. Guard d igits were consideredsufciently important by IBM that in 1968 it added a guard digit to the doubleprecision format in the System/ 360 architecture (single precision already h ad aguard digit), and retrotted all existing machines in the eld. Two examp lesare given to illustrate th e utility of guard d igits.

  • 8/3/2019 Floating Point Math

    4/94

    174 Nu merical Computation Guide

    D

    The IEEE standard goes further than just requiring the use of a guard digit. Itgives an algorithm for add ition, subtra ction, mu ltiplication, d ivision an dsquare root, and requires that implementations produce the same result as thatalgorithm. Thus, when a program is moved from one machine to another, theresults of the basic operations w ill be the same in ev ery bit if both m achinessupport the IEEE standard. This greatly simplies the porting of programs.Other uses of this p recise specication are given in Exactly Round edOperations on page 185.

    Floating-point Formats

    Several different representations of real numbers have been proposed, but by

    far the most widely used is the oating-point representation.1

    Floating-pointrepresentations have a base (which is always assumed to be even) and aprecision p . If = 10 and p = 3 then th e nu mber 0.1 is represented as 1.00 10 -1.If = 2 and p = 24, then the d ecimal nu mber 0.1 cann ot be represented exactlybu t is a pproxim ately 1.10011001100110011001101 2-4. In general, a oating-point nu mber w ill be represented as d.dd d e, where d.dd d is called thesignicand 2 and has p digits. More precisely d 0 . d 1 d 2 d p-1 e representsthe number

    (1)

    The term oating-point number will be used to m ean a real number that can beexactly represented in the format under discussion. Two other parametersassociated with oating-point representations are the largest and smallestallowable expon ents, emax an d emin . Since there a re p possible signicands, an demax - emin + 1 possible exponents, a oating-point number can be encoded in

    1. Examples of other representations are oating slash an d signed logarithm [Matula and Korneru p 1985;Swartzlander and Alexopoulos 1975].

    2. This term was introd uced by Forsythe and Moler [1967], and h as generally replaced the older term mantissa .

    d 0 d 11 d p 1 p 1( )+ + + e 0 d i

  • 8/3/2019 Floating Point Math

    5/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 175

    D

    bits, wh ere the nal +1 is for the sign b it. The p recise encoding is not imp ortantfor now.

    There are two reasons wh y a real num ber might n ot be exactly representable asa oating-point number. The most common situation is illustrated by thedecimal number 0.1. Although it has a nite decimal representation, in binaryit has an inn ite repeating representation. Thus wh en = 2, the n um ber 0.1 liesstrictly between two oating-point numbers and is exactly representable byneither of them. A less common situation is that a real number is out of range,that is, its absolute value is larger than emax or smaller than 1.0 emin .Most of this paper discusses issues due to the rst reason. However, numbersthat are out of range w ill be discussed in Innity on page 199 andDenormalized Nu mbers on page 202.

    Floating-point representations are not n ecessarily u nique. For examp le, both0.01 101 and 1.00 10-1 represent 0.1. If the leading digit is nonz ero ( d 0 0 inequation (1) above), then the rep resentation is said to be normalized . Theoating-point number 1.00 10-1 is norm alized, w hile 0.01 101 is not. When = 2, p = 3, emin = -1 and emax = 2 there are 16 norm alized oating-pointnum bers, as shown in Figure D-1. The bold hash m arks correspond to num berswh ose signicand is 1.00. Requiring th at a oating-point rep resentation benorm alized makes th e representation u nique. Unfortu nately, this restrictionmakes it imp ossible to represent zero! A natur al way to represent 0 is with1.0 emin -1 , since this preserves the fact that the n um erical ordering of nonnegative real numbers corresponds to the lexicographic ordering of theiroating-point representations. 1 When the exponent is stored in a k bit eld,that means that only 2 k - 1 values are available for u se as exponen ts, since onemust be reserved to represent 0.

    Note that the in a oating-point number is part of the notation, and differentfrom a oating-point multiply operation. The meaning of the symbol shouldbe clear from the context. For example, the expression (2.5 10-3) (4.0 102)involves only a single oating-point m ultiplication.

    1. This assum es the usual arrangem ent where the exponent is stored to the left of the signicand.

  • 8/3/2019 Floating Point Math

    6/94

    176 Nu merical Computation Guide

    D

    Figure D-1 Normalized nu mbers when = 2, p = 3, emin = -1, emax = 2

    Relative Error and Ulps

    Since rounding error is inherent in oating-point computation, it is importantto have a way to measure this error. Consider the oating-point format with = 10 and p = 3, which will be used throu ghou t this section. If the result of aoating-point comp utation is 3.12 10 -2, and the answer when computed toinnite precision is .0314, it is clear that this is in error by 2 units in the lastplace. Similarly, if the real n um ber .0314159 is represen ted as 3.14 10-2, then itis in error by .159 units in th e last place. In general, if the oating-pointnumber d.d d e is used to represent z , then it is in error by d.d d -( z / e) p-1 un its in the last p lace. 1, 2 The term ulps will be used as shorthandfor un its in the last p lace. If the result of a calculation is the oating-pointnu mber nearest to the correct result, it still might be in error by a s mu ch as .5ulp . Another way to measure the difference between a oating-point numberand the real number it is approximating is relative error , which is simply thedifference between the two numbers divided by the real number. For examplethe relative error committed w hen app roximating 3.14159 by 3.14 100 is.00159/ 3.14159 .0005.To compu te the relative error that correspond s to .5 ulp , observe that wh en areal number is approximated by the closest possible oating-point numberd.dd...dd e, the error can be as large as 0.00...00 e, where is the digit / 2, there are p units in the signicand of the oating-point number, and punits of 0 in the signicand of the error. This error is (( / 2)-p) e. Since

    1. Unless the number z is larger than emax +1 or smaller than emin . Numbers w hich are out of range in thisfashion will not be considered until further n otice.

    2. Let z be the oating-point number that app roximates z. Then d.d d - ( z / e) p-1 is equivalent toz-z /ulp(z). (See Num erical Computation Guide for the denition of ulp(z)). A more accurate formula for

    measuring error is z-z /ulp (z). -- Ed.

    0 1 2 3 4 5 6 7

  • 8/3/2019 Floating Point Math

    7/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 177

    D

    numbers of the form d.dd

    dd

    e all have the same absolute error, but havevalues that range between e an d e, the relative error ran ges between(( / 2)-p) e / e and (( / 2)-p) e / e+1 . That is,

    (2)

    In particular, the relative error correspon ding to .5 ulp can vary by a factor of . This factor is called the wobble . Setting = ( / 2)-p to the largest of thebound s in (2) above, we can say that wh en a real number is round ed to theclosest oating-point num ber, the relative error is alw ays boun ded by , wh ichis referred to as machine epsilon .

    In the exam ple above, th e relative error w as .00159/ 3.14159 .0005. In o rd er toavoid such small numbers, the relative error is normally written as a factortimes , which in this case is = ( / 2)-p = 5(10) -3 = .005. Thu s the relative er rorwou ld be expressed as (.00159/ 3.14159)/ .005) 0.1.

    To illustrate th e d ifference between ulps and relative error, consider the realnumber x = 12.35. It is approximated by = 1.24 101. The error is 0.5 ulps ,the relative error is 0. 8. Next consider the compu tation 8 . The exact value is8 x = 98.8, wh ile the compu ted value is 8 = 9.92 101. The error is now 4.0ulps , but the relative error is still 0. 8. The error measured in ulps is 8 timeslarger, even thou gh th e relative error is the same. In genera l, when th e base is, a xed relative error expressed in ulps can wobble by a factor of up to .And conversely, as equa tion (2) above show s, a xed error of .5 ulps results ina relative error tha t can w obble by .

    The most natural w ay to measure roun ding error is in ulps . For examplerounding to the nearest oating-point number corresponds to an error of lessthan or equal to .5 ulp . How ever, when analyzing the round ing error causedby variou s formulas, relative error is a better measu re. A good illustration of this is the analysis on pa ge 226. Since can overestimate the effect of roun dingto the nearest oating-point number by the wobble factor of , error estimatesof formulas will be tighter on machines with a small .

    When only the order of magnitude of rounding error is of interest, ulps an d may be used interchangeably, since they differ by at most a factor of . Forexample, when a oating-point number is in error by n ulps , that means thatthe number of contaminated digits is log n . If the relative error in acomputation is n, then

    contaminated digits log n . (3)

    12--- p 1

    2---ulp

    2--- p

    x x

    x

  • 8/3/2019 Floating Point Math

    8/94

    178 Nu merical Computation Guide

    D

    Guard DigitsOne method of computing the difference between two oating-point numbersis to compute the difference exactly and then round it to the nearest oating-point n um ber. This is very expensive if the operan ds d iffer greatly in size.Assuming p = 3, 2.15 1012 - 1.25 10-5 would be calculated as

    x = 2.15 1012 y = .0000000000000000125 1012

    x - y = 2.1499999999999999875 1012

    which rounds to 2.15 1012. Rather than u sing all these digits, oating-pointhardw are normally operates on a xed num ber of digits. Sup pose that thenumber of digits kept is p , and that wh en the smaller operand is shifted right,digits are simply discarded (as opposed to rounding). Then2.15 1012 - 1.25 10 -5 becomes

    x = 2.15 1012 y = 0.001012

    x - y = 2.15 1012

    The answ er is exactly the same a s if the difference had been comp uted exactlyand then roun ded . Take an other example: 10.1 - 9.93. This becomes

    x = 1.01101 y = 0.99 101

    x - y = .02101

    The correct answer is .17, so the computed difference is off by 30 ulps and iswrong in every digit! How bad can the error be?

    Theorem 1

    Using a oating-point format with parameters and p, and computing differencesusing p digits, the relative error of the result can be as large as - 1 .

    Proof

    A relative error of - 1 in the expression x - y occurs when x = 1.00 0 and y = ., where = - 1. Here y has p digits (all equal to ). The exactd ifference is x - y = - p. How ever, when computing the answ er using only p

  • 8/3/2019 Floating Point Math

    9/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 179

    D

    digits, the rightmost digit of y gets shifted off, and so the computedd ifference is - p+1 . Thus the error is - p - - p+1 = - p ( - 1), and the relativeerror is - p( - 1)/ - p = - 1.

    When =2, the relative error can be as large as the resu lt, and w hen =10, it canbe 9 times larger. Or to p ut it an other w ay, wh en =2, equa tion (3) above show sthat the nu mber of contam inated d igits is log 2(1/ ) = log 2(2 p) = p . That is, all of th e p digits in the result are wrong! Suppose that one extra digit is added toguard against this situation (a guard digit ). That is, the smaller num ber istruncated to p + 1 digits, and then the result of the subtraction is rounded to pdigits. With a guard digit, the previous example becomes

    x = 1.010 101 y = 0.993

    101

    x - y = .017 101

    and the answ er is exact. With a single gu ard digit, the relative error of theresult may be greater than , as in 110 - 8.59.

    x = 1.10102 y = .085102

    x - y = 1.015 102

    This round s to 102, comp ared with the correct answ er of 101.41, for a relativeerror of .006, which is greater than = .005. In general, the relative error of theresult can be on ly slightly larger th an . More precisely,

    Theorem 2

    If x and y are oating-point numbers in a format with parameters and p, and if subtraction is done with p + 1 digits (i.e. one guard digit), then the relativerounding error in the result is less than 2 .

    This theorem w ill be proven in Round ing Error on page 225. Ad dition isincluded in the above theorem since x an d y can be positive or negative.

    Cancellation

    The last section can be summarized by saying that without a guard digit, therelative error committed when subtracting two nearby quantities can be verylarge. In oth er w ords, the evalu ation of any expression containing a su btraction(or an add ition of quan tities with op posite signs) could result in a relative error

  • 8/3/2019 Floating Point Math

    10/94

    180 Nu merical Computation Guide

    D

    so large that all the d igits are meaningless (Theorem 1). When subtractingnearby quantities, the most signicant digits in the operands match and canceleach other. There are tw o kind s of cancellation: catastroph ic and benign.

    Catastrophic cancellation occurs when the operand s are subject to round ingerrors. For examp le in th e qua dr atic formu la, the expression b2 - 4 ac occurs.The quan tities b2 and 4 ac are subject to rounding errors since they are theresults of oating-point multiplications. Suppose that they are rounded to thenearest oating-point nu mber, and so are accura te to within .5 ulp . When theyare subtracted, cancellation can cause many of the accura te digits to disapp ear,leaving behind mainly digits contaminated by rounding error. Hence thedifference might have an error of many ulps . For examp le, consider b = 3.34,a = 1.22, and c = 2.28. The exact value of b2 - 4 ac is .0292. But b2 roun ds to 11.2

    and 4ac roun ds to 11.1, hence the nal an swer is .1 which is an error by 70ulps , even though 11.2 - 11.1 is exactly equal to .1 1. The subtraction did notintroduce any error, but rather exposed the error introduced in the earliermultiplications.

    Benign cancellation occurs w hen su btracting exactly know n qu antities. If x an d yhave no rounding error, then by Theorem 2 if the subtraction is done with aguard digit, the difference x-y has a very sm all relative error (less than 2 ).A formu la that exhibits catastrophic cancellation can som etimes be rearrang edto eliminate the problem. Again consider the quadratic formula

    (4)

    When , then d oes not involve a cancellation and

    . But th e other ad dition (subtraction) in one of the formu las willhave a catastrophic cancellation. To avoid th is, mu ltiply th e nu merator anddenominator of r 1 by

    1. 700, not 70. Since .1 - .0292 = .0708, the er ror in ter ms of u lp(0.0292) is 708 ulps . -- Ed.

    r 1b b2 4ac+

    2a-------------------------------------- - r 2,

    b b2 4ac2a

    --------------------------------------= =

    2 ac b 2 4ac

    b 2 4 ac b

    b b 2 4ac

  • 8/3/2019 Floating Point Math

    11/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 181

    D

    (and similarly for r 2) to obtain

    (5)

    If and , then compu ting r 1 using formula (4) will involve acancellation. Therefore, use (5) for computing r 1 and (4) for r 2. On the otherhand , if b < 0, use (4) for computing r 1 and (5) for r 2.

    The expression x2 - y2 is another formula th at exhibits catastroph ic cancellation.It is more accura te to evaluate it as ( x - y)( x + y).1 Unlike the quadratic formula,this improved form still has a su btraction, but it is a benign cancellation of quantities without rounding error, not a catastrophic one. By Theorem 2, therelative error in x - y is at most 2 . The same is true of x + y. Multiplying twoquantities with a small relative error results in a product with a small relativeerror (see Roun ding Error on pag e 225).

    In order to avoid confusion between exact and computed values, the followingnotation is used. Whereas x - y den otes the exact d ifference of x an d y, x yden otes the comp uted difference (i.e., with roun ding error). Similarly ,, and

    denote computed addition, multiplication, and division, respectively. Allcaps indicate the compu ted va lue of a fun ction, as in LN( x) or SQRT( x). Lowercase functions and traditional mathematical notation denote their exact values

    as in ln( x) and .

    Although ( x y) ( x y) is an excellent ap proximation to x2 - y 2, theoating-point num bers x an d y might themselves be approximations to sometru e qu an tities an d . For exam ple, an d m igh t be exactly kn ow ndecimal nu mbers th at cannot be expressed exactly in binary. In this case, eventhough x y is a good approximation to x - y, it can h ave a h uge relative errorco mp ar ed t o th e t ru e exp re ssio n , a nd so th e a d va nta ge of ( x + y)( x - y)over x2 - y 2 is not as dra matic. Since comp uting ( x + y)( x - y) is about the sameamoun t of work as compu ting x2 - y 2, it is clearly the preferred form in thiscase. In general, how ever, replacing a catastroph ic cancellation by a benign oneis not worthwhile if the expense is large because the input is often (but not

    1. Although the expression ( x - y)( x + y) does not cause a catastroph ic cancellation, it is slightly less accuratethan x2 - y2 if or . In this case, ( x - y)( x + y) has three rounding errors, but x 2 - y2 has only twosince the round ing error comm itted when comp uting the smaller of x2 an d y2 does not affect the nalsubtraction.

    r 12c

    b b2 4ac-------------------------------------- r 2,

    2c

    b b 2 4ac+-------------------------------------- -= =

    b 2 ac b 0>

    x y x y

    x

    x y x y

    x y

  • 8/3/2019 Floating Point Math

    12/94

    182 Nu merical Computation Guide

    D

    always) an app roximation. But eliminating a cancellation entirely (as in th equadratic formula) is worthwhile even if the data are not exact. Throughoutthis paper, it will be assumed that the oating-point inputs to an algorithm areexact and that the results are computed as accurately as possible.

    The expression x2 - y2 is more accurate when rewritten as ( x - y)( x + y) becausea catastrophic cancellation is replaced w ith a benign one. We next p resent moreinteresting examples of form ulas exhibiting catastrophic cancellation th at canbe rewr itten to exhibit only benign cancellation.

    The area of a triangle can be expressed d irectly in terms of the lengths of itssides a, b, and c as

    (6)

    Sup pose th e triangle is very at; that is, a b + c. Then s a, and the term ( s - a)in eq. (6) subtracts two nearby numbers, one of which may have roundingerror. For example, if a = 9.0, b = c = 4.53, then the correct value of s is 9.03 and

    A is 2.342... . Even thoug h th e compu ted v alue of s (9.05) is in error by only 2ulps , the comp uted value of A is 3.04, an er ror o f 70 ulps .

    There is a way to rewrite formu la (6) so that it w ill return accura te results evenfor at triangles [Kahan 1986]. It is

    (7)

    If a, b an d c do n ot satisfy a b c, simply renam e them before applying (7). Itis straightforw ard to check that the right-han d sides of (6) and (7) arealgebraically identical. Using the v alues of a, b, and c above gives a computedarea of 2.35, which is 1 ulp in error and m uch more accurate than the rstformula.

    Although formula (7) is much m ore accurate th an (6) for this examp le, it wou ldbe nice to know how well (7) performs in general.

    A s s a( ) s b( ) s c( ) where s, a b c+ +( ) 2 = =

    A a b c+( )+( ) c a b( )( ) c a b( )+( ) a b c( )+( )4------------------------------------------------------------------------------------------------------------------------------------------------ a b c ,=

  • 8/3/2019 Floating Point Math

    13/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 183

    D

    Theorem 3The rounding error incurred when using (7) to compute the area of a triangle is at most 11 , provided that subtraction is performed with a guard digit, e .005 , and that square roots are computed to within 1/ 2 ulp .

    The condition that e < .005 is met in virtu ally every actual oating-pointsystem. For example when = 2, p 8 ensures that e < .005, and wh en = 10,

    p 3 is enough.In statements like Theorem 3 tha t d iscuss th e relative error of an expression, itis understood that the expression is computed using oating-point arithmetic.In particular, the relative error is actually of the expression

    SQRT(( a (b c)) (c (a b)) (c (a b )) (a (b c ))) 4 (8)

    Because of the cumbersom e natu re of (8), in the statement of theorems w e willusually say the computed value of E rather than writing out E with circlenotation.

    Error bounds are usually too pessimistic. In the numerical example givenabove, the comp uted value of (7) is 2.35, comp ared with a tru e value of 2.34216for a relative error of 0.7 , wh ich is mu ch less than 11 . The main reason forcomputing error bounds is not to get precise bounds but rather to verify thatthe formula does not contain numerical problems.

    A nal example of an expression that can be rewritten to use benigncancellation is (1 + x)n , w h ere . Th is exp ression ar ises in n an cia lcalculations. Consider d epositing $100 every d ay into a ban k account thatearns an annual interest rate of 6%, compounded daily. If n = 365 and i = .06,

    the amount of money accumulated at the end of one year is 100

    dollars. If this is compu ted u sing = 2 and p = 24, the result is $37615.45compared to the exact answer of $37614.05, a discrepancy of $1.40. The reasonfor the p roblem is easy to see. The expression 1 + i / n involves adding 1 to.0001643836, so the low ord er bits of i / n are lost. This round ing error isamplied when 1 + i / n is raised to the n th pow er.

    x 1

    1 i n +( )n 1i n --------------------------------

  • 8/3/2019 Floating Point Math

    14/94

    184 Nu merical Computation Guide

    D

    The troublesome expression (1 + i / n )n can be rewritten as en ln(1 + i / n ), wherenow the problem is to compu te ln(1 + x) for sm all x . One app roach is to use theapp roximation ln(1 + x) x , in which case the payment becomes $37617.26,wh ich is off by $3.21 and even less accura te than the obv ious formu la. Butthere is a way to compute ln(1 + x) very accurately, as Theorem 4 show s[Hewlett-Packard 1982]. This formula yields $37614.07, accurate to within twocents!

    Theorem 4 assumes that LN ( x) approximates ln( x) to within 1/ 2 ulp . Theproblem it solves is that when x is small, LN(1 x) is not close to ln(1 + x)because 1 x has lost the information in the low order bits of x . That is, thecomputed value of ln(1 + x) is n ot close to its actu al valu e w hen .

    Theorem 4

    If ln(1 + x) is computed using the formula

    x for 1 x = 1ln(1 + x) =

    for 1 x 1

    the relative error is at most 5 when 0 x < , provided subtraction is performed

    with a guard digit, e < 0.1 , and ln is computed to within 1/ 2 ulp .

    This formula will work for any value of x bu t is only interesting for ,wh ich is w here catastrophic cancellation occurs in th e naive formu la ln(1 + x).Although the formula may seem mysterious, there is a simple explanation for

    wh y it w orks. Write ln(1 + x) as . The left hand factor

    can be compu ted exactly, but the right h and factor ( x) = ln(1 + x)/ x w ill suffera large rounding error wh en add ing 1 to x . However, is almost constant,since ln(1 + x) x . So changing x slightly w ill not introd uce m uch error. In

    other w ords, if , computing w ill be a good approximation to

    x( x) = ln(1 + x). Is there a value for for w hich and can becomputed

    x 1

    x ln(1+x)1 x+( ) 1---------------------------

    34---

    x 1

    ln 1 x+( ) x-------------------

    x x( )=

    x x x x( )

    x x x 1+

  • 8/3/2019 Floating Point Math

    15/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 185

    D

    accurately? There is; namely = (1 x) 1, becau se th en 1 + is exactlyequal to 1 x .

    The results of this section can be summarized by saying that a guard digitguarantees accuracy when nearby precisely known quantities are subtracted(benign cancellation). Sometimes a formula that g ives inaccura te results can berewritten to have much higher numerical accuracy by using benigncancellation; how ever, the p rocedure on ly works if subtraction is performedusing a guard digit. The price of a guard digit is not high, because it merelyrequires making the adder one bit wider. For a 54 bit double precision adder,the ad ditional cost is less than 2%. For this price, you gain the ability to runmany algorithms such as the formula (6) for computing the area of a triangle

    and the expression ln(1 + x). Although most mod ern computers have a guarddigit, there are a few (such as Cray systems) that do not.

    Exactly Rounded Operations

    When oating-point operations are done with a guard digit, they are not asaccurate as if they were computed exactly then rounded to the nearest oating-point number. Operations performed in this manner will be called exactlyrounded .1 The example immediately preceding Theorem 2 shows that a singleguard digit will not always give exactly rounded results. The previous sectiongave several examp les of algorithms tha t require a guard digit in order to w ork prop erly. This section gives examp les of algorithm s that require exact

    rounding.

    So far, the denition of rounding has not been given. Rounding isstraightforward, with the exception of how to round halfway cases; forexample, should 12.5 roun d to 12 or 13? One school of though t divid es the 10digits in half, letting {0, 1, 2, 3, 4} roun d d own , and {5, 6, 7, 8, 9} roun d u p; thus12.5 would round to 13. This is how rounding works on Digital EquipmentCorpora tions VAX compu ters. Another school of thou ght say s that sincenum bers ending in 5 are halfway between tw o possible round ings, they shouldround dow n half the time and roun d u p the other half. One way of obtainingthis 50% behavior to requ ire that the rou nd ed resu lt have its least signicant

    1. Also comm only referred to as correctly rounded . -- Ed.

    x x

  • 8/3/2019 Floating Point Math

    16/94

    186 Nu merical Computation Guide

    D

    digit be even. Thus 12.5 roun ds to 12 rather th an 13 because 2 is even. Which of these methods is best, round up or round to even? Reiser and Knuth [1975]offer the following reason for preferring roun d to even.

    Theorem 5

    Let x and y be oating-point numbers, and dene x 0 = x, x 1 = (x 0 y) y, , x n = (x n-1 y) y. If and are exactly rounded using round to even, theneither x n = x for all n or x n = x 1 for all n 1.

    To clarify this result, consider = 10, p = 3 and let x = 1.00, y = -.555. Whenround ing up , the sequence becomes x0 y = 1.56, x1 = 1.56 .555 = 1.01,

    x1 y = 1.01 .555 = 1.57, and each successive value of xn increases by .01,until x n = 9.45 (n 845)1. Under round to even, xn is always 1.00. This examplesuggests that w hen u sing the roun d u p ru le, compu tations can gradually driftup ward , whereas when u sing round to even the theorem says this cannothapp en. Throughou t the rest of this pap er, round to even w ill be used.

    One ap plication of exact roun ding occurs in mu ltiple precision arithm etic.There are two basic approaches to higher precision. One approach representsoating-point numbers using a very large signicand, which is stored in anarray of words, and codes the routines for manipulating these numbers inassembly language. The second approach represents higher precision oating-point num bers as an array of ordinary oating-point num bers, wh ere addingthe elements of the array in innite p recision recovers the high precisionoating-point number. It is this second approach that will be discussed here.The advantage of using an array of oating-point numbers is that it can becoded portably in a high level language, but it requires exactly roundedarithmetic.

    The key to multiplication in this system is representing a product xy as a sum,where each summan d h as the same precision as x an d y. This can be d one bysplitting x an d y. Writing x = x h + x l an d y = yh + y l, the exact prod uct is xy =

    xh y h + x h y l + x l yh + x l y l. If x an d y have p bit signicands, the summ ands willalso have p bit signicands provided that x l, x h, yh, y l can be represented using p / 2 bits. When p is even, it is easy to nd a splitting. The n um ber

    x0. x1 x p - 1 can be written as the sum of x0. x1 x p / 2 - 1 an d0.0 0 x p / 2 x p - 1. When p is odd, this simple splitting method wont work.

    1. When n = 845, x n= 9.45, x n + 0.555 = 10.0, and 10.0 - 0.555 = 9.45. Ther efore, x n = x845 for n > 845.

  • 8/3/2019 Floating Point Math

    17/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 187

    D

    An extra bit can, however, be gained by using negative numbers. For example,if = 2, p = 5, and x = .10111, x can be sp lit as x h = .11 and x l = -.00001. There ismore than one way to split a number. A splitting method that is easy tocompu te is due to Dekk er [1971], but it requires more than a single guard digit.

    Theorem 6

    Let p be the oating-point precision, with the restriction that p is even when > 2 ,and assume that oatin g-point operations are exactly rounded. Then if k = [p / 2] ishalf the precision (rounded up) and m = k + 1 , x can be split as x = x h + x l , where

    x h = (m x) (m x x), x l = x x h , and each x i is representable using [p / 2]bits of precision.

    To see how this theorem works in an example, let = 10, p = 4, b = 3.476, a =3.463, and c = 3.479. Then b2 - ac round ed to the n earest oating-point numberis .03480, while b b = 12.08, a c = 12.05, and so the computed value of b2 -ac is .03. This is an error of 480 ulps . Using Theorem 6 to wr ite b = 3.5 - .024,a = 3.5 - .037, and c = 3.5 - .021, b2 becomes 3.5 2 - 2 3.5 .024 + .024 2. Eachsummand is exact, so b2 = 12.25 - .168 + .000576, where the sum is leftun evaluated at this p oint. Similarly, ac = 3.52 - (3.5 .037 + 3.5 .021) + .037 .021 = 12.25 - .2030 +.000777. Finally, su btra cting th ese tw o series t erm by termgives an estimate for b2 - ac of 0 .0350 .000201 = .03480, wh ich is identicalto the exactly round ed result. To show that Theorem 6 really requ ires exactrounding, consider p = 3, = 2, and x = 7. Then m = 5, mx = 35, and m x = 32.If subtraction is performed with a single guard digit, then ( m x) x = 28.

    Therefore, xh = 4 and x l = 3, hence x l is not representable with [ p/2 ] = 1 bit.As a nal example of exact round ing, consider d ividing m by 10. The resu lt is aoating-point number that will in general not be equal to m / 10. When = 2,multiplying m / 10 by 10 will miraculously restore m , provided exact round ingis being used . Actually, a m ore general fact (du e to Kahan ) is true. The proof isingenious, but readers not interested in such d etails can skip ahead to Section ,The IEEE Stand ard, on p age 189.

    Theorem 7

    When = 2 , if m and n are integers with | m| < 2 p - 1 and n has the special form n= 2i + 2 j , then (m n)

    n = m, provided oating-point operations are exactlyrounded.

  • 8/3/2019 Floating Point Math

    18/94

    188 Nu merical Computation Guide

    D

    Proof Scaling by a p ow er of two is harm less, since it chang es only the expon ent,not the signicand. If q = m / n , then scale n so that 2 p - 1 n < 2 p and scale m

    so that < q < 1. Thus, 2 p - 2 < m < 2 p. Since m has p signicant bits, it h as at

    most one bit to the right of the binary point. Changing the sign of m isharmless, so assume that q > 0.

    If = m n , to prove the theorem requires showing that

    (9)

    That is because m has at most 1 bit right of the binary point, so n will

    round to m . To deal with the halfway case when | n - m | = , note that

    since the initial unscaled m had | m | < 2 p - 1 , its low -order bit was 0, so thelow-order bit of the scaled m is also 0. Thu s, halfwa y cases will round to m .

    Sup pose that q = .q1q2 , and let = . q1q2 q p1. To estim ate | n - m | , rstcompute | - q| = | N / 2 p + 1 - m / n | , w h er e N is an odd integer. Since n =2i + 2 j and 2 p - 1 n < 2 p, it must be that n = 2 p - 1 + 2k for some k p - 2, andthus

    .

    The numerator is an integer, and since N is odd , it is in fact an od d integer.Thus, | - q| 1/ (n 2 p + 1 - k ). Assum e q < (the case q > is sim ilar). 1 Thenn < m , and

    | m-n | = m-n = n (q- ) = n (q-( -2 -p-1 ) ) =(2 p-1+2 k )2-p-1 2-p-1+k =

    This establishes (9) and proves th e theorem. 2

    1. Notice that in binary, q ca nn ot e qu a l . -- Ed .

    12---

    q

    nq m 1

    4---

    q

    q 14---

    q qq

    q q nN 2 p 1+ m

    n2 p 1+------------------------------ 2

    p 1 k 1+( )N 2 p 1 k + mn 2 p 1 k +

    ---------------------------------------------------------------------= =

    q q q

    q

    q

    q n 2p 1 1

    n 2 p 1 k +

    ----------------------

    14---

  • 8/3/2019 Floating Point Math

    19/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 189

    D

    The theorem holds true for any base

    , as long as 2 i + 2 j is replaced b y

    i +

    j.As gets larger, how ever, den ominators of the form i + j are farther andfarther apart.

    We are now in a p osition to an swer th e question, Does it matter if the basicarithmetic operations introduce a little more rounding error than necessary?The answ er is that it does m atter, because accurate basic operations enable u sto prove th at formu las are correct in the sense they h ave a sm all relativeerror. Cancellation on pag e 179 discussed several algorithm s that requireguard digits to produce correct results in this sense. If the input to thoseformulas are numbers representing imprecise measurements, however, theboun ds of Theorems 3 and 4 become less interesting. The reason is that th ebenign cancellation x - y can become catastroph ic if x an d y are only

    approximations to some measured quantity. But accurate operations are usefuleven in th e face of inexact d ata, because they en able us to establish exactrelationships like those d iscussed in Theorems 6 and 7. These are u seful even if every oating-point variable is only an approximation to some actual value.

    The IEEE Standard There are two different IEEE stand ards for oa ting-point comp utation. IEEE754 is a binary stand ard th at requires = 2, p = 24 for single precision and p =53 for d ou ble p recision [IEEE 1987]. It also sp ecies the p recise layout o f bits ina single and double precision. IEEE 854 allows either = 2 or = 10 and un like754, does not specify how oating-point numbers are encoded into bits [Cody

    et al. 1984]. It does not require a particular value for p , but instead it speciesconstraints on the allowable values of p for single and dou ble precision. Theterm IEEE Standard will be used when discussing properties common to bothstandards.

    This section prov ides a tou r of the IEEE stand ard. Each sub section d iscussesone aspect of the standard and wh y it was includ ed. It is not the pu rpose of this paper to argue that the IEEE standard is the best possible oating-pointstandard bu t rather to accept the standard as given and provide anintrodu ction to its use. For full details consult the stand ards themselves [IEEE1987; Cody et al. 1984].

    2. Left as an exercise to the reader: extend the proof to bases other than 2. -- Ed.

  • 8/3/2019 Floating Point Math

    20/94

    190 Nu merical Computation Guide

    D

    Formats and Operations

    Base

    It is clear why IEEE 854 allows = 10. Base ten is how humans exchange andthink about nu mbers. Using = 10 is especially appropriate for calculators,where the result of each operation is displayed by the calculator in decimal.

    There are several reasons why IEEE 854 requires that if the base is not 10, itmu st be 2. Relative Error and Ulps on page 176 mentioned on e reason: theresults of error analyses are much tighter w hen is 2 because a round ing errorof .5 ulp wob bles by a factor of when computed as a relative error, and erroranalyses are almost always simpler when based on relative error. A relatedreason ha s to d o w ith the effective p recision for large bases. Consider = 16,

    p = 1 compared to = 2, p = 4. Both system s hav e 4 bits of signicand .Consider the comp utation of 15/ 8. When = 2, 15 is represen ted as 1.111 23,and 15/ 8 as 1.111 20. So 15/ 8 is exact. How ever, wh en = 16, 15 isrepresented as F 160, where F is the hexad ecimal d igit for 15. But 15/ 8 isrepresented as 1 160, wh ich h as only on e bit correct. In general, base 16 canlose up to 3 bits, so that a p recision of p hexidecimal d igits can h ave aneffective precision as low as 4 p - 3 rather th an 4 p binary bits. Since large valuesof have these p roblems, why d id IBM choose = 16 for its system/ 370? OnlyIBM kn ows for sure, but there are tw o p ossible reasons. The rst is increasedexponent ran ge. Single precision on the system/ 370 has = 16, p = 6. Hencethe signicand requires 24 bits. Since this must t into 32 bits, this leaves 7 bitsfor the exponent and one for the sign bit. Thus the magnitude of representable

    numbers ranges from about to abou t = . To get a similarexponent range w hen = 2 wou ld requ ire 9 bits of exponen t, leaving on ly 22bits for the signicand. However, it was just pointed out that when = 16, theeffective precision can be as low as 4 p - 3 = 21 bits. Even worse, when = 2 it ispossible to gain a n extra bit of precision (as explained later in th is section), soth e = 2 machine h as 23 bits of precision to compare with a ran ge of 21 24bits for the = 16 machine.Anoth er possible explanation for choosing = 16 has to d o w ith shifting. Whenadding two oating-point numbers, if their exponents are different, one of thesignicands will have to be shifted to make the radix points line up, slowingdow n the operation. In the = 16, p = 1 system, all the nu mbers betw een 1 and15 have the same exponent, and so no shifting is required when adding any of

    16 26 626 22 8

  • 8/3/2019 Floating Point Math

    21/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 191

    D

    th e(

    15

    2 )= 105 possible pairs of distinct num bers from this set. How ever, in the

    = 2, p = 4 system, these numbers have exponents ranging from 0 to 3, andshifting is required for 70 of the 105 pairs.

    In most modern hardware, the performance gained by avoiding a shift for asubset of operand s is negligible, and so the small wobble of = 2 makes it thepreferable base. Another advantage of using = 2 is that th ere is a way to gainan extra bit of signicance. 1 Since oating-point numbers are alwaysnorm alized, the m ost signicant bit of the signicand is alwa ys 1, and there isno reason to waste a bit of storage representing it. Formats that use this trick are said to have a hidden bit. It was already pointed out in Floating-pointFormats on p age 174 that this requires a special convention for 0. The methodgiven there was that an exponent of emi n - 1 and a signicand of all zeros

    represents not , bu t rather 0.

    IEEE 754 single precision is encod ed in 32 bits using 1 bit for th e sign, 8 bits forthe exponen t, and 23 bits for the signicand. H owever, it uses a hidd en bit, sothe signicand is 24 bits ( p = 24), even th ough it is encoded using on ly 23 bits.

    Precision

    The IEEE stand ard den es four different p recisions: single, dou ble, single-extended, and double-extended. In 754, single and double precisioncorrespond roughly to what most oating-point hardware provides. Singleprecision occup ies a single 32 bit word , dou ble precision tw o consecutive 32 bitword s. Extend ed precision is a format that offers at least a little extra p recisionand expon ent range (Table D-1).

    1. This appears t o have r st been p ublished by Goldberg [1967], although Knuth ([1981], page 211) attributesthis idea to Konrad Zuse.

    Table D-1 IEEE 754 Format Parameters

    Parameter

    Format

    Single Single-Extended DoubleDouble-

    Extended

    p 24 32 53 64

    ema x +127 1023 +1023 >16383

    1.0 2 emi n 1

  • 8/3/2019 Floating Point Math

    22/94

    192 Nu merical Computation Guide

    D

    The IEEE standard only species a lower bound on how many extra bitsextended precision provides. The minimum allowable double-extended formatis sometimes referred to as 80-bit format , even though the table shows it using79 bits. The reason is that ha rdw are implemen tations of extend ed p recisionnormally dont use a hidden bit, and so would use 80 rather than 79 bits. 1

    The standard p uts the m ost emphasis on extended precision, making norecommendation concerning double precision, but strongly recommending that

    Implementations should support the extended format corresponding t o the widest basic format supported, One m otivation for extend ed p recision comes from calculators, which w illoften display 10 digits, but use 13 digits internally. By displaying only 10 of the13 digits, the calculator appears to the user as a black box that computesexpon entials, cosines, etc. to 10 digits of accuracy. For the calculator tocompu te functions like exp, log and cos to within 10 digits with reasonableefciency, it needs a few extra digits to wor k w ith. It isnt hard to nd a simp lerational expression tha t app roximates log w ith an error of 500 units in the lastplace. Thu s compu ting w ith 13 digits gives an answ er correct to 10 digits. Bykeeping these extra 3 digits hidden, the calculator presents a simple model tothe operator.

    1. According to Kahan, extended precision has 64 bits of signicand because that w as the wid est precisionacross which carry prop agation could be d one on the Intel 8087 withou t increasing the cycle time [Kahan1988].

    emin -126 -1022 -1022 -16382Exponent w idth inbits

    8 11 11 15

    Format wid th in bits 32 43 64 79

    Table D-1 IEEE 754 Format Parameters

    Parameter

    Format

    Single Single-Extended DoubleDouble-

    Extended

  • 8/3/2019 Floating Point Math

    23/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 193

    D

    Extend ed precision in the IEEE stand ard serves a similar function. It enableslibraries to efciently compu te qua ntities to within abou t .5 ulp in single (ordou ble) precision, giving the u ser of those libraries a simple mod el, nam elythat each primitive operation, be it a simple multiply or an invocation of log,returns a value accurate to within about .5 ulp . However, wh en usingextended precision, it is important to make sure that its use is transparent tothe u ser. For examp le, on a calculator, if the internal rep resentation of adisplayed value is not rounded to the same precision as the display, then theresult of further operations will depend on the hidd en d igits and appearunpredictable to the user.

    To illustrate extend ed precision furth er, consider the p roblem of convertingbetween IEEE 754 single precision and decimal. Ideally, single precision

    num bers will be printed with enough digits so that when the decimal num beris read back in, the single precision n um ber can be recovered. It turn s out th at9 decimal digits are enough to recover a single precision binary n um ber (seeBinary to Decimal Conversion on p age 236). When converting a decimalnu mber back to its un ique binary representation, a round ing error as small as 1ulp is fatal, because it w ill give the wron g answ er. Here is a situation w hereextended precision is vital for an efcient algorithm . When single-extended isavailable, a very straightforward method exists for converting a decimalnu mber to a single precision binary on e. First read in th e 9 decimal digits as aninteger N , ignoring th e decimal point. From Table D-1, p 32, and since109 < 232 4.3 109, N can be represented exactly in single-extend ed. N ext ndthe app ropriate power 10 P necessary to scale N . This will be a combination of the exponent of the decimal num ber, together w ith the position of the (up u ntilnow) ignored decimal point. Compute 10 | P | . If | P | 13, then this is alsorepresented exactly, because 10 13 = 213513, and 5 13 < 232. Finally multiply (ordivide if p < 0) N and 10 | P | . If this last operation is done exactly, then theclosest binary n um ber is recovered. Binary to Decimal Conversion onpage 236 shows how to do th e last mu ltiply (or divide) exactly. Thu s for | P | 13, the use of the single-extended format enables 9 digit decimal num bers to beconverted to the closest binary n um ber (i.e. exactly roun ded ). If | P | > 13, thensingle-extended is not enough for the above algorithm to always compute theexactly rounded binary equivalent, but Coonen [1984] shows that it is enoughto guar antee that th e conversion of binary to d ecimal and back will recover theoriginal binary number.

  • 8/3/2019 Floating Point Math

    24/94

    194 Nu merical Computation Guide

    D

    If double precision is supported, then the algorithm above would be run indouble precision rather than single-extended, but to convert double precisionto a 17 digit decimal number and back would require the double-extendedformat.

    Exponent

    Since the exponent can be positive or negative, some method must be chosento represent its sign. Two common methods of representing signed numbersare sign/ magnitude and twos complement. Sign/ magnitude is the systemused for the sign of the signicand in the IEEE formats: one bit is used to h oldthe sign, the rest of the bits represent the magnitude of the number. The twoscomplemen t representation is often used in integer arithm etic. In this schem e,a number in the range [-2 p-1 , 2 p-1 - 1] is represented by the sm allest nonn egativenum ber that is congruent to it modu lo 2 p .

    The IEEE binary stan da rd d oes not use either of these methods to represent th eexponent, but instead uses a biased representation. In th e case of singleprecision, w here the exponent is stored in 8 bits, the bias is 127 (for dou bleprecision it is 1023). What this means is that if is the value of the exponentbits interpreted as an unsigned integer, then the exponent of the oating-pointnumber is - 127. This is often called the unbiased exponent to distingu ish fromth e b ia sed e xp on en t .

    Referring to Table D-1 on p age 191, single precision ha s ema x = 127 ande

    mi n= -126. The reason for having | e

    mi n| < e

    ma xis so that th e reciprocal of the

    sm allest n um ber w ill n ot overow. Alth ou gh it is tru e th at th ereciprocal of the largest number will underow, underow is usually lessserious than overow. Base on page 190 explained that emi n - 1 is used forrepresenting 0, and Special Qu antities on page 197 will introdu ce a use forema x + 1. In IEEE single precision, this mean s that the biased expon ents rangebetween emi n - 1 = -127 and ema x + 1 = 128, whereas the unbiased exponentsrange between 0 and 255, which are exactly the nonnegative numbers that canbe represented using 8 bits.

    k

    k k

    1 2 emin ( )

  • 8/3/2019 Floating Point Math

    25/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 195

    D

    OperationsThe IEEE stand ard requires that the resu lt of addition, subtraction,multiplication and division be exactly rounded. That is, the result must becomp uted exactly and then round ed to the n earest oating-point nu mber(using roun d to even). Guard Digits on p age 178 pointed out that compu tingthe exact difference or sum of two oating-point numbers can be veryexpensive w hen their exponen ts are substan tially d ifferent. That sectionintroduced guard digits, which provide a practical way of computingdifferences while guaran teeing that th e relative error is small. How ever,comp uting w ith a single guard digit will not always give the same answ er ascomputing the exact result and then rounding. By introducing a second guarddigit and a third sticky bit, differences can be comp uted at only a little morecost than w ith a single guard digit, but th e result is the same as if the differencewere computed exactly and then rounded [Goldberg 1990]. Thus the standardcan be implemen ted efciently.

    One reason for completely specifying th e results of arithmetic operations is toimprove the portability of software. When a program is moved between twomachines and both support IEEE arithmetic, then if any intermediate resultdiffers, it mu st be because of softw are bu gs, not from d ifferences in arithm etic.Anoth er ad vantag e of precise specication is that it mak es it easier to reasonabout oating-point. Proofs about oating-point are hard enough, withouthaving to deal with multiple cases arising from multiple kinds of arithmetic.Just as integer p rograms can be proven to be correct, so can oating-pointprograms, although wh at is proven in that case is that the roun ding error of the result satises certain bou nd s. Theorem 4 is an examp le of such a p roof.These proofs are made much easier when the operations being reasoned aboutare p recisely specied. O nce an algorithm is p roven to be correct for IEEEarithmetic, it w ill wor k correctly on any machine su pp orting the IEEEstandard.

    Brown [1981] has p roposed axioms for oating-point that includ e most of theexisting oating-point hard ware. H owever, proofs in th is system cann ot verifythe algorithms of sections, Cancellation on page 179 and, Exactly Roun dedOperations on page 185, which require features not p resent on all hardware.Furthermore, Browns axioms are more complex than simply deningoperations to be performed exactly and then round ed. Thus proving theorems

    from Browns axioms is usually more difcult than proving them assumingoperations are exactly rounded.

  • 8/3/2019 Floating Point Math

    26/94

    196 Nu merical Computation Guide

    D

    There is not complete agreement on what operations a oating-point standardshould cover. In ad dition to th e basic operations +, -, and / , the IEEEstandard also species that square root, remainder, and conversion betweeninteger and oating-point be correctly round ed. It also requires that conversionbetween internal formats and decimal be correctly rounded (except for verylarge numbers). Kulisch and Miranker [1986] have proposed adding innerprod uct to the list of operations that are p recisely specied. They n ote thatwhen inner products are computed in IEEE arithmetic, the nal answer can bequite wrong. For example sums are a special case of inner products, and thesum ((2 10 -30 + 10 30) - 10 30) - 10 -30 is exactly equal to 10 -30, but on a m achinewith IEEE arithm etic the comp uted result w ill be -10 -30. It is possible tocomp ute inner produ cts to within 1 ulp with less hardware than it takes toimplemen t a fast mu ltiplier [Kirchner an d Kulish 1987]. 1 2

    All the operations mentioned in the standard are required to be exactlyrounded except conversion between decimal and binary. The reason is thatefcient algorithms for exactly rou nd ing all the operations are kn own , exceptconversion. For conversion, the best known efcient algorithms produceresults that are slightly worse than exactly round ed on es [Coonen 1984].

    The IEEE stand ard does n ot require transcend ental functions to be exactlyround ed because of the table makers dilemma . To illustrate, sup pose you aremaking a table of the expon ential function to 4 places. Thenexp(1.626) = 5.0835. Shou ld th is be rou nd ed t o 5.083 or 5.084? If exp(1.626) iscomputed more carefully, it becomes 5.08350. And then 5.083500. And then5.0835000. Since exp is transcendental, this could go on arbitrarily long beforedistinguishing whether exp(1.626) is 5.083500 0ddd or 5.0834999 9ddd . Thusit is not pr actical to specify that th e precision of tran scend ental functions be thesame as if they w ere compu ted to innite precision and then round ed. Anotherapp roach w ould be to sp ecify transcenden tal functions algorithmically. Butthere does not appear to be a single algorithm that works well across allhard wa re architectures. Rational ap proximation, CORDIC, 3 and large tables

    1. Some arguments against including inner p roduct as one of the basic operations are presented by Kahan an dLeBlanc [1985].

    2. Kirchner wr ites: It is possible to compu te inner produ cts to within 1 ulp in hardwa re in one partial productper clockcycle. The add itionally needed hardw are compares to the multiplier array needed an yway for thatspeed.

    3. CORDIC is an acronym for Coordinate Rotation Digital Comp uter and is a method of computingtranscendental functions that u ses mostly shifts and add s (i.e., very few mu ltiplications and d ivisions)[Walther 1971]. It is the method ad ditionally needed hard ware compares to the multiplier array neededanywa y for that speed. d u sed on both th e Intel 8087 and the Motorola 68881.

  • 8/3/2019 Floating Point Math

    27/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 197

    D

    are three different techniques that are used for computing transcendentals oncontemporary machines. Each is appropriate for a different class of hardware,and at present no single algorithm works acceptably over the wide range of current hardware.

    Special Quant ities

    On some oating-point hardware every bit pattern represents a valid oating-point n um ber. The IBM System/ 370 is an examp le of this. On the other h and ,the VAX reserves some bit p atterns to represent sp ecial nu mbers calledreserved operands . This idea g oes back to the CDC 6600, which had bit patternsfor the sp ecial quan tities IN DEFINITE an d IN FINITY.

    The IEEE standard continues in this tradition and has NaN s ( Not a Number ) andinnities. Without any special quantities, there is no good way to handleexceptional situations like taking the squ are root of a negative nu mber, otherthan abor ting compu tation. Under IBM System/ 370 FORTRAN, the defaultaction in response to computing the square root of a negative number like -4results in the pr inting of an error m essage. Since every bit pattern rep resents avalid number, the return value of square root must besome oating-point n um ber. In the case of System/ 370 FORTRAN ,is returned. In IEEE arithmetic, a NaN is returned in this situation.

    The IEEE stand ard species the following special values (see Table D-2): 0,denormalized nu mbers, an d NaN s (there is more than one NaN , as explained

    in the next section). These special values are all encoded with exp onents of either emax + 1 or emi n - 1 (it was already pointed out that 0 has an exponent of emi n - 1).

    Table D-2 IEEE 754 Spe cial Valu es

    Exponent Fraction Represents

    e = emin - 1 f = 0 0e = emin - 1 f 0emin e ema x 1. f 2e

    e = ema x + 1 f = 0 e = ema x + 1 f 0 NaN

    4 2=

    0. f 2 emi n

  • 8/3/2019 Floating Point Math

    28/94

    198 Nu merical Computation Guide

    D

    NaN sTrad it ionally, the computa tion of 0/ 0 o r has been trea ted as anunrecoverable error which causes a computation to halt. However, there areexamples where it makes sense for a computation to continue in such asituation. Consider a subroutine that nds the zeros of a function f , sayzero(f) . Trad itionally, zero nders requ ire the user to inp ut an interval [ a, b]on which the function is dened and over which the zero nder will search.That is, the subrou tine is called as zero(f , a , b) . A more useful zero nderwould not require the user to input this extra information. This more generalzero nd er is especially approp riate for calculators, wh ere it is natu ral tosimp ly key in a function, and awkw ard to then have to specify the domain.However, it is easy to see why most zero nders require a domain. The zeronder does its work by probing the function f at various v alues. If it probedfor a value outside the d omain of f , the code for f might well comp ute 0/ 0 or

    , and the compu tation w ould halt, unnecessarily aborting the zero ndingprocess.

    This problem can be avoid ed b y introdu cing a special value called NaN , and

    specify ing tha t the computa tion of expressions like 0/ 0 and produce NaN ,rather th an h alting. A list of some of the situations th at can cause a NaN aregiven in Table D-3. Then w hen zero(f) probes outside the d omain of f , thecode for f will return NaN , and the zero nder can continue. That is, zero(f)is not pu nished for making an incorrect guess. With this examp le in m ind, itis easy to see what th e result of combining a NaN with an ordinary oating-point num ber should be. Sup pose that the nal statement of f isreturn(-b + sqrt(d))/(2*a) . If d < 0, then f should return a NaN . Sinced < 0, sqrt(d) is a NaN , and -b + sqrt(d) will be a NaN , if the su m of a NaN

    1

    1

    1

  • 8/3/2019 Floating Point Math

    29/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 199

    D

    and an y other number is a NaN . Similarly if one opera nd of a d ivisionoperation is a NaN , the quotient should be a NaN . In general, whenever a NaNparticipates in a oating-point operation, the result is another NaN .

    Another approach to writing a zero solver that doesnt require the user toinput a domain is to use signals. The zero-nder could install a signal handlerfor oating-point exceptions. Then if f was evaluated outside its domain andraised an exception, control wou ld be return ed to th e zero solver. The problemwith this approach is that every language has a different method of handlingsignals (if it has a m ethod at all), and so it has n o hop e of portability.

    In IEEE 754, NaN s are often represented as oating-point numbers with theexponent ema x + 1 and nonzero signicands. Implementations are free to putsystem-dependent information into the signicand. Thus there is not a unique

    NaN , but rather a whole family of NaN s. When a NaN and an ordinary oating-point num ber are combined, the result should be the sam e as the NaN operand.Thus if the result of a long compu tation is a NaN , the system-depend entinformation in the signicand w ill be the information that w as generated w henthe rst NaN in the comp utation w as generated . Actually, there is a caveat tothe last statement. If both operands are NaN s, then the resu lt will be one of those NaN s, but it might not be the NaN that was generated rst.

    Innity

    Just as NaN s provide a way to continue a computation when expressions like

    0/ 0 o r a re encounte red , innit ies p rov ide a way to con tinue when anoverow occurs. This is much safer than simply returning the largest

    Table D-3 Operations That Produ ce a NaN

    Operation NaN Produced By

    + + (- )x 0

    / 0/ 0, / REM x REM 0, REM y

    (when x < 0) x

    1

  • 8/3/2019 Floating Point Math

    30/94

    200 Nu merical Computation Guide

    D

    rep resen table nu mber. As an exam ple, con sid er com pu tin g , w hen = 10, p = 3, and ema x = 98. If x = 3 1070 an d y = 4 1070, then x2 will overow,and be replaced by 9.99 1098. Similarly y2, and x2 + y2 will each overow inturn, and be replaced by 9.99 1098. So the nal result will be

    , wh ich is d rastically wrong: the correct answer

    is 5 1070. In IEEE arithm etic, the result of x2 is , as is y2, x2 + y2 and .So the nal result is , which is safer than returning an ordinary oating-pointnumber that is nowhere near the correct answer. 1

    The division of 0 by 0 results in a NaN . A nonzero number d ivided by 0,how ever, return s innity: 1/ 0 = , -1/ 0 = - . The reason for the d istinction isthis: if f ( x) 0 and g( x) 0 as x approaches some limit, then f ( x)/ g( x) couldhave any value. For example, when f ( x) = sin x an d g( x) = x , then f ( x)/ g( x) 1as x 0. But when f ( x) = 1 - cos x , f ( x)/ g( x) 0. When thinking of 0/ 0 as thelimiting situation of a quotient of two very small num bers, 0/ 0 could rep resentanyth ing. Thu s in the IEEE stand ard, 0/ 0 results in a NaN . But w hen c > 0, f ( x)c, and g ( x)0, then f ( x)/ g( x) , for any ana lytic functions f and g. If g( x)

  • 8/3/2019 Floating Point Math

    31/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 201

    D

    practical example that makes u se of the rules for innity arithmetic. Considercomputing the function x / ( x2 + 1). This is a bad form ula, because not on ly will

    it overow w hen x is larger than , bu t innity arithmetic w ill givethe wrong answer because it will yield 0, rather than a number near 1/ x .However, x / ( x2 + 1) can be rewritten as 1/ ( x + x -1). This imp roved expressionwill not overow prematurely and because of innity arithmetic will have thecorrect value when x= 0: 1/ (0 + 0 -1) = 1/ (0 + ) = 1/ = 0. Withou t inn ityarithmetic, the expression 1/ ( x + x -1) requires a test for x = 0, which not on lyadds extra instructions, but may also disrupt a pipeline. This exampleillustrates a general fact, nam ely that innity arithm etic often avoids the n eedfor special case checking; how ever, formu las need to be carefully inspected tomake sure they d o not h ave spurious behavior at innity (as x / ( x2 + 1) did).

    Signed Z ero

    Zero is represented by the exponent emi n - 1 and a zero signicand . Since thesign bit can take on two different values, there are two zeros, +0 and -0. If adistinction w ere mad e wh en comp aring +0 and -0, simp le tests like if (x = 0)wou ld have very unp redictable behavior, depend ing on the sign of x . Thusthe IEEE standard den es comparison so th at +0 = -0, rather than -0 < +0.

    Although it wou ld be p ossible always to ign ore the sign of zero, the IEEEstandard does not do so. When a multiplication or division involves a signedzero, the usual sign rules apply in computing the sign of the answer. Thus3(+0) = +0, and +0/ -3 = -0. If zero did not h ave a sign, then the relation1/ (1/ x) = x wou ld fail to hold wh en x = . The reason is that 1/ - a nd 1/ + both result in 0, and 1/ 0 results in + , the sign information having been lost.One w ay to restore the identity 1/ (1/ x) = x is to only have on e kind of innity,however that would result in the disastrous consequence of losing the sign of an overowed quantity.

    Another example of the use of signed zero concerns underow and functionsthat h ave a discontinuity at 0, such a s log. In IEEE arithmetic, it is natu ral toden e log 0 = - and log x to be a NaN when x < 0. Supp ose that x represents asmall negative number that h as und erowed to zero. Thanks to signed zero, xwill be negative, so log can retur n a NaN . How ever, if there were no signedzero, the log function could not d istinguish an und erowed n egative nu mberfrom 0, and wou ld therefore have to return -

    . Another exam ple of a function

    with a d iscontinu ity at zero is the signum function, which return s the sign of anumber.

    emax 2

  • 8/3/2019 Floating Point Math

    32/94

    202 Nu merical Computation Guide

    D

    Probably the m ost interesting u se of signed zero occurs in complex arithmetic.

    To take a simple example, consid er the equ ation . This is

    certainly true when z 0. If z = -1, the obvious computation gives

    and . Thu s,

    ! The problem can be traced to th e fact that squ are root ismu lti-valued, and there is no wa y to select the values so th at it is continu ous inthe entire complex plane. H owev er, square root is continuou s if a branch cut consisting of all negative real num bers is exclud ed from consideration. Thisleaves the problem of what to do for the negative real numbers, which are of the form - x + i0, w here x > 0. Signed zero provid es a p erfect w ay to resolve thisproblem. N um bers of the

    form x + i(+0) h ave one sign and nu mbers of th e form x + i(-0) on the

    oth er sid e of th e b ran ch cu t h av e th e oth er sig n . In fa ct, th e n atu ral

    formulas for computing will give these results.

    Back to . If z =1 = -1 + i0, then 1/ z = 1/ (-1 + i0) =

    [(-1- i0)]/ [(-1 + i0)(-1 - i0)] = (-1 - i0)/ ((-1) 2 - 0 2) = -1 + i(-0), and so

    , w hile . Thus IEEEarithmetic preserves this iden tity for all z . Some m ore sophisticated examplesare given by Kahan [1987]. Although distinguishing b etween +0 and -0 hasadv antages, it can occasionally be confusing. For example, signed zero d estroysthe relation x = y 1/ x = 1/ y, wh ich is false wh en x = +0 and y = -0. However,the IEEE comm ittee decided that th e adv antages of utilizing th e sign of zerooutweighed the d isadvantages.

    Denormalized Numbers

    Consider normalized oating-point nu mbers w ith = 10, p = 3, and emi n = -98.The numbers x = 6.87

    10 -97 an d y = 6.81

    10 -97 app ear to be perfectly ordinary

    oating-point numbers, which are more than a factor of 10 larger than thesmallest oating-point n um ber 1.00 10 -98. They have a strange prop erty,

    1 z 1 z( ) =

    1 1( ) 1 i= = 1( ) 1 i i= =1 z 1 z( )

    i x( )i x( )

    1 z 1 z( ) =

    1 z 1 i 0( )+ i= = z( ) 1 i i= =

  • 8/3/2019 Floating Point Math

    33/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 203

    D

    however: x y = 0 even though x

    y! The reason is that x - y = .06

    10 -97

    = 6.0 10 -99 is too small to be represented a s a normalized n um ber, and so m ustbe ushed to zero. How imp ortant is it to p reserve the property

    x = y x - y = 0 ? (10)

    Its very easy to ima gine wr iting th e code fragment,if (x y) then z = 1/(x-y) , and m uch later having a p rogram fail due to aspu rious d ivision by zero. Tracking dow n bu gs like this is frustrating and timeconsum ing. On a m ore ph ilosoph ical level, compu ter science textbooks oftenpoint out that even though it is currently impractical to prove large programscorrect, designing programs with the idea of proving them often results inbetter code. For examp le, introdu cing inva riants is quite useful, even if theyarent going to be u sed as par t of a p roof. Floating-point code is just like anyother code: it helps to have provable facts on which to depend. For example,when analyzing formula (6), it was very helpful to know that

    x / 2 < y < 2 x x y = x - y. Similarly, knowing that (10) is true makes writingreliable oating-point code easier. If it is only tru e for m ost nu mbers, it cannotbe used to prove anything.

    The IEEE stand ard u ses denorm alized 1 nu mbers, w hich guar antee (10), as wellas other u seful relations. They are th e most controversial part of the stan dardand probably accounted for the long delay in getting 754 approved. Most highperformance hardware that claims to be IEEE compatible does not support

    denormalized num bers directly, but rather trap s wh en consuming orprod ucing den ormals, and leaves it to software to simu late the IEEE stand ard. 2

    The idea behind denormalized numbers goes back to Goldberg [1967] and isvery simple. When the exponent is em in , the signicand does not have to benormalized, so that w hen = 10, p = 3 and emi n = -98, 1.00 10 -98 is no longerthe sm allest oating-point n um ber, because 0.98 10 -98 is also a oating-pointnumber.

    1. They are called subnormal in 854, denormal in 754.

    2. This is the cause of one of the most troublesome aspects of the standard . Programs that frequentlyund erow often run noticeably slower on hardw are that uses software traps.

  • 8/3/2019 Floating Point Math

    34/94

    204 Nu merical Computation Guide

    D

    There is a small snag when

    = 2 and a hidden bit is being used, since anum ber with an exponent of emi n will alwa ys have a signicand greater than orequal to 1.0 because of the im plicit leading bit. The solution is similar to thatused to represent 0, and is summ arized in Table D-2 on p age 197. The expon entemi n is used to represent den ormals. More formally, if the bits in th e signicandeld are b1, b2, , b p - 1 , and the value of the exponent is e, then whene > emi n - 1, the number being represented is 1. b1b2b p - 1 2e whereas when e =emi n - 1, the number being represented is 0. b1b2b p - 1 2e + 1 . The +1 in theexponent is needed because denorm als have an exponent of emi n , not emi n - 1.

    Recall the example of = 10, p = 3, emi n = -98, x = 6.87 10 -97 an d y = 6.81 10 -97presented at the beginning of this section. With d enorm als, x - y does not ushto zero but is instead represented by th e denorm alized n um ber .6 10 -98. Thisbehavior is called gradual underow . It is easy to verify that (10) always holdswhen using gradual underow.

    Figure D-2 Flush To Zero Comp ared With Gradu al Underow

    Figure D-2 illustrates denormalized num bers. The top num ber line in the gureshows norm alized oating-point nu mbers. Notice the gap between 0 and thesm allest n orm alized nu mber . If the resu lt of a oatin g-p ointcalculation falls into this gulf, it is ush ed to zero. The bottom nu mber lineshows w hat happ ens when d enormals are add ed to the set of oating-pointnu mbers. The gu lf is lled in, an d w hen th e result of a calculation is lessthan , it is rep resented by the nearest d enormal. Whendenormalized num bers are added to the num ber line, the spacing betweenadjacent oating-point numbers varies in a regular way: adjacent spacings areeither the same length or d iffer by a factor of . Without denormals, the

    spacing abrup tly changes from to , w hich is a factor of

    0 emin emin 1+ emin 2+ emin 3+

    0 emin emin 1+ emin 2+ emin 3+

    1.0 emi n

    1.0 emi n

    p 1+ emin emin

  • 8/3/2019 Floating Point Math

    35/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 205

    D

    , rather than the orderly change by a factor of

    . Because of this, manyalgorithms that can have large relative error for normalized numbers close tothe und erow threshold are w ell-behaved in this range when gradu alunderow is used.

    Without gradual underow, the simple expression x - y can have a very largerelative error for norm alized inpu ts, as was seen above for x = 6.87 10 -97 an d

    y = 6.81 10 -97. Large relative errors can h app en even withou t cancellation, asthe following examp le shows [Demm el 1984]. Consider d ividing tw o comp lexnumbers, a + ib an d c + id . The obvious formula

    i

    suffers from the problem that if either component of the denominator c + id is

    larger th an , th e form ula will overow, ev en th ou gh th e n al resu ltmay be well within range. A better method of computing the quotients is touse Smiths formu la:

    (11)

    App lying Smiths formu la to (2 10 -98 + i10 -98)/ (4 10 -98 + i(2 10 -98)) gives thecorrect answer of 0.5 with gradual underow. It yields 0.4 with ush to zero,an error of 100 ulps . It is typical for den orma lized n um bers to guaran tee error

    b ou n d s fo r a rg u me nt s a ll t he w a y d ow n to 1.0 x .

    Exceptions, Flags and Trap Handlers

    When an exceptional condition like division by zero or overow occurs in IEEEarithmetic, the default is to deliver a result an d continue. Typical of the default

    p 1

    a ib+c id +-------------- ac bd +

    c 2 d 2+------------------ bc ad

    c 2 d 2+------------------+=

    ema x 2

    a ib+c id +--------------

    a b d c ( )+c d d c ( )+----------------------------- i

    b a d c ( )c d d c ( )+---------------------------- if d c

  • 8/3/2019 Floating Point Math

    36/94

    206 Nu merical Computation Guide

    D

    results are NaN for 0/ 0 and , and

    for 1/ 0 and overow. The precedingsections gave examples wh ere proceeding from an exception w ith these defaultvalues w as the reasonable thing to do. When any exception occurs, a status agis also set. Imp lementations of the IEEE standard are required to provid e userswith a way to read and write the status ags. The ags are sticky in thatonce set, they remain set until explicitly cleared. Testing the ags is the onlyway to distinguish 1/ 0, which is a genu ine innity from an overow.

    Sometimes continu ing execution in the face of exception conditions is notapprop riate. Innity on p age 199 gave the example of x / ( x2 + 1). When x >

    , the denominator is innite, resulting in a nal answer of 0, whichis totally wrong. Although for this formula the problem can be solved by

    rewriting it as 1/ ( x + x-1

    ), rewriting may not always solve the problem. TheIEEE standard strongly recommends that implementations allow trap handlersto be installed. Then w hen a n exception occurs, the tra p h and ler is calledinstead of setting the ag. The value returned by the trap handler will be usedas the result of the op eration. It is the respon sibility of the trap hand ler toeither clear or set the statu s ag; otherwise, the value of the ag is allowed tobe undened.

    The IEEE stand ard divides exceptions into 5 classes: overow, un derow,division by zero, invalid operation and inexact. There is a separate status agfor each class of exception. The meaning of the rst three exceptions is self-evident. Invalid op eration covers the situations listed in Table D-3 on page 199,and an y comp arison that involves a NaN. The default result of an operationthat causes an inv alid exception is to return a NaN , but the converse is not true.When one of the operands to an operation is a NaN , the result is a NaN but noinvalid exception is raised u nless the operation a lso satises one of theconditions in Table D-3 on p age 199. 1

    Table D-4 Exceptions in IEEE 754*

    Exception Result when trapsdisabledArgument to trap

    handler

    overow or xma x round( x2-)underow 0, or denormal round( x2)

    divide by zero operands

    1

    emax 2

    2emin

  • 8/3/2019 Floating Point Math

    37/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 207

    D

    * x is the exact result of the operation, = 192 for single precision, 1536 for dou ble, and

    xma x = 1.11 11 .

    The inexact exception is raised w hen th e result of a oating-point operation isnot exact. In th e = 10, p = 3 system, 3.5 4.2 = 14.7 is exact, but3.5 4.3 = 15.0 is not exact (since 3.5 4.3 = 15.05), and raises an inexactexception. Binary to Decimal Conversion on p age 236 discusses an algorithmthat u ses the inexact exception. A su mm ary of the beha vior of all veexceptions is given in Table D-4.

    There is an implemen tation issue connected w ith the fact that the inexactexception is raised so often. If oating-point hardware does not have ags of its own, but instead interrupts the operating system to signal a oating-pointexception, the cost of inexact exceptions could be prohibitive. This cost can beavoided by having the status ags maintained by software. The rst time anexception is raised, set the software ag for the a pp ropriate class, and tell the

    oating-point hard w are to mask off that class of exceptions. Then all furth erexceptions will run without interrupting the operating system. When a userresets that status ag, the hardware mask is re-enabled.

    Trap Handlers

    One obvious use for trap handlers is for backward compatibility. Old codesthat expect to be aborted when exceptions occur can install a trap handler thataborts the process. This is especially useful for codes with a loop like

    1. No invalid exception is raised unless a trapp ing NaN is involved in the operation. See section 6.2 of IEEEStd 754-1985. -- Ed.

    invalid NaN operand s

    inexact round( x) round( x)

    Table D-4 Exceptions in IEEE 754*

    Exception Result when trapsdisabledArgument to trap

    handler

    2 emax

  • 8/3/2019 Floating Point Math

    38/94

    208 Nu merical Computation Guide

    D

    do S until (x >= 100) . Since comparing a NaN to a num ber with ,

    , or= (but not ) always return s false, this code w ill go into an inn ite loop if xever becomes a NaN .

    There is a more interesting use for trap handlers that comes up whencom p utin g p rod u cts su ch a s th at cou ld poten tially ov er ow. O ne

    solu tion is to use logarith ms, an d com pu te exp in stead . Th e p roblemwith this approach is that it is less accurate, and that it costs more than thesimple expression , even if there is no overow. There is another solutionusing trap handlers called over/underow counting that avoids both of theseproblems [Sterbenz 1974].

    The idea is as follows. There is a global counter initialized to zero. When everthe partial p rodu ct overow s for some k , the trap hand lerincrements the counter by one and returns the overowed qu antity with theexponent w rapp ed a roun d. In IEEE 754 single precision, ema x = 127, so if

    pk = 1.45 2130, it will overow and cause the trap handler to be called, whichwill wrap the exponent back into range, changing pk to 1.45 2-62 (see below).Similarly, if p k und erows, the counter would be d ecremented, and negativeexponent w ould get wrap ped around into a positive one. When all themultiplications are done, if the counter is zero then the nal product is p n . If the counter is positive, the product overowed, if the counter is negative, itund erowed. If none of the partial products are out of range, the trap h andleris never called an d the comp utation incur s no extra cost. Even if there areover/ un derow s, the calculation is more accura te than if it had beencomputed with logarithms, because each pk was compu ted from pk - 1 using afull precision multiply. Barnett [1987] discusses a formula where the fullaccuracy of over/ und erow counting turned up an error in earlier tables of that formula.

    IEEE 754 species that wh en an ov erow or un derow trap h and ler is called, itis passed the w rapped -around result as an argum ent. The denition of wrapped-around for overow is that the result is computed as if to inniteprecision, then divided by 2 , and then rounded to the relevant precision. Forunderow, the result is multiplied by 2 . The exponent is 192 for singleprecision and 1536 for double precision. This is why 1.45 x 2 130 w astransformed into 1.45 2-62 in the example above.

    i 1=n xi log xi( )

    xi

    pk i 1=k xi=

  • 8/3/2019 Floating Point Math

    39/94

    What Every Computer Scient ist Should Know About Floating-Point Arithmetic 209

    D

    Rounding M odesIn the IEEE standard, rounding occurs whenever an operation has a result thatis not exact, since (with th e exception of binary d ecimal conversion) eachoperation is computed exactly and then roun ded. By default, round ing meansround toward n earest. The standard requires that three other round ing mod esbe provided , namely round toward 0, round toward + , and round toward - .When u sed with the convert to integer operation, round toward - causes theconvert to become the oor function, while round toward + is ceiling. Theround ing mod e affects overow, because w hen roun d toward 0 or roundtoward - is in effect, an over ow of positive ma gnitud e causes the d efaultresult to be the largest representable nu mber, not + . Similarly, overows of negative magnitud e will produce the largest negative nu mber w hen roun dtoward + or round toward 0 is in effect.One application of rounding modes occurs in interval arithmetic (another ismentioned in Binary to Decimal Conversion on page 236). When usinginterval arithmetic, the sum of two numbers x an d y is an interval ,w here is x y rounded toward - , and is x y rounded toward + . Theexact resu lt of the add i tion is con ta ined wi th in the in te rval . Withou trounding modes, interval arithmetic is usually implemented by computing

    and , w here is machine epsilon. 1

    This results in overestimates for the size of the intervals. Since the result of anoperation in interval arithmetic is an interval, in general the input to anoperation w ill also be an interval. If tw o intervals , and , aread d ed , the resu lt is , w here is w ith the round ing mod e set toround toward - , an d is w ith the rou nd in g m od e set to rou nd tow ard+.When a oating-point calculation is performed using interval arithmetic, thenal answ er is an interval th at contains the exact result of the calculation. Thisis not very helpful if the interval tu rns ou t to be large (as it often does), sincethe correct answer could be anywhere in that interval. Interval arithmeticmakes more sense when used in conjunction with a multiple precision oating-point p ackage. The calculation is rst performed with some p recision p . If interval arithmetic suggests that the nal answer may be inaccurate, thecompu tation is redon e with higher and higher precisions un til the nal intervalis a reasonable size.

    1. may be g reat er t han i f both x and y a r e nega tive . - - Ed .

    z, z[ ] z z

    z, z[ ] z x y( ) 1 (= x y( ) 1 +(=

    z z

    x, x[ ] y, y[ ] z, z[ ] z x y z z y

  • 8/3/2019 Floating Point Math

    40/94

    210 Nu merical Computation Guide

    D

    FlagsThe IEEE standard has a nu mber of ags and m odes. As discussed above, thereis one status ag for each of the v e exceptions: und erow, overow, divisionby zero, invalid operation and inexact. There are four rounding modes: roundtoward nearest, round toward + , round toward 0, and round toward - . It