title lorem ipsum - ucsd mathematics

FLOATING-POINT ARITHMETIC

Rounding errors and their analysis

Floating-Point Arithmeticon Computers

◦ Scientific notation: 5.1671 × 10−3

◦ Floating-point notation: 5.1671e-3

◦ Advantage: easy to tell the magnitude of the quantity and fix the precision of approximations:

777777 ≈ 8.81916 × 102

◦ Keywords: mantissa, exponent, base, precision

◦ Applications in physics and engineering require huge amounts of numerical computations

◦ Need to fix a particular format to write down those huge numbers of numbers

◦ Solution: floating-point format with fixed precision

◦ Trade-off: effort for storing and calculation vs accuracy of computations

We need to compute on a large scale…

Human computers Digital computers

Source: NASA


◦ Floating-point format: scientific notation but we restrict

the precision in all computations

◦ Mainboards and graphic cards have dedicated floating-

point units only for such computations

◦ Phones, tablets, laptops, super computers

◦ Many programming languages (C, C++, Java, ….) have

datatypes only for specific floating-point formats

◦ Most popular floating-point formats:

◦ Single precision (4 bytes, 23 bit mantissa, 8 bit exponent)

◦ Double precision (8 bytes, 52 bit mantissa, 11 bit exponent)

◦ Rounding errors are typically negligible, but accumulate

after many calculations


◦ Floating-point units implement computation with

certain floating-point formats

◦ Single and double precision format, and possibly other

formats such as quadruple precision

◦ Built-in capacity for addition, subtraction, multiplication,

division, square root, and possibly trigonometric functions

APPROXIMATION OF NUMBERS

Absolute and Relative Errors, Unit roundoff, machine epsilon

Approximating numbers

Suppose that 𝑥 is some number that 𝑥 is an approximation to that number. The absolute error is

| 𝑥 − 𝑥 |

The relative error is

| 𝑥 − 𝑥 |

|𝑥|

In practice, only the relative error is a good measure of approximation quality. An absolute error of 10^2 is unimportant if 𝑥 = 1012, but it is quite notable if 𝑥 = 104.

Approximating numbers

For the discussion of the relative error, we sometimes write

𝑥 = 𝑥 (1 + 𝛿 )

for some 𝛿 that indicates the fraction of how much the

approximation 𝑥 differs from the true value 𝑥.

In that case, 𝛿 is an upper bound for the relative error.

𝑥 − 𝑥

𝑥= 𝛿,

| 𝑥 − 𝑥 |

| 𝑥 |= |𝛿|

Approximating numbers (Example)

The quantity that we want to compute and the quantity that we use are typically different:

𝜋 = 3.1415926535… vs 𝜋 = 3.14159

Absolute and relative error:

3.14159265359…− 3.14159 ≈ 2.65 × 10−6

|3.14159265359…−3.14159|

|3.14159265359…|≈ 8.44 × 10−7

The relative error tells us how the magnitude of the error compares to the magnitude of the actual value. The relative error stays the same if we change the scale (e.g., units)

Unit RoundoffThe unit roundoff 𝒖 of a floating-point format is the maximum relative error of a number when approximated in the floating-point format:

| 𝑥 −𝑥𝑓𝑙𝑜𝑎𝑡 |

|𝑥|< 𝑢,

where 𝑥𝑓𝑙𝑜𝑎𝑡 is the closest approximation to 𝑥 in the floating-point format. The unit roundoff depends only on the number of fractional digits that we keep in the floating-point format (and the base that we use).

For example, if we use a decimal floating-point system with 5 fractional digits, then the closest approximation to 𝜋 is

𝜋𝑓𝑙𝑜𝑎𝑡 = 3.14159

The unit roundoff when using five fractional digits is then is 𝑢 =1

210−6 .

Machine Epsilon

The machine epsilon 𝝐 of a floating-point format is the difference between 1 and the next larger floating-point number.

Like the unit roundoff, the machine epsilon depends only on the precision (and the base) of the floating-point format.

The machine epsilon and the unit roundoff are comparable to each other.

Terminology may vary: some use machine epsilon and unit roundoff interchangeably. This has little consequences for practical purposes.

APPROXIMATE CALCULATIONS

Basic arithmetic operations

We express all calculations in a few “basic” arithmetic operations,+, −, ×, ÷, 𝑠𝑞𝑟𝑡, 𝑒𝑥𝑝, …

are replaced by approximate basic operations + , −, ×, ÷, ෧𝑠𝑞𝑟𝑡, ෦𝑒𝑥𝑝, …

That’s because the result is always approximated by some number in the floating-point format.

Modern implementations of floating-point arithmetic require, e.g., that the result of 𝑎 + 𝑏 is the same as 𝑎 + 𝑏 after approximated in the floating-point format:

| (𝑎 + 𝑏) − (𝑎 + 𝑏) |

(𝑎 + 𝑏)≤ 𝑢

Similarly for the other elementary operations. The floating-point format gives the best approximation to the exact result within the floating-point format.


Modern implementations of floating-point arithmetic require, e.g., that the result of 𝑎 + 𝑏 is the same as 𝑎 + 𝑏 after approximated in the floating-point format:

| (𝑎 + 𝑏) − (𝑎 + 𝑏) |

(𝑎 + 𝑏)≤ 𝑢

So there exists −u ≤ 𝛿 ≤ 𝑢 such that

𝑎 + 𝑏 = 𝑎 + 𝑏 1 + 𝛿 .

Analogously for the other operations.


In long calculations, inexact intermediate results are used as input for other inexact calculations and the errors accumulate in the long run:

𝑎 + 𝑏 × 𝑐 + 𝑑 = ( 𝑎 + 𝑏 1 + 𝛿1 × 𝑐 + 𝑑 1 + 𝛿2 )

= ( 𝑎 + 𝑏 1 + 𝛿1 × 𝑐 + 𝑑 1 + 𝛿2 ) 1 + 𝛿3= 𝑎 + 𝑏 × 𝑐 + 𝑑 × 1 + 𝛿1 1 + 𝛿2 1 + 𝛿3

where −u ≤ 𝛿1, 𝛿2, 𝛿3 ≤ 𝑢 indicate the approximation errors at each step.

𝑎 + 𝑏 × 𝑐 + 𝑑 − 𝑎 + 𝑏 × 𝑐 + 𝑑

𝑎 + 𝑏 × 𝑐 + 𝑑=

𝛿2 + 𝛿3 + 𝛿2𝛿3 1 + 𝛿1 + 𝛿1𝑎 + 𝑏 × 𝑐 + 𝑑


Another example:𝑎 × 𝑏 − 𝑐 = 𝑎 × 𝑏 1 + 𝛿1 − 𝑐

= ( 𝑎 × 𝑏 1 + 𝛿1 − 𝑐) 1 + 𝛿2

where −u ≤ 𝛿1, 𝛿2 ≤ 𝑢 are the approximation errors at each step. We isolate the true result from the last expression:

𝑎 × 𝑏 − 𝑐

= 𝑎 × 𝑏 − 𝑐 + 𝛿2 𝑎 × 𝑏 − 𝑐 + 𝑎 × 𝑏 𝛿1 1 + 𝛿2

Hence

𝑎 × 𝑏 − 𝑐 − 𝑎×𝑏 −𝑐

𝑎×𝑏 −𝑐= 𝛿2 +

𝑎×𝑏 𝛿1 1+𝛿2

𝑎×𝑏 −𝑐


This effect is called cancellation: when subtracting two numbers 𝑥 and 𝑦 that are almost equal, then small relative errors in 𝑥 and 𝑦 lead to relatively big errors in 𝑥 − 𝑦.

For the difference of perturbed variables 𝑥(1 + 𝛿1) and 𝑦(1 + 𝛿2):

𝑥 1 + 𝛿1 − 𝑦 1 + 𝛿2 = 𝑥 − 𝑦 + 𝛿1𝑥 − 𝛿2 𝑦

Thus, for the relative error we find:

(𝑥(1+𝛿1)−𝑦(1+𝛿2))−(𝑥−𝑦)

𝑥−𝑦=

𝛿1𝑥 −𝛿2 𝑦

𝑥−𝑦,

| 𝛿1𝑥 −𝛿2 𝑦 || 𝑥−𝑦 |

= 𝑢| 𝑥 |+| 𝑦 |

| 𝑥−𝑦 |

Underflow/Overflow of Floating-Point Numbers

◦ Most implementations are binary and have the exponent constrained to a certain range.

◦ If an intermediate calculation produces a very small non-zero number, then the result may be rounded to zero.

◦ If an intermediate calculation produces a very large number, then the result may be rounded to ±∞.

◦ Lastly, certain operations such as Τ0 0 , Τ∞ ∞, −1, … produce NaN, Not-a-Number, which indicates invalid computations.

Numerous consequences for the design of algorithms. For example, it is good practice to check the range of the operations involved in every division.


◦ The discussion of relative errors applies to input errors (e.g., measurement errors) as well as to roundoff errors (e.g., intermediate calculations).

◦ Roundoff errors are constrained by the machine epsilon/unit roundoff.

◦ The combination of roundoff errors over a sequence of operations may be accumulate. Analysis can be quite difficult.

◦Analysis of relative errors very complex even for algorithms that are otherwise well-understood.

◦Most important problem in practice: loss of relative precision when taking the difference of numbers that are almost equal

title lorem ipsum - ucsd mathematics

Documents