title lorem ipsum - ucsd mathematics
TRANSCRIPT
Floating-Point Arithmeticon Computers
β¦ Scientific notation: 5.1671 Γ 10β3
β¦ Floating-point notation: 5.1671e-3
β¦ Advantage: easy to tell the magnitude of the quantity and fix the precision of approximations:
777777 β 8.81916 Γ 102
β¦ Keywords: mantissa, exponent, base, precision
β¦ Applications in physics and engineering require huge amounts of numerical computations
β¦ Need to fix a particular format to write down those huge numbers of numbers
β¦ Solution: floating-point format with fixed precision
β¦ Trade-off: effort for storing and calculation vs accuracy of computations
Floating-Point Arithmeticon Computers
β¦ Floating-point format: scientific notation but we restrict
the precision in all computations
β¦ Mainboards and graphic cards have dedicated floating-
point units only for such computations
β¦ Phones, tablets, laptops, super computers
β¦ Many programming languages (C, C++, Java, β¦.) have
datatypes only for specific floating-point formats
β¦ Most popular floating-point formats:
β¦ Single precision (4 bytes, 23 bit mantissa, 8 bit exponent)
β¦ Double precision (8 bytes, 52 bit mantissa, 11 bit exponent)
β¦ Rounding errors are typically negligible, but accumulate
after many calculations
Floating-Point Arithmeticon Computers
β¦ Floating-point units implement computation with
certain floating-point formats
β¦ Single and double precision format, and possibly other
formats such as quadruple precision
β¦ Built-in capacity for addition, subtraction, multiplication,
division, square root, and possibly trigonometric functions
Approximating numbers
Suppose that π₯ is some number that π₯ is an approximation to that number. The absolute error is
| π₯ β π₯ |
The relative error is
| π₯ β π₯ |
|π₯|
In practice, only the relative error is a good measure of approximation quality. An absolute error of 10^2 is unimportant if π₯ = 1012, but it is quite notable if π₯ = 104.
Approximating numbers
For the discussion of the relative error, we sometimes write
π₯ = π₯ (1 + πΏ )
for some πΏ that indicates the fraction of how much the
approximation π₯ differs from the true value π₯.
In that case, πΏ is an upper bound for the relative error.
π₯ β π₯
π₯= πΏ,
| π₯ β π₯ |
| π₯ |= |πΏ|
Approximating numbers (Example)
The quantity that we want to compute and the quantity that we use are typically different:
π = 3.1415926535β¦ vs π = 3.14159
Absolute and relative error:
3.14159265359β¦β 3.14159 β 2.65 Γ 10β6
|3.14159265359β¦β3.14159|
|3.14159265359β¦|β 8.44 Γ 10β7
The relative error tells us how the magnitude of the error compares to the magnitude of the actual value. The relative error stays the same if we change the scale (e.g., units)
Unit RoundoffThe unit roundoff π of a floating-point format is the maximum relative error of a number when approximated in the floating-point format:
| π₯ βπ₯πππππ‘ |
|π₯|< π’,
where π₯πππππ‘ is the closest approximation to π₯ in the floating-point format. The unit roundoff depends only on the number of fractional digits that we keep in the floating-point format (and the base that we use).
For example, if we use a decimal floating-point system with 5 fractional digits, then the closest approximation to π is
ππππππ‘ = 3.14159
The unit roundoff when using five fractional digits is then is π’ =1
210β6 .
Machine Epsilon
The machine epsilon π of a floating-point format is the difference between 1 and the next larger floating-point number.
Like the unit roundoff, the machine epsilon depends only on the precision (and the base) of the floating-point format.
The machine epsilon and the unit roundoff are comparable to each other.
Terminology may vary: some use machine epsilon and unit roundoff interchangeably. This has little consequences for practical purposes.
Basic arithmetic operations
We express all calculations in a few βbasicβ arithmetic operations,+, β, Γ, Γ·, π πππ‘, ππ₯π, β¦
are replaced by approximate basic operations + , β, Γ, Γ·, ΰ·§π πππ‘, ΰ·¦ππ₯π, β¦
Thatβs because the result is always approximated by some number in the floating-point format.
Modern implementations of floating-point arithmetic require, e.g., that the result of π + π is the same as π + π after approximated in the floating-point format:
| (π + π) β (π + π) |
(π + π)β€ π’
Similarly for the other elementary operations. The floating-point format gives the best approximation to the exact result within the floating-point format.
Basic arithmetic operations
Modern implementations of floating-point arithmetic require, e.g., that the result of π + π is the same as π + π after approximated in the floating-point format:
| (π + π) β (π + π) |
(π + π)β€ π’
So there exists βu β€ πΏ β€ π’ such that
π + π = π + π 1 + πΏ .
Analogously for the other operations.
Basic arithmetic operations
In long calculations, inexact intermediate results are used as input for other inexact calculations and the errors accumulate in the long run:
π + π Γ π + π = ( π + π 1 + πΏ1 Γ π + π 1 + πΏ2 )
= ( π + π 1 + πΏ1 Γ π + π 1 + πΏ2 ) 1 + πΏ3= π + π Γ π + π Γ 1 + πΏ1 1 + πΏ2 1 + πΏ3
where βu β€ πΏ1, πΏ2, πΏ3 β€ π’ indicate the approximation errors at each step.
π + π Γ π + π β π + π Γ π + π
π + π Γ π + π=
πΏ2 + πΏ3 + πΏ2πΏ3 1 + πΏ1 + πΏ1π + π Γ π + π
Basic arithmetic operations
Another example:π Γ π β π = π Γ π 1 + πΏ1 β π
= ( π Γ π 1 + πΏ1 β π) 1 + πΏ2
where βu β€ πΏ1, πΏ2 β€ π’ are the approximation errors at each step. We isolate the true result from the last expression:
π Γ π β π
= π Γ π β π + πΏ2 π Γ π β π + π Γ π πΏ1 1 + πΏ2
Hence
π Γ π β π β πΓπ βπ
πΓπ βπ= πΏ2 +
πΓπ πΏ1 1+πΏ2
πΓπ βπ
Basic arithmetic operations
This effect is called cancellation: when subtracting two numbers π₯ and π¦ that are almost equal, then small relative errors in π₯ and π¦ lead to relatively big errors in π₯ β π¦.
For the difference of perturbed variables π₯(1 + πΏ1) and π¦(1 + πΏ2):
π₯ 1 + πΏ1 β π¦ 1 + πΏ2 = π₯ β π¦ + πΏ1π₯ β πΏ2 π¦
Thus, for the relative error we find:
(π₯(1+πΏ1)βπ¦(1+πΏ2))β(π₯βπ¦)
π₯βπ¦=
πΏ1π₯ βπΏ2 π¦
π₯βπ¦,
| πΏ1π₯ βπΏ2 π¦ || π₯βπ¦ |
= π’| π₯ |+| π¦ |
| π₯βπ¦ |
Underflow/Overflow of Floating-Point Numbers
β¦ Most implementations are binary and have the exponent constrained to a certain range.
β¦ If an intermediate calculation produces a very small non-zero number, then the result may be rounded to zero.
β¦ If an intermediate calculation produces a very large number, then the result may be rounded to Β±β.
β¦ Lastly, certain operations such as Ξ€0 0 , Ξ€β β, β1, β¦ produce NaN, Not-a-Number, which indicates invalid computations.
Numerous consequences for the design of algorithms. For example, it is good practice to check the range of the operations involved in every division.
Basic arithmetic operations
β¦ The discussion of relative errors applies to input errors (e.g., measurement errors) as well as to roundoff errors (e.g., intermediate calculations).
β¦ Roundoff errors are constrained by the machine epsilon/unit roundoff.
β¦ The combination of roundoff errors over a sequence of operations may be accumulate. Analysis can be quite difficult.
β¦Analysis of relative errors very complex even for algorithms that are otherwise well-understood.
β¦Most important problem in practice: loss of relative precision when taking the difference of numbers that are almost equal