numerical mathematical analysis

Numerical Mathematical Analysis


Catalin Trenchea

Department of MathematicsUniversity of Pittsburgh

September 20, 2010

Numerical Mathematical Analysis Math 1070


Outline

1 Introduction, matlab notes

2 Chapter 1: Taylor polynomials

3 Chapter 2: Error and Computer Arithmetic

4 Chapter 4: 4.1 - 4.3 Interpolation

5 Chapter 5: Numerical integration and differentiation

6 Chapter 3: Rootfinding

7 Chapter 4: Approximation of functions

Numerical Mathematical Analysis Math 1070

> Introduction

Numerical Analysis

Numerical Analysis: This refers to the analysis of mathematicalproblems by numerical means, especially mathematical problemsarising from models based on calculus.Effective numerical analysis requires several things:

An understanding of the computational tool being used, be it acalculator or a computer.

An understanding of the problem to be solved.

Construction of an algorithm which will solve the givenmathematical problem to a given desired accuracy andwithin the limits of the resources (time, memory, etc) that areavailable.

1. Introduction, Matlab notes Math 1070

> Introduction

This is a complex undertaking. Numerous people make this their lifeswork, usually working on only a limited variety of mathematicalproblems.Within this course, we attempt to show the spirit of the subject.

Most of our time will be taken up with looking at algorithms forsolving basic problems such as rootfinding and numerical integration;but we will also look at the structure of computers and theimplications of using them in numerical calculations.

We begin by looking at the relationship of numerical analysis to thelarger world of science and engineering.


> Introduction > Science

Traditionally, engineering and science had a two-sided approach tounderstanding a subject: the theoretical and the experimental. More recently,a third approach has become equally important: the computational.

Traditionally we would build an understanding by building theoreticalmathematical models, and we would solve these for special cases.

For example, we would study the flow of an incompressible irrotational fluidpast a sphere, obtaining some idea of the nature of fluid flow.

But more practical situations could seldom be handled by direct means,because the needed equations were too difficult to solve.

Thus we also used the experimental approach to obtain better informationabout the flow of practical fluids.

The theory would suggest ideas to be tried in the laboratory, and the

experimental results would often suggest directions for a further development

of theory.



With the rapid advance in powerful computers, we now can augmentthe study of fluid flow by directly solving the theoretical models of fluidflow as applied to more practical situations; and this area is oftenreferred to as computational fluid dynamics.

At the heart of computational science is numerical analysis; and toeffectively carry out a computational science approach to studying aphysical problem, we must understand the numerical analysis beingused, especially if improvements are to be made to the computationaltechniques being used.



Mathematical Models

A mathematical model is a mathematical description of a physicalsituation.By means of studying the model, we hope to understand more aboutthe physical situation. Such a model might be very simple. Forexample,

A = 4πR2e, Re ≈ 6, 371 km

is a formula for the surface area of the earth. How accurate is it?

1 First, it assumes the earth is sphere, which is only anapproximation. At the equator, the radius is approximately 6,378km; and at the poles, the radius is approximately 6,357 km.

2 Next, there is experimental error in determining the radius; and inaddition, the earth is not perfectly smooth.

Therefore, there are limits on the accuracy of this model for thesurface area of the earth.



An infectious disease model

For rubella measles, we have the following model for the spread of theinfection in a population (subject to certain assumptions).

ds

dt= −asi

di

dt= asi− bi

dr

dt= bi

In this, s, i, and r refer, respectively, to the proportions of a totalpopulation that are susceptible, infectious, and removed (from thesusceptible and infectious pool of people). All variables are functionsof time t. The constants can be taken as

a =6.811, b =

111.



Mathematical Models

The same model works for some other diseases (e.g. flu), with asuitable change of the constants a and b. Again, this is anapproximation of reality (and a useful one).

But it has its limits. Solving a bad model will not give good results, nomatter how accurately it is solved; and the person solving this modeland using the results must know enough about the formation ofthe model to be able to correctly interpret the numerical results.



The logistic equation

This is the simplest model for population growth. Let N(t) denote thenumber of individuals in a population (rabbits, people, bacteria, etc).Then we model its growth by

N ′(t) = cN(t), t ≥ 0, N(t0) = N0

The constant c is the growth constant, and it usually must bedetermined empirically.

Over short periods of time, this is often an accurate model forpopulation growth. For example, it accurately models the growth ofUS population over the period of 1790 to 1860, with c = 0.2975.



The predator-prey model

Let F (t) denote the number of foxes at time t; and let R(t) denote thenumber of rabbits at time t. A simple model for these populations is calledthe Lotka-Volterra predator-prey model:

dR

dt= a[1− bF (t)]R(t)

dF

dt= c[−1 + dR(t)]F (t)

with a, b, c, d positive constants.If one looks carefully at this, then one can see how it is built from the logisticequation. In some cases, this is a very useful model and agrees with physicalexperiments.

Of course, we can substitute other interpretations, replacing foxes and rabbits

with other predator and prey. The model will fail, however, when there are

other populations that affect the first two populations in a significant way.



Newton’s second law

Newtons second law states that the force acting on an object is directlyproportional to the product of its mass and acceleration,

F ∝ ma

With a suitable choice of physical units, we usually write this in its scalarform as

F = ma

Newtons law of gravitation for a two-body situation, say the earth and anobject moving about the earth is then

md2r(t)dt2

= −Gmme

|r(t)|2· r(t)|r(t)|

with r(t) the vector from the center of the earth to the center of the object

moving about the earth. The constant G is the gravitational constant, not

dependent on the earth; and m and me are the masses, respectively of the

object and the earth.



This is an accurate model for many purposes. But what are somephysical situations under which it will fail? When the object is veryclose to the surface of the earth and does not move far from one spot,we take |r(t)| to be the radius of the earth. We obtain the new model

md2r(t)dt2

= −mgk

with k the unit vector directly upward from the earths surface at thelocation of the object. The gravitational constant

g.= 9.8 meters/second2

Again this is a model; it is not physical reality.


> Matlab notes

Matlab notes

Matlab designed for numerical computing.

Strongly oriented towards use of arrays, one and two dimensional.

Excellent graphics that are easy to use.

Powerful interactive facilities; and programs can also be written init.

It is a procedural language, not an object-oriented language.

It has facilities for working with both Fortran and C languageprograms.


> Matlab notes

At the prompt in Unix or Linux, type Matlab.Or click the Red Hat, then DIVMS, then Mathematics, then Matlab.Run the demo program (simply type demo). Then select one of themany available demos.To seek help on any command, simply type

help commandor use the online Help command. To seek information on Matlabcommands that involve a given word in their description, type

lookfor wordLook at the various online manuals available thru the help page.


> Matlab notes

MATLAB is an interactive computer language.For example, to evaluate

y = 6− 4x+ 7x2 − 3x5 +3

x+ 2

usey = 6− 4 ∗ x+ 7 ∗ x ∗ x− 3 ∗ x5 + 3/(x+ 2);

There are many built-in functions, e.g.

exp(x), cos(x),√x, log(x)

The default arithmetic used in MATLAB is double precision (16decimal digits and magnitude range 10−308 : 10+308) and real.However, complex arithmetic appears automatically when needed.

sqrt(-4) results in an answer of 2i.


> Matlab notes

The default output to the screen is to have 4 digits to the right of thedecimal point.To control the formatting of output to the screen, use the commandformat. The default formatting is obtained using

format shortTo obtain the full accuracy available in a number, you can use

format longThe commands

format short eformat long e

will use ‘scientific notation’ for the output. Other format options arealso available.


> Matlab notes

SEE plot trig.m

MATLAB works very efficiently with arrays, and many tasks are best donewith arrays. For example, plot sinx and cosx on the interval 0 ≤ x ≤ 10.

t = 0 : .1 : 10;x = cos(t); y = sin(t);plot(t, x, t, y, ’LineWidth’, 4)

0 1 2 3 4 5 6 7 8 9 10−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure: SEE plot trig.m


> Matlab notes

The statementt = a : h : b;

with h > 0 creates a row vector of the form

t = [a, a+ h, a+ 2h, . . .]

giving all values a+ jh that are less than b.When h is omitted, it is assumed to be 1. Thus

n = 1 : 5

creates the row vectorn = [1, 2, 3, 4, 5]


> Matlab notes

Arrays

b = [1, 2, 3]

creates a row vector of length 3.

A = [1 2 3; 4 5 6; 7 8 9]

creates the square matrix

A =

1 2 34 5 67 8 9

Spaces or commas can be used as delimiters in giving the components of anarray; and a semicolon will separate the various rows of a matrix.For a column vector,

b = [1 3 − 6]′

results in the column vector

13−6


> Matlab notes

Array Operations

Addition: Do componentwise addition.

A = [1, 2; 3,−2;−6, 1];B = [2, 3;−3, 2; 2,−2];C = A+B;

results in the answer

C =

3 50 0−4 −1

Multiplication by a constant: Multiply the constant times each componentof the array.

D = 2 ∗A;


D =

2 46 −4

−12 2


> Matlab notes

Array Operations

Matrix multiplication: This has the standard meaning.

E = [1, −2; 2, −1; −3, 2]F = [2, −1, 3; −1, 2, 3];G = E ∗ F ;


G =

1 −22 −1−3 2

[ 2 −1 3−1 2 3

]=

4 −5 −35 −4 3−8 7 −3

A nonstandard notation:

H = 3 + F ;

results in the computation

H = 3[

1 1 11 1 1

]+[

2 −1 3−1 2 3

]=[

5 2 62 5 6

]1. Introduction, Matlab notes Math 1070

> Matlab notes

Componentwise operations

Matlab also has component-wise operations for multiplication, division andexponentiation. These three operations are denoted by using a period toprecede the usual symbol for the operation. With

a = [1 2 3]; b = [2 −, 1 4];

we have

a.∗b = [2 − 2 12]a./b = [.5 − 2.0 0.75]a. 3 = [1 8 27]2. a = [2 4 8]b. a = [2 1 64]

The expression

y = 6− 4x+ 7x2 − 3x5 +3

x+ 2can be evaluated at all of the elements of an array x using the command

y = 6− 4 ∗ x+ 7 ∗ x.∗x− 3 ∗ x. 5 + 3./(x+ 2);

The output y is then an array of the same size as x.


> Matlab notes

Special arrays

A = zeros(2, 3)

produces an array with 2 rows and 3 columns, with all components setto zero, [

0 0 00 0 0

]B = ones(2, 3)

produces an array with 2 rows and 3 columns, with all components setto 1, [

1 1 11 1 1

]eye(3) results in the 3× 3 identity matrix, 1 0 0

0 1 00 0 1


> Matlab notes

Array functions

There are many MATLAB commands that operate on arrays, weinclude only a very few here.For a vector x, row or column, of length n, we have the followingfunctions.

max(x) = maximum component of xmin(x) = minimum component of xabs(x) = vector of absolute values of components of xsum(x) = sum of the components of x

norm(x)=√|x1|2 + · · ·+ |xn|2


> Matlab notes

Script files

A list of interactive commands can be stored as a script file.For example, store

t = 0 : .1 : 10;x = cos(t); y = sin(t);plot(t, x, t, y)

with the file name plot trig.m. Then to run the program, give thecommand

plot trigThe variables used in the script file will be stored locally, andparameters given locally are available for use by the script file.


> Matlab notes

Functions

To create a function, we proceed similarly, but now there are input andoutput parameters. Consider a function for evaluating the polynomial

p(x) = a1 + a2x+ a3x2 + · · ·+ anx

n−1

MATLAB does not allow zero subscripts for arrays.The following function would be stored under the name polyeval.m.The coefficients {aj} are given to the function in the array namedcoeff, and the polynomial is to be evaluated at all of the componentsof the array x.


> Matlab notes

Functions

function value = polyeval(x,coeff);%% function value = polyeval(x,coeff)%% Evaluate a polynomial at the points given in x.% The coefficients are to be given in coeff.% The constant term in the polynomial is coeff(1).

n = length(coeff)value = coeff(n)*ones(size(x));for i = n-1:-1:1value = coeff(i) + x.*value;end

>> polyeval(3,[1,2])yieldsn=2

ans = 7


> 1. Taylor polynomials

Taylor polynomials

1 Taylor polynomials1 The Taylor polynomial2 Error in Taylor’s polynomial3 Polynomial evaluation

1. Taylor polynomials Math 1070

> 1. Taylor polynomials > 1.1 The Taylor polynomial

Let f(x) be a given function, for example

ex, sinx, log(x).

The Taylor polynomial mimics the behavior of f(x) near x = a:

T (x) ≈ f(x), for all x ”close” to a.

Example

Find a linear polynomial p1(x) for which{p1(a) = f(a),p′1(a) = f ′(a).

p1 is uniquely given by

p1(x) = f(a) + (x− a)f ′(a).

The graph of y = p1(x) is tangent to that of y = f(x) at x = a.1. Taylor polynomials Math 1070


Example

Let f(x) = ex, a = 0. Then

p1(x) = 1 + x.

-2.4 -2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4

-1.2

-0.8

-0.4

0.4

0.8

1.2

y=p_1(x)

Figure: Linear Taylor Approximation ex1. Taylor polynomials Math 1070


Example

Find a quadratic polynomial p2(x) to approximate f(x) near x = a.

Sincep2(x) = b0 + b1x+ b2x

2

we impose three conditions on p2(x) to determine the coefficients. Tobetter mimic f(x) at x = a we require

p2(a) = f(a)p′2(a) = f ′(a)p′′2(a) = f ′′(a)

p2 is uniquely given by

p2(x) = f(a) + (x− a)f ′(a) +12

(x− a)2f ′′(a).



Example

Let f(x) = ex, a = 0. Then

p2(x) = 1 + x+12x2.

-1 -0.5 0 0.5 1

-0.25

0.25

0.5

0.75

1

1.25

1.5

y=p_1(x)

y=p_2(x)

y=e^x

Figure: Linear and quadratic Taylor Approximation ex (see eval exp simple.m)



Let pn(x) be a polynomial of degree n that mimics the behavior off(x) at x = a. We require

p(j)n (a) = f (j)(a), j = 0, 1, . . . , n

where f (j) the jth derivative of f(x).The Taylor polynomial of degree n for the function f(x) at point a:

pn(x) = f(a) + (x− a)f ′(a) +(x− a)2

2!f ′′(a) + . . .+

(x− a)n

n!f (n)(a)

=n∑j=0

(x− a)j

j!f (j)(a). (3.1)

Recall the notations: f (0)(a) = f(a) and the ”factorial”:

j! ={

1, j = 0j · (j − 1) · · · 2 · 1, j = 1, 2, 3, 4, . . .



Example

Let f(x) = ex, a = 0.Since f (j)(x) = ex, f (j)(0) = 1, for all j ≥ 0, then

pn(x) = 1 + x+12!x2 + . . .+

1n!xn =

n∑j=0

xj

j!(3.2)

For a fixed x, the accuracy improves as the degree n increases.

For a fixed degree n, the accuracy improves as x gets close toa = 0.



Table. Taylor approximations to ex

x p1(x) p2(x) p3(x) ex

−1.0 0 0.500 0.33333 0.36788

−0.5 0.5 0.625 0.60417 0.60653

−0.1 0.9 0.905 0.90483 0.90484

0 1.0 1.000 1.00000 1.00000

0.1 1.1 1.105 1.10517 1.10517

0.5 1.5 1.625 1.64583 1.64872

1.0 2.0 2.500 2.66667 2.71828



Example

Let f(x) = ex, a arbitrary, not necessarily 0.Since f (j)(x) = ex, f (j)(a) = ea, for all j ≥ 0, then

pn(x; a) = ea(

1 + (x− a) +12!

(x− a)2 + . . .+1n!

(x− a)n)

= ean∑j=0

(x− a)j

j!



Example

Let f(x) = ln(x), a = 1.

Since f(1) = ln(1) = 0, and

f (j)(x) = (−1)j−1(j − 1)!1xj

f (j)(1) = (−1)j−1(j − 1)!

then the Taylor polynomial is

pn(x) = (x− 1)− 12

(x− 1)2 +13

(x− 1)3 − . . .+ (−1)(n−1) 1n

(x− 1)n

=n∑j=1

(−1)j−1

j(x− 1)j



-0.25 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2

-1.5

-1

-0.5

0.5

1

y=ln(x)

y=p_2(x)

y=p_3y=p_1(x)

Figure: Taylor approximations of ln(x) about x = 1 (see plot log.m)


> 1. Taylor polynomials > 1.2 The error in Taylor’s polynomial

Theorem (Lagrange’s form)

Assume f ∈ Cn[α, β], x ∈ [α, β].The remainder Rn(x) ≡ f(x)− pn(x), or error in approximating

f(x) by pn(x) =n∑j=0

(x− a)j

j!f (j)(a) satisfies

Rn(x) =(x− a)n+1

(n+ 1)!f (n+1)(cx), α ≤ x ≤ β (3.3)

where cx is an unknown point between a and x.

[Exercise.] Derive the formal Taylor series for f(x) = ln(1 + x) ata = 0, and determine the range of positive x for which the seriesrepresents the function.

Hint: f(k)(x) = (−1)k−1(k − 1)! 1(1+x)k , f

(k)(0) = (−1)k−1(k − 1)!, 0 ≤ x1+cx

≤ 1, if x ∈ [0, 1].

ln(1 + x) =Pn

k=1(−1)k−1 xk

k+

(−1)n

n+1xn+1

(1+cx)n+1



Example

Let f(x) = ex and a = 0.

Recall that pn(x) =n∑j=0

xj

j!= 1 + x+

12!x2 + . . .+

1n!xn.

The approximation error is

ex − pn(x) =xn+1

(n+ 1)!ec, n ≥ 0 (3.4)

with c ∈ (0, x).

[Exercise.] It can be proved that for each fixed x,

limn→∞

Rn(x) = 0, i.e., ex = limn→∞

n∑k=0

xk

k!=∞∑k=0

xk

k!.

(See the case |x| ≤ 1).

For each fixed n, Rn becomes larger as x moves away from 0.



-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

0.25

0.5

0.75

y=R_1(x)

y=R_2(x)

y=R_3(x)y=R_3(x)

y=R_1(x)

y=R_2(x)

y=R_4(x)

y=R_4(x)

Figure: Error in Taylor polynomial approx. to ex (see plot exp simple.m)



Example

Let x = 1 in (3.4), so from (3.2):

e ≈ pn(1) = 1 + 1 +12!

+13!

+ · · ·+ 1n!

and from (3.4)

Rn(1) = e− pn(1) =ec

(n+ 1)!, 0 < c < 1

Since e < 3 and e0 ≤ ec ≤ e1 we have

1(n+ 1)!

≤ Rn(1) ≤ e

(n+ 1)!<

3(n+ 1)!



Suppose we want to approximate e by pn(1) with

Rn(1) ≤ 10−9.

We have to take n ≥ 12 to guarantee

3(n+ 1)!

≤ 10−9,

i.e., to have p12 as a sufficiently good approximation to e.



Exercise

Expand√

1 + h in powers of h, then compute√

1.00001.

If f(x) = x12 , then f ′(x) = 1

2x− 1

2 , f ′′(x) = − 14x− 3

2 , f ′′′(x) = 38x− 5

2 , . . .

√1 + h = 1 +

12h− 1

8h2 +

116h3ε−

52 , 1 < ε < 1 + h, if h > 0.

Let h = 10−5. Then

√1.00001 ≈ 1 + .5× 10−5 − 0.125× 10−10 = 1.00000 49999 87500

Since 1 < ε < 1 + h, the absolute error does not exceed

116h3ε−

52 <

116

10−52 = 0.00000 00000 00000 0625

and the numerical value is correct to all 15 decimal places shown.



Approximations and remainder formulae

ex = 1 + x+x2

2!+ · · ·+ xn

n!+

xn+1

(n+ 1)!ec (3.5)

sin(x) = x− x3

3!+x5

5!−· · ·+(−1)n−1 x2n−1

(2n−1)!+(−1)n

x2n+1

(2n+1)!cos(c) (3.6)

cos(x) = 1− x2

2!+x4

4!−· · ·+(−1)n

x2n

(2n)!+(−1)n+1 x2n+2

(2n+2)!cos(c) (3.7)

11− x

= 1 + x+ x2 + · · ·+ xn +xn+1

1− x, x 6= 1 (3.8)

(1 + x)α = 1 +(α1

)x+

(α2

)x2 + · · ·+

(αn

)xn (3.9)

+(

αn+ 1

)xn+1(1 + c)α−n−1, α ∈ R

Recall that c is between 0 and x, and the binomial coefficients are(ακ

)=α(α− 1) · · · (α− κ+ 1)

κ!



Example

Approximate cos(x) for |x| ≤ π4 with an error no greater than 10−5.

Since cos(c) ≤ 1 and

R2n+1(x) ≤ x2n+2

(2n+ 2)!≤ 10−5

we must have (π4

)2n+2

(2n+ 2)!≤ 10−5

which is satisfied when n ≥ 3. Hence

cos(x) ≈ 1− x2

2!+x4

4!− x6

6!



Indirect construction of a Taylor polynomial approximation

Remark

From (3.5), by replacing x with −t2, we obtain

e−t2=1− t2 +

t4

2!− t6

3!+ · · ·+ (−1)nx2n

n!+

(−1)(n+1)x2n+2

(n+ 1)!ec,

(3.10)

− t2≤c≤0.

If we attempt to construct the Taylor approximation directly, thederivatives of e−t

2become quickly too complicated.



Indirect construction of a Taylor polynomial approximation

From (3.8) we can easily get:

−∫ t

0

dx

1− x= −

∫ t

0

(1 + x+ x2 + · · ·+ xn

)dx−

∫ t

0

xn+1

1− xdx,

ln(1− t) (3.12)= −

(t+

12t2 + · · ·+ 1

n+ 1tn+1

)− 1

1− c

∫ t

0xn+1dx

ln(1− t) = −(t+

12t2 + · · ·+ 1

n+ 1tn+1

)− 1

1− cttn+2

n+ 2, (3.11)

where ct is a number between 0 and t, −1 ≤ t ≤ 1.

Theorem (Integral Mean Value Theorem)

Let w(x) b a nonnegative integrable function on (a, b), andf ∈ C[a, b]. Then ∃ at least one point c ∈ [a, b] for which∫ b

af(x)w(x)dx = f(c)

∫ b

aw(x)dx (3.12)



Infinite Series

By rearranging the terms in (3.8) we obtain the sum of a infinitegeometric series

1 + x+ x2 + · · ·+ xn =1− xn+1

1− x, x 6= 1. (3.13)

For |x| < 1, letting n→∞ we obtain the infinite geometric series

11− x

= 1 + x+ x2 + x3 + · · · =∞∑j=0

xj , x 6= 1. (3.14)



Infinite Series

Definition

The infinite series∞∑j=0

cj

is convergent if the partial sums

Sn =n∑j=0

cj , n ≥ 0

form a convergent sequence, i.e.,

∃S = limn→∞

Sn

and we then write

S =∞∑j=0

cj



Infinite Series

For the infinite series (3.14) with |x| 6= 1, the partial sums are given by (3.13):

Sn =1− xn+1

1− x

Sn|x|<1−→n→∞

11−x

Sn is divergent when |x| > 1

What happens when |x| = 1?



Infinite Series

Definition

Assume that f(x) has derivatives of any order atx = a. The infinite series

∞∑j=0

(x− a)j

j!f (j)(a)

is called the Taylor series expansion of the function f(x) about thepoint x = a.

The partial sumn∑j=0

(x− a)j

j!f (j)(a)

is simply the Taylor polynomial pn(x).



Infinite Series

If the sequence {pn(x)} has the limit f(x), i.e. the error tends to zeroas n→∞

limn→∞

(f(x)− pn(x)

)= 0

then we can write

f(x) =∞∑j=0

(x− a)j

j!f (j)(a)



Infinite series

Actually, it can be shown that the errors terms in (3.5)-(3.9) and(3.11) tend to 0 as n→∞ for suitable values of x. Hence the Taylorexpansions

ex =∞∑j=0

xj

j!, −∞ < x <∞

sinx =∞∑j=0

(−1)jx2j+1

(2j + 1)!, −∞ < x <∞

cosx =∞∑j=0

(−1)jx2j

(2j)!, −∞ < x <∞ (3.15)

(1 + x)α =∞∑j=0

(αj

)xj , −1 < x < 1

ln(1− t) = −∞∑j=0

tj

j, −1 ≤ x < 1



Infinite seriesDefinition

Infinite series of the form ∞∑j=0

aj(x− a)j (3.16)

are called power series.

They can arise from Taylor’s formulae or some other ways. Their convergencecan be examined directly.

Theorem (Comparison criterion)

Assume the series (3.16) converges for some value x0. Then the series (3.16)converges for all x satisfying |x− a| ≤ |x0 − a|.

Theorem (Quotient criterion)

For the series (3.16), assume that the limit

R = limn→∞

∣∣∣∣an+1

an

∣∣∣∣exists. Then for x satisfying |x− a| < 1

R , the series (3.16) converges to alimit S(x). When R = 0, the series (3.16) converges for any x ∈ R.



Infinite series

Example

Consider the power series in (3.15). Letting t = x2, we obtain theseries

∞∑j=0

(−1)jtj

(2j)!(3.17)

Applying the quotient criterion with

aj =(−1)j

(2j)!

we find R = 0, so the series (3.17) converges for any value of t, hencethe series in the formula (3.15) converges for any value of x.


> 1. Taylor polynomials > 1.3 Polynomial evaluation

Consider the evaluation of the polynomial

p(x) = 3− 4x− 5x2 − 6x3 + 7x4 − 8x5

1 simplest method: compute each term independently, i.e.,

c ∗ xk or c ∗ x ∗ ∗k

yielding1 + 2 + 3 + 4 + 5 = 15 multiplications

2 a more efficient method: compute each power of x using the precedingone:

x3 = x(x2), x4 = x(x3), x5 = x(x4) (3.18)

Since each term takes two multiplications for k > 1, the result will be

1 + 2 + 2 + 2 + 2 = 9 multiplications

3 nested multiplication:

p(x) = 3 + x(−4 + x(−5 + x(−6 + x(7− 8x))))

with only 5 multiplications



Consider now the general polynomial of degree n:

p(x) = a0 + a1x+ · · ·+ anxn, an 6= 0

If we use the 2nd method, with the powers of x computed as in(3.18), the number of multiplications in evaluating p(x) is 2n− 1.

For the nested multiplication, we write and evaluate p(x) in theform

p(x) = a0 + x(a1 + x(a2 + · · ·+ x(an−1 + anx) · · · )) (3.19)

using only n multiplications, saving about 50% over the 2ndmethod.

All methods use n additions.



Example

Evaluate the Taylor polynomial p5(x) for ln(x) about a = 1.

A general formula is (3.11), with t replaced by −(x− 1), yielding

p5(x) = (x− 1︸︷︷︸w

)− 12

(x− 1)2 +13

(x− 1)3 − 14

(x− 1)4 +15

(x− 1)5

= w

(1 + w

(−1

2+ w

(13

+ w

(−1

4+

15w

)))).



A more formal algorithm than (3.19)

Suppose we want to evaluate p(x) at number z. Let define the sequence ofcoefficients bi as:

bn = an (3.20)

bn−1 = an−1 + zbn (3.21)

bn−2 = an−2 + zbn−1 (3.22)

... (3.23)

b0 = a0 + zb1 (3.24)

Now the nested multiplication is the Horner Method:p(z) = a0 + z(a1 + z(a2 + · · ·+ z(an−1 + an︸︷︷︸

bn

z

︸︷︷︸bn−1

) · · ·

︸︷︷︸b2

)

︸︷︷︸b1

)

︸︷︷︸b0



Hence

p(z) = b0

With the coefficients from (3.20), define the polynomial

q(x) = b1 + b2x+ b3x2 + · · ·+ bnx

n−1

It can be shown that

p(x) = b0 + (x− z)q(x),

i.e., q(x) is the quotient from dividing p(x) by x− z, and b0 is theremainder.

Remark

This property we will use it later for polynomial rootfinding method toreduce the degree of a polynomial when a root z has been found, sincethen b0 = 0 and p(x) = (x− z)q(x).


> 2. Error and Computer Arithmetic

Numerical analysis is concerned with how to solve a problemnumerically, i.e., how to develop a sequence of numerical calculationsto get a satisfactory answer.

Part of this process is the consideration of the errors that arise in thesecalculations, from the errors in the arithmetic operations or from othersources.

2. Error and Computer Arithmetic Math 1070


Computers use binary arithmetic, representing each number as abinary number: a finite sum of integer powers of 2.

Some numbers can be represented exactly, but others, such as110 ,

1100 ,

11000 , . . ., cannot.

For example,2.125 = 21 + 2−3

has an exact representation in binary (base 2), but

3.1 ≈ 21 + 20 + 2−4 + 2−5 + 2−8 + · · ·

does not.And, of course, there are the transcendental numbers like π that haveno finite representation in either decimal or binary number system.



Computers use 2 formats for numbers.

Fixed-point numbers are used to store integers.Typically, each number is stored in a computer word of 32 binarydigits (bits) with values of 0 and 1. ⇒ at most 232 differentnumbers can be stored.If we allow for negative numbers, we can represent integers in therange −2−31 ≤ x ≤ 231 − 1, since there are 232 such numbers.Since 231 ≈ 2.1× 109, the range for fixed-point numbers is toolimited for scientific computing. =

always get an integer answer.the numbers that we can store are equally spaced.very limited range of numbers.

Therefore they are used mostly for indices and counters.

An alternative to fixed-point, floating-point numbersapproximate real numbers.

the numbers that we can store are NOT equally spaced.wide range of variably-spaced numbers that can be representeedexactly.


> 2. Error and Computer Arithmetic > 2.1 Floating-point numbers

Numbers must be stores and used for arithmetic operations. Storing:

1 integer format

2 floating-point format

Definition (decimal Floating-point representation)

Let consider x 6= 0 written in decimal system.Then it can be written uniquely as

x = σ · x · 10e (4.1)

where

σ = +1 or −1 is the sign

e is an integer, the exponent

1 ≤ x < 10, the significand or mantissa

Example ( 124.62 = (1.2462︸︷︷︸x

) · 102 )

σ = +1, the exponent e = 2, the significand x = 1.2462



The decimal floating-point representation of x ∈ R is given in (4.1),with limitations on the

1 number of digits in mantissa x

2 size of e

Example

Suppose we limit

1 number of digits in x to 4

2 −99 ≤ e ≤ 99

We say that a computer with such a representation has a four-digit decimalfloating point arithmetic.This implies that we cannot store accurately more than the first four digits ofa number; and even the fourth digit may be changed by rounding.

What is the next smallest number bigger than 1? What is the next smallestnumber bigger than 100? What are the errors and relative errors?

What is the smallest positive number?



Definition (Floating-point representation of a binary number x)

Let consider x written in binary format. Analogous to (4.1)

x = σ · x · 2e (4.2)

where

σ = +1 or −1 is the sign

e is an integer, the exponent

x is a binary fraction satisfying

(1)2 ≤ x < (10)2 (in decimal:1 ≤ x < 2)

For example, ifx = (11011.0111)2

then σ = +1, e = 4 = (100)2 and x = (1.10110111)2



The floating-point representation of a binary number x is given by (4.2) witha restriction on

1 number of digits in x: the precision of the binary floating-pointrepresentation of x

2 size of e

The IEEE floating-point arithmetic standard is the format for floating pointnumbers used in almost all computers.

the IEEE single precision floating-point representation of x has

a precision of 24 binary digits,and the exponent e is limited by −126 ≤ e ≤ 127:

x = σ · (1.a1a2 . . . a23) · 2e

where, in binary

−(1111110)2 ≤ e ≤ (1111111)2

the IEEE double precision floating-point representation of x has

a precision of 53 binary digits,and the exponent e is limited by −1022 ≤ e ≤ 1023:

x = σ · (1.a1a2 . . . a52) · 2e



the IEEE single precision floating-point representation of xhas a precision of 24 binary digits,and the exponent e is limited by −126 ≤ e ≤ 127:

x = σ · (1.a1a2 . . . a23) · 2e

stored on 4 bytes (32 bits)

b1︸︷︷︸σ

b2b3 · · · b9︸︷︷︸E=e+127

b10b11 · · · b32︸︷︷︸x

the IEEE double precision floating-point representation of xhas a precision of 53 binary digits,and the exponent e is limited by −1022 ≤ e ≤ 1023:

x = σ · (1.a1a2 . . . a52) · 2e

stored on 8 bytes (64 bits)

b1︸︷︷︸σ

b2b3 · · · b12︸︷︷︸E=e+1023

b13b14 · · · b64︸︷︷︸x


> 2. Error and Computer Arithmetic > 2.1.1 Accuracy of floating-point representation

Epsilon machine

How accurate can a number be stored in the floating point representation?How can this be measured?

(1) Machine epsilon

Machine epsilon

For any format, the machine epsilon is the difference between 1 and the nextlarger number that can be stored in that format.

In single precision IEEE, the next larger binary number is

1.0000000000000000000000 1︸︷︷︸a23

(1 + 2−24 cannot be stored exactly)

Then the machine epsilon in single precision IEEE format is

2−23 ·= 1.19× 10−7

i.e., we can store approximately 7 decimal digits of a number x in decimal format.



Then the machine epsilon in double precision IEEE format is

2−52 ·= 2.22× 10−16

IEEE double-precision format can be used to store approximately 16decimal digits of a number x in decimal format.MATLAB: machine epsilon is available as the constant eps.



Another way to measure the accuracy of floating-point format:

(2) look for the largest integer M such that any integer x such that0 ≤ x ≤M can be stored and represented exactly in floating point form.

If n is the number of binary digits in the significand x, all integers less orequal to

(1.11 . . . 1)2 · 2n−1 =(

1 + 1 · 2−1 + 1 · 2−2 + . . .+ 1 · 2−(n−1))· 2n−1

=(

1− 12n

1− 12

)· 2n−1 = 2n − 1

can be represented exactly.

In IEEE single precision format

M = 224 = 16777216

and all 7-digit decimal integers will store exactly.

In IEEE double precision format

M = 253 = 9.0× 1015

and all 15-digit decimal integers and most 16 digit ones will store exactly.


> 2. Error and Computer Arithmetic > 2.1.2 Rounding and chopping

Let say that a number x has a significand

x = 1.a1a2 . . . an−1anan+1

but the floating-point representation may contain only n binary digits.Then x must be shortened when stored.

Definition

We denote the machine floating-point version of x by fl(x).

1 truncate or chop x to n binary digits, ignoring the remainingdigits

2 round x to n binary digits, based on the size of the part of xfollowing digit n:

(a) if digit n+ 1 is 0, chop x to n digits(b) if digit n+ 1 is 1, chop x to n digits and add 1 to the last digit of

the result



It can be shown that

fl(x) = x · (1 + ε) (4.3)

where ε is a small number depending of x.

(a) If chopping is used

− 2−n+1 ≤ ε ≤ 0 (4.4)

(b) If rounding is used

− 2−n ≤ ε ≤ 2−n (4.5)

Characteristics of chopping:1 the worst possible error is twice as large as when rounding is used2 the sign of the error x− fl(x) is the same as the sign of x

The worst of the two: no possibility of cancellation of errors.Characteristics of rounding:

1 the worst possible error is only half as large as when chopping is used2 More important: the error x− fl(x) is negative for only half the

cases, which leads to better error propagation behavior2. Error and Computer Arithmetic Math 1070


For single precision IEEE floating-point rounding arithmetic (there aren = 24 digits in the significand):

1 chopping (”rounding towards zero”):

− 2−23 ≤ ε ≤ 0 (4.6)

2 standard rounding:

− 2−24 ≤ ε ≤ 2−24 (4.7)

For double precision IEEE floating-point rounding arithmetic:

1 chopping:

− 2−52 ≤ ε ≤ 0 (4.8)

2 rounding:

− 2−53 ≤ ε ≤ 2−53 (4.9)


> 2. Error and Computer Arithmetic >Conseq. for programming of floating-point arithm.

Numbers that have finite decimal expressions may have infinite binaryexpansions. For example

(0.1)10 = (0.000110011001100110011 . . .)2

Hence (0.1)10 cannot be represented exactly in binary floating-pointarithmetic.Possible problems:

Run into infinite loops

pay attention to the language used:

Fortran and C have both single and double precision, specifydouble precision constants correctlyMATLAB does all computations in double precision


> 2. Error and Computer Arithmetic > 2.2 Errors: Definitions, Sources and Examples

Definition

The error in a computed quantity is defined as

Error(xA) = xT - xA

where xT =true value, xA=approximate value. This is called alsoabsolute error.

The relative error Rel(xA) is a measure off error related to thesize of the true value

Rel(xA) = errortrue value = xT−xA

xT

For example, for the approximation

π·=

227

we have xT = π = 3.14159265 . . . and xA = 227 = 3.1428571

Error

(227

)= π − 22

7·= −0.00126

Rel

(227

)=π − 22

7

π

·= −0.00042



The notion of relative error is a more intrinsic error measure.

1 The exact distance between 2 cities: x1T = 100km and the

measured distance is x1A = 99km

Error(x1T

)= x1

T − x1A = 1km

Rel(x1T

)=

Error(x1T

)x1T

= 0.01 = 1%

2 The exact distance between 2 cities: x2T = 2km and the measured

distance is x2A = 1km

Error(x2T

)= x2

T − x2A = 1km

Rel(x2T

)=

Error(x2T

)x2T

= 0.5 = 50%



Definition (significant digits)

The number of significant digits in xA is number of its leading digitsthat are correct relative to the corresponding digits in the true value xT

More precisely, if xA, xT are written in decimal form; compute the error

xT = a1 a2 .a3 · · · am am+1 am+2

|xT − xA| = 0 0 .0 · · · 0 bm+1 bm+2

If the error is ≤ 5 units in the (m+ 1)th digit of xT , countingrightward from the first nonzero digit, then we say that xA has, atleast, m significant digits of accuracy relative to xT .

Example

1 xA = 0.222, xT = 29 :

xT = 0. 2 2 2 2 2 2|xT − xA| = 0. 0 0 0 2 2 2

⇒ 3


> 2. Error and Computer Arithmetic > 2.2.1 Sources of error

Errors in a scientific-mathematical-computational problem:1 Original errors

(E1) Modeling errors(E2) Blunders and mistakes(E3) Physical measurement errors(E4) Machine representation and arithmetic errors(E5) Mathematical approximation errors

2 Consequences of errors

(F1) Loss-of-significance errors(F2) Noise in function evaluation(F3) Underflow and overflow erros



(E1) Modeling errors: the mathematical equations are used torepresent physical reality - mathematical model.

Malthusian growth model (it can be accurate for some stages ofgrowth of a population, with unlimited resources)

N(t) = N0ekt, N0, k ≥ 0

where N(t) = population at time t. For large t the modeloverestimates the actual population:

accurately models the growth of US population for1790 ≤ t ≤ 1860, with k = 0.02975, N0 = 3, 929, 000× e−1790k

but considerably overestimates the actual population for 1870



(E2) Blunders and mistakes: mostly programming errors

test by using cases where you know the solutionbreak into small subprograms that can be tested separately

(E2) Physical measurement errors. For example, the speed of light invacuum

c = (2.997925 + ε) · 1010cm/sec, |ε| ≤ 0.000003

Due to the error in data, the calculations will contain the effectof this observational error. Numerical analysis cannot remove theerror in the data, but it can

look at its propagated effect in a calculation andsuggest the best form for a calculation that will minimize thepropagated effect of errors in data



(E4) Machine representation and arithmetics errors. For exampleerrors from rounding and chopping.

they are inevitable when using floating-point arithmeticthey form the main source of errors with some problems (solvingsystems of linear equations). We will look at the effect ofrounding errors for some summation procedures

(E4) Mathematical approximation errors: major forms of error that wewill look at.

For example, when evaluating the integral

I =∫ 1

0e−x

2dx,

since there is no antiderivative for e−x2, I cannot be evaluated

explicitly. Instead, we approximate it with a quantity that can becomputed.



Using the Taylor approximation

e−x2 ≈ 1− x2 +

x4

2!− x6

3!+x8

4!

we can easily evaluate

I ≈∫ 1

0

(1− x2 +

x4

2!− x6

3!+x8

4!

)dx

with the truncation error being evaluated by (3.10)


> 2. Error and Computer Arithmetic > 2.2.2 Loss-of-significance errors

Loss of significant digits

Example

Consider the evaluation off(x) = x[

√x+ 1−

√x]

for an increasing sequence of values of x.

x0 Computed f(x) True f(x)

1 0.414210 0.41421410 1.54340 1.54347

100 4.99000 4.987561000 15.8000 15.8074

10, 000 50.0000 49.9988100, 000 100.000 158.113

Table: results of using a 6-digit decimal calculator

As x increases, there are fewer digits of accuracy in the computed value f(x)



What happens?For x = 100:

√100 = 10.0000︸︷︷︸

exact

,√

101 = 10.04999︸︷︷︸rounded

where√

101 is correctly rounded to 6 significant digits of accuracy.

√x+ 1−

√x =√

101−√

100 = 0.0499000

while the true value should be 0.0498756.The calculation has a loss-of-significance error. Three digits ofaccuracy in

√x+ 1 =

√101 were canceled by subtraction of the

corresponding digits in√x =√

100.The loss of accuracy was a by-product of

the form of f(x) and

the finite precision 6-digit decimal arithmetic being used



For this particular f , there is a simple way to reformulate it and avoidthe loss-of-significance error:

f(x) =x√

x+ 1−√x

which on a 6 digit decimal calculator will imply

f(100) = 4.98756

the correct answer to six digits.



Example

Consider the evaluation of

f(x) =1− cos(x)

x2

for a sequence approaching 0.

x0 Computed f(x) True f(x)

0.1 0.4995834700 0.49958347220.01 0.4999960000 0.49999583330.001 0.5000000000 0.49999995830.0001 0.5000000000 0.49999999960.00001 0.0 0.5000000000

Table: results of using a 10-digit decimal calculator



Look at the calculation when x = 0.01:

cos(0.01) = 0.9999500004 (= 0.999950000416665)

has nine significant digits of accuracy, being off in the tenth digit bytwo units. Next

1− cos(0.01) = 0.0000499996 (= 4.999958333495869e− 05)

which has only five significant digits, with four digits being lost in thesubtraction.To avoid the loss of significant digits, due to the subtraction of nearlyequal quantities, we use the Taylor approximation (3.7) for cos(x)about x = 0:

cos(x) = 1−x2

2!+x4

4!− x6

6!+R6(x)

R6(x) =x8

8!cos(ξ), ξ unknown number between 0 and x.



Hence

f(x) =1x2

{1−

[1− x2

2!+x4

4!− x6

6!+R6(x)

]}=

12!− x2

4!+x4

6!− x6

8!cos(ξ)

giving f(0) = 12 . For |x| ≤ 0.1∣∣∣∣x6

8!cos(ξ)

∣∣∣∣ ≤ 10−6

8!·= 2.5 · 10−11

Therefore, with this accuracy

f(x) ≈ 12!− x2

4!+x4

6!, |x| < 0.1



Remark

When two nearly equal quantities are subtracted, leading significantdigits will be lost.

In the previous two examples, this was easy to recognize and we foundways to avoid the loss of significance.More often, the loss of significance is subtle and difficult to detect, asin calculating sums (for example, in approximating a function f(x) bya Taylor polynomial).If the value of the sum is relatively small compared to the terms beingsummed, then there are probably some significant digits of accuracybeing lost in the summation process.



Example

Consider using the Taylor series approximation for ex to evaluate e−5:

e−5 = 1 +(−5)

1!+

(−5)2

2!+

(−5)3

3!+

(−5)4

4!+ . . .

Degree Term Sum Degree Term Sum0 1.000 1.000 13 -0.1960 -0.042301 -5.000 -4.000 14 0.7001E-1 0.027712 12.50 8.500 15 -0.2334E-1 0.0043703 -20.83 -12.33 16 0.7293E-2 0.011664 26.04 13.71 17 -0.2145E-2 0.0095185 -26.04 -12.33 18 0.5958E-3 0.010116 21.70 9.370 19 -0.1568E-3 0.0099577 -15.50 -6.130 20 0.3920E-4 0.0099968 9.688 3.558 21 -0.9333E-5 0.0099879 -5.382 -1.824 22 0.2121E-5 0.00998910 2.691 0.8670 23 -0.4611 E-6 0.00998911 -1.223 -0.3560 24 0.9607 E-7 0.00998912 0.5097 0.1537 25 -0.1921 E-7 0.009989

Table. Calculation of e−5 =0.006738 using four-digit decimal arithmetic2. Error and Computer Arithmetic Math 1070


There are loss-of-significance errors in the calculation of the sum.To avoid the loss of significance is simple in this case:

e−5 =1e5

=1

series for e5

and form e5 with a series not involving cancellation of positive andnegative terms.


> 2. Error and Computer Arithmetic > 2.2.3 Noise in function evaluation

Consider evaluating a continuous function f for all x ∈ [a, b]. Thegraph is a continuous curve.When evaluating f on a computer using floating-point arithmetic (withrounding or chopping), the errors from arithmetic operations cause thegraph to cease being a continuous curve.Let look at

f(x) = (x− 1)3 = −1 + x(3 + x(−3 + x))

on [0, 2].



Figure: f(x) = x3 − 3x2 + 3x− 1



Figure: Detailed graph of f(x) = x3 − 3x2 + 3x− 1 near x = 1

Here is a plot of the computed values of f(x) for 81 evenly spacedvalues of x ∈ [0.99998, 1.00002].A rootfinding program might consider f(x) t have a very large numberof solutions near 1 based on the many sign changes!!!


> 2. Error and Computer Arithmetic > 2.2.4 Underflow and overflow errors

From the definition of floating-point number, there are upper andlower limits for the magnitudes of the numbers that can be expressedin a floating-point form.Attempts to create numbers

that are too small ⇒ underflow errors: the default option is toset the number to zero and proceed

that are too large ⇒ overflow errors: generally fatal errors onmost computers. With the IEEE floating-point format, overflowerrors can be carried along as having a value of ±∞ or NaN ,depending on the context. Usually, an overflow error is anindication of a more significant problem or error in the program.



Example (underflow errors)

Consider evaluating f(x) = x10 for x near 0.

With the IEEE single precision arithmetic, the smallest nonzeropositive number expressible in normalized floating point arithmetic is

m = 2−126 ·= 1.18× 10−38

So f(x) is set to zero if

x10 < m

|x| < 10√m·= 1.61× 10−4

− 0.000161 < x < 0.000161



Sometimes is possible to eliminate the overflow error by justreformulating the expression being evaluated.

Example (overflow errors)

Consider evaluating z =√x2 + y2.

If x or y is very large, then x2 + y2 might create an overflow error,even though z might be within the floating-point range of the machine.

z =

|x|√

1 +( yx

)2, 0 ≤ |y| ≤ |x|

|y|√

1 +(xy

)2, 0 ≤ |x| ≤ |y|

In both cases, the argument of√

1 + w2 has |w| ≤ 1, which will notcause any overflow error (except when z is too large to be expressed inthe floating-point format used).


> 2. Error and Computer Arithmetic > 2.3 Propagation of Error

When doing calculations with numbers that contain an error, the resultwill be effected by these errors.

If xA, yA denote numbers used in a calculation, corresponding toxT , yT , we wish to bound the propagated error

E = (xTωyT )− (xAωyA)

where ω denotes: “ + ”, “− ”, “ · ”, “÷”.

The first technique used to bound E is interval arithmetic.Suppose we know bounds on xT − xA and yT − yA. Using thesebounds and xAωyA, we look for an interval guaranteed to containxTωyT .



Example

Let xA = 3.14 and yA = 2.651 be correctly rounded from xT and yT ,to the number of digits shown. Then

|xA − xT | ≤ 0.005, |yA − yT | ≤ 0.00053.135 ≤ xT ≤ 3.145 2.6505 ≤ yT ≤ 2.6515 (4.10)

For addition (ω : “+′′)

xA + yA = 5.791 (4.11)

The true value, from (4.10)

5.7855=3.135+2.6505 ≤ xT +yT ≤ 3.145+2.6515=5.7965 (4.12)

To bound E, subtract (4.11) from (4.12) to get

−0.0055 ≤ (xT + yT )− (xA + yA) ≤ 0.0055



Example

For division (ω : “÷′′)

xAyA

=3.142.651

·= 1.184459 (4.13)

The true value, from (4.10),

3.1352.6515

≤ xTyT≤ 3.145

2.6505

dividing and rounding to 7 digits:

1.182350 ≤ xTyT≤ 1.186569

The bound on E

−0.002109 ≤ xTyT− 1.184459 ≤ 0.002110



Propagated error in multiplication

The relative error in xAyA compared to xT yT is

Rel(xAyA) =xT yT − xAyA

xT yT

If xT = xA + ε, yT = yA + η, then (or use xA = xT (1−Rel(xA)))

Rel(xAyA) =xT yT − xAyA

xT yT=xT yT − (xT − ε)(yT − η)

xT yT

=ηxT + εyT − εη

xT yT=

ε

xT+

η

yT− ε

xT

η

yT

= Rel(xA) +Rel(yA)−Rel(xA)Rel(yA)

If Rel(xA)� 1 and Rel(yA)� 1, then

Rel(xAyA) ≈ Rel(xA) +Rel(yA)



Propagated error in division

The relative error in xAyA

compared to xTyT

is

Rel

(xAyA

)=

xTyT− xA

yAxTyT

If xT = xA + ε, yT = yA + η, then

Rel

(xAyA

)=

xT

yT− xA

yA

xT

yT

=xT yA − xAyT

xT yA=xT (yT − η)− (xT − ε)yT

xT (yT − η)

=−xT η + εyTxT (yT − η)

=εxT− η

yT

1− ηxT

=Rel(xA)−Rel(yA)

1−Rel(yA)

If Rel(yA)� 1, then

Rel

(xAyA

)≈ Rel(xA)−Rel(yA)



Propagated error in addition and subtraction

(xT ± yT )− (xA ± yA) = (xT − xA)± (yT − yA) = ε± ηError(xA ± yA) = Error(xA)± Error(yA)

Misleading: we can have much larger Rel(xA ± yA) for small values ofRel(xA) and Rel(yA) (this source of error is connected closely toloss-of-significance errors).

Example

1

xT = xA = 13yT =

√168, yA = 12.961

Rel(xA) = 0, Rel(yA) ·= 0.0000371Error(xA − yA) ·= −0.0004814Rel(xA − yA) ·= −0.0125

2

xT = πxA = 3.1416yT = 22

7 , yA = 3.1429xT − xA

·= −7.35× 10−6 Rel(xA) ·= −2.34× 10−6,

(yT − yA) ·= −4.29× 10−5 Rel(yA) ·= −1.36× 10−5

(xT−yT )−(xA−yA) ·= −0.0012645− (−.0013) ·=3.55× 10−5

Rel(xA − yA) ·= −0.282. Error and Computer Arithmetic Math 1070


Total calculation error

With floating-point arithmetic on a computer, xAωyA involves an additionalrounding or chopping error, as in (4.13).Hence the total error in computing xAωyA (the complete operation incomputer, involving the propagated error plus the rounding or chopping error):

xTωyT − xAωyA = (xTωyT − xAωyA)︸︷︷︸propagated error

+(xAωyA − xAωyA︸︷︷︸error in computing xAωyA

)

When using IEEE arithmetic with the basic arithmetic operations

xAωyA = fl(xAωyA)(4.3)= (1 + ε)(xAωyA)

where ε as in (4.4)-(4.5), we get

xAωyA − xAωyA = −ε(xAωyA)xAωyA−xAωyA

xAωyA= −ε

Hence the process of rounding or chopping introduces a relatively small new

error into xAωyA as compared with xAωyA.


> 2. Error and Computer Arithmetic > 2.3.1 Propagated error in function evaluation

Evaluate f(x) ∈ C1[a, b] at the approximate value xA instead of xT .Using the mean value theorem

f(xT )−f(xA) = f ′(c)(xT−xA), c between xT and xA≈ f ′(xT )(xT − xA) ≈ f ′(xA)(xT − xA)

(4.14)

and

Rel(f(xA)) ≈ f ′(xT )f(xT )

(xT − xA) =f ′(xT )f(xT )

xTRel(xA) (4.15)



Example: ideal gas law

PV = nRT

with R a constant for all gases:

R = 8.3143 + ε, |ε| ≤ 0.0012

Let evaluate T assuming P = V = n = 1 : T = 1R ,

i.e., evaluate f(x) = 1x at x = R. Let

xT = R, xA = 8.3143, |xT − xA| ≤ 0.0012

For the error

E =1R− 1

8.3143we have

|E| = |f(xT )− f(xA)|(4.14)≈ |f ′(xA)||xT − xA| ≤

(1x2A

)0.0012 ·= 0.000144

Hence , the uncertainty ε in R ⇒ relatively small error in the computed

1R

·=1

8.31432. Error and Computer Arithmetic Math 1070


Example: evaluate bx, b > 0

Since f ′(x) = (ln b)bx, by (4.15)

bxT − bxA ≈ (ln b)bxT (xT − xA)Rel(bxA) ≈ (ln b)xT︸︷︷︸

K=conditioning number

Rel(xA)

If the conditioning number K � 1 is large in size ⇒

Rel(bxA)� Rel(xA)

For example, ifRel(xA) = 10−7, K = 104

thenRel(bxA) ·= 10−3

independent of how bxA is actually computed.


> 2. Error and Computer Arithmetic > 2.4 Summation

Let denote the sum

S = a1 + a2 + · · ·+ an =n∑j=1

aj ,

where each aj is a floating-point number. It takes (n− 1) additions, each ofwhich will probably involve a rounding or chopping error:

S2 = fl(a1 + a2)(4.3)= (a1 + a2)(1 + ε2)

S3 = fl(a3 + S2)(4.3)= (a3 + S2)(1 + ε3)

S4 = fl(a4 + S3)(4.3)= (a4 + S3)(1 + ε4)

...

Sn = fl(an + Sn−1)(4.3)= (an + Sn−1)(1 + εn)

with Sn the computed version of S.Each εj satisfies the bounds (4.6)- (4.9), assuming the IEEE arithmetic isused.

S−Sn≈−a1(ε2+. . .+εn)−a2(ε2+. . .+εn)−a3(ε3+. . .+εn)−a4(ε4 + . . .+ εn)− · · · − anεn

(4.16)



Trying to minimize the error S − SN :

1 before summing, arrange the terms a1, a2, . . . , an so they areincreasing in size

|a1| ≤ |a2| ≤ |a3| ≤ · · · |an|

2 then summate

Then the terms in (4.16) with the largest numbers of εj ’s aremultiplied by the smaller values among aj ’s.

Example1j decimal fraction round it to 4 significant digits := aj

Use a decimal machine with 4 significant digits.SL = adding S from smallest to largest,LS = adding S from largest to smallest.



n True SL Error LS Error10 2.929 2.929 0.001 2.927 0.00225 3.816 3.813 0.003 3.806 0.01050 4.499 4.491 0.008 4.479 0.020100 5.187 5.170 0.017 5.142 0.045200 5.878 5.841 0.037 5.786 0.092500 6.793 6.692 0.101 6.569 0.2241000 7.486 7.284 0.202 7.069 0.417

Table: Calculating S on a machine using chopping

n True SL Error LS Error10 2.929 2.929 0 2.929 025 3.816 3.816 0 3.817 -0.00150 4.499 4.500 -0.001 4.498 0.001100 5.187 5.187 0 5.187 0200 5.878 5.878 0 5.876 0.002500 6.793 6.794 0.001 6.783 0.0101000 7.486 7.486 0 7.449 0.037

Table: Calculating S on a machine using rounding2. Error and Computer Arithmetic Math 1070

> 2. Error and Computer Arithmetic > 2.4.1 Rounding versus chopping

A more important difference in the errors in the previous Tables:between rounding and chopping.

Rounding ⇒ a far smaller error in the calculated sum than does chopping.

Let go back to the first term in (4.16):

T ≡ −a1(ε1 + ε2 + . . .+ εn) (4.17)

(1) Assuming that the previous example used rounding with four-digitdecimal machine, we know that all εj satisfy

− 0.005 ≤ εj ≤ 0.005 (4.18)

Treating rounding errors as being random, then the positive and negativevalues of the εj ’s will tend to cancel and the sum T will be nearly zero.Moreover, from probability theory:

|T | ≤ 1.49 · 0.0005 ·√n|a1|

hence T is proportional to√n and is small until n becomes quite large.

Similarly for the total error in (4.16).



(2) For chopping with the four-digit decimal machine, (4.18) isreplaced by

−0.001 ≤ εj ≤ 0

and all errors have one sign.

Again, the errors will be random in this interval, with the average−0.0005 and (4.17) will likely be

−a1 · (n− 1) · (−0.0005)

hence T is proportional to n, which increases more rapidly than√n.

Thus the error (4.17) and (4.16) will grow more rapidly when choppingis used rather than rounding.



Example (difference between rounding and chopping on summation)

ConsiderS =

n∑j=1

1j

in single precision accuracy of six decimal digits.

Errors occur both in calculation of 1j and in the summation process.

n True Rounding error Chopping Error10 2.92896825 -1.76E-7 3.01E-750 4.49920534 7.00E-7 3.56E-6100 5.18737752 -4.12E-7 6.26E-6500 6.79282343 -1.32E-6 3.59E-51000 7.48547086 8.88E-8 7.35E-5

Table: Calculation of S: rounding versus chopping

The true values were calculated using double precision arithmetic to evaluate

S; all sums were performed from the smallest to the largest.


> 2. Error and Computer Arithmetic > 2.4.2 A loop error

Suppose we wish to calculate:

x = a+ jh, (4.19)

for j = 0, 1, 2, . . . , n, for given h > 0, which is used later to evaluate afunction f(x).

Question

Should we compute x using (4.19) or by using

x = x+ h (4.20)

in the loop, having initially set x = a before beginning the loop?

Remark: These are mathematically equivalent ways to compute x, butthey are usually not computationally equivalent.The difficulty arises when h does not have a finite binary expansionthat can be stored in the given floating-point significand; for example

h = 0.12. Error and Computer Arithmetic Math 1070


The computation (4.19)x = a+ jh

will involve two arithmetic operations, hence only two chopping orrounding errors, for each value of x.In contrast, the repeated use of (4.20)

x = x+ hwill involve a succession of j additions, for the x of (4.19). As xincreases in size, the use of (4.20) involves a larger number of roundingor chopping errors, leading to a different quantity than in (4.19).

Usually (4.19) is the preferred way to evaluate x

Example

Compute x in the two ways, with a = 0, h = 0.1 and verify with thetrue value computed using a double precision computation.



j x Error using (4.19) Error using (4.20)

10 1 1.49E-8 -1.04 E-7

20 2 2.98E-8 -2.09E-7

30 3 4.47E-8 7.60 E-7

40 4 5.96E-8 1.73 E-6

50 5 7.45E-8 2.46 E-6

60 6 8.94E-8 3.43 E-6

70 7 1.04E-7 4.40 E-6

80 8 1.19E-7 5.36 E-6

90 9 1.34E-7 2.04 E-6

100 10 1.49E-7 -1.76 E-6

Table: Evaluation of x = j · h, h = 0.1


> 2. Error and Computer Arithmetic > 2.4.3 Calculation of inner products

Definition (inner product)

A sum of the form

S = a1b1 + a2b2 + · · ·+ anbn =n∑j=1

ajbj (4.21)

is called a dot product or inner product.

(1) If we calculate S in single precision:⇒ a single precision rounding error for each multiplication andeach addition⇒ 2n− 1 single precision rounding errors to calculate S.The consequences of these errors can be analyzed as for (4.16)and derive an optimal strategy of calculating (4.21).


> 2. Error and Computer Arithmetic > 2.4.3 Calculation of inner products

(2) Using double precision:1 convert each aj and bj to double precision by extending their

significands with zeros2 multiply in double precision3 sum in double precision4 round to single precision to obtain the calculated value of S

For machines with IEEE arithmetic, this is a simple and rapidprocedure to obtain more accurate inner products in singleprecision.There is no increase in storage space for the arraysA = [a1, a2, . . . , an] and B.The accuracy is improved since there is only one single precisionrounding error, regardless of the size n.


> 4. Interpolation and Approximation

Interpolation

4. Interpolation Math 1070


Most functions cannot be evaluated exactly:

√x, ex, lnx, trigonometric functions

since by using a computer we are limited to the use of elementaryarithmetic operations

+,−,×,÷

With these operations we can only evaluate polynomials and rationalfunctions (polynomial divided by polynomials).



Interpolation

Given pointsx0, x1, . . . , xn

and corresponding values

y0, y1, . . . , yn

find a function f(x) such that

f(xi) = yi, i = 0, . . . , n.

The interpolation function f is usually taken from a restricted class offunctions: polynomials.


> 4. Interpolation and Approximation > 4.1 Polynomial Interpolation Theory

Interpolation of functions

f(x)

x0, x1, . . . , xn

f(x0), f(x1), . . . , f(xn)

Find a polynomial (or other special function) such that

p(xi) = f(xi), i = 0, . . . , n.

What is the error f(x) = p(x)?


> 4. Interpolation and Approximation > 4.1.1 Linear interpolation

Linear interpolation

Given two sets of points (x0, y0) and (x1, y1) with x0 6= x1, draw a linethrough them, i.e., the graph of the linear polynomial

x0 x1

y0 y1

`(x) =x− x1

x0 − x1y0 +

x− x0

x1 − x0y1

`(x) =(x1 − x)y0 + (x− x0)y1

x1 − x0(5.1)

We say that `(x) interpolates the value yi at the point xi, i = 0, 1, or`(xi) = yi, i = 0, 1

Figure: Linear interpolation



Example

Let the data points be (1, 1) and (4,2). The polynomialP1(x) is given by

P1(x) =(4− x) · 1 + (x− 1) · 2

3(5.2)

The graph y = P1(x) and y =√x, from which the data points were taken.

Figure: y =√x and its linear interpolating polynomial (5.2)



Example

Obtain an estimate of e0.826 using the function values

e0.82·= 2.270500, e0.83

·= 2.293319

Denote x0 = 0.82, x1 = 0.83.The interpolating polynomial P1(x) interpolating ex at x0 and x1 is

P1(x) =(0.83− x) · 2.270500 + (x− 0.82) · 2.293319

0.01(5.3)

andP1(0.826) = 2.2841914

while the true value se0.826

·= 2.2841638

to eight significant digits.


> 4. Interpolation and Approximation > 4.1.2 Quadratic Interpolation

Assume three data points (x0, y0), (x1, y1), (x2, y2), with x0, x1, x2 distinct.We construct the quadratic polynomial passing through these points usingLagrange’s folmula

P2(x) = y0L0(x) + y1L1(x) + y2L2(x) (5.4)

with Lagrange interpolation basis functions for quadratic interpolatingpolynomial

L0(x) = (x−x1)(x−x2)(x0−x1)(x0−x2)

L1(x) = (x−x0)(x−x2)(x1−x0)(x1−x2)

L2(x) = (x−x0)(x−x1)(x2−x0)(x2−x1)

(5.5)

Each Li(x) has degree 2 ⇒ P2(x) has degree ≤ 2. Moreover

Li(xj) = 0, j 6= iLi(xi) = 1 for 0 ≤ i, j ≤ 2 i.e., Li(xj) = δi,j =

{1, i = j0, i 6= j

the Kronecker delta function.

P2(x) interpolates the data P2(x) = yi, i=0,1,2.



Example

Construct P2(x) for the data points (0,−1), (1,−1), (2, 7). Then

P2(x)=(x−1)(x−2)

2· (−1)+

x(x−2)−1

· (−1)+x(x−1)

2· 7 (5.6)

Figure: The quadratic interpolating polynomial (5.6)



1 With linear interpolation: obvious that there is only one straight linepassing through two given data points.

2 With three data points: only one quadratic interpolating polynomialwhose graph passes through the points.

Indeed: assume ∃Q2(x), deg(Q2) ≤ 2 passing through(xi, yi), i = 0, 1, 2, then it is equal to P2(x).The polynomial

R(x) = P2(x)−Q2(x)

has deg(R) ≤ 2 and

R(xi) = P2(xi)−Q2(xi) = yi − yi = 0, for i = 0, 1, 2

So R(x) is a polynomial of degree ≤ 2 with three roots ⇒ R(x) ≡ 0.



Example

Calculate a quadratic interpolate to e0.826 from the function values

e0.82 ·= 2.27050 e0.83 ·= 2.293319 e0.84 ·= 2.231637

With x0 = e0.82, x1 = e0.83, x2 = e0.84, we have

P2(0.826) ·= 2.2841639

to eight digits,while the true answer is

e0.826 ·= 2.2841638

andP1(0.826) ·= 2.2841914.


> 4. Interpolation and Approximation > 4.1.3 Higher-degree interpolation

Lagrange’s Formula

Given n+ 1 data points (x0, y0), (x1, y1), . . . , (xn, yn) with all xi’s distinct,∃ unique Pn, deg(Pn) ≤ n such that

Pn(xi) = yi, i = 0, . . . , ngiven by Lagrange’s Formula

Pn(x) =n∑i=0

yiLi(x) = y0L0(x) + y1L1(x) + · · · ynLn(x) (5.7)

where Li(x)=n∏

j=0,j 6=i

x−xjxi−xj

=(x−x0) · · · (x−xi−1)(x−xi+1) · · · (x−xn)

(xi−x0) · · · (xi−xi−1)(xi−xi+1) · · · (xi−xn),

where Li(xj) = δij .


> 4. Interpolation and Approximation > 4.1.4 Divided differences

Remark. The Lagrange’s formula (5.7) is suited for theoretical uses,but is impractical for computing the value of an interpolating polynomial:knowing P2(x) does not lead to a less expensive way to compute P3(x).But for this we need some preliminaries, and we start with a discrete version of the derivative of a function f(x).

Definition (First-order divided difference)

Let x0 6= x1, we define the first-order divided difference of f(x) by

f [x0, x1] =f(x1)− f(x2)

x1 − x2(5.8)

If f(x) is differentiable on an interval containing x0 and x1,then the mean value theorem gives

f [x0, x1] = f ′(c), for c between x0 and x1.

Also if x0, x1 are close together, then

f [x0, x1] ≈ f ′(x0+x1

2

)usually a very good approximation.



Example

Let f(x) = cos(x), x0 = 0.2, x1 = 0.3.

Then

f [x0, x1] =cos(0.3)− cos(0.2)

0.3− 0.2·= −0.2473009 (5.9)

while

f ′(x0 + x1

2

)= − sin(0.25) ·= −0.2474040

so f [x0, x1] is a very good approximation of this derivative.



Higher-order divided differences are defined recursively

Let x0, x1, x2 ∈ R distinct.The second-order divided difference

f [x0, x1, x2] =f [x1, x2]− f [x0, x1]

x2 − x0(5.10)

Let x0, x1, x2, x3 ∈ R distinct.The third-order divided difference

f [x0, x1, x2, x3] =f [x1, x2, x3]− f [x0, x1, x2]

x3 − x0(5.11)

In general, let x0, x1, . . . , xn ∈ R, n+ 1 distinct numbers.The divided difference of order n

f [x0, . . . , xn] =f [x1, . . . , xn]− f [x0, . . . , xn−1]

xn − x0(5.12)

or the Newton divided difference.



Theorem

Let n ≥ 1, f ∈ Cn[α, β] and x0, x1, . . . , xn n+ 1 distinct numbers in [α, β].Then

f [x0, x1, . . . , xn] = 1n!f (n)(c) (5.13)

for some unknown point c between the maximum and the minimum ofx0, . . . , xn.

Example

Let f(x) = cos(x), x0 = 0.2, x1 = 0.3, x2 = 0.4.

The f [x0, x1] is given by (5.9), and

f [x1, x2] =cos(0.4)− cos(0.3)

0.4− 0.3·= −0.3427550

hence from (5.11)f [x0, x1, x2] ·=

−0.3427550− (−0.2473009)0.4− 0.2

= −0.4772705 (5.14)

For n = 2, (5.13) becomes

f [x0, x1, x2] =12f ′′(c) = −1

2cos(c) ≈ −1

2cos(0.3) ·= −0.4776682

which is nearly equal to the result in (5.14).


> 4. Interpolation and Approximation > 4.1.5 Properties of divided differences

The divided differences (5.12) have special properties that help simplify workwith them.(1) Let (i0, i1, . . . , in) be a permutation (rearrangement) of the integers(0, 1, . . . , n). It can be shown that

f [xi0 , xi1 , . . . , xin ] = f [x0, x1, . . . , xn] (5.15)

The original definition (5.12) seems to imply that the order of x0, x1, . . . , xnis important, but (5.15) asserts that it is not true.

1 For n = 1f [x0, x1] =

f(x0)− f(x1)x0 − x1

=f(x1)− f(x0)

x1 − x0= f [x1, x0]

2 For n = 2 we can expand (5.11) to get

f [x0, x1, x2] =f(x0)

(x0−x1)(x0−x2)+

f(x1)(x1−x0)(x1−x2)

+f(x2)

(x2 − x0)(x2 − x1)

(2) The definitions (5.8), (5.11) and (5.12) extend to the case where some or

all of the xi coincide, provided that f(x) is sufficiently differentiable.



For example, define

f [x0, x0] = limx1→x0

f [x0, x1] = limx1→x0

f(x0)− f(x1)x1 − x0

f [x0, x0] = f ′(x0)

For an arbitrary n ≥ 1, let all xi → x0; this leads to the definition

f [x0, . . . , x0] =1n!f (n)(x0) (5.16)

For the cases where only some of nodes coincide: using (5.15), (5.16)we can extend the definition of the divided difference.For example

f [x0, x1, x0] = f [x0, x0, x1] =f [x0, x1]− f [x0, x0]

x1 − x0=f [x0, x1]− f ′(x0)

x1 − x0



MATLAB program - evaluating divided differences: divdif.m

Given a set of values f(x0), . . . , f(xn)we need to calculate the set of divided differences

f [x0, x1], f [x0, x1, x2], . . . , f [x0, x1, . . . , xn]

We can use the MATLAB function divdif using the function calldivdif y = divdif(x nodes, y values)

Note that MATLAB does not allow zero subscripts, hence x nodes andy values have to be redefined as vectors containing n+ 1 components:

x nodes = [x0, x1, . . . , xn]x nodes(i) = xi−1, i = 1, . . . , n+ 1y values = [f(x0), f(x1), . . . , f(xn)]y values(i) = f(xi−1), i = 1, . . . , n+ 1



MATLAB program - evaluating divided differences: divdif.m

function divdif f = divdif(x nodes,y values) %% This is a function% divdif y = divdif(x nodes,y values)% It calculates the divided differences of the function% values given in the vector y values, which are the values of% some function f(x) at the nodes given in x nodes. On exit,% divdif y(i) = f[x 1,...,x i], i=1,...,m% with m the length of x nodes. The input values x nodes and% y values are not changed by this program.%divdif y = y values;m = length(x nodes);for i=2:mfor j=m:-1:idivdif y(j) = (divdif y(j)-divdif y(j-1)) ...

/(x nodes(j)-x nodes(j-i+1));end

end



>> x = [0.0 0.2 .4 .6 .8 1.0 1.2];>> y = cos(x);>> divdif(x,y)

This program is illustrated in this table.

i xi cos(xi) Di

0 0.0 1.000000 0.1000000E+11 0.2 0.980067 -0.9966711E-12 0.4 0.921061 -0.4884020E+03 0.6 0.825336 0.4900763E-14 0.8 0.696707 0.3812246E-15 1.0 0.540302 -0.3962047E-26 1.2 0.36358 -0.1134890E-2

Table: Values and divided differences for cos(x)


> 4. Interpolation and Approximation > 4.1.6 Newton’s Divided Differences

Newton interpolation formula

Interpolation of f at x0, x1, . . . , xnIdea: use Pn−1(x) in the definition of Pn(x):

Pn(x) = Pn−1(x) + C(x)

C(x) ∈ Pn correction termC(xi) = 0 i = 0, . . . , n− 1

}⇒ C(x) = a(x− x0)(x− x1) . . . (x− xn−1)

Pn(x) = axn + . . .



Definition: The divided difference of f(x) at points x0, . . . , xn

f [x0, x1, . . . , xn] def= the coefficient of xn in Pn(x; f).

C(x) = f [x0, x1, . . . , xn](x− x0) . . . (x− xn−1)pn(x; f) = pn−1(x; f)+f [x0, x1, . . . , xn](x−x0) . . . (x−xn−1) (5.17)

p1(x; f) = f(x0) + f [x0, x1](x− x0) (5.18)

p2(x; f)=f(x0)+f [x0, x1](x−x0)+f [x0, x1, x2](x−x0)(x−x1) (5.19)

...

pn(x; f) = f(x0) + f [x0, x1](x− x0) + . . . (5.20)

+f [x0, . . . , xn](x− x0) . . . (x− xn−1)=⇒ Newton interpolation formula



1 For (5.18), consider p1(x0), p1(x1). Easily, p1(x0) = f(x0), and

p1(x1)=f(x0)+(x1−x0)[f(x1)−f(x0)x1 − x0

]=f(x0)+[f(x1)−f(x0)]=f(x1)

So: deg(p1) ≤ 1 and satisfies the interpolation conditions.Then by the uniqueness of polynomial interpolation⇒ (5.18) is the linear interpolation polynomial to f(x) at x0, x1.

2 For (5.19), note that

p2(x) = p1(x) + (x− x0)(x− x1)f [x0, x1, x2]

It satisfies: deg(P2) ≤ 2 and

p2(xi) = p1(xi) + 0 = f(xi), i = 0, 1p2(x2) = f(x0) + (x2 − x0)f [x0, x1] + (x2 − x0)(x2 − x1)f [x0, x1, x2]

= f(x0) + (x2 − x0)f [x0, x1] + (x2 − x1){f [x1, x2]− f [x0, x1]}= f(x0) + (x1 − x0)f [x0, x1] + (x2 − x1)f [x1, x2]= f(x0) + {f(x1)− f(x0)}+ {f(x2)− f(x1)} = f(x2)

By the uniqueness of polynomial interpolation, this is the quadraticinterpolating polynomial to f(x) at {x0, x1, x2}.



Example

Find p(x) ∈ P2 such that p(−1) = 0, p(0) = 1, p(1) = 4.

p(x) = p(x0) + p[x0, x1](x− x0) + p[x0, x1, x2](x− x0)(x− x1)

xi p(xi) p[xi, xi+1] p[xi, xi+1, xi+1]−1 0 1 10 1 31 4

p(x) = 0 + 1(x+ 1) + 1(x+ 1)(x− 0) = x2 + 2x+ 1



Example

Let f(x) = cos(x). The previous table contains a set of nodes xi, the valuesf(xi) and the divided differences computed with divdif.m

Di = f [x0, . . . , xi] i ≥ 0

pn(0.1) pn(0.3) pn(0.5)1 0.9900333 0.9700999 0.95016642 0.9949173 0.9554478 0.87690613 0.9950643 0.9553008 0.87764134 0.9950071 0.9553351 0.87758415 0.9950030 0.9553369 0.87758236 0.9950041 0.9553365 0.8775825

True 0.9950042 0.9553365 0.8775826

Table: Interpolation to cos(x) using (5.20)

This table contains the values of pn(x) for various values of n, computed

with interp.m, and the true values of f(x).



In general, the interpolation node points xi

need not to be evenly spaced,nor be arranged in any particular order

to use the divided difference interpolation formula (5.20)

To evaluate (5.20) efficiently we can use a nested multiplicationalgorithm

Pn(x)=D0+(x−x0)D1+(x−x0)(x−x1)D2+. . .+(x−x0) · · · (x−xn−1)Dn

with D0 = f(x0), Di = f [x0, . . . , xi] for i ≥ 1Pn(x)=D0+(x−x0)[D1+(x−x1)[D2+· · ·+(x−xn−2)[Dn−1+(x−xn−1)Dn]· · ·]]

(5.21)

For example

P3(x) = D0 + (x− x0)D1 + (x− x1)[D2 + (x− x2)D3]

(5.21) uses only n multiplications to evaluate Pn(x) and is more convenient

for a fixed degree n. To compute a sequence of interpolation polynomials of

increasing degree is more efficient to use the original form (5.20).



MATLAB - evaluating Newton divided difference for polynomial interpolation: interp.m

function p eval = interp(x nodes,divdif y,x eval) %% This is a function% p eval = interp(x nodes,divdif y,x eval)% It calculates the Newton divided difference form of% the interpolation polynomial of degree m-1, where the% nodes are given in x nodes, m is the length of x nodes,% and the divided differences are given in divdif y. The% points at which the interpolation is to be carried out% are given in x eval; and on exit, p eval contains the% corresponding values of the interpolation polynomial.%m = length(x nodes);p eval = divdif y(m)*ones(size(x eval));for i=m-1:-1:1p eval = divdif y(i) + (x eval - x nodes(i)).*p eval;

end


> 4. Interpolation and Approximation > 4.2 Error in polynomial interpolation

Formula for error E(t) = f(t)− pn(t; f)

Pn(x) =n∑j=0

f(xj)Lj(x)

Theorem

Let n ≥ 0, f ∈ Cn+1[a, b], x0, x1, . . . , xn distinct points in [a, b].Then

f(x)− Pn(x) = (x− x0)(x− x1) · · · (x− xn)︸︷︷︸=ψn(x)

f [x0, x1, . . . , xn, x] (5.22)

= (x− x0)(x− x1) · · · (x− xn)︸︷︷︸=ψn(x)

f (n+1)(cx)(n+ 1)!

(5.23)

for x ∈ [a, b], cx unknown between the minimum and maximum ofx0, x1, . . . , xn.



Proof of Theorem

Fix t ∈ R. Consider

pn+1(x; f)− interpolates f at x0, . . . , xn, t

pn(x; f)− interpolates f at x0, . . . , xn

From Newton

pn+1(x; f) = pn(x; f) + f [x0, . . . , xn, t](x− x0) . . . (x− xn)

Let x = t.

f(t) := pn+1(t, f) = pn(t; f) + f [x0, . . . , xn, t](t− x0) . . . (t− x)E(t) := f [x0, x1, . . . , xn, t](t− x0) . . . (t− xn) �



Examples

f(x) = xn, f [x0, . . . , xn] = n!n! = 1 or

pn(x; f) = 1 · xn ⇒ f [x0, . . . , xn] = 1

f(x) ∈ Pn−1, f [x0, x1, . . . , xn] = 0

f(x) = xn+1, f [x0, . . . , xn] = x0 + x1 + . . .+ xn

R(x) = xn+1 − pn(x; f) ∈ Pn+1

R(xi) = 0, i = 0, . . . , n

R(xi) = (x− x0) . . . (x− xn) = xn+1 − (x0 + . . .+ xn)xn + . . .︸︷︷︸Pn

=⇒ f [x0, . . . , xn] = x0 + x1 + . . .+ xn.

If f ∈ Pm,

f [x0, . . . , xm, x] ={

polynomial degree m− n− 1 if n ≤ m− 10 if n > m− 1



Example

Take f(x) = ex, x ∈ [0, 1] and consider the error in linear interpolation tof(x) using x0, x1 satisfying 0 ≤ x0 ≤ x1 ≤ 1.

From the Lagrange form of the remainder (5.23) we have

ex − P1(x) =(x− x0)(x− x1)

2ecx ,

for some cx between the max and min of x0, x1 and x. Assumingx0 < x < x1, the error is negative and approximately a quadratic polynomial

ex − P1(x) = − (x1 − x)(x− x0)2

ecx

Since x0 ≤ cx ≤ x1

(x1 − x)(x− x0)2

ex0 ≤ |ex − P1(x)| = (x1 − x)(x− x0)2

ex1



For a bound independent of x

maxx0≤x≤x1

(x1 − x)(x− x0)2

=h2

8, h = x1 − x0

and ex1 ≤ e on [0, 1]

|ex − P1(x)| ≤ h2

8e, 0 ≤ x0 ≤ x ≤ x1 ≤ 1

independent of x, x0, x1.Recall that we estimated e0.826 ·= 2.2841914 by e0.82 and e0.83, i.e.,h = 0.01.

|ex − P1(x)| ≤ h2

8e = |ex − P1(x)| ≤ 0.012

82.72 = 0.0000340,

The actual error being −0.0000276, it satisfies this bound.



Example

Again let f(x) = ex on [0, 1], but consider the quadratic interpolation.

ex − P2(x) =(x− x0)(x− x1)(x− x2)

6ecx

for some cx between the min and max of x0, x1, x2 and x.Assuming evenly spaced points, h = x1 − x0 = x2 − x1, and0 ≤ x0 < x < x2 ≤ 1, we have as before

|ex − P2(x)| ≤∣∣∣∣ (x− x0)(x− x1)(x− x2)

6

∣∣∣∣ e1while

maxx0≤x≤x2

(x− x0)(x− x1)(x− x2)6

=h3

9√

3(5.24)

hence|ex − P2(x)| ≤ h3

9√

3e ≈ 0.174h3

For h = 0.01, 0 ≤ x ≤ 1

|ex − P2(x)| ≤ 1.74× 10−7



Letw2(x) = (x+h)x(x−h)

6 = x3−xh6

a translation along x-axis of polynomial in (5.24). Then x = ± h√3

satisfies

0 = w′2(x) = 3x2−h2

6 and gives∣∣∣w2

(± h√

3

)∣∣∣ = h3

9√

3.

Figure: y = w2(x)


> 4. Interpolation and Approximation > 4.2.2 Behaviour of the error

When we consider the error formula (5.23) or (5.22), the polynomial

ψn(x) = (x− x0) · · · (x− xn)

is crucial in determining the behaviour of the error.Let assume that x0, . . . , xn are evenly spaced and x0 ≤ x ≤ xn.

Figure: y = Ψ6(x)



Remark that the interpolation error

is relatively larger in [x0, x1] and [x5, x6], and

is likely to be smaller near the middle of the node points

⇒ In practical interpolation problems, high-degree polynomial interpolationwith evenly spaced nodes is seldom used.But when the set of nodes is suitable chosed, high-degree polynomials can bevery useful in obtaining polynomial approximations to functions.

Example

Let f(x) = cos(x), h = 0.2, n = 8 and then interpolate at x = 0.9.

Case (i) x0 = 0.8, x8 = 2.4⇒ x = 0.9 ∈ [x0, x1]. By direct calculation ofP8(0.9),

cos(0.9)− P8(0.9) ·= −5.51× 10−9

Case (ii) x0 = 0.2, x8 = 1.8⇒ x = 0.9 ∈ [x3, x4], where x4 is the midpoint.

cos(0.9)− P8(0.9) ·= 2.26× 10−10,

a factor of 24 smaller than the first case.



Example: the Runge example

Le f(x) = 11+x2 , x ∈ [−5, 5], n > 0 an even integer, h = 10

n

xj = −5 + jh, j = 0, 1, 2, . . . , n

It can be shown that for many points x in [−5, 5],the sequence of {Pn(x)} does not converge to f(x) as n→∞.

Figure: The interpolation to 11+x2


> 4. Interpolation and Approximation > 4.3 Interpolation using spline functions

x 0 1 2 2.5 3 3.5 4

y 2.5 0.5 0.5 1.5 1.5 1.125 0The simplest method of interpolating data in a table: connecting the nodepoints by straight lines: the curve y = `(x) is not very smooth.

Figure: y = `(x): piecewise linear interpolation

We want to construct a smooth curve that interpolates the given data point that follows the shape of y = `(x).



Next choice: use polynomial interpolation.With 7 data point, we consider the interpolating polynomial P6(x).

Figure: y = P6(x): polynomial interpolation

Smooth, but quite different from y = `(x)!!!



A third choice: connect the data in the table using a succession ofquadratic interpolating polynomials: on each [0, 2], [2, 3], [3, 4].

Figure: y = q(x): piecewise quadratic interpolation

Is smoother than y = `(x), follows it more closely than y = P6(x), butat x = 2 and x = 3 the graph has corners, i.e., q′(x) is discontinuous.


> 4. Interpolation and Approximation > 4.3.1 Spline interpolation

Cubic spline interpolation

Suppose n data points (xi, yi), i = 1, . . . , n are given and

a = x1 < . . . < xn = b

We seek a function S(x) defined on [a, b] that interpolates the data

S(xi) = yi, i = 1, . . . , n

Find S(x) ∈ C2[a, b], a natural cubic spline such thatS(x)is a polynomial of degree ≤ 3 on each subinterval [xj−1, xj ], j = 2, 3, . . . ,n;S(xi) = yi, i = 1, . . . , n,S′′(a) = S′′(b).

On [xi−1, xi]: 4 degrees of freedom (cubic spline) ⇒ 4(n− 1) DOF.


> 4. Interpolation and Approximation > 4.3.1 Spline interpolation

S(xi) = yi : n constraintsS′(xi − 0) = S′(xi + 0) i = 2, . . . , n−1 (continuity)S′′(xi − 0) = S′′(xi + 0) i = 2, . . . , n−1S(xi − 0) = S(xi + 0) i = 2, . . . , n−1

3n− 6

n+ (3n− 6) = 4n− 6 constraints4n− 4 unknowns (DOF)

For a natural cubic spline, 2 more constraints are needed.the Boundary Conditions:

S′′ = 0 at a = x1, xn = b,

(Linear system, with symmetric, positive definite, diagonally dominant matrix.)


> 4. Interpolation and Approximation > 4.3.2 Construction of the cubic spline

Introduce the variables M1, . . . ,Mn with Mi ≡ S′′(xi), i = 1, 2, . . . , n.Since S(x) is cubic on each [xj−1, xj ]⇒ S′′(x) is linear,hence determined by its values at two points:

S′′(xj−1) = Mj−1, S′′(xj) = Mj

⇒S′′(x) =(xj − x)Mj−1 + (x− xj−1)Mj

xj − xj−1, xj−1 ≤ x ≤ xj (5.25)

Forming the second antiderivative of S′′(x) on [xj−1, xj ] and applying theinterpolating conditions

S(xj−1) = yj−1, S(xj) = yj

we get

S(x) =(xj − x)3Mj−1 + (x− xj−1)3Mj

6(xj − xj−1)+

(xj − x)yj−1 + (x− xj−1)yj6(xj − xj−1)

− 16

(xj − xj−1) [(xj − x)Mj−1 + (x− xj−1)Mj ] (5.26)

for x ∈ [xj−1, xj ], j = 2, . . . , n.



Formula (5.26) implies that

S(x) ∈ C[a, b]

S(xi) = yi (and satisfies the interpolating conditions).

Similarly, formula (5.25) for S′′(x) implies that

S′′ ∈ C[a, b] (is continuous).

To ensure

S′ ∈ C[a, b]

we require S′(x) on [xj−1, xj ] and [xj , xj+1] have to give the same value attheir common point xj , j = 2, 3, . . . , n− 1

xj−xj−16 Mj−1 + xj+1−xj−1

3 Mj + xj+1−xj

6 Mj+1

= yj+1−yj

xj+1−xj− yj−yj−1

xj−xj−1, j = 2, 3, . . . , n− 1

(5.27)

These n− 2 equations together with the assumption S′′(a) = S′′(b) = 0:

M1 = Mn = 0 (5.28)

lead to the values M1, . . . ,Mn, hence the function S(x).



Example

Calculate the natural cubic spline interpolating the data{(1, 1), (2,

12

), (3,13

), (4,14

)}

n = 4, and all xj−1 − xj = 1. The system (5.27) becomes

16M1+ 2

3M2 + 16M3 = 1

316M2 + 2

3M3 + 16M4 = 1

12

and with (5.28) this yields

M2 =12, M3 = 0

which by (5.26) gives

S(x) =

112x

3 − 14x

2 − 13x+ 3

2 , 1 ≤ x ≤ 2− 1

12x3 + 3

4x2 − 7

3x+ 176 , 2 ≤ x ≤ 3

− 112x+ 7

12 , 3 ≤ x ≤ 4

Are S′(x) and S′′(x) continuous?



Example

Calculate the natural cubic spline interpolating the datax 0 1 2 2.5 3 3.5 4

y 2.5 0.5 0.5 1.5 1.5 1.125 0

n = 7, system (5.27) has 5 equations

Figure: Natural cubic spline interpolation y = S(x)Compared to the graph of linear and quadratic interpolation y = `(x) and

y = q(x), the cubic spline S(x) no longer contains corners.


> 4. Interpolation and Approximation > 4.3.3 Other interpolation spline functions

So far we only interpolated data points, wanting a smooth curve.

When we seek a spline to interpolate a known function, we are interested also

in the accuracy.

Let f(x) be given on [a, b], that we want to interpolate on evenlyspaced values of x. For n > 1, let

h = b−an−1 , xj = a+ (j − 1)h, j = 1, 2, . . . , n

and Sn(x) be the natural cubic spline interpolating f(x) at x1, . . . , xn.It can be shown that

maxa≤x≤b

|f(x)− Sn(x)| ≤ ch2 (5.29)

where c depends on f ′′(a), f ′′(b) and maxa≤x≤b |f (4)(x)|.Sn(x) doesn’t converge more rapidly (have an error bound with a higher power of h)

f ′′(a) 6= 0 6= f ′′(b),

while by definition S′′n(a) = S′′n(b) = 0.

For functions f(x) with f ′′(a) = f ′′(b) = 0, the RHS of (5.29) can bereplaced with ch4.



To improve on Sn(x), we look for other interpolating functions S(x) thatinterpolate f(x) on

a = x1 < x2 < . . . < xn = b

Recall the definition of natural cubic spline:

1 S(x) cubic on each subinterval [xj−1, xj ];

2 S(x), S′(x) and S′′(x) are continuous on [a, b];

3 S′′(x1) = S′′(xn) = 0.

We say that S(x) is a cubic spline on [a, b] if

1 S(x) cubic on each subinterval [xj−1, xj ];2 S(x), S′(x) and S′′(x) are continuous on [a, b];

With the interpolating conditions

S(xi) = yi, i = 1, . . . , n

the

{representation formula (5.26)and the tridiagonal system (5.27)

are still valid.



This system has n− 2 equations and n unknowns: M1, . . . ,Mn.By replacing the end conditions (5.28) S′′(x1) = S′′(xn) = 0, we can obtainother interpolating cubic splines.

If the data (xi, yi) is obtained by evaluating a function f(x)

yi = f(xi), i = 1, . . . , n

then we choose endpoints (boundary conditions) for S(x) that would yield abetter approximation to f(x).We require

S′(x1) = f ′(x1), S′(xn) = f ′(xn)

or S′′(x1) = f ′′(x1), S′′(xn) = f ′′(xn)

When combined with(5.26-5.27), either of these conditions leads to a uniqueinterpolating spline S(x), dependent on which of these conditions is used.

In both cases, the RHS of (5.29) can be replaced by ch4, where c depends on

maxx∈[a,b]

∣∣f (4)(x)∣∣.



If the derivatives of f(x) are not known, then extra interpolatingconditions can be used to ensure that the error bond of (5.29) isproportional to h4. In particular, suppose that

x1 < z1 < x2, xn−1 < z2 < xn

and f(z1), f(z2) are known.Then use the formula for S(x) in (5.26) and

s(z1) = f(z1), s(z2) = f(z2) (5.30)

This adds two new equations to the system (5.27), one for M1 andM2, and another equation for Mn−1,Mn. This form is preferable tothe interpolating natural cubic spline, and is almost equally easy toproduce. This is the default form of spline interpolation that isimplemented in MATLAB. The form of spline formed in this way issaid to satisfy the not-a-knot interpolation boundary conditions.



Interpolating cubic spline functions are a popular way to represent dataanalytically because

1 are relatively smooth: C2,

2 do not have the rapid oscillation that sometime occurs with thehigh-degree polynomial interpolation,

3 are relatively easy to work with on a computer.

They do not replace polynomials, but are a very useful extension ofthem.


> 4. Interpolation and Approximation > 4.3.4 The MATLAB program spline

The standard package MATLAB contains the function spline.The standard calling sequence is

y = spline(x nodes, y nodes, x)

which produces the cubic spline function S(x) whose graph passes throughthe points {(ξi, ηi) : i = 1, . . . , n} with

(ξi, ηi) = (x nodes(i), y nodes (i))

and n the length of x nodes (and y nodes).The not-a-knot interpolation conditions of (5.30) are used.

The point (ξ2, η2) is the point (z1, f(z1)) of (5.30) and (ξn−1, ηn−1) is the

point (z2, f(z2)).



x = [0 1 2 2.5 3 3.5 4];y = [2.5 0.5 .5 1.5 1.5 1.125 0];t = 0:.1:4;s = spline(x,y,t);plot(x,y,’o’,t,s,’LineWidth’,1.5)

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

2

2.5



Example

Approximate the function f(x) = ex on the interval [a, b] = [0, 1]. Forn > 0, define h = 1/n and interpolating nodes

x1 = 0, x2 = h, x3 = 2h, . . . , xn+1 = nh = 1

Using spline, we produce the cubic interpolating spline interpolantSn,1 to f(x). With the not-a-knot interpolation conditions, the nodesx2 and xn are the points z1 and z2 in (5.30). For a general smoothfunction f(x), it turns out that te magnitude of the errorf(x)− Sn,1(x) is largest around the endpoints of the interval ofapproximation.



function Cubic spline matlab 3 = Cubic spline matlab 3(n)

x = 0:1/n:1;

y = exp(x);

t = 0:.01:1;

s = spline(x,y,t);

plot(x,y,’o’,t,s,’LineWidth’,1.5)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8



Example

Two interpolating nodes are inserted, the midpoints of the subintervals [0, h]and [1− h, 1]:

x1 =0, x2 =12h, x3 =h, x4 =2h, . . . , xn+1 =(n−1)h, xn+2 =1− 1

2h, xn+3 =1

Using spline results in a cubic spline function Sn,2(x); with the not-a-knotinterpolation conditions conditions, the nodes x2 and xn+2 are the points z1and z2 of (5.30). Generally, Sn,2 is a more accurate approximation than isSn,1(x).The cubic polynomials produced for Sn,2(x) by spline for the intervals[x1, x2] and [x2, x3] are the same, thus we can use the polynomial for [0, 1

2h]for the entire interval [0, h]. Similarly for [1− h, h].

n E(1)n Ratio E

(2)n Ratio

5 1.01E-4 1.11E-510 6.92E-6 14.6 7.88E-7 14.120 4.56E-7 15.2 5.26E-8 15.040 2.92E-8 15.6 3.39E-9 15.5

Table: Cubic spline approximation to f(x) = ex


> 5. Numerical Integration

Numerical Integration

5. Numerical Integration Math 1070

> 5. Numerical Integration

Review of Interpolation

Find pn(x) with pn(xj) = yj , j = 0, 1, 2, . . . , n. Solution:

pn(x) = y0`0(x) + y1`1(x) + . . .+ yn`n(x), `k(x) =∏nj=1,j 6=k

x−xjxk−xj .

Theorem

Let yj = f(xj), f(x) smooth and pn interpolates f(x) atx0 < x1 < . . . < xn. For any x ∈ (x0, xn) there is a ξ such that

f(x)− pn(x) =f (n+1)(ξ)(n+ 1)!

(x− x0)(x− x1) . . . (x− xn)︸︷︷︸ψ(x)

Example: n = 1, linear:

f(x)− p1(x) =f′′(ξ)2

(x− x0)(x− x1),

p1(x) =y1 − y0

x1 − x0(x− x0) + y0.


> 5. Numerical Integration > 5.1 The Trapezoidal Rule

Mixed rule

Find area under the curve y = f(x)

∫ b

af(x)dx =

N−1∑j=0

∫ xj+1

xj

f(x)dx︸︷︷︸interpolate f(x)on (xj, xj+1) andintegrate exactly

≡N−1∑j=0

(xj+1 − xj)f(xj) + f(xj+1)

2



Trapezoidal Rule

∫ xj+1

xj

f(x) ≈∫ xj+1

xj

p1(x)dx (and integrate exactly)

=∫ xj+1

xj

{f(xj+1) + f(xj)

xj+1 − xj(x− xj) + f(xj)

}dx

= (xj+1 − xj)f(xj) + f(xj+1)

2

T1(f) = (b− a)f(b) + f(a)

2(6.1)



Another derivation of Trapezoidal Rule:

Seek: ∫ xj+1

xj

f(x)dx ≈ wjf(xj) + wj+1f(xj+1) (6.2)

exact on polynomials of degree 1 (1, x).

f(x) ≡ 1 :∫ xj+1

xj

f(x)dx = xj+1 − xj = wj · 1 + wj+1 · 1

f(x) = x :∫ xj+1

xj

xdx =x2j+1

2−x2j

2= wjxj + wj+1xj+1



Example

Approximate integral

I =∫ 1

0

dx

1 + x

The true value is I = ln(2) ·= 0.693147. Using (6.1), we obtain

T1 =12

(1 +

12

)=

34

= 0.75

and the error is

I − T1(f) ·= −0.0569 (6.3)



To improve on approximation (6.1) when f(x) 6= nearly linear on [a, b],

1 break interval [a, b] into smaller subintervals, and

2 apply (6.1) on each subinterval.

If the subintervals are small enough, then f(x) will be nearly linear on each one.

Example

Evaluate the preceding example by using T1(f) on 2 equal subintervals.

For two subintervals,

I =∫ 1

2

0

dx

1 + x+∫ 1

12

dx

1 + x

·=12

1 + 23

2+

12

23 + 1

2

2

T2 =1724

·= 0.70833

and the error

I − T2·= −0.0152 (I − T1(f) ·= −0.0569) (6.4)

is about 14 of that for T1 in (6.3).



General trapezium rule

We derive the general formula for calculations using n subintervals of equallength h = b−a

n . The endpoints of each subinterval are then

xj = a+ jh, j = 0, 1, . . . , n

Breaking the integral into n subintegrals

I(f) =∫ baf(x)dx =

∫ xn

x0f(x)dx

=∫ x1

x0f(x)dx+

∫ x2

x1f(x)dx+ . . .+

∫ xn

xn−1f(x)dx

≈ hf(x0) + f(x1)2

+ hf(x1) + f(x2)

2+ . . .+ h

f(xn−1) + f(xn)2

The trapezoidal numerical integration rule

Tn(f)=h(

12f(x0)+f(x1)+f(x2)+· · ·+f(xn−1)+ 1

2f(xn))

(6.5)



With a sequence of increasing values of n,Tn(f) will usually be an increasingly accurate approximation of I(f).

But which sequence of n should be used?

If n is doubled repeatedly, then the function values used in each T2n(f)will include all earlier function values used in the preceding Tn(f).

T2(f) = h(f(x0)

2 + f(x1) + f(x2)2

)with

h = b−a2 , x0 = a, x1 = a+b

2 , x2 = b.

Also T4(f) = h(f(x0)

2 + f(x1) + f(x2) + f(x3) + f(x4)2

)with

h = b−a4 , x0 = a, x1 = 3a+b

4 , x2 = a+b4 , x3 = a+3b

4 , x4 = b

Only f(x1) and f(x3) need to be evaluated.



Example: use trapezoidal.mWe give calculations of Tn(f) for three integrals

I(1) =∫ 1

0e−x

2dx

·= 0.746824132812427

I(2) =∫ 4

0dx

1+x2 = tan−1(4) ·= 1.32581766366803

I(3) =∫ 2π

0dx

2+cos(x) = 2π√3

·= 3.62759872846844

n I(1) I(2) I(3)

Error Ratio Error Ratio Error Ratio

2 1.55E-2 -1.33E-1 -5.61E-14 3.84E-3 4.02 -3.59E-3 37.0 -3.76E-2 14.98 9.59E-4 4.01 5.64E-4 -6.37 -1.93E-4 195.0

16 2.40E-4 4.00 1.44E-4 3.92 -5.19E-9 37,600.032 5.99E-5 4.00 3.60E-5 4.00 *64 1.50E-5 4.00 9.01E-6 4.00 *

128 3.74E-6 4.00 2.25E-6 4.00 *

The error for I(1), I(2) decreases by a factor of 4 when n doubles, for I(3) the answers for n = 32, 64, 128 were

correct up to the limits due to rounding error on the computer (16 decimal digits).


> 5. Numerical Integration > 5.1.1 The Cavalieri-Simpson rule

Cavalieri-Simpson rule [Bonaventura Cavalieri in 1635]

To improve on T1(f) in (6.1), use quadratic interpolation.Let P2(x) = quadratic polynomial interpolating f(x) at a, c = a+b

2 and b.

I(f) ≈∫ baP2(x)dx (and integrate exactly) (6.6)

=∫ ba

((x−c)(x−b)(a−c)(a−b)f(a) + (x−a)(x−b)

(c−a)(c−b) f(c) + (x−a)(x−c)(b−a)(b−c) f(b)

)dx

This can be evaluated directly, or with a change of variables.Is there another way? Think of (6.2)!

Let h = b−a2 and u = x− a. Then∫ b

a(x−c)(x−b)(a−c)(a−b)dx = 1

2h2

∫ a+2h

a(x− c)(x− b)dx

= 12h2

∫ 2h

0(u− h)(u− 2h)du = 1

2h2

(u3

3 −32u

2h+ 2h2u)∣∣∣2h

0= h

3

and

S2(f) =h

3

(f(a) + 4f(

a+ b

2) + f(b)

)(6.7)



Example

I =∫ 1

0

dx

1 + x

Then h = b−a2 = 1

2 and

S2(f) =12

3

(1 + 4(

23

) +12

)=

2536

·= 0.69444 (6.8)

and the error isI − S2 = ln(2)− S2

·= −0.00130

while the error for the trapezoidal rule was(the number of function evaluations is the same for both S2 and T2)

I − T2·= −0.0152.

The error in S2 is smaller than that in (6.4) for T2 by a factor of 12,

a significant increase in accuracy.



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

An illustration of Simpson’s rule (6.7), y = f(x), y = P2(x)



The rule S2(f) will be an accurate approximation to I(f) if f(x) is nearlyquadratic on [a, b].For the other cases, proceed in the same manner as for the trapezoidal rule.

Let n be an even integer, h = b−an and define the evaluation points for f(x)

byxj = a+ jh, j = 0, 1, . . . , n

We follow the idea from the trapezoidal rule, but break [a, b] = [x0, xn] intolarger intervals, each containing three interpolation node points:

I(f) =∫ baf(x)dx =

∫ xn

x0f(x)dx

=∫ x2

x0f(x)dx+

∫ x4

x2f(x)dx+ . . .+

∫ xn

xn−2f(x)dx

≈ h3 [f(x0) + 4f(x1) + f(x2)] + h

3 [f(x2) + 4f(x3) + f(x4)]

+ . . .+ h3 [f(xn−2) + 4f(xn−1) + f(xn)]



General Simpson rule

Simpson’s rule:

Sn(f) =h

3(f(x0)+4f(x1)+2f(x2)+4f(x3)+2f(x4) (6.9)

+· · ·+2f(xn−2)+4f(xn−1)+f(xn))

It has been among the most popular numerical integrationmethods for more than two centuries.



Example: use simpson.m

I(1) =∫ 1

0e−x

2dx

·= 0.746824132812427

I(2) =∫ 4

0dx

1+x2 = tan−1(4) ·= 1.32581766366803

I(3) =∫ 2π

0dx

2+cos(x) = 2π√3

·= 3.62759872846844

n I(1) I(2) I(3)

Error Ratio Error Ratio Error Ratio

2 -3.56E-4 8.66E-2 -1.264 -3.12E-5 11.4 3.95E-2 2.2 1.37E-1 -9.28 -1.99E-6 15.7 1.95E-3 20.3 1.23E-2 11.2

16 -1.25E-7 15.9 4.02E-6 485.0 6.43E-5 191.032 -7.79E-9 16.0 2.33E-8 172.0 1.71E-9 37,600.064 -4.87E-10 16.0 1.46E-9 16.0 *

128 -3.04E-11 16.0 9.15E-11 16.0 *

For I(1), I(2), the ratio by which the error decreases approaches 16.

For I(3)the errors converge to zero much more rapidly


> 5. Numerical Integration > 5.2 Error formulae

The integration error for the trapezoidal rule

Theorem

Let f ∈ C2[a, b], n ∈ N. The error in integrating

I(f) =∫ b

a

f(x)dx

using the trapezoidal ruleTn(f) = h

[12f(x0) + f(x1) + f(x2) + · · ·+ f(xn−1) + 1

2f(xn)]

is given by

ETn ≡ I(f)− Tn(f) =−h2(b− a)

12f ′′(cn) (6.10)

where cn is some unknown point in a, b, and h = b−an .



Theorem

Suppose that f ∈ C2[a, b], and h = maxj(xj+1 − xj). Then∣∣∣∣∣∣∫ b

a

f(x)dx−∑j

(xj+1−xj)f(xj+1)+f(xj)

2

∣∣∣∣∣∣ ≤ b− a12

h2 maxa≤x≤b

|f ′′(x)|.

Proof: Let Ij be the jth subinterval and p1 = linear interpolant on Ij atxj , xj+1.

f(x)− p1(x) =f ′′(ξ)

2(x− xj)(x− xj+1)︸︷︷︸

ψ(x)

.

Local error: ∣∣∣∣∣∫ xj+1

xj

f(x)dx− (xj+1−xj)f(xj) + f(xj+1)

2

∣∣∣∣∣ =



proof

=

∣∣∣∣∣∫ xj+1

xj

f ′′(ξj)2!

(x− xj)(x− xj+1)

∣∣∣∣∣≤ 1

2

∫ xj+1

xj

|f ′′(ξj)||x− xj | |x− xj+1|dx

≤ 12

maxa≤x≤b

|f ′′(x)|∫ xj+1

xj

(x− xj)(xj+1 − x)dx

=12

maxa≤x≤b

|f ′′(x)|(xj+1 − xj)3

6.

Hence

|local error| ≤ 112

maxa≤x≤b

|f ′′(x)| · h3j , hj = xj+1 − xj .



Recall the example

I(f) =∫ 1

0

dx

1 + x= ln 2

Here f(x) = 11+x , [a, b] = [0, 1], and f ′′(x) = 2

(1+x)3 . Then by (6.10)

ETn (f) = −h2

12f ′′(cn), 0 ≤ cn ≤ 1, h =

1n.

This cannot be computed exactly since cn is unknown. But

max0≤x≤1

|f ′′(x)| = max0≤x≤1

2(1 + x)3

= 2

and therefore

|ETn (f)| ≤ h2

12(2) =

h2

6For n = 1 and n = 2 we have

|ET1 (f)︸︷︷︸−0.0569

| ≤ 16·= 0.167, |ET2 (f)︸︷︷︸

−0.0152

| ≤(

12

)26

·= 0.0417.



A possible weakness in the trapezoidal rule can be inferred from theassumption of the theorem for the error.

If f(x) does not have two continuous derivatives on [a, b], thenTn(f) does converge more slowly??

YES

for some functions, especially if the first derivative is not continuous.


> 5. Numerical Integration > 5.2.1 Asymptotic estimate of Tn(f)

The error formula (6.10)

ETn (f) ≡ I(f)− Tn(f) = −h2(b−a)12 f ′′(cn)

can only be used to bound the error, because f ′′(cn) is unknown.

This can be improved by a more careful consideration of the error formula.

A central element of the proof of (6.10) lies in the local error∫ α+h

αf(x)dx− hf(α) + f(α+ h)

2= −h

3

12f ′′(c) (6.11)

for some c ∈ [α, α+ h].



Recall the derivation of the trapezoidal rule Tn(f) and use the local error(6.11):

ETn (f) =∫ b

a

f(x)dx− Tn(f) =∫ xn

x0

f(x)dx− Tn(f)

=∫ x1

x0

f(x)dx− hf(x0) + f(x1)2

+∫ x2

x1

f(x)dx− hf(x1) + f(x2)2

+ · · ·+∫ xn

xn−1

f(x)dx− hf(xn−1) + f(xn)2

= −h3

12f ′′(γ1)− h3

12f ′′(γ2)− · · · − −h

3

12f ′′(γn)

with γ1 ∈ [x0, x1], γ2 ∈ [x1, x2], . . . γn ∈ [xn−1, xn], and

ETn (f) = −h2

12

(hf ′′(γ1) + · · ·+ hf ′′(γn)︸︷︷︸

=(b−a)f ′′(cn)

), cn ∈ [a, b].



To estimate the trapezoidal error, observe that hf ′′(γ1) + · · ·+ hf ′′(γn) is aRiemann sum for the integral∫ b

a

f ′′(x)dx = f ′(b)− f ′(a) (6.12)

The Riemann sum is based on the partition

[x0, x1], [x1, x2], . . . , [xn−1, xn] of [a, b].

As n→∞, this sum will approach the integral (6.12).With (6.12), we find an asymptotic estimate (improves as n increases)

ETn (f) ≈ −h2

12(f ′(b)− f ′(a)) =: ETn (f). (6.13)

As long as f ′(x) is computable, eETn (f) will be very easy to compute.



Example

Again consider I =∫ 1

0

dx

1 + x.

Then f ′(x) = − 1(1 + x)2

, and the asymptotic estimate (6.13) yields

ETn =−h2

12

(−1

(1 + 1)2− −1

(1 + 0)2

)=−h2

16, h =

1n

and for n = 1 and n = 2

ET1 = − 116 = −0.0625, ET2

·= −0.0156I − T1

·= −0.0569, I − T2·= −0.0152



The estimate (6.13):

ETn (f) = −h2

12 (f ′(b)− f ′(a))has several practical advantages over the earlier formula (6.10)

ETn (f) = −h2(b−a)12 f ′′(cn).

1 It confirms that when n is doubled (or h is halved), the error decreasesby a factor of about 4, provided that f ′(b)− f ′(a) 6= 0.This agrees with the results for I(1) and I(2).

2 (6.13) implies that the convergence of Tn(f) will be more rapid whenf ′(b)− f ′(a) = 0.This is a partial explanation of the very rapid convergence of I(3).

3 (6.13) leads to a more accurate numerical integration formula

I(f)− Tn(f) ≈− h2

12 (f ′(b)− f ′(a))

I(f) ≈ Tn(f)− h2

12 (f ′(b)− f ′(a)) := CTn(f), (6.14)

the corrected trapezoidal rule



Example

Recall the integral I(1), I =∫ 1

0e−x

2dx

·= 0.74682413281243

n I − Tn(f) En(f) CTn(f) I − CTn(f) Ratio2 1.545E-4 1.533E-2 0.746698561877 1.26E-44 3.840E-3 3.832E-3 0.746816175313 7.96E-6 15.88 9.585E-4 9.580E-4 0.746823634224 4.99E-7 16.016 2.395E-4 2.395E-4 0.746824101633 3.12E-8 16.032 5.988E-5 5.988E-5 0.746824130863 1.95E-9 16.064 1.497E-5 1.497E-5 0.746824132690 2.22E-10 16.0

Table: Example of CTn(f) and En(f)Note that the estimate

ETn (f) =h2e−1

6, h =

1n

is a very accurate estimator of the true error.

The error inCTn(f)→0 at a more rapid rate than does the error forTn(f).

Whenn is doubled, the error inCTn(f) decreases by a factor of about 16.


> 5. Numerical Integration > 5.2.2 Error formulae for Simpson’s rule

Theorem

Assume f ∈ C4[a, b], n ∈ N. The error in using Simpson’s rule is

ESn (f) = I(f)− Sn(f) = −h4(b− a)

180f (4)(cn) (6.15)

with cn ∈ [a, b] an unknown point, and h = b−an .

Moreover, this error can be estimated with the asymptotic error formula

ESn (f) = − h4

180(f ′′′(b)− f ′′′(a)) (6.16)

Note that (6.15) says that Simpson’s rule is exact for all f(x) that arepolynomials of degree ≤ 3, whereas the quadratic interpolation on whichSimpson’s rule is based is exact only for f(x) a polynomial of degree ≤ 2.

The degree of precision being 3 leads to the power h4 in the error, rather

than the power h3, which would have been produced on the basis of the error

in quadratic interpolation.

The higher power of h4, and

the simple form of the method

that historically have caused Simpson’s rule to become the most popular numerical integration rule.



Example

Recall (6.8) where S2(f) was applied to I =∫ 1

0dx

1+x :

S2(f) =12

3

(1 + 4(

23

) +12

)=

2536

·= 0.69444

f(x) =1

1 + x, f3(x) =

−6(1 + x)4

, f (4)(x) =24

(1 + x)5

The exact error is given by

ESn (f) = − h4

180f (4)(cn), h =

1n

for some 0 ≤ cn ≤ 1. We can bound it by

|ESn (f)| ≤ h4

18024 =

2h4

15The asymptotic error is given by

ESn (f) = − h4

180

(−6

(1 + 1)4− −6

(1 + 0)4

)= −h

4

32

For n = 2, ESn·= −0.00195; the actual error is −0.00130.



The behavior in I(f)− Sn(f) can be derived from (6.16):

ESn (f) = − h4

180(f ′′′(b)− f ′′′(a)) ,

i.e.,when n is doubled, h is halved, and h4 decreases by of factor of 16.

Thus, the error ESn (f) should decrease by the same factor, provided thatf ′′′(a) 6= f ′′′(b). This is the error observed with integrals I(1) and I(2).

When f ′′′(a) = f ′′′(b), the error will decrease more rapidly, which is a partial

explanation of the rapid convergence for I(3).



The theory of asymptotic error formulae

En(f) = En(f) (6.17)

such as for ETn (f) and ESn (f), says that (6.17) will vary with theintegrand f , which is illustrated with the two cases I(1) and I(2).

From (6.15) and (6.16) we infer thatSimpson’s rule will not perform as well if f(x) /∈ C4[a, b].



Example

Use Simpson’s rule to approximate

I =∫ 1

0

√xdx =

23.

n Error Ratio

2 2.860E-2

4 1.014E-2 2.82

8 3.587E-3 2.83

16 1.268E-3 2.83

32 4.485E-4 2.83

Table: Simpson’s rule for√x

The column ”Ratio” shows that the convergence is much slower.



As was done for the trapezoidal rule, a corrected Simpson’s rule canbe defined:

CSn(f) = Sn(f)− h4

180(f ′′′(b)− f ′′′(a)

)(6.18)

This will usually will be more accurate approximation than Sn(f).


> 5. Numerical Integration > 5.2.3 Richardson extrapolation

Richardson extrapolation

The error estimates for Trapezoidal rule (6.13)

ETn (f) ≈ −h2

12 (f ′(b)− f ′(a))

and Simpson’s rule (6.16)

ESn (f) = − h4

180 (f ′′′(b)− f ′′′(a))

are both of the form

I − In ≈c

np(6.19)

where In denotes the numerical integral and h = b−an .

The constants c and p vary with the method and the function.With most integrands f(x),

p = 2 for the Trapezoidal rule and p = 4 for Simpson’s rule.

There are other numerical methods that satisfy (6.19), with other values of p

and c. We use (6.19) to obtain a computable estimate of the error I − In,

without needing to know c explicitly.



Replacing n by 2n

I − I2n ≈c

2pnp(6.20)

and comparing to (6.19)

2p(I − I2n) ≈ c

np≈ I − In

and solving for I gives the Richardson’s extrapolation formula

(2p − 1)I ≈ 2pI2n − In

I≈ 12p − 1

(2pI2n − In) ≡ R2n (6.21)

R2n is an improved estimate of I, based on using In, I2n, p, and the

assumption (6.19). How much more accurate than I2n depends on the

validity of (6.19), (6.20).



To estimate the error in I2n, compare it with the more accurate value R2n

I − I2n ≈ R2n − I2n =1

2p − 1(2pI2n − In)− I2n

I − I2n ≈1

2p − 1(I2n − In) (6.22)

This is Richardson’s error estimate.



Example

Using the trapezoidal rule to approximate

I =∫ 1

0

e−x2dx

·= 0.74682413281243

we haveT2·= 0.7313702518, T4

·= 0.7429840978

Using (6.21) I ≈ 12p−1 (2pI2n − In) with p = 2 and n = 2, we obtain

I ≈ R4 =13

(4I4 − I2) =13

(4T4 − T2) ·= 0.7468553797

The error in R4 is −0.0000312; and from a previous Table, R4 is moreaccurate than T32. To estimate the error in R4, use (6.22) to get

I − T4 ≈13

(T4 − T2) ·= 0.00387

The actual error in T4 is 0.00384⇒ (6.22) is a very accurate error estimate.


> 5. Numerical Integration > 5.2.4 Periodic Integrands

Definition

A function f(x) is periodic with period τ if

f(x) = f(x+ τ), ∀x ∈ R (6.23)

and this relation should not be true with any smaller value of τ .

For example,f(x) = ecos(πx)

is periodic with periodic τ = 2.

If f(x) is periodic and differentiable, then itsderivatives are also periodic with period τ.



Consider integrating

I =∫ b

a

f(x)dx

with trapezoidal or Simpson’s rule, and assume that b− a is and integermultiple of the period τ .Assume f(x) ∈ C∞[a, b] (has derivatives of any order).

Then for all derivatives of f(x), the periodicity of f(x) implies that

f (k)(a) = f (k)(b), k ≥ 0 (6.24)

If we now look at the asymptotic error formulae for the trapezoidal andSimpson’s rules, they become zero because of (6.24).

Thus, the error formulae ETn (f) and ESn (f) should converge to zero more

rapidly when f(x) is a periodic function, provided b− a is an integer

multiple of the period of f .



The asymptotic error formulae ETn (f) and ESn (f) can be extended tohigher-order terms in h, using the Euler-MacLaurin expansion andthe higher-order terms are multiples of f (k)(b)− f (k)(a) for all oddintegers k ≥ 1. Using this, we can prove that the errors ETn (f) andESn (f) converge to zero even more rapidly than was implied by theearlier comments for f(x) periodic.

Note that the trapezoidal rule is the preferred integration rulewhen we are dealing with smooth periodic integrands.The earlier results for the integral I(3) illustrate this.



Example

The ellipse with boundary (xa

)2

+(yb

)2

= 1

has area πab. For the case in which the area is π (and thus ab = 1), we studythe variation of the perimeter of the ellipse as a and b vary.

The ellipse has the parametric representation

(x, y) = (a cos θ, b sin θ), 0 ≤ θ ≤ 2π (6.25)

By using the standard formula for the perimeter, and using the symmetry ofthe ellipse about the x-axis, the perimeter is given by

P = 2∫ π

0

√(dx

dθ

)2

+(dy

dθ

)2

dθ

= 2∫ π

0

√a2 sin2 θ + b2 cos2 θdθ



Since ab = 1, we write this as

P (b) = 2∫ π

0

√1b2

sin2 θ + b2 cos2 θdθ

=2b

∫ π

0

√(b4 − 1) cos2 θ + 1dθ (6.26)

We consider only the case with 1 ≤ b <∞. Since the perimeters forthe two ellipses(x

a

)2+(yb

)2= 1 and

(xb

)2+(ya

)2= 1

are equal, we can always consider the case in which the y-axis of theellipse is larger than or equal to its x-axis; and this also shows

P

(1b

)= P (b), b > 0 (6.27)



The integrand of P (b)

f(θ) =2b

[(b4 − 1) cos2 θ + 1

] 12

is periodic with period π. As discussed above, the trapezoidal rule is the

natural choice for numerical integration of (6.26). Nonetheless, there is a

variation in the behaviour of f(θ) as b varies, and this will affect the accuracy

of the numerical integration.

0 0.5 1 1.5 2 2.5 3 3.50

2

4

6

8

10

12

14

16

π/2

Figure:The graph of integrand f(θ) : b = 2, 5, 8



n b = 2 b = 5 b = 88 8.575517 19.918814 31.690628

16 8.578405 20.044483 31.953632

32 8.578422 20.063957 32.008934

64 8.578422 20.065672 32.018564

128 8.578422 20.065716 32.019660

256 8.578422 20.065717 32.019709

Table: Trapezoidal Rule Approximation of (6.26)Note that as b increases, the trapezoidal rule converges more slowly.This is due to the integrand f(θ) changing more rapidly as b increases.For large b, f(θ) changes very rapidly in the vicinity of θ = 1

2π; andthis causes the trapezoidal rule to be less accurate than when b issmaller, near 1. To obtain a certain accuracy in the perimeter P (b), wemust increase n as b increases.



0 1 2 3 4 5 6 7 8 9 105

10

15

20

25

30

35

40

45

z=P(b)

Figure: The graph of perimeter function P (b) for ellipseThe graph of P (b) reveals that P (b) ≈ 4b for large b. Returning to (6.26), wehave for large b

P (b) ≈ 2b

∫ π

0

(b4 cos2 θ

) 12 dθ

≈ 2bb2∫ π

0

| cos θ|dθ = 4b

We need to estimate the error in the above approximation to know when we

can use it to replace P (b); but it provides a way to avoid the integration of

(6.26) for the most badly behaved cases.


> 5. Numerical Integration > Review and more

Review

∫ xj+1

xj

f(x)dx ≈∫ xj+1

xj

pn(x)dx︸︷︷︸≡Ij

pn(x) interpolates at x(0)j , x

(1)j , . . . , x

(n)j points on [xj , xj+1]

Local error:∫ xj+1

xj

f(x)dx− Ij =∫ xj+1

xj

f (n+1)(ξ)

(n+ 1)!ψ(x)dx

(integrand is error in interpolation)

where ψ(x) = (x− x(0)j )(x− x(1)

j ) · · · (x− x(n)j ).



x_j x_{j+1}−6

−4

−2

0

2

4

6x 106

xj ≤ x ≤ x

j+1

ψ(x

)

Conclusion: exact on Pn1 |local error| ≤ C max |f (n+1)(x)|hn+2

2 |global error| ≤ C max |f (n+1)(x)|hn+1(b− a)



Observation:If ξ is a point on (xj , xj+1), then

g(ξ) = g(xj+ 12) +O(h)

(if g′ is continuous) i.e.,

g(ξ) = g(xj+ 12) + (ξ − xj+ 1

2)︸︷︷︸

≤h

g′(η)

︸︷︷︸O(h)



Local error:

1(n+ 1)!

∫ xj+1

xj

f (n+1)(ξ)︸︷︷︸=f (n+1)(x

j+12)+O(h)

ψ(x)dx

=1

(n+ 1)!f (n+1)(xj+ 1

2)∫ xj+1

xj

ψ(x)dx︸︷︷︸Dominant Term O(hn+2)

+

C(n+1)!

max |f (n+2)|hn+3︷︸︸︷1

(n+ 1)!

∫ xj+1

xj

f (n+2)(η(x))︸︷︷︸take max out

(ξ − xj+ 12)︸︷︷︸

integrate

ψ(x)dx

︸︷︷︸Higher Order Terms O(hn+3)



The dominant term: case N = 1, Trapezoidal Rule

x_j x_{j+1}−0.2

−0.15

−0.1

−0.05

0

0.05ψ(x) = (x−x

j)(x−x

j+1)



The dominant term: case N = 2, Simpson’s Rule

ψ(x) = (x− xj)(x− xj+ 12)(x− xj+1)⇒

∫ xj+1

xj

ψ(x)dx = 0

x_j x_{j+1/2} x_{j+1}−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2ψ(x) = (x−x

j)(x−x

j+1/2)(x−x

j+1)



The dominant term: case N = 3, Simpson’s 3/8’s Rule

Local error = O(h5)

x_j x_{j+1/3} x_{j+2/3} x_{j+1}−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1ψ(x) = (x−x

j)(x−x

j+1/3)(x−x

j+2/3)(x−x

j+1)



The dominant term: case N = 4

∫ψ(x)dx = 0⇒ local error = O(h7)

x_j x_{j+1}−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4



Simpson’s Rule

Is exact on P2 (and P3 actually)

xj+1/2

= (xj+ x

j+1)/2

← xj+1/2

Seek:∫ xj+1

xj

f(x)dx ≈ wjf(xj) + wj+1/2f(xj+1/2) + wj+1f(xj+1)



Exact on 1, x, x2:

1 :∫ xj+1

xj

1dx = xj+1 − xj = wj · 1 + wj+1/2 · 1 + wj+1 · 1,

x :∫ xj+1

xj

xdx =x2j+1

2−x2j

2= wjxj + wj+1/2xj+1/2 + wj+1xj+1,

x2 :∫ xj+1

xj

x2dx =x3j+1

3−x3j

3= wjx

2j + wj+1/2x

2j+1/2 + wj+1x

2j+1.

3× 3 linear system:

wj =16

(xj+1 − xj);

wj+1/2 =46

(xj+1 − xj);

wj+1 =16

(xj+1 − xj).



Theorem

h = max(xj+1 − xj), I(f) = Simpson’s rule approximation:∣∣∣∣∫ b

af(x)dx− I(f)

∣∣∣∣ ≤ b− a2880

h4 maxa≤x≤b

∣∣∣f (4)(x)∣∣∣ .

Trapezoid rule versus Simpson’s rule

Cost in TR

Cost in SR=

2 function evaluations/integral × no. intervals

3 function evaluations/integral × no. intervals=

23

;

(reducible to 12 if storing the previous values)

Accuracy in TR

Accuracy in SR=h2 b−a

12

h4 b−a2880

=240h2

.

E.g. for h = 1100 ⇒ 2.4× 106, i.e. SR is more accurate than TR by factor of 2.4× 106.



What if there is round-off error?

Suppose we use the method with

f(xj)computed = f(xj)true ± εj , εj = O(machine precision︸︷︷︸=ε

)

∫ b

af(x)dx ≈

∑j

(xj+1 − xj)f(xj+1)computed + f(xj)computed

2

=∑j

(xj+1 − xj)f(xj+1)± εj+1 + f(xj)± εj

2

=∑j

(xj+1 − xj)f(xj+1) + f(xj)

2︸︷︷︸value in exact arithmetic

+∑j

(xj+1 − xj)±εj+1 ± εj

2︸︷︷︸contribution of round-off error≤ε(b−a)


> 5. Numerical Integration > 5.3 Gaussian Numerical Integration

The numerical methods studied in the first 2 sections were based onintegrating

1 linear (trapezoidal rule) and

2 quadratic (Simpson’s rule)

and the resulting formulae were applied on subdivisions of ever smallersubintervals.

We consider now a numerical method based on exact integration ofpolynomials of increasing degree; no subdivision of the integrationinterval [a, b] is used.Recall the Section 4.4 of Chapter 4 on approximation of functions.



Let f(x) ∈ C[a, b].Then ρn(f) denotes the smallest error bound that can be attained inapproximating f(x) with a polynomial p(x) of degree ≤ n on the giveninterval a ≤ x ≤ b.The polynomial mn(x) that yields this approximation is called theminimax approximation of degree n for f(x)

maxa≤x≤b

|f(x)−mn(x)| = ρn(f) (6.28)

and ρn(f) is called the minimax error.



Example

Let f(x) = e−x2

for x ∈ [0, 1]

n ρn(f) n ρn(f)1 5.30E-2 6 7.82E-6

2 1.79E-2 7 4.62E-7

3 6.63E-4 8 9.64E-8

4 4.63E-4 9 8.05E-9

5 1.62E-5 10 9.16E-10

Table: Minimax errors for e−x2, 0 ≤ x ≤ 1

The minimax errors ρn(f) converge to zero rapidly, although not at auniform rate.



If we have a

numerical integration formula to integrate low- tomoderate-degree polynomials exactly,

then the hope is that

the same formula will integrate other functions f(x) almostexactly,

if f(x) is well ”approximable” by such polynomials.



To illustrate the derivation of such integration formulae, we restrict ourselvesto the integral

I(f) =∫ 1

−1

f(x)dx.

The integration formula is to have the general form:(Gaussian numerical integration method)

In(f) =n∑j=1

wjf(xj) (6.29)

and we require that

the nodes {x1, . . . , xn} and

weights {w1, . . . , wn}be so chosen that

In(f) = I(f)for all polynomials f(x) of as large degree as possible.



Case n = 1

The integration formula has the form∫ 1

−1

f(x)dx ≈ w1f(x1) (6.30)

Using f(x) ≡ 1, and forcing (6.30) 2 = w1

Using f(x) = x 0 = w1x1

which implies x1 = 0. Hence (6.30) becomes∫ 1

−1

f(x)dx ≈ 2f(0) ≡ I1(f) (6.31)

This is the midpoint formula, and is exact for all linear polynomials.

To see that (6.31) is not exact for quadratics, let f(x) = x2. Then theerror in (6.31) is ∫ 1

−1

x2dx− 2(0)2 =236= 0,

hence (6.31) has degree of precision 1.



Case n = 2

The integration formula is∫ 1

−1

f(x)dx ≈ w1f(x1) + w2f(x2) (6.32)

and it has four unspecified quantities:

x1, x2, w1, w2.

To determine these, we require (6.32) to be exact for the four monomials

f(x) = 1, x, x2, x3.

obtaining 4 equations

2 = w1 + w2

0 = w1x1 + w2x223 = w1x

21 + w2x

22

0 = w1x31 + w2x

32



Case n = 2

This is a nonlinear system with a solution

w1 = w2 = 1, x1 = −√

33, x2 =

√3

3(6.33)

and another one based on reversing the signs of x1 and x2.This yields the integration formula∫ 1

−1f(x)dx ≈ f

(−√

33

)+ f

(√3

3

)≈ I2(f) (6.34)

which has degree of precision 3 (exact on all polynomials of degree 3and not exact for f(x) = x4).



Example

Approximate

I =∫ 1

−1

exdx = e− e−1 ·= 2.3504024

Using ∫ 1

−1

f(x)dx ≈ f

(−√

33

)+ f

(√3

3

)≈ I2(f)

we get

I2 = e−√

33 + e−

√3

3·= 2.3426961

I − I2·= 0.00771

The error is quite small, considering we are using only 2 node points.



Case n > 2

We seek the formula (6.29)

In(f) =n∑j=1

wjf(xj)

which has 2n points unspecified parameters x1, . . . , x,w1, . . . , wn, by forcingthe integration formula to be exact for 2n monomials

f(x) = 1, x, x2, . . . , x2n−1

In turn, this forces In(f) = I(f) for all polynomials f of degree ≤ 2n− 1.This leads to the following system of 2n nonlinear equations in 2nunknowns:

2 = w1 + w2 + . . .+ wn0 = w1x1 + w2x2 + . . .+ wnxn23 = w1x

21 + w2x

22 + . . .+ wnx

2n

...2

2n−1 = w1x2n−21 + w2x

22 + . . .+ wnx

2n−2n

0 = w1x2n−11 + w2x

22 + . . .+ wnx

2n−1n

(6.35)

The resulting formula In(f) has degree of precision 2n− 1.



Solving this system is a formidable problem. The nodes {xi} and weights{wi} have been calculated and collected in tables for most commonly usedvalues of n.

n xi wi2 ± 0.5773502692 1.03 ± 0.7745966692 0.5555555556

0.0 0.88888888894 ± 0.8611363116 0.3478548451

± 0.3399810436 0.65214515495 ± 0.9061798459 0.2369268851

± 0.5384693101 0.47862867050.0 0.5688888889

6 ± 0.9324695142 0.1713244924± 0.6612093865 0.3607651730± 0.2386191861 0.4679139346

7 ± 0.9491079123 0.1294849662± 0.7415311856 0.2797053915± 0.4058451514 0.3818300505

0.0 0.41795918378 ± 0.9602898565 0.1012285363

± 0.7966664774 0.2223810345± 0.5255324099 0.3137066459± 0.1834346425 0.3626837834

Table: Nodes and weights for Gaussian quadrature formulae



There is also another approach to the development of the numericalintegration formula (6.29), using the theory of orthogonalpolynomials.From that theory, it can be shown that the nodes {x1, . . . , xn} are thezeros of the Legendre polynomials of degree n on the interval [−1, 1].Recall that these polynomials were introduced in Section 4.7. Forexample,

P2(x) =12

(3x2 − 1)

and its roots are the nodes given in (6.33)

x1 = −√

33 , x2 =

√3

3 .Since the Legendre polynomials are well known, the nodes {xj} can befound without any recourse to the nonlinear system (6.35).



The sequence of formulae (6.29) is called Gaussian numerical integration method.

From its definition, In(f) uses n nodes, and it is exact for all polynomials of degree ≤ 2n− 1.

In(f) is limited to∫ 1

−1f(x)dx, an integral over [−1, 1].

Given an integral

I(f) =∫ b

a

f(x)dx (6.36)

introduce the linear change of variable

x =b+ a+ t(b− a)

2, −1 ≤ t ≤ 1 (6.37)

transforming the integral to

I(f) =b− a

2

∫ 1

−1

f(t)dt (6.38)

withf(t) = f

(b+ a+ t(b− a)

2

)Now apply In to this new integral.



Example

Apply Gaussian numerical integration to the three integrals I(1) =∫ 1

0e−x

2dx,

I(2) =∫ 4

0dx

1+x2 , I(3) =∫ 2π

0dx

2+cos(x) , which were used as examples for the

trapezoidal and Simpson’s rules.

All are reformulated as integrals over [−1, 1]. The error results are

n Error in I(1) Error in I(2) Error in I(3)

2 2.29E-4 -2.33E-2 8.23E-13 9.55E-6 -3.49E-2 -4.30E-14 -3.35E-7 1.90E-3 1.77E-15 6.05E-9 1.70E-3 -8.12E-26 -7.77E-11 2.74E-4 3.55E-27 7.89E-13 -6.45E-5 -1.58E-210 * 1.27E-6 1.37E-315 * 7.40E-10 -2.33E-520 * * 3.96E-7



If these results are compared to those of trapezoidal and Simpson’srule, thenGaussian integration of I(1) and I(2) is much more efficient thanthe trapezoidal rules.But then integration of the periodic integrand I(3) is not as efficientas with the trapezoidal rule.These results are also true for most other integrals.

Except for periodic integrands, Gaussian numerical integration isusually much more accurate than trapezoidal and Simpson rules.

This is even true with many integrals in which the integrand does nothave a continuous derivative.



Example

Use Gaussian integration on I =∫ 1

0

√xdx =

23.

The results are n I − In Ratio2 -7.22E-34 -1.16E-3 6.28 -1.69E-4 6.916 -2.30E-5 7.432 -3.00E-6 7.664 -3.84E-7 7.8

where n is the number of node points. The ratio column is defined asI − I 1

2n

I − Inand it shows that the error behaves like

I − In ≈c

n3(6.39)

for some c. The error using Simpson’s rule has an empirical rate of

convergence proportional to only 1n1.5 , a much slower rate than (6.39).



A result that relates the minimax error to the Gaussian numerical integrationerror.

Theorem

Let f ∈ C[a, b], n ≥ 1. Then, if we aply Gaussian numerical integration to

I =∫ baf(x)dx, the error In satisfies

|I(f)− In(f)| ≤ 2(b− a)ρ2n−1(f) (6.40)

where ρ2n−1(f) is the minimax error of degree 2n− 1 for f(x) on [a, b].



Example

Using the tablen ρn(f) n ρn(f)1 5.30E-2 6 7.82E-6

2 1.79E-2 7 4.62E-7

3 6.63E-4 8 9.64E-8

4 4.63E-4 9 8.05E-9

5 1.62E-5 10 9.16E-10apply (6.40) to

I =∫ 1

0e−x

2dx

For n = 3, the above bound implies

|I − I3| ≤ ρ5(e−x2) ·= 3.24× 10−5.

The actual error is 9.95E − 6.



Gaussian numerical integration is not as simple to use as are thetrapezoidal and Simpson rules, partly because the Gaussian nodes andweights do not have simple formulae and also because the error isharder to predict.Nonetheless, the increase in the speed of convergence is so rapid anddramatic in most instances that the method should always beconsidered seriously when one is doing many integrations.Estimating the error is quite difficult, and most people satisfythemselves by looking at two or more succesive values.If n is doubled, then repeatedly comparing two successive values, Inand I2n, is almost always adequate for estimating the error in In

I − In ≈ I2n − In

This is somewhat inefficient, but the speed of convergence in In is sorapid that this will still not diminish its advantage over most othermethods.


> 5. Numerical Integration > 5.3.1 Weighted Gaussian Quadrature

A common problem is the evaluation of integrals of the form

I(f) =∫ b

aw(x)f(x)dx (6.41)

with

f(x) a “well-behaved” function and

w(x) a possibly (and often) ill-behaved function.

Gaussian quadrature has been generalized to handle such integrals formany functions w(x).Examples include∫ 1

−1

f(x)√1− x2

,

∫ 1

0

√xf(x)dx,

∫ 1

0f(x)ln(

1x

)dx.

The function w(x) is called a weight function.



We begin by imitating the development given earlier in this section,and we do so for the special case of

I(f) =∫ 1

0

f(x)√xdx

in which w(x) = 1√x

.

As before, we seek numerical integration formulae of the form

In(f) =n∑j=1

wjf(xj) (6.42)

and we require that the nodes {x1, . . . , xn} and the weights{w1, . . . , wn} be so chosen that In(f) = I(f) for polynomials f(x)of as large as possible.



Case n = 1

The integration formula has the form∫ 1

0

f(x)√xdx ≈ w1f(x1)

We force equality for f(x) = 1 and f(x) = x. This leads to equations

w1 =∫ 1

0

1√xdx = 2

w1x1 =∫ 1

0

x√xdx =

23

Solving for w1 and x1, we obtain the formula∫ 1

0

f(x)√xdx ≈ 2f(

13

) (6.43)

and it has the degree of precision 1.



Case n = 2The integration formula has the form∫ 1

0

f(x)√xdx ≈ w1f(x1) + w2f(x2) (6.44)

We force equality for f(x) = 1, x, x2, x3. This leads to equations

w1 + w2 =∫ 1

0

1√xdx = 2

w1x1 + w2x2 =∫ 1

0

x√xdx =

23

w1x21 + w2x

22 =

∫ 1

0

x2

√xdx =

25

w1x31 + w2x

32 =

∫ 1

0

x3

√xdx =

27

This has the solution

x1 = 37 −

235

√30 ·= 0.11559, x2 = 3

7 + 235

√30 ·= 0.74156

w1 = 1 + 118

√30 ·= 1.30429, w2 = 1− 1

18

√30 ·= 0.69571

The resulting formula (6.44) has degree of precision 3.



Case n > 2

We seek formula (6.42), which has 2n unspecified parameters, x1, . . . , xn,w1, . . . , wn, by forcing the integration formula to be exact for the 2nmonomials

f(x) = 1, x, x2, . . . , x2n−1.

In turn, this forces In(f) = I(f) for all polynomials f of degree ≤ 2n− 1.This leads to the following system of 2n nonlinear equations in 2n unknowns:

w1 + w2 + . . .+ wn = 2w1x1 + w2x2 + . . .+ wnxn = 2

3

w1x21 + w2x

22 + . . .+ wnx

2n = 2

5...w1x

2n−11 + w2x

2n−12 + . . .+ wnx

2n−1n = 2

4n−1

(6.45)

The resulting formula In(f) has degree of precision 2n− 1.

As before, this system is very difficult to solve directly, but there are

alternative methods of deriving {xi} and {wi}. It is based on looking at the

polynomials that are orthogonal with respect to the weight function

w(x) = 1√x

on the interval [0, 1].



Example

We evaluate

I =∫ 1

0

cos(πx)√x

dx·= 0.74796566683146

using (6.43) ∫ 1

0

f(x)√xdx ≈ 2f(

12

) ≡ 1.0

and (6.44) ∫ 1

0

f(x)√xdx ≈ w1f(x1) + w2f(x2) ·= 0.740519

I2 is a reasonable estimate of I, with I − I2·= 0.00745.



A general theory can be developed for the weighted Gaussian quadrature

I(f) =∫ b

a

w(x)f(x)dx ≈n∑j=1

wjf(xj) = In(f) (6.46)

It requires the following assumptions for the weight function w(x):

1 w(x) > 0 for a < x < b;

2 For all integers n ≥ n, ∫ b

a

w(x)|x|ndx <∞

These hypotheses are the same as were assumed for the generalized leastsquares approximation theory following Section 4.7 of Chapter 4. This is notaccidental since both Gaussian quadrature and least squares approximationtheory are dependent on the subject of orthogonal polynomials. The nodepoints {xj} solving the system (6.45) are the zeros of the degree n orthogonalpolynomial on [a, b] with respect to the weight function w(x) = 1√

x.

For the generalization (6.46), the nodes {xi} are the zeros of the degree n

orthogonal polynomial on [a, b] with respect to the weight function w(x).


> 5. Numerical Integration > Supplement

Gauss’s idea:

The optimal abscissas of the κ−point Gaussian quadrature formulas are

precisely the roots of the orthogonal polynomial for the same interval and

weighting function.∫ b

af(x)dx =

∑j

∫ xj+1

xj

f(x)dx︸︷︷︸composite formula

=∑j

∫ 1

−1f

(xj+1 − xj

2t+

xj+1 + xj2

)(xj+1 − xj

2

)dt︸︷︷︸

=R 1−1 g(t)dt ≈

κ∑`=1

w`g(q`)︸︷︷︸κ point Gauss Rule for max accuracy

w1, . . . , wκ: weights, q1, . . . , qκ: quadrature points on (−1, 1).Exact on polynomials p(x) ∈ P2κ−1, i.e., 1, t, t2, . . . , t2κ−1.



Example: 3 point Gauss, exact on P5 ⇔ exact on 1, t, t2, t3, t4, t5

∫ +1

−1g(t)dt ≈ w1g(q1) + w2g(q2) + w3g(q3)∫ 1−1 1dt = 2 = w1 · 1 + w2 + w3∫ 1−1 t dt = 0 = w1q1 + w2q2 + w3q3∫ 1−1 t

2dt = 23 = w1q

21 + w2q

22 + w3q

23∫ 1

−1 t3dt = 0 = w1q

31 + w2q

32 + w3q

33∫ 1

−1 t4dt = 2

5 = w1q41 + w2q

42 + w3q

43∫ 1

−1 t5dt = 0 = w1q

51 + w2q

52 + w3q

53

Guess: q1 = −q3, q2 = 0(q1 ≤ q2 ≤ q3), w1 = w3.



Example: 3 point Gauss, exact on P5 ⇔ exact on 1, t, t2, t3, t4, t5

With this guess: 2w1 + w2 = 22w1q

21 = 2/3

2w1q41 = 2/5

,

hence

q1 = −√

35, q3 =

√35

w1 =59, w3 =

59, w2 =

89

A. H. Stroud and D. Secrest: ”Gaussian Quadrature Formulas”.Englewood Cliffs, NJ: Prentice-Hall, 1966.



1 The idea of GaussGauss-Lobatto∫ 1

−1g(t)dt = w1g(−1) + w2g(q2) + · · ·+ wk−1g(k−1) + wkg(1)

k − 2 nodes locates as k-point formula; is accurate P2k−3. (Orderdecreased by 2 beside the Gauss quadrature formula).



Adaptive Quadrature

Problem

Given∫ ba f(x)dx and ε−preassigned tolerance

compute

I(f) ≈∫ b

af(x)dx

with

(a) to assured accuracy ∣∣∣∣∫ b

af(x)dx− I(f)

∣∣∣∣ < ε

(b) at minimal / near minimal cost (no. function evaluations)

Strategy: LOCALIZE!



Localization Theorem

Let I(f) =∑

j Ij(f) where Ij(f) ≈∫ xj+1

xjf(x)dx.

If

∣∣∣∣∣∫ xj+1

xj

f(x)dx− Ij(f)

∣∣∣∣∣ < ε(xj+1 − xj)b− a

(= local tolerance),

then

∣∣∣∣∫ b

af(x)dx− I(f)

∣∣∣∣ < ε(= tolerance)

Proof:

∣∣∣∣∫ b

af(x)dx− I(f)

∣∣∣∣ = |∑j

∫ xj+1

xj

f(x)dx−∑j

I(f)|

= |∑j

(∫ xj+1

xj

f(x)dx− Ij(f)

)| ≤

∑j

|∫ xj+1

xj

f(x)dx− Ij(f)|

=∑j

ε(xj+1 − xj)b− a

=ε

b− a∑j

(xj+1 − xj) =ε

b− a(b− a) = ε.



Need:

Estimator for local error

and

Strategy

when to cut h to ensure accuracy?when to increase h to ensure minimal cost?

One approach: halving and doubling!

Recall: Trapezoidal rule

Ij ≈ (xj+1 − xj)f(xj)+f(xj+1)2 . A priori estimate:∫ xj+1

xj

f(x)dx− Ij =(xj+1 − xj)3

12f ′′(sj)

for some sj in (xj , xj+1).



Step 1: compute Ij

Ij = f(xj+1)+f(xj)2 (xj+1 − xj)

Step 2: cut interval in half + reuse trapezoidal rule

Ij =f(xj) + f(xj+ 1

2)

2(xj+ 1

2−xj)+

f(xj+1) + f(xj+ 12)

2(xj+1−xj+ 1

2)

Error estimate:∫ xj+1

xj

f(x)dx− Ij =h3j

12f ′′(ξj) = ej 1st use of trapezoid rule∫ xj+1

xj

f(x)dx− Ij =(hj/2)3

12f ′′(η1) +

(hj/2)3

12f ′′(η2) 2nd use of TR

=14h3j

12f ′′(ξj)︸︷︷︸ej

+O(h4j )



ej = 4ej + Higher Order Terms

Substracting

Ij − Ij = 3ej +O(h4) =⇒ ej =Ij − Ij

3+ Higher Order Terms︸︷︷︸

O(h4)



4 points Gauss: exact on P7

local error: O(h9)global error: O(h8)A priori estimate:∫ xj+1

xj

f(x)dx− Ij = C(xj+1 − xj)9f (8)(ξj)∫ xj+1

xj

f(x)dx− Ij = Ch9jf

(8)(ξj)∫ xj+1

xj

f(x)dx− Ij = C

(hj2

)9

f (8)(ξ′j) + C

(hj2

)9

f (8)(ξ′′j )︸︷︷︸= C

28h9jf

(8)(ξj)+O(h10)

⇒ Ij − Ij = 225ej +O(h10)

⇒ ej =Ij − Ij

225+ Higher Order Terms︸︷︷︸

O(h10)



Algorithm

Input: a, b, f(x)upper error tolerance: εmax

initial mesh width: h

Initialize: Integral = 0.0xL = aεmin = 1

2k+3 εmax

* xR = xL + h( If xR > b, xR ← b

do integral 1 more time and stop)

Compute on xL, xR:I, I and EST

EST =∣∣∣∣ I − I2k+1 − 1

∣∣∣∣ (if exact on Pk)



’error is just right’:If εmin

hb−a < EST < εmax

hb−a

Integral← Integral + IxL ← xRgo to *

’error is too small’:If EST ≤ εmin

b−a h

Integral← Integral + IxL ← xRh← 2hgo to *

’error is too big’:If EST ≥ εmax

b−a hh← h/2.0go to *

STOPEND



Trapezium rule∫ xj+1

xj

f(x)dx− (xj+1 − xj)f(xj+1) + f(xj)

2∫ xj+1

xj

f(x)− p1(x)dx =∫ xj+1

xj

f ′′(ξ)2

(x− xj)(x− xj+1)︸︷︷︸ψ(x)

dx

=f ′′(x)

2

∫ xj+1

xj

ψ(x)dx︸︷︷︸integrate exactly

+O(h)∫ xj+1

xj

ψ(x)dx



The mysteries of ψ(x)

x_j q_1 q_2 q_3 q_4 q_5 q_6 q_7−1.5

−1

−0.5

0

0.5

1

1.5x 105

xj ≤ x ≤ x

j+1

ψ(x

)

ψ(x) = (x− q1)(x− q2) · · · (x− q7)



Error in k + 1 point quadrature

pk(x) interpolates f(x) =⇒ f(x)− pk(x) = fk+1(ξ)(k+1)! ψ(x)

(xj ≤) q1 < q2 < . . . < qk+1 (≤ xj+1)

∫ xj+1

xj

f(x)dx︸︷︷︸true

−∫ xj+1

xj

pk(x)dx︸︷︷︸approx

=∫ xj+1

xj

ψ(x)(k + 1)!

f (k+1)(ξ)dx



1. A simple error bound

Ignoring oscillation of ψ(x):

|error| ≤ max |fk+1|(k + 1)!

∫ xj+1

xj

|ψ(x)|dx︸︷︷︸=

R xj+1xj

|x−q1|···|x−qk+1|≤hk+1R xj+1xj

dx

≤ max |fk+1|(k + 1)!

|xj+1−xj |k+2

x_j q_1 q_2 q_3 q_4 q_5 q_6 q_7−1.5

−1

−0.5

0

0.5

1

1.5x 105

xj ≤ x ≤ x

j+1

ψ(x

) an

d ab

s(ψ

(x))



2. Analysis without cancelation

xj < ξ < x < xj+1

Lemma

Let ξ, x ∈ (xj , xj+1). Then

f (k+1)(ξ) = f (k+1)(x) + (ξ − x)fk+2(η) MVT

for some η between ξ and x, and |ξ − x| ≤ xj+1 − xj ≤ h.



2. Analysis without cancelation

|error| ≤ |true− approx|

= | 1(k + 1)!

∫ xj+1

xj

ψ(x)[f (k+1)(x)︸︷︷︸

fixed

+(ξ − x︸︷︷︸O(h)

)f (k+2)(η)]dx|

≤ f (k+1)(x)(k + 1)!

|∫ xj+1

xj

ψ(x)dx︸︷︷︸=0 if

R xj+1xj

ψ(x)dx=0

|

+1

(k + 1)!|∫ xj+1

xj

f (k+2)(η)(ξ − x)ψ(x)|

≤ max |f (k+2)|(k + 1)!

∫ xj+1

xj

|ξ − x|︸︷︷︸≤h

|ψ(x)|︸︷︷︸≤hk+1

dx

≤ hk+3 max |f (k+2)(x)|(k + 1)!

The error for Simpson’s rule, i.e. cancelation.



ψ(x) interpolates zero at k + 1 points (degψ(x) = k + 1)

Lemma

The general result of p(q`) = 0, ` = 1, . . . , k + 1, p ∈ Pk+1 isp(x) = Constant · ψ(x).



Questions:

1) How to pick the points q1, . . . , qk+1 so that∫ 1

−1g(x)dx ≈ w1g(q1) + . . .+ wk+1g(qk+1) (6.47)

integrates Pk+m exactly?

2) What does this imply about the error?

Remark

If m < 1, pick q1, . . . , qk+1 so that∫ 1−1 ψ(x)dx = 0 and then the error

converges O(hk+3).

m=2



Step 1

Let r1 be some fixed point on [−1, 1]:−1 < q1 < q2 < . . . < r1 < . . . < qk < qk+1

pk+1(x) = pk(x) + ψ(x)g(r1)− pk(r1)

ψ(r1)(6.48)

pk interpolates g(x) at q1, . . . , qk+1.Claim: pk+1 interpolates g(x) at k + 2 points q1, . . . , qk+1, r1.Suppose now that (6.47) is exact on Pk+1, then from (6.48)∫ 1

−1g(x)dx︸︷︷︸true

−∫ 1

−1pk+1(x)︸︷︷︸

substitute (6.48)

dx error in k + 2 quadrature rule, Ek+2

=∫ 1

−1g(x)dx−

∫ 1

−1pk(x)dx︸︷︷︸

error in k + 1 quadrature rule ≡ Ek+1

−∫ 1

−1ψ(x)

f(r1)− pk(r1)ψ(r1)

dx



Step 1

So Ek+2 = Ek+1 −f(r1)− pk(r1)

ψ(r1)

∫ 1

−1ψ(x)dx

Conclusion 1

If∫ 1−1 ψ(x)dx = 0, then error in k + 1 point rule is exactly the same as

if we had used k + 2 points.



Step 2

Let r1, r2 be fixed points in [−1, 1]. So interpolate at k + 3 points:q1, . . . , qk+1, r1, r2

pk+2(x) = pk(x) + ψ(x)(x− r1)g(r2)− pk(r2)(r2 − r1)ψ(r2)

(6.49)

+ ψ(x)(x− r2)g(r1)− pk(r1)(r1 − r2)ψ(r1)

Consider error in a rule with k + 1 + 2 points:

error in k + 3 p. r. =∫ 1

−1g(x)dx−

∫ 1

−1pk+2(x)dx

=∫ 1

−1g(x)dx−

∫ 1

−1pk(x)dx− g(r2)− pk(r2)

(r2 − r1)ψ(r2)

∫ 1

−1ψ(x)(x− r1)dx

− g(r1)− pk(r1)(r1 − r2)ψ(r1)

∫ 1

−1ψ(x)(x− r2)dx.



Step 2

So Ek+3 = Ek+1 + Const∫ 1−1 ψ(x)(x− r1)dx+ Const

∫ 1−1 ψ(x)(x− r2)dx

Conclusion 2

If∫ 1−1 ψ(x)dx = 0 and

∫ 1−1 xψ(x)dx = 0, then error in k + 1 point rule

has the same error as k + 3 point rule.



. . .

Ek+1+m = Ek+1 + C0

∫ 1

−1ψ(x)dx+ C1

∫ 1

−1ψ(x)x1dx+ . . . (6.50)

+ Cm

∫ 1

−1ψ(x)xm−1dx (6.51)

So

Conclusion 3

If∫ 1−1 ψ(x)xjdx = 0, j = 0, . . . ,m− 1, then error is as good as using

m extra points.



Overview

Interpolating QuadratureInterpolate f(x) at q0, q1, q2, . . . , qk ⇒ pk(x)

f(x)− pk(x) =f (k+1)(ξ)(k + 1)!

(x− q0)(x− q1) . . . (x− qk)∫ 1

−1f(x)dx−

∫ 1

−1pk(x)dx =

1(k + 1)!

∫ 1

−1f (k+1)(ξ)ψ(x)dx

Gauss rules

pick q` to maximize exactness

what is the accuracy

what are the q`’s?



Overview

Interpolate at k + 1 +m points Interpolate at k + 1 pointsq0, . . . , qk, r1 . . . , rm q0, . . . , qk

error error⇓ ⇓

Ek+m = Ek + c0∫ 1−1 ψ(x) · 1dx

+c1∫ 1−1 ψ(x) · xdx+ . . .

+cm∫ 1−1 ψ(x) · xm−1dx

Definition

p(x) is the µ+ 1st orthogonal polynomial on [−1, 1] (weightw(x) ≡ 1) if p(x) ∈ Pµ+1 and

∫ 1−1 p(x)x`dx = 0, ` = 0, . . . , µ, i.e.,∫ 1

−1 p(x)q(x)dx = 0∀q ∈ Pµ.



Overview

Pick q0, q1, . . . , qk so∫ 1−1 ψ(x) · 1dx = 0∫ 1−1 ψ(x) · xdx = 0

...∫ 1−1 ψ(x) · xm−1dx = 0

⇔∫ 1

−1ψ(x)︸︷︷︸

deg k+1

degm−1︷︸︸︷q(x) dx = 0, ∀q ∈ Pm−1

So, maximum accuracy if ψ(x) is the orthogonal polynomial of degreek + 1:∫ 1

−1ψ(x)q(x)dx = 0 ∀q ∈ Pk ⇒ m− 1 = k,m = k + 1

So, the Gauss quadrature points are the roots of the orthogonalpolynomial.



Overview

Adaptivity

I = I1 + I2

Trapezium rule’s local error = O(h3)∫ xj+1

xj

f(x)dx− I = 4e⇐ e ≈ 8e1, e ≈ 8e2 so, e ≈ 4e∫ xj+1

xj

f(x)dx− I = e = e1 + e2

I − I = 3e (+ Higher Order Terms)⇒ e =I − I

3(+ Higher Order Terms)



Overview

Final Observation

True−Approx ≈ 4e

True−Approx ≈ e

}2 equations, 2 unknowns: e, True

So, can solve for e+ TrueSolving for True:

True ≈ I + e ≈ I +I − I

3≈ 4

3I − 1

3I (+ Higher Order Terms)


> 5. Numerical Integration > 5.4 Numerical Differentiation

Forward difference formula

To numerically calculate the derivative of f(x), begin by recalling thedefinition of derivative

f ′(x) = limx→0

f(x+ h)− f(x)h

This justifies using

f ′(x) ≈f(x+ h)− f(x)

h≡ Dhf(x) (6.52)

for small values of h. Dhf(x) is called a numerical derivative of f(x)with stepsize h.



Example: [use forward difference.m]

Use Dhf to approximate the derivative of f(x) = cos(x) at x = π6 .

h Dn(f) Error Ratio

0.1 -0.54243 0.04243

0.05 -0.52144 0.02144 1.98

0.025 -0.51077 0.01077 1.99

0.0125 -0.50540 0.00540 1.99

0.00625 -0.50270 0.00270 2.00

0.003125 -0.50135 0.00135 2.00

Looking at the error column, we see the error is nearly proportional toh; when h is halved, the error is almost halved.



To explain the behaviour in this example, Taylor’s theorem can be used tofind an error formula. Expanding f(x+ h) about x, we get

f(x+ h) = f(x) + hf ′(x) +h2

2f ′′(c)

for some c between x and x+ h.Substituting on the right side of (6.52), we obtain

Dhf(x) =1h

{[f(x) + hf ′(x) +

h2

2f ′′(c)

]− f(x)

}= f ′(x) +

h

2f ′′(c)

f ′(x)−Dhf(x)= −h2f ′′(c) (6.53)

The error is proportional to h, agreeing with the results in the Table above.



For that example,f′„π

6

«−Dhf

„π

6

«=h

2cos(c) (6.54)

where c is between 16π and 1

6π + h.

Let’s check that if c is replaced by 16π, then the RHS of (6.54) agrees with the error column in the Table.

As seen in the example, we use the formula (6.52) with a positive stepsizeh > 0. The formula (6.52) is commonly known as the forward differenceformula for the first derivative.We can formally replace h by −h in (6.52) to obtain the formula

f ′(x) ≈f(x)− f(x− h)

h, h > 0 (6.55)

This is the backward difference formula for the first derivative.A derivation similar to that leading to (6.53) shows that

f ′(x)− f(x)− f(x− h)h

=h

2f ′′(c) (6.56)

for some c between x and x− h.

Thus, we expect the accuracy of the backward difference formula to be

almost the same as that of the forward difference formula.


> 5. Numerical Integration > 5.4.1 Differentiation Using Interpolation

Let Pn(x) denote the degree n polynomial that interpolates f(x) atn+ 1 node points x0, . . . , xn.

To calculate f ′(x) at some point x = t, use

f ′(t) ≈ P ′n(t) (6.57)

Many different formulae can be obtained by

1 varying n and by

2 varying the placement of the nodes x0, . . . , xn relative to thepoint t of interest.



As an especially useful example of (6.57), take

n = 2, t = x1, x0 = x1 − h, x2 = x1 + h.

Then

P2(x) =(x−x1)(x−x2)

2h2f(x0) +

(x−x0)(x−x2)−h2

f(x1) +(x−x0)(x−x1)

2h2f(x2)

P ′2(x) =2x− x1 − x2

2h2f(x0) +

2x− x0 − x2

−h2f(x1) +

2x−x0−x1

2h2f(x2)

P ′2(x1) =x1 − x2

2h2f(x0) +

2x1 − x0 − x2

−h2f(x1) +

x1 − x0

2h2f(x2)

=f(x2)− f(x0)

2h(6.58)



The central difference formula

Replacing x0 and x2 by x1 − h and x1 + h, from (6.57) and (6.58) we obtaincentral difference formula

f ′(x1) ≈f(x1 + h)− f(x1 − h)

2h≡ Dhf(x1), (6.59)

another approximation to the derivative of f(x).

It will be shown below that this is a more accurate approximation to f ′(x)than is the forward difference formula Dhf(x) of (6.52), i.e.,

f ′(x) =f(x+ h)− f(x)

h≡ Dhf(x).



Theorem

Assume f ∈ Cn+2[a, b].Let x0, x1, . . . , xn be n+ 1 distinct interpolation nodes in [a, b], andlet t be an arbitrary given point in [a, b].Then

f ′(t)− P ′n(t) = Ψn(t)f (n+2)(c1)

(n+ 2)!+ Ψ′n(t)

f (n+1)(c2)

(n+ 1)!(6.60)

withΨn(t) = (t− x0)(t− x1) · · · (t− xn)

The numbers c1 and c2 are unknown points located between themaximum and minimum of x0, x1, . . . , xn and t.



To illustrate this result, an error formula can be derived for the centraldifference formula (6.59).Since t = x1 in deriving (6.59), we find that the first term on the RHS of(6.60) is zero. Also n = 2 and

Ψ2(x) = (x− x0)(x− x1)(x− x2)Ψ′2(x) = (x− x0)(x− x1) + (x− x0)(x− x2) + (x− x0)(x− x1)Ψ′2(x1) = (x1 − x0)(x1 − x2) = −h2

Using this in (6.60), we get

f ′(x1)− f(x1 + h)− f(x1 − h)2h

= −h2

6f ′′′(c2) (6.61)

with x1 − h ≤ c2 ≤ x1 + h. This says that for small values of h, the central

difference formula (6.59) should be more accurate that the earlier

approximation (6.52), the forward difference formula, because the error term

of (6.59) decreases more rapidly with h.



Example: [use central difference.m]

The earlier example f(x) = cos(x) is repeated using the central differencefomula (6.59) (recall x1 = π

6 ).

h Dn(f) Error Ratio0.1 -0.49916708 -0.00083290.05 -0.49979169 -0.0002083 4.000.025 -0.49994792 -0.00005208 4.000.0125 -0.49998698 -0.00001302 4.000.00625 -0.49999674 -0.000003255 4.00

The results confirm the rate of convergence given in (6.59), and they

illustrate that the central difference formula (6.59) will usually will be superior

to the earlier approximation, the forward difference formula (6.52).


> 5. Numerical Integration > 5.4.2 The Method of Undetermined Coefficients

The method of undetermined coefficients is a procedure used in derivingformulae for numerical differentiation, interpolation and integration. We willexplain the method by using it to derive an approximation for f ′′(x).To approximate f ′′(x) at some point x = t, write

f ′′(t) ≈ D(2)h f(t) ≡ Af(t+ h) +Bf(t) + Cf(t− h) (6.62)

with A,B and C unspecified constants.Replace f(t− h) and f(t+ h) by the Taylor polynomial approximations

f(t− h) ≈ f(t)− hf ′(t) + h2

2 f′′(t)− h3

6 f′′′(t) + h4

24 f(4)(t)

f(t+ h) ≈ f(t) + hf ′(t) + h2

2 f′′(t) + h3

6 f′′′(t) + h4

24 f(4)(t)

(6.63)

Including more terms would give higher powers of h; and for small values of

h, these additional terms should be much smaller that the terms included in

(6.63).



Substituting these approximations into the formula for D(2)h f(t) and

collecting together common powers of h give us

D(2)h f(t) ≈ (A+B + C)︸︷︷︸

=0

f(t) + h(A− C)︸︷︷︸=0

f ′(t) +h2

2(A+ C)︸︷︷︸

=1

f ′′(t)

+h3

6(A− C)f ′′′(t) +

h4

24(A+ C)f (4)(t) (6.64)

To haveD

(2)h f(t) ≈ f ′′(t)

for arbitrary functions f(x), it is necessary to require

A+B + C = 0; coefficient of f(t)h(A− C) = 0; coefficient of f ′(t)h2

2 (A+ C) = 1; coefficient of f ′′(t)

This system has the solution

A = C =1h2, B = − 2

h2.



This determines

D(2)h f(t) =

f(t+ h)− 2f(t) + f(t− h)

h2(6.65)

To determine an error error formula for D(2)h f(t), substitute

A = C = 1h2 , B = − 2

h2 into (6.64) to obtain

D(2)h f(t) ≈ f ′′(t) +

h2

12f (4)(t).

The approximation in this arises from not including in the Taylor polynomials(6.63) corresponding higher powers of h. Thus,

f ′′(t)− f(t+ h)− 2f(t) + f(t− h)h2

≈ −h2

12f (4)(t) (6.66)

This is accurate estimate of the error for small values of h. Of course, in a practical situation we would not know

f(4)(t). But the error formula shows that the error decreases by a factor of about 4 when h is halved. This can lead

to justify Richardson’s extrapolation to obtain an even more accurate estimate of the error and of f ′′(t).



Example: [use numerical 2nd Derivative.m]

Let f(x) = cos(x), t = 16π, and use (6.65) to calculate f ′′(t) = − cos( 1

6π).

h D2h(f) Error Ratio

0.5 -0.84813289 -1.789E-20.25 -0.86152424 -4.501E-3 3.970.125 -0.86489835 -1.127E-3 3.990.0625 -0.86574353 -2.819E-4 4.000.03125 -0.86595493 -7.048E-5 4.00

The results shown (see the ratio column) are consistent with the error

formula (6.66)

f ′′(t)− f(t+ h)− 2f(t) + f(t− h)h2

≈ −h2

12f (4)(t).



In the derivation of (6.66), the form (6.62)

f ′′(t) ≈ D(2)h f(t) ≡ Af(t+ h) +Bf(t) + Cf(t− h)

was assumed for the approximate derivative. We could equally well havechosen to evaluate f(x) at points other than those used there, for example,

f ′′(t) ≈ Af(t+ 2h) +Bf(t+ h) + Cf(t)

Or, we could have chosen more evaluation points, as in

f ′′(t) ≈ Af(t+ 3h) +Bf(t+ 2h) + Cf(t+ h) +Df(t)

The extra degree of freedom could have been used to obtain a more accurateapproximation to f ′′(t), by forcing the error term to be proportional to ahigher power of h.

Many of the formulae derived by the method of undetermined coefficients can

also be derived by differentiating and evaluating a suitably chosen

interpolation polynomial. But often, it is easier to visualize the desired

formula as a combination of certain function values and to then derive the

proper combination, as was done above for (6.65).


> 5. Numerical Integration > 5.4.3 Effects of error in function evaluation

The formulae derived above are useful for differentiating functions thatare known analytically and for setting up numerical methods for solvingdifferential equations. Nonetheless, they are very sensitive to errors inthe function values, especially if these errors are not sufficiently smallcompared with the stepsize h used in the differentiation formula.To explore this, we analyze the effect of such errors in the formula

D(2)h f(t) approximating f ′′(t).

Rewrite (6.65), D(2)h f(t) = f(t+h)−2f(t)+f(t−h)

h2 , as

D(2)h f(x1) =

f(x2)− 2f(x1) + f(x0)h2

≈ f ′′(x1)

where x2 = x1 + h, x0 = x1 − h. Let the actual function values usedin the computation be denoted by f0, f1, and f2 with

f(xi)− fi = εi, i = 0, 1, 2

the errors in the function values.



Thus, the actual quantity calculated is

D2hf(x1) =

f2 − 2f1 + f0

h2

For the error in this quantity, replace fj by f(xj)− εj , j = 0, 1, 2 toobtain

f ′′(x1)−D2hf(x1) = f ′′(x1)− [f(x2)− ε2]− 2[f(x1)− ε1] + [f(x0)− ε0]

h2

=[f ′′(x1)− f(x2)− 2f(x1) + f(x0)

h2︸︷︷︸(6.66)

= −h2

12f (4)(t)

]+ε2 − 2ε1 + ε0

h2

≈ −h2

12f (4)(x1) +

ε2 − 2ε1 + ε0

h2(6.67)



The errors ε0, ε1, ε2 are generally random in some interval [−δ, δ].If the values f0, f1, f2 are experimental data, then δ is a bound onthe experimental data.

Also, if these function values fi are obtained from computingf(x) on a computer, then the errors εj are combination ofrounding or chopping errors and δ is a bound on these errors.

In either case, (6.67)

f ′′(x1)−D2hf(x1) ≈ −

h2

12f (4)(x1) +

ε2 − 2ε1 + ε0

h2

yields the approximate inequality∣∣∣f ′′(x1)− D(2)h f(x1)

∣∣∣ ≤ h2

12

∣∣∣f (4)(x1)∣∣∣+

4δh2

(6.68)

This error bound suggests that as h→ 0,the error will eventually increase, because of the final term 4δ

h2 .



Example

Calculate D(2)h (x1) for f(x) = cos(x) at x1 = 1

6π. To show the effect of

rounding errors, the values fi are obtained by rounding f(xi) to sixsignificant digits; and the errors satisfy

|εi| ≤ 5.0× 10−7 = δ, i = 0, 1, 2

Other than these rounding errors, the formula D(2)h f(x1) is calculated exactly.

The results are

h D2h(f) Error

0.5 -0.848128 -0.0178970.25 -0.861504 -0.0045210.125 -0.864832 -0.0011930.0625 -0.865536 -0.0004890.03125 -0.865280 -0.0007450.015625 -0.860160 -0.0058650.0078125 -0.851968 -0.0140570.00390625 -0.786432 -0.079593



In this example, the bound (6.68), i.e.,∣∣∣f ′′(x1)− D(2)h f(x1)

∣∣∣ ≤ h2

12

∣∣f (4)(x1)∣∣+ 4δ

h2 ,

becomes∣∣∣f ′′(x1)− D(2)h f(x1)

∣∣∣ ≤ h2

12cos(

16π

)+(

4h2

)(5× 10−7)

·= 0.0722h2 +2× 10−6

h2≡ E(h)

For h = 0.125, the bound E(h) ·= 0.00126, which is not too far offfrom actual error given in the table.

The bound E(h) indicates that there is a smallest value of h, call ith∗, below which the error will begin to increase.To find it, let E′(h) = 0, with its root being h∗.

This leads to h∗·= 0.00726, which is consistent with the behaviour of

the errors in the table.



One must be very cautious in using numerical differentiation,because of the sensitivity to errors in the function values.

This is especially true if the function values are obtained empiricallywith relatively large experimental errors, as is common in practice. Inthis latter case, one should probably use a carefully prepared packageprogram for numerical differentiation. Such programs take intoaccount the error in the data, attempting to find numerical derivativesthat are as accurate as can be justified by the data.

In the absence of such a program, one should consider producing acubic spline function that approximates the data, and then use itsderivative as a numerical derivative for the data. The cubic splinefunction could be based on interpolation; or better for data with largerelatively errors, construct a cubic spline that is a least squaresapproximation to the data. The concept of least squaresapproximation is introduced in Section 7.1.



Rootfinding for Nonlinear Equations

3. Rootfinding Math 1070

> 3. Rootfinding

Calculating the roots of an equation

f(x) = 0 (7.1)

is a common problem in applied mathematics.

We will

explore some simple numerical methods for solving this equation,and also will

consider some possible difficulties


> 3. Rootfinding

The function f(x) of the equation (7.1)

will usually have at least one continuous derivative, and often

we will have some estimate of the root that is being sought.

By using this information, most numerical methods for (7.1) computea sequence of increasingly accurate estimates of the root.

These methods are called iteration methods.

We will study three different methods

1 the bisection method

2 Newton’s method

3 secant method

and give a general theory for one-point iteration methods.


> 3. Rootfinding > 3.1 The bisection method

In this chapter we assume that f : R→ R i.e.,f(x) is a function that is real valued and that x is a real variable.

Suppose that

f(x) is continuous on an interval [a, b], and

f(a)f(b) < 0 (7.2)

Then f(x) changes sign on [a, b], and f(x) = 0 has at least one root on theinterval.

Definition

The simplest numerical procedure for finding a root is to repeatedly halve theinterval [a, b], keeping the half for which f(x) changes sign. This procedure iscalled the bisection method, and is guaranteed to converge to a root,denoted here by α.



Suppose that we are given an interval [a, b] satisfying (7.2) and an errortolerance ε > 0.The bisection method consists of the following steps:

B1 Define c = a+b2 .

B2 If b− c ≤ ε, then accept c as the root and stop.

B3 If sign[f(b)] · sign[f(c)] ≤ 0, then set a = c.Otherwise, set b = c. Return to step B1.

The interval [a, b] is halved with each loop through steps B1 to B3.The test B2 will be satisfied eventually, and with it the condition |α− c| ≤ εwill be satisfied.

Notice that in the step B3 we test the sign of sign[f(b)] · sign[f(c)] in order

to avoid the possibility of underflow or overflow in the multiplication of f(b)and f(c).



Example

Find the largest root of

f(x) ≡ x6 − x− 1 = 0 (7.3)

accurate to within ε = 0.001.

With a graph, it is easy to check that 1 < α < 2

We choose a = 1, b = 2; then f(a) = −1, f(b) = 61, and (7.2) is satisfied.



Use bisect.m

The results of the algorithm B1 to B3:

n a b c b− c f(c)1 1.0000 2.0000 1.5000 0.5000 8.89062 1.0000 1.5000 1.2500 0.2500 1.56473 1.0000 1.2500 1.1250 0.1250 -0.09774 1.1250 1.2500 1.1875 0.0625 0.61675 1.1250 1.1875 1.1562 0.0312 0.23336 1.1250 1.1562 1.1406 0.0156 0.06167 1.1250 1.1406 1.1328 0.0078 -0.01968 1.1328 1.1406 1.1367 0.0039 0.02069 1.1328 1.1367 1.1348 0.0020 0.0004

10 1.1328 1.1348 1.1338 0.00098 -0.0096

Table: Bisection Method for (7.3)

The entry n indicates that the associated row corresponds to iteration

number n of steps B1 to B3.


> 3. Rootfinding > 3.1.1 Error bounds

Let an, bn and cn denote the nth computed values of a, b and c:

bn+1 − an+1 =12

(bn − an), n ≥ 1

and

bn − an =1

2n−1(b− a) (7.4)

where b− a denotes the length of the original interval with which we started.Since the root α ∈ [an, cn] or α ∈ [cn, bn], we know that

|α− cn| ≤ cn − an = bn − cn =12

(bn − an) (7.5)

This is the error bound for cn that is used in step B2.Combining it with (7.4), we obtain the further bound

|α− cn| ≤1

2n(b− a).

This shows that the iterates cn → α as n→∞.



To see how many iterations will be necessary, suppose we want to have|α− cn| ≤ ε

This will be satisfied if12n

(b− a) ≤ ε

Taking logarithms of both sides, we can solve this to give

n ≥log(b−aε

)log 2

For the previous example (7.3), this results in

n ≥log(

10.001

)log 2

·= 9.97

i.e., we need n = 10 iterates, exactly the number computed.



There are several advantages to the bisection method

It is guaranteed to converge.

The error bound (7.5) is guaranteed to decrease by one-half with eachiteration

Many other numerical methods have variable rates of decrease for the error,and these may be worse than the bisection method for some equations.

The principal disadvantage of the bisection method is that

generally converges more slowly than most other methods.

For functions f(x) that have a continuous derivative, other methods are

usually faster. These methods may not always converge; when they do

converge, however, they are almost always much faster than the bisection

method.


> 3. Rootfinding > 3.2 Newton’s method

Figure: The schematic for Newton’s method



There is usually an estimate of the root α, denoted x0.To improve it, consider the tangent to the graph at the point(x0, f(x0)).If x0 is near α, then the tangent line ≈ the graph of y = f(x) forpoints about α.Then the root of the tangent line should nearly equal α, denoted x1.



The line tangent to the graph of y = f(x) at (x0, f(x0)) is the graphof the linear Taylor polynomial:

p1(x) = f(x0) + f ′(x0)(x− x0)

The root of p1(x) is x1:

f(x0) + f ′(x0)(x1 − x0) = 0

i.e.,

x1 = x0 −f(x0)f ′(x0)

.



Since x1 is expected to be an improvement over x0 as an estimate of α, werepeat the procedure with x1 as initial guess:

x2 = x1 −f(x1)f ′(x1)

.

Repeating this process, we obtain a sequence of numbers, iterates,x1, x2, x3, . . . hopefully approaching the root α.

The iteration formula

xn+1 = xn −f(xn)f ′(xn)

, n = 0, 1, 2, . . . (7.6)

is referred to as the Newton’s method, or Newton-Raphson, for solving

f(x) = 0.



Example

Using Newton’s method, solve (7.3) used earlier for the bisection method.

Heref(x) = x6 − x− 1, f ′(x) = 6x5 − 1

and the iteration

xn+1 = xn −x6n − xn − 16x5

n − 1, n ≥ 0 (7.7)

The true root is α·= 1.134724138, and x6

·= α to nine significant digits.

Newton’s method may converge slowly at first. However, as the iterates comecloser to the root, the speed of convergence increases.



Use newton.m

n xn f(xn) xn − xn−1 α− xn−1

0 1.500000000 8.89E+11 1.300490880 2.54E+10 -2.00E-1 -3.65E-12 1.181480420 5.38E−10 -1.19E-1 -1.66E-13 1.139455590 4.92E−20 -4.20E-2 -4.68E-24 1.134777630 5.50E−40 -4.68E-3 -4.73E-35 1.134724150 7.11E−80 -5.35E-5 -5.35E-56 1.134724140 1.55E−15 -6.91E-9 -6.91E-9

1.134724138

Table: Newton’s Method for x6 − x− 1 = 0

Compare these results with the results for the bisection method.



Example

One way to compute ab on early computers (that had hardware arithmetic for

addition, subtraction and multiplication) was by multiplying a and 1b , with 1

bapproximated by Newton’s method.

f(x) ≡ b− 1x

= 0

where we assume b > 0. The root is α = 1b , the derivative is

f ′(x) =1x2

and Newton’s method is given by

xn+1 = xn −b− 1

xn

1x2

n

,

i.e.,

xn+1 = xn(2− bxn), n ≥ 0 (7.8)



This involves only multiplication and subtraction.The initial guess should be chosen x0 > 0.For the error it can be shown

Rel(xn+1) = [Rel(xn)]2, n ≥ 0 (7.9)

where

Rel(xn) =α− xnα

the relative error when considering xn as an approximation to α = 1/b. From(7.9) we must have

|Rel(x0)| < 1

Otherwise, the error in xn will not decrease to zero as n increases.This contradiction means

−1 <1b − x0

1b

< 1

equivalently

0 < x0 <2b

(7.10)



The iteration (7.8), xn+1 = xn(2− bxn), n ≥ 0, converges to α = 1b if and

only if the initial guess x0 satisfies0 < x0 <

2b

Figure: The iterative solution of b− 1x = 0



If the condition on the initial guess is violated, the calculated value ofx1 and all further iterates would be negative.

The result (7.9) shows that the convergence is very rapid, once wehave a somewhat accurate initial guess.

For example, suppose |Rel(x0)| = 0.1, which corresponds to a 10%error in x0. Then from (7.9)

Rel(x1) = 10−2, Rel(x2) = 10−4

Rel(x3) = 10−8, Rel(x4) = 10−16(7.11)

Thus, x3 or x4 should be sufficiently accurate for most purposes.


> 3. Rootfinding > 3.2.1 Error Analysis

Error analysis

Assume that f ∈ C2 in some interval about the root α, and

f ′(α) 6= 0, (7.12)

i.e., the graph y = f(x) is not tangent to the x-axis when the graphintersects it at x = α. The case in which f ′(α) = 0 is treated in Section 3.5.Note that combining (7.12) with the continuity of f ′(x) implies thatf ′(x) 6= 0 for all x near α.By Taylor’s theorem

f(α) = f(xn) + (α− xn)f ′(xn) +12

(α− xn)2f ′′(cn)

with cn an unknown point between α and xn.Note that f(α) = 0 by assumption, and then divide by f ′(xn) to obtain

0 =f(xn)f ′(xn)

+ α− xn + (α− xn)2f ′′(cn)2f ′(xn)

.



Quadratic convergence of Newton’s method

Solving for α− xn+1, we have

α− xn+1 = (α− xn)2[−f ′′(cn)2f ′(xn)

](7.13)

This formula says that the error in xn+1 is nearly proportional to thesquare of the error in xn.When the initial error is sufficiently small, this shows that the error inthe succeeding iterates will decrease very rapidly, just as in (7.11).

Formula (7.13) can also be used to give a formal mathematical proofof the convergence of Newton’s method.



Example

For the earlier iteration (7.7), i.e., xn+1 = xn − x6n−xn−16x5

n−1 , n ≥ 0, we have

f ′′(x) = 30x4. If we are near the root α, then

−f ′′(cn)2f ′(cn)

≈ −f′′(α)

2f ′(α)=

−30α4

2(6α5 − 1)·= −2.42

Thus for the error in (7.7),

α− xn+1 ≈ −2.42(α− xn)2 (7.14)

This explains the rapid convergence of the final iterates in table.For example, consider the case of n = 3, with α− x3

·= −.73E − 3. Then(7.14) predicts

α− x4·= 2.42(4.73E − 3)3 ·= −5.42E − 5

which compares well to the actual error of α− x4·= 5.35E − 5.



If we assume that the iterate xn is near the root α, the multiplier on the RHS

of (7.13), i.e., α− xn+1 = (α− xn)2[−f ′′(cn)2f ′(xn)

]can be written as

−f ′′(cn)2f ′(xn)

≈ −2f ′′(α)2f ′(α)

≡M. (7.15)

Thus,

α− xn+1 ≈M(α− xn)2, n ≥ 0

Multiply both sides by M to get

M(α− xn+1) ≈ [M(α− xn)]2

Assuming that all of the iterates are near α, then inductively we can show that

M(α− xn) ≈ [M(α− x0)]2n

, n ≥ 0



Since we want α− xn to converge to zero, this says that we must have

|M(α− x0)| < 1

|α− x0| <1|M |

=∣∣∣∣2f ′(α)f ′′(α)

∣∣∣∣ (7.16)

If the quantity |M | is very large, then x0 will have to be chosen very close toα to obtain convergence. In such situation, the bisection method is probablyan easier method to use.The choice of x0 can be very important in determining whetherNewton’s method will converge.

Unfortunately, there is no single strategy that is always effective in choosing x0.

In most instances, a choice of x0 arises from physical situation that ledto the rootfinding problem.

In other instances, graphing y = f(x) will probably be needed, possiblycombined with the bisection method for a few iterates.


> 3. Rootfinding > 3.2.2 Error estimation

We are computing sequence of iterates xn, and we would like to estimatetheir accuracy to know when to stop the iteration.To estimate α− xn, note that, since f(α) = 0, we have

f(xn) = f(xn)− f(α) = f ′(ξn)(xn − α)

for some ξn between xn and α, by the mean-value theorem.Solving for the error, we obtain

α− xn =−f(xn)f ′(ξn)

≈ −f(xn)f ′(xn)

provided that xn is so close to α that f ′(xn) ·= f ′(ξn). From the

Newton-Raphson method (7.6), i.e., xn+1 = xn − f(xn)f ′(xn) , this becomes

α− xn ≈ xn+1 − xn (7.17)

This is the standard error estimation formula for Newton’s method, and it isusually fairly accurate.

However, this formula is not valid if f ′(α) = 0, a case that is discussed in Section 3.5.


> 3. Rootfinding > 3.2.2 Error estimation

Example

Consider the error in the entry x3 of the previous table.

α− x3·= −4.73E − 3

x4 − x3·= −4.68E − 3

This illustrates the accuracy of (7.17) for that case.


> 3. Rootfinding > Error Analysis - linear convergence

Linear convergence of Newton’s method

Example

Use Newton’s Method to find a root of f(x) = x2.

xn+1 = xn − f(xn)f ′(xn) = xn − x2

n

2xn= xn

2 .

So the method converges to the root α = 0, but the convergence is only linear

en+1 =en2.

Example

Use Newton’s Method to find a root of f(x) = xm.

xn+1 = xn − xmn

mxm−1= m−1

m xn.

The method converges to the root α = 0, again with linear convergence

en+1 =m− 1m

en.




Theorem

Assume f ∈ Cm+1[a, b] and has a multiplicity m root α. Then Newton’sMethod is locally convergent to α, and the absolute error en satisfies

limn→∞

en+1

en=m− 1m

. (7.18)




Example

Find the multiplicity of the root α = 0 of f(x) = sinx+ x2 cosx− x2 − x,and estimate the number of steps in NM for convergence to 6 correct decimalplaces (use x0 = 1).

f(x) = sinx+ x2 cosx− x2 − x ⇒ f(0) = 0f ′(x) = cosx+ 2x cosx− x2 sinx− 2x− 1 ⇒ f ′(0) = 0f ′′(x) = − sinx+ 2 cosx− 4x sinx− x2 cosx− 2 ⇒ f ′′(0) = 0f ′′′(x) = − cosx− 6 sinx− 6x cosx+ x2 sinx ⇒ f ′′′(0) = −1

Hence α = 0 is a triple root, m = 3; so en+1 ≈ 23en.

Since e0 = 1, we need to solve(23

)n< 0.5× 10−6, n >

log10 .5− 6log10 2/3

≈ 35.78.


> 3. Rootfinding > Modified Newton’s Method

Modified Newton’s Method

If the multiplicity of a root is known in advance, convergence of Newton’sMethod can be improved.

Theorem

Assume f ∈ Cm+1[a, b] which contains a root α of multiplicity m > 1. ThenModified Newton’s Method

xn+1 = xn −mf(xn)f ′(xn)

(7.19)

converges locally and quadratically to α.

Proof. MNM: mf(xn) = (xn − xn+1)f ′(xn).Taylor’s formula:

0 = xn−xn+1m f ′(xn) + (α− xn)f ′(xn) + f ′′(c) (α−xn)2

2!

= α−xn+1m f ′(xn) + (α− xn)f ′(xn)

(1− 1

m

)+ f ′′(c) (α−xn)2

2!

= α−xn+1m f ′(xn) + (α− xn)2

(1− 1

m

)f ′′(ξ) + (α− xn)2 f

′′(c)2!


> 3. Rootfinding > Nonconvergent behaviour of Newton’s Method

Failure of Newton’s Method

Example

Apply Newton’s Method to f(x) = −x4 + 3x2 + 2 with starting guess x0 = 1.

The Newton formula is

xn+1 = xn − −x4+3x2

n+2−4x3

n+6xn,

which gives

x1 = −1, x2 = 1, . . .


> 3. Rootfinding > Nonconvergent behaviour of Newton’s Method

Failure of Newton’s Method

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Failure of Newton"s Method for −x4 + 3x2 + 2=0

x1

x0

x1

x0

x1

x0


> 3. Rootfinding > 3.3 The secant method

The Newton method is based on approximating the graph of y = f(x) with atangent line and on then using a root of this straight line as an approximationto the root α of f(x).From this perspective,other straight-line approximation to y = f(x) would also lead to methods ofapproximating a root of f(x). One such straight-line approximation leads tothe secant method.

Assume that two initial guesses to α are known and denote them byx0 and x1.

They may occur

on opposite sides of α, or

on the same side of α.



Figure: A schematic of the secant method: x1 < α < x03. Rootfinding Math 1070


Figure: A schematic of the secant method: α < x1 < x03. Rootfinding Math 1070


To derive a formula for x2, we proceed in a manner similar to that used toderive Newton’s method:

Find the equation of the line and then find its root x2.The equation of the line is given by

y = p(x) ≡ f(x1) + (x− x1) · f(x1)− f(x0)x1 − x0

Solving p(x2) = 0, we obtain

x2 = x1 − f(x1) · x1 − x0

f(x1)− f(x0).

Having found x2, we can drop x0 and use x1, x2 as a new set of approximatevalues for α. This leads to an improved values x3; and this can be continuedindefinitely. Doing so, we obtain the general formula for the secant method

xn+1 = xn −xn − xn−1

f(xn)− f(xn−1), n ≥ 1. (7.20)

It is called a two-point method, since two approximate values are needed to

obtain an improved value. The bisection method is also a two-point method,

but the secant method will almost always converge faster than bisection.



Two steps of the secant method for f(x) = x3 + x− 1, x0 = 0, x1 = 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−1.5

−1

−0.5

0

0.5

1

1.5

y

x0

x1

x2

x3



Use secant.m

Example

We solve the equation f(x) ≡ x6 − x− 1 = 0.n xn f(xn) xn − xn−1 α− xn−1

0 2.0 61.01 1.0 -1.0 -1.02 1.01612903 -9.15E-1 1.61E-2 1.35E-13 1.19057777 6.57E-1 1.74E-1 1.19E-14 1.11765583 -1.68E-1 -7.29E-2 -5.59E-25 .113253155 -2.24E-2 -2.24E-2 1.71E-26 1.13481681 9.54E-4 2.29E-3 2.19E-37 1.13472365 -5.07E-6 -9.32E-5 -9.27E-58 1.13472414 -1.13E-9 4.92E-7 4.92E-7

The iterate x8 equals α rounded to nine significant digits.

As with the Newton method (7.7) for this equation, the initial iterates do not

converge rapidly. But as the iterates become closer to α, the speed of

convergence increases.



By using techniques from calculus and some algebraic manipulation, itis possible to show that the iterates xn of (7.20) satisfy

α− xn+1 = (α− xn)(α− xn−1)−f ′′(ξn)

2f ′(ζn). (7.21)

The unknown number ζn is between xn and xn−1, and the unknownnumber ξn is between the largest and the smallest of the numbersα, xn and xn−1. The error formula closely resembles the Newton errorformula (7.13). This should be expected, since the secant method canbe considered as an approximation of Newton’s method, based on using

f ′(xn) ≈ f(xn)− f(xn−1)xn − xn−1

.

Check that the use of this in the Newton formula (7.6) will yield (7.20).



The formula (7.21) can be used to obtain the further error result thatif x0 and x1 are chosen sufficiently close to α, then we haveconvergence and

limn→∞

|α− xn+1||α− xn|r

=∣∣∣∣ f ′′(α)2f ′(α)

∣∣∣∣r−1

≡ c

where r =√

5+12

·= 1.62. Thus,

|α− xn+1| ≈ c|α− xn|1.62 (7.22)

as xn approaches α. Compare this with the Newton estimate (7.15), inwhich the exponent is 2 rather then 1.62. Thus, Newton’s methodconverges more rapidly than the secant method. Also, the constant cin (7.22) plays the same role as M in (7.15), and they are related by

c = |M |r−1.

The restriction (7.16) on the initial guess for Newton’s method can bereplaced by a similar one for the secant iterates, but we omit it.



Finally, the result (7.22) can be used to justify the error estimate

α− xn−1 ≈ xn − xn−1

for iterates xn that are sufficiently close to the root.

Example

For the iterate x5 in the previous Table

α− x5·= 2.19E − 3

x6 − x5·= 2.29E − 3

(7.23)


> 3. Rootfinding > 3.3.2 Comparison of Newton and Secant methods

From the foregoing discussion, Newton’s method converges more rapidlythan the secant method. Thus,Newton’s method should require fewer iterations to attain a givenerror tolerance.

However, Newton’s method requires two function evaluations periteration, that of f(xn) and f ′(xn). And the secant method requiresonly one evaluation, f(xn), if it is programed carefully to retain thevalue of f(xn−1) from the preceding iteration. Thus,the secant method will require less time per iteration than theNewton method.

The decision as to which method should be used will depend on the factorsjust discussed, including the difficulty or expense of evaluating f ′(xn);

and it will depend on intangible human factors, such as convenience of use.

Newton’s method is very simple to program and to understand; but for many

problems with a complicated f ′(x), the secant method will probably be faster

in actual running time on a computer.


> 3. Rootfinding > 3.3.2 Comparison of Newton and Secant methods

General remarks

The derivation of both the Newton and secant methods illustrate a generalprinciple of numerical analysis.

When trying to solve a problem for which there is no direct or simple methodof solution, approximate it by another problem that you can solve more easily.

In both cases, we have replaced the solution off(x) = 0

with the solution of a much simpler rootfinding problem for a linear equation.

GENERAL OBSERVATIONWhen dealing with problems involving differentiable functions f(x),move to a nearby problem by approximating each such f(x) with alinear problem.

The linearization of mathematical problems is common throughout applied

mathematics and numerical analysis.


> 3. Rootfinding > 3.3 The MATLAB function fzero

MATLAB contains the rootfinding routine fzero that uses ideasinvolved in the bisection method and the secant method. As withmany MATLAB programs, there are several possible calling sequences.

The commandroot = fzero(f name, [a, b])

produces a root within [a, b], where it is assumed thatf(a)f(b) ≤ 0.

The commandroot = fzero(f name, x0)

tries to find a root of the function near x0.

The default error tolerance is the maximum precision of the machine,although this can be changed by the user.This is an excellent rootfinding routine, combining guaranteedconvergence with high efficiency.


> 3. Rootfinding > Method of False Position

There are three generalization of the Secant method that are alsoimportant. The Method of False Position, or Regula Falsi, is similarto the Bisection Method, but where the midpoint is replaced by aSecant Method-like approximation. Given an interval [a, b] thatbrackets a root (assume that f(a)f(b) < 0), define the next point

c =bf(a)− af(b)f(a)− f(b)

as in the Secant Method, but unlike the Secant Method, the new pointis guaranteed to lie in [a, b], since the points (a, f(a)) and (b, f(b)) lieon separate sides of the x-axis. The new interval, either [a, c] or [c, b],is chosen according to whether f(a)f(c) < 0 or f(c)f(b) < 0,respectively, and still brackets a root.



Given interval [a, b] such that f(a)f(b) < 0for i = 1, 2, 3, . . .

c =bf(a)− af(b)f(a)− f(b)

if f(c) = 0, stop, end

if f(a)f(c) < 0b = c

else

a = c

end

end

The Method of False Position at first appears to be an improvement on both

the Bisection Method and the Secant Method, taking the best properties of

each. However, while the Bisection method guarantees cutting the

uncertainty by 1/2 on each step, False Position makes no such promise, and

for some examples can converge very slowly.



Example

Apply the Method of False Position on initial interval [-1,1] to find theroot r = 1 of f(x) = x3 − 2x2 + 3

2x.

Given x0 = −1, x1 = 1 as the initial bracketing interval, we computethe new point

x2 =x1f(x0)− x0f(x1)f(x0)− f(x1)

=1(−9/2)− (−1)1/2−9/2− 1/2

=45.

Since f(−1)f(4/5) < 0, the new bracketing interval is[x0, x2] = [−1, 0.8]. This completes the first step. Note that theuncertainty in the solution has decreased by far less than a factor of1/2. As seen in the Figure, further steps continue to make slowprogress toward the root at x = 0.Both the Secant Method and Method of False Position converge slowlyto the root r = 0.



(a) The Secant Method converges slowly to the root r = 0.

−1.5 −1 −0.5 0 0.5 1 1.5−6

−5

−4

−3

−2

−1

0

1

2

y

x0

x1

x2

x3

x4



(b) The Method of False Position converges slowly to the root r = 0.

−1.5 −1 −0.5 0 0.5 1 1.5−6

−5

−4

−3

−2

−1

0

1

2

y

x0

x1

x2

x3

x4


> 3. Rootfinding > 3.4 Fixed point iteration

The Newton method (7.6)


, n = 0, 1, 2, . . .

and the secant method (7.20)

xn+1 = xn − f(xn) · xn − xn−1

f(xn)− f(xn−1)n ≥ 1

are examples of one-point and two-point iteration methods,respectively.

In this section we give a more general introduction to iterationmethods, presenting a general theory for one-point iteration formulae.



Solve the equationx = g(x)

for a root α = g(α) by the iteration{x0,xn+1 = g(xn), n = 0, 1, 2, . . .

Example: Newton’s Method


:= g(xn)

where g(x) = x− f(x)f ′(x) .

Definition

The solution α is called a fixed point of g.

The solution of f(x) = 0 can always be rewritten as a fixed point of g, e.g.,

x+ f(x) = x =⇒ g(x) = x+ f(x).



Example

As motivational example, consider solving the equation

x2 − 5 = 0 (7.24)

for the root α =√

5 ·= 2.2361.

We give four methods to solve this equation

I1. xn+1 = 5 + xn − x2n x = x+ c(x2 − a), c 6= 0

I2. xn+1 =5xn

x =a

x

I3. xn+1 = 1 + xn −15x2n x = x+ c(x2 − a), c 6= 0

I4. xn+1 =12

(xn +

5xn

)x =

12

(x+a

x)

All four iterations have the property that if the sequence {xn : n ≥ 0} has a

limit α, then α is a root of (7.24). For each equation, check this as follows:

Replace xn and xn+1 by α, and then show that this implies α = ±√

5.



n xn:I1 xn:I2 xn :I3 xn :I40 2.5 2.5 2.5 2.51 1.25 2.0 2.25 2.252 4.6875 2.5 2.2375 2.23613 -12.2852 2.0 2.2362 2.2361

Table: The iterations I1 to I4To explain these numerical results, we present a general theory for one-pointiteration formulae.The iterations I1 to I4 all have the form

xn+1 = g(xn)

for appropriate continuous functions g(x). For example, with I1,g(x) = 5 + x− x2. If the iterates xn converge to a point α, then

limn→∞ xn+1 = limn→∞ g(xn)α = g(α)

Thus α is a solution of the equation x = g(x), and α is called a fixed point of

the function g.



Existence of a fixed point

In this section, a general theory is given to explain when the iterationxn+1 = f(xn) will converge to a fixed point of g.We begin with a lemma on existence of solutions of x = g(x).

Lemma

Let g ∈ C[a, b]. Assume that g([a, b]) ⊂ [a, b], i.e.,∀x ∈ [a, b], g(x) ∈ [a, b].

Then x = g(x) has at least one solution α in the interval [a, b].

Proof. Define the function f(x) = x− g(x). It is continuous for a ≤ x ≤ b.Moreover,

f(a) = a− g(a) ≤ 0f(b) = b− g(b) ≥ 0

Intermediate value theorem ⇒ ∃x ∈ [a, b] such that f(x) = 0, i.e. x = g(x).

�



0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

y=g(x)

y=x

y=g(x)

y=x

The solutions α are the x-coordinates of the intersection points of thegraphs of y = x and y = g(x).



Lipschitz continuous

Definition

Given g : [a, b]→ R, is called Lipschitz continuous with constant λ > 0(denoted g ∈ Lipλ[a,b]) if ∃λ > 0 such that

|g(x)− g(y)| ≤ λ|x− y| ∀x, y ∈ [a, b].

Definition

g : [a, b]→ R is called contraction map if g ∈ Lipλ[a, b] with λ < 1.



Existence and uniqueness of a fixed point

Lemma

Let g ∈ Lipλ[a, b] with λ < 1 and g([a, b]) ⊂ [a, b].Then x = g(x) has exactly one solution α. Moreover, forxn+1 = g(xn), xn → α for any x0 ∈ [a, b] and

|α− xn| ≤λn

1− λ|x1 − x0|.

Proof.Existence: follows from previous Lemma.Uniqueness: assume ∃α, β solutions: α = g(α), β = g(β).

|α− β| = |g(α)− g(β)| ≤ λ|α− β|(1− λ)︸︷︷︸

>0

|α− β| ≤ 0.



Convergence of the iterates

If xn ∈ [a, b] then g(xn) = xn+1 ∈ [a, b])⇒ {xn}n≥0 ⊂ [a, b].

Linear convergence with rate λ:

|α− xn| = |g(α)− g(xn)| ≤ λ|α− xn−1| ≤ . . . ≤ λn|α− x0| (7.25)

|x0 − α| = |x0 − x1 + x1 − α| ≤ |x0 − x1|+ |x1 − α|≤ |x0 − x1|+ λ|x0 − α|

=⇒ |x0 − α| ≤x0 − x1

1− λ(7.26)

|xn − α| ≤ λn|x0 − α| ≤λn

1− λ|x1 − x0|

�3. Rootfinding Math 1070


Error estimate

|α− xn| ≤ |α− xn+1|+ |xn+1 − xn| ≤ λ|α− xn|+ |xn+1 − xn|

|α− xn| ≤1

1− λ|xn − xn+1|

|α− xn+1| ≤ λ|α− xn|

=⇒ |α− xn+1| ≤λ

1− λ|xn+1 − xn|



Assume g′(x) exists on [a, b]. By the mean value theorem:

g(x)− g(y) = g′(ξ)(x− y), ξ ∈ [a, b], ∀x, y ∈ [a, b]

Defineλ = max

x∈[a,b]|g′(x)|.

Then g ∈ Lipλ[a, b]:

|g(x)− g(y)| ≤ |g′(ξ)||x− y| ≤ λ|x− y|.



Theorem 2.6

Assume g ∈ C1[a, b], g([a, b]) ⊂ [a, b] and maxx∈[a,] |g′(x)| < 1. Then

1 x = g(x) has a unique solution α in [a, b],2 xn → α ∀x0 ∈ [a, b],

3 |α− xn| ≤λn

1− λ|x1 − x0|,

4 limn→∞

α− xn+1

α− xn= g′(α).

Proof.

α− xn+1 = g(α)− g(xn)= g′(ξn)(α− xn), ξn ∈ [α, xn].

�



Theorem 2.7

Assume α solves x = g(x), g ∈ C1[Iα], for some Iα 3 α, |g′(α)| < 1.Then Theorem 2.6 holds for x0 close enough to α.

Proof.Since |g′(α)| < 1 by continuity

|g′(x)| < 1 for x ∈ Iα = [α− ε, α+ ε].

Take x0 ∈ Iα close to x1 ∈ Iα

|x1 − α| = |g(x0)− g(α)| = |g′(ξ)(x0 − α)|≤ |g′(ξ)||x0 − α| < |x0 − α| < ε

⇒ x1 ∈ Iα and, by induction, xn ∈ Iα.⇒ Theorem 2.6 holds with [a, b] = Iα. �



Importance of |g′(α)| < 1:

If |g′(α)| > 1 : and xn is close to α then

|xn+1 − α| = |g′(ξn)||xn − α|→ |xn+1 − α| > |xn − α|

=⇒ divergence

When g′(α) = 1, no conclusion can be drawn; and even if convergencewere to occur, the method would be far too slow for the iterationmethod to be practical.



Examples

Recall α =√

5.

1 g(x) = 5 + x− x2; , g′(x) = 1− 2x, g′(α) = 1− 2√

5. Thusthe iteration will not converge to

√5.

2 g(x) = 5/x, g′(x) = − 5x2

; g′(α) = − 5(√

5)2= −1.

We cannot conclude that the iteration converges or diverges.From the Table, it is clear that the iterates will not converge to α.

3 g(x) = 1 + x− 15x

2, g′(x) = 1− 25x, g

′(α) = 1− 25

√5 ·= 0.106,

i.e., the iteration will converge. Also,

|α− xn+1| ≈ 0.106|α− xn|

when xn is close to α. The errors will decrease by approximately afactor of 0.1 with each iteration.

4 g(x) = 12

(x+ 5

x

); g′(α) = 0 convergence

Note that this is Newton’s method for computing√

5.



x = g(x), g(x) = x+ c(x2 − 3)

What value of c will give convergent iteration?

g′(x) = 1 + 2cx

α =√

3Need |g′(α)| < 1

−1 < 1 + 2c√

3 < 1

Optimal choice: 1 + 2c√

3 = 0 =⇒ c = − 12√

3.



The possible behaviour of the fixed point iterates xn for various sizesof g′(α).To see the convergence, consider the case of x1 = g(x0), the height ofthe graph of y = g(x) at x0.We bring the number x1 back to the x-axis by using the line y = xand the height y = x1.We continue this with each iterate, obtaining a stairstep behaviourwhen g′(α) > 0.

When g′(α) < 0, the iterates oscillate around the fixed point α, as canbe seen.



Figure: 0 < g′(α) < 1



Figure: −1 < g′(α) < 0



Figure: 1 < g′(α)



Figure: g′(α) < −1



The results from the iteration for

g(x) = 1 + x− 15x2, g′(α) ·= 0.106.

along with the ratiosrn =

α− xnα− xn−1

. (7.27)

Empirically, the values of rn converge to g′(α) ·= 0.105573, which agrees with

limn→∞

α− xn+1

α− xn = g′(α).

n xn α− xn rn0 2.5 -2.64E-11 2.25 -1.39E-2 0.05282 2.2375 -1.43E-3 0.10283 2.23621875 -1.51E-4 0.10534 2.23608389 -1.59E-5 0.10555 2.23606966 -1.68E-6 0.10566 2.23606815 -1.77E-7 0.10567 2.23606800 -1.87E-8 0.1056

Table: The iteration xn+1 = 1 + xn − 15x

2n



We need a more precise way to deal with the concept of the speed ofconvergence of an iteration method.

Definition

We say that a sequence {xn : n ≥ 0} converges to α with an order ofconvergence p ≥ 1 if

|α− xn+1| ≤ c|α− xn|p, n ≥ 0

for some constant c ≥ 0.

The cases p = 1, p = 2, p = 3 are referred to as linear convergence,quadratic convergence and cubic convergence, respectively.

Newton’s method usually converges quadratically; and

the secant method has a order of convergence p = 1+√

52 .

For linear convergence we make the additional requirement thatc < 1; as otherwise, the error α− xn need not converge to zero.



If g′(α) < 1, then formula

|α− xn+1| ≤ |g′(ξn)||α− xn|

shows that the iterates xn are linearly convergent.

If in addition g′(α) 6= 0, then formula

|α− xn+1| ≈ |g′(α)||α− xn|

proves that the convergence is exactly linear, with no higher orderof convergence being possible. In this case, we call the value ofg′(α) the linear rate of convergence.



High order one-point methods

Theorem 2.8

Assume g ∈ Cp(Iα) for some Iα containing α, andg′(α) = g′′(α) = . . . = g(p−1)(α) = 0, p ≥ 2.

Then for x0 close enough to α, xn → α and

limn→∞

α− xn+1

(α− xn)p= (−1)p−1 g

(p)(α)p!

i.e. convergence is of order p.

Proof: xn+1 = g(xn)

= g(α)︸︷︷︸=α

+(xn − α) g′(α)︸︷︷︸=0

+ . . .+(xn − α)p−1

(p− 1)!g(p−1)(α)︸︷︷︸

=0

+(xn − α)p

p!g(p)(ξn)

α− xn+1 = − (xn − α)p

p!g(p)(ξn).

α− xn+1

(α− xn)p= (−1)p−1 g

(p)(ξn)p!

−→ (−1)p−1 g(p)(α)p!

�



Example: Newton’s method


, n > 0

= g(xn), g(xn) = x− f(x)f ′(x)

;

g′(x) =ff ′′

(f ′)2g′(α) = 0

g′′(x) =f ′f ′′ + ff ′′′

(f ′)2− 2

ff ′′

(f ′)3, g′′(α) =

f ′′(α)f ′(α)

Theorem 2.8 with p = 2:

limn→∞

α− xn+1

(α− xn)2= −g

′′(α)2

= −12f ′′(α)f ′(α)

.



Parallel Chords Method (two step fixed point method)

xn+1 = xn − f(xn)a

Ex.: a = f ′(x0).

xn+1 = xn − f(xn)f ′(x0) = g(xn).

Need |g′(α)| < 1 for convergence:

∣∣∣∣1− f ′(α)a

∣∣∣∣Linear convergence wit rate 1− f ′(α)

a . (Thm 2.6.)

If a = f ′(x0) and x0 is close enough to α, then

∣∣∣∣1− f ′(α)a

∣∣∣∣.


> 3. Rootfinding > 3.4.1 Aitken Error Estimation and Extrapolation

Aitken extrapolation for linearly convergent sequences

Recall

Theorem 2.6

xn+1 = g(xn)xn → α

α− xn+1

α− xn−→ g′(α)

Assuming linear convergence: g′(α) 6= 0.Derive an estimate for the error and use it to accelerate convergence.



α− xn = (α− xn−1) + (xn−1 − xn) (7.28)

α− xn = g(α)− g(xn−1)= g′(ξn−1)(α− xn−1)

α− xn−1 =1

g′(ξn−1)(α− xn) (7.29)

From (7.28)-(7.29)

α− xn =1

g′(ξn−1)(α− xn) + (xn−1 − xn)

α− xn =g′(ξn−1)

1− g′(ξn−1)(xn−1 − xn)



α− xn =g′(ξn−1)

1− g′(ξn−1)(xn−1 − xn)

g′(ξn−1)1− g′(ξn−1)

≈ g′(α)1− g′(α)

Need an estimate for g′(α).Define

λn =xn − xn−1

xn−1 − xn−2

and

α− xn+1 = g(α)− g(xn) = g′(ξn)(α− xn), ξn ∈ α, xn, n ≥ 0.



λn =(α− xn−1)− (α− xn)

(α− xn−2)− (α− xn−1)

=(α− xn−1)− g′(ξn−1)(α− xn−1)(α− xn−1)/g′(ξn−2)− (α− xn−1)

=1− g′(ξn−1)1− g′(ξn−2)

g′(ξn−2)

λn → g′(α) as ξn → α : λn ≈ g′(α)



Aitken Error Formula

α− xn =λn

1− λn(xn − xn−1) (7.30)

From (7.30)

α ≈ xn +λn

1− λn(xn − xn−1) (7.31)

Define

Aitken Extrapolation Formula

xn = xn +λn

1− λn(xn − xn−1) (7.32)



Example

Repeat the example for I3.

The Table contains the differences xn − xn−1, the ratios λn, and theestimated error from α− xn ≈ λn

1−λn(xn − xn−1), given in the column

Estimate. Compare the column Estimate with the error column in theprevious Table.

n xn xn − xn−1 λn Estimate0 2.51 2.25 -2.50E-12 2.2375 -1.25E-2 0.0500 -6.58E-43 2.23621875 -1.28E-3 0.1025 -1.46E-44 2.23608389 -1.35E-4 0.1053 -1.59E-55 2.23606966 -1.42E-5 0.1055 -1.68E-66 2.23606815 -1.50E-6 0.1056 -1.77E-77 2.23606800 -1.59E-7 0.1056 -1.87E-8

Table: The iteration xn+1 = 1 + xn − 15x

2n and Aitken Error Estimation



Algorithm (Aitken)

Given g, x0, ε, root, assume |g′(α)| < 1 and xn → α linearly.

1 x1 = g(x0), x2 = g(x1)2 x2 = x2 + λ2

1−λ2(x2 − x1) where λ2 = x2−x1

x1−x0

3 if |x2 − x2| ≤ ε then root = x2; exit

4 set x0 = x2, go to (1)



General remarks

There are a number of reasons to perform theoretical error analyses ofnumerical method. We want to better understand the method,

when it will perform well,

when it will perform poorly, and perhaps,

when it may not work at all.

With a mathematical proof, we convinced ourselves of the correctnessof a numerical method under precisely stated hypotheses on theproblem being solved. Finally, we often can improve on theperformance of a numerical method.The use of the theorem to obtain the Aitken extrapolation formula isan illustration of the following:

By understanding the behaviour of the error in a numerical method, itis often possible to improve on that method and to obtain anothermore rapidly convergent method.



Quasi-Newton Iterates

f(x) = 0

{x0

xk+1 = xk − f(xk)ak

, k = 0, 1, . . .

1 ak = f ′(xk)⇒ Newton’s Method

2 ak =f(xk)− f(xk−1)

xk − xk−1⇒ Secant Method

3 ak = a =constant (e.g. ak = f ′(x0)) ⇒ Parallel Chords Method

4 ak =f(xk + hk)− f(xk)

hk, hk > 0 ⇒ Finite Diff. Newton Method

If |hk| < c|f(xk)|, then the convergence is quadratic. Needhk ≥ h ≈

√δ




1 ak = f(xk+f(xk))−f(xk)f(xk)

⇒ Steffensen Method. This is Finite

Difference Method with hk = f(xk)⇒ quadratic convergence.

2 ak = f(xk)−f(xk′ )xk−xk′

where k′ is the largest index < k such that

f(xk)f(xk′) < 0⇒ Regula FalseNeed x0, x1 : f(x0)f(x1) < 0

x2 = x1 − f(x1)x1 − x0

f(x1)− f(x0)

x3 = x2 − f(x2)x2 − x0

f(x2)− f(x0)


> 3. Rootfinding > 3.4.2 High-Order Iteration Methods

The convergence formula

α− xn+1 ≈ g′(α)(α− xn)

gives less information in the case g′(α) = 0, although the convergence isclearly quite good. To improve on the results in the Theorem, consider theTaylor expansion of g(xn) about α, assuming that g(x) is twice continuouslydifferentiable:

g(xn) = g(α) + (xn − α)g′(α) +12

(xn − α)2g′′(cn) (7.33)

with cn between xn and α. Using xn+1 = g(xn), α = g(α), and g′(α) = 0,we have

xn+1 = α+12

(xn − α)2g′′(cn)

α− xn+1 = −12

(α− xn)2g′′(cn) (7.34)

limn→∞

α− xn+1

(α− xn)2= −1

2g′′(α) (7.35)

If g′′(α) 6= 0, then this formula shows that the iteration xn+1 = g(xn) is of

order 2 or is quadratically convergent.


> 3. Rootfinding > 3.4.2 High-Order Iteration Methods

If also g′′(α) = 0, and perhaps also some high-order derivatives arezero at α, then expand the Taylor series through higher-order terms in(7.33), until the final error term contains a derivative of g that is notnonzero at α. Thi leads to methods with an order of convergencegreater than 2.

As an example, consider Newton’s method as a fixed-point iteration:

xn+1 = g(xn), g(x) = x− f(x)f ′(x)

. (7.36)

Then,

g′(x) =f(x)f ′′(x)

[f ′(x)]2

and if f ′(α) 6= 0, theng′(α) = 0.

Similarly, it can be shown that g′′(α) 6= 0 if moreover, f ′′(α) 6= 0. Ifwe use (7.35), these results shows that Newton’s method is of order 2,provided that f ′(α) 6= 0 and f ′′(α) 6= 0.


> 3. Rootfinding > 3.5 The Numerical Evaluation of Multiple Roots

We will examine two classes of problems for which the methods ofSections 3.1 to 3.4 do not perform well. Often there is little that anumerical analyst can do to improve these problems, but one should beaware of their existence and of the reason for their ill-behaviour.

We begin with functions that have a multiple root. The root α of f(x)is said to be of multiplicity m if

f(x) = (x− α)mh(x), h(α) 6= 0 (7.37)

for some continuous function h(x) with h(α) 6= 0, m a positiveinteger. If we assume that f(x) is sufficiently differentiable, anequivalent definition is that

f(α) = f ′(α) = · · · = f (m−1)(α) = 0, f (m)(α) 6= 0. (7.38)

A root of multiplicity m = 1 is called a simple root.



Example.

(a) f(x) = (x− 1)2(x+ 2) has two roots. The root α = 1 hasmultiplicity 2, and α = −2 is a simple root.

(b) f(x) = x3 − 3x2 + 3x− 1 has α = 1 as a root of multiplicity 3.To see this, note that

f(1) = f ′(1) = f ′′(1) = 0, f ′′′(1) = 6.

The result follows from (7.38).

(c) f(x) = 1− cos(x) has α = 0 as a root of multiplicity m = 2. Tosee this, write

f(x) = x2

[2 sin2(x2 )

x2

]≡ x2h(x)

with h(0) = 12 . The function h(x) is continuous for all x.



When the Newton and secant methods are applied to the calculation of amultiple root α, the convergence of α− xn to zero is much slower than itwould be for simple root. In addition, there is a large interval of uncertaintyas to where the root actually lies, because of the noise in evaluating f(x).

The large interval of uncertainty for a multiple root is the most seriousproblem associated with numerically finding such a root.

Figure: Detailed graph of f(x) = x3 − 3x2 + 3x− 1 near x = 13. Rootfinding Math 1070


The noise in evaluating f(x) = (x− 1)3, which has α = 1 as a root ofmultiplicity 3. The graph also illustrates the large interval of uncertainty infinding α.

Example

To illustrate the effect of a multiple root on a rootfinding method, we useNewton’s method to calculate the root α = 1.1 of

f(x) = (x− 1.1)3(x− 2.1)2.7951 + x(−8.954 + x(10.56 + x(−5.4 + x))). (7.39)

The computer used is decimal with six digits in the significand, and it uses

rounding. The function f(x) is evaluated in the nested form of (7.39), and

f ′(x) is evaluated similarly. The results are given in the Table.



The column “ratio” gives the values of

α− xnα− xn−1

(7.40)

and we can see that these values equal about 23 .

n xn f(xn) α− xn Ratio0 0.800000 0.03510 0.3000001 0.892857 0.01073 0.207143 0.6902 0.958176 0.00325 0.141824 0.6853 1.00344 0.00099 0.09656 0.6814 1.03486 0.00029 0.06514 0.6755 1.05581 0.00009 0.04419 0.6786 1.07028 0.00003 0.02972 0.6737 1.08092 0.0 0.01908 0.642

Table: Newton’s Method for (7.39)

The iteration is linearly convergent with a rate of 23 .



It is possible to show that when we use Newton’s method to calculatea root of multiplicity m, the ratios (7.40) will approach

λ =m− 1m

, m ≥ 1. (7.41)

Thus, as xn approaches α,

α− xn ≈ λ(α− xn−1) (7.42)

and the error decreases at about the constant rate. In our example,λ = 2

3 , since the root has multiplicity m = 3, which corresponds to thevalues in the last column of the table. The error formula (7.42) impliesa much slower rate of convergence than is usual for Newton’s method.With any root of multiplicity m ≥ 2, the number λ ≥ 1

2 ; thus, thebisection method is always at least as fast as Newton’s method formultiple roots. Of course, m must be an odd integer to have f(x)change sign at x = α, thus permitting the bisection method to beapplied.



Newton’ Method for Multiple Roots

xk+1 = xk −f(xk)f ′(xk)

f(x) = (x− α)ph(x), p ≥ 0

Apply the fixed point iteration theorem

f ′(x) = p(x− α)p−1h(x) + (x− α)ph′(x)

g(x) = x− (x− α)ph(x)p(x− α)p−1h(x) + (x− α)ph′(x)



g(x) = x− (x− α)h(x)ph(x) + (x− α)h′(x)

Differentiating

g′(x) = 1− h(x)ph(x) + (x− α)h′(x)

−(x−α)d

dx

[h(x)

ph(x) + (x− α)h′(x)

]and

g′(α) = 1− 1p

=p− 1p




If p = 1⇒ g′(α) = 0 Then by theorem 2.8 ⇒ quadratic convergence

xk+1 − α(xk − α)2

k→∞−→ 12g′′(α).

If p > 1 then by fixed point theory, theorem 2.6 ⇒ linear convergence

|xk+1 − α| ≤p− 1p|xk − α|.

E.g. p = 2, p−1p = 1

2 .



Acceleration of Newton’s Method for Multiple Roots

f(x) = (x− α)ph(x), h(α) 6= 0.

Assume p is known.

xk+1 = xk − pf(xk)f ′(xk)

xk+1 = g(xk)

g(x) = x− p f(x)f ′(x)

g′(α) = 1− p

p= 0

limk→∞

α− xk+1

(x− xk)2=g′′(α)

2



Can run several Newton iterations to estimate p:

look at

∣∣∣∣α− x+1

α− xk

∣∣∣∣ ≈ p− 1p

.

One way to deal with uncertainties in multiple roots:

ϕ(x) = f (p−1)(x)ϕ(x) = (x− α)ψ(x), ψ(α) 6= 0.

⇒ α is a simple root for ϕ(x).


> 3. Rootfinding > Roots of polynomials

Roots of polynomials

p(x) = 0p(x) = a0 + a1x+ . . .+ anx

n, an 6= 0.

Fundamental Theorem of Algebra:

p(x) = an(x− z1)(x− z2) . . . (x− zn), z1, . . . , zn ∈ C.



Location of real roots:

1. Descarte’s rule of sign

Real coefficients

ν = # changes in sign of coefficients (ignore zero coefficients)

k = # positive roots

k ≤ ν and k − ν is even.

Example: p(x) = x5 + 2x4 − 3x3 − 5x2 − 1.ν = 1⇒ k ≤ 1⇒ k = 0 or k = 1.

ν − k ={

1, k = 0 not even0, k = 1.



For negative roots consider q(x) = p(−x).Apply rule to q(x).

Ex.: q(x) = −x5 + 2x4 + 3x3 − 5x2 − 1.ν = 2k = 0 or 2.



2. Cauchy

|ζi| ≤ 1 + max0≤i≤n−1

∣∣∣∣ aian∣∣∣∣

Book: Householder ”The numerical treatment of single nonlinearequations”, 1970.

Cauchy: given p(x), consider

p1(x) = |an|xn + |an−1|xn−1 + . . .+ |a1|x− |a0| = 0

p2(x) = |an|xn − |an−1|xn−1 − . . .− |a1|x− |a0| = 0

By Descarte’s: pi has a single positive root ρi

ρ1 ≤ |ζj | ≤ ρ2.



Nested multiplication (Horner’s method)

p(x) = a0 + a1x+ a2x2 + . . .+ anx

n (7.43)

p(x) = a0 + x (a1 + x (a2 + . . .+ x (an−1 + anx)) . . .) (7.44)

(7.44) requires n multiplications and n additions.

(7.43) to form akxk:

x · xk−1 : 1∗ak · xk : 1∗

n+ and 2n− 1∗.



For any ζ ∈ R define bk, k = 0, . . . , n.

bn = an

bk = ak + ζbk+1, k = n− 1, n− 2, . . . , 0

p(ζ) = a0 + ζ

a1 + . . .+ ζ (an−1 + anζ)︸︷︷︸bn−1

. . .

︸︷︷︸

b1︸︷︷︸b0



Considerq(x) = b1 + b2x+ . . .+ bnx

n−1.

Claim:

p(x) = b0 + (x− ζ)q(x).

Proof.

b0 + (x− ζ)q(x)

= b0 + (x− ζ)(b1 + b2x+ . . .+ bnxn−1)

= b0 − ζb1︸︷︷︸a0

+ (b1 − b2ζ)︸︷︷︸a1

x+ . . .+ (bn−1 − bnζ)︸︷︷︸an−1

xn−1 + bn︸︷︷︸an

xn

= a0 + a1x+ . . .+ anxn = p(x). �

Note: if p(ζ) = 0, then b0 = 0 : p(x) = (x− ζ)q(x).



Deflation

If ζ is found, continue with q(x) to find the rest of the roots.



Newton’s method for p(x) = 0.

xk+1 = xk −p(xk)p′(xk)

, k = 0, 1, 2, . . .

To evaluate p and p′ at x = ζ:

p(ζ) = b0p′(x) = q(x) + (x− ζ)q′(x)p′(ζ) = q(ζ)



Algorithm (Newton’s method for p(x) = 0)

Given: a = (a0, a1, . . . , an)Output: b = b(b1, b2, . . . , bn): coefficients of deflated polynomial g(x);

root.

Newton(a, n, x0, ε, itmax, root, b, ierr)

itnum = 11. ζ := x0; bn := an; c := an

for k = n− 1, . . . , 1 bk := ak + ζbk+1; c := bk + ζc p′(ζ)b0 := a0 + ζb1 p(ζ)

if c = 0, iter = 2, exit p′(ζ) = 0x1 = x0 − b0/c x1 = x0 − p(x0)/p′(x0)if |x0 − x1| ≤ ε, then ierr= 0: root = x, exit

it itnum = itmax, then ierr = 1, exititnum = itnum + 1, x0 := x1,quad go to 1.



Conditioning

p(x) = a0 + a1x+ . . .+ anxn

roots: ζ1, . . . , ζnPerturbation polynomial q(x) = b0 + b1x+ . . .+ bnx

n

Perturbed polynomial p(x, ε) = p(x) + εq(x)= (a0 + εb0) + (a1 + εb1)x+ . . .+ (an + εbn)xn

roots: ζj(ε) - continuous functions of ε, ζi(0) = ζi.(Absolute) Conditioning number

kζj = limε→0

|ζj(ε)− ζj ||ε|



Example

(x− 1)3 = 0 ζ1 = ζ2 = ζ3 = 1

(x− 1)3 − ε = 0 (q(x) = −1)

Set y = x− 1 and a = ε1/3

p(x, ε) = y3 − ε = y3 − a3 = (y − a)(y2 + ya+ a2)y1 = a

y2,3 =−a±

√−3a2

2=−a(1± i

√3)

2

y2 = −aω, y3 = −aω2, ω =1− i

√3

2, |ω| = 1



ζ1(ε) = 1 + ε1/3

ζ2(ε) = 1− ωε1/3

ζ3(ε) = 1− ω2ε1/3

|ζj(ε)− 1| = ε1/3

Conditioning number

∣∣∣∣ζj(ε)− 1ε

∣∣∣∣ =ε1/3

ε

ε→0−→∞

If ε = 0.001, ε1/3 = 0.1, |ζj(ε)− 1| = 0.1.



General argument

p(x), Simple root ζ

p(ζ) = 0p′(ζ) 6= 0p(x, ε) : root ζ(ε)

ζ(ε) = ζ +∞∑`=1

γ`ε`

= ζ︸︷︷︸this is what matters

+γ1ε+ γ2ε2 + . . .︸︷︷︸

negligeable if εis small

ζ(ε)− ζε

= γ1 + γ2ε+ . . . −→ε→∞

γ1



To find γ1:

ζ ′(0) = γ1

p(ζ(ε), ε) = 0p(ζ(ε)) + εq(ζ(ε)) = 0p′(ζ(ε))ζ ′(ε) + q(ζ(ε)) + εq′(ζ(ε))ζ ′(ε) = 0ε = 0

p′(ζ) ζ ′(0)︸︷︷︸γ1

+q(ζ) = 0 =⇒ γ1 = − q(ζ)p′(ζ)

kζ = |γ1| =∣∣∣∣ q(ζ)p′(ζ)

∣∣∣∣k is large if p′(ζ) is close to zero.



Example

p(x) = W7 =7∏i=1

(x− i)

q(x) = x6, ε = −0.002

p′(ζj) =7∏

`=1,`6=j(j − `) ζj = j

kζj =∣∣∣∣ q(ζj)p′(ζj)

∣∣∣∣ =j6∏7

`=1(j − `)In particular,

ζj(ε) ≈ j + εq(ζj)p′(ζj)

= j + δ(j).


> 3. Rootfinding > Systems of Nonlinear Equations

Systems of Nonlinear Equations

f1(x1, . . . , xm) = 0,f2(x1, . . . , xm) = 0,...fm(x1, . . . , xm) = 0.

(7.45)

If we denote

F =

f1(x)f2(x)...fm(x)

: Rm → Rm,

then (7.45) is equivalent to writing

F(x) = 0. (7.46)



Fixed Point Iteration

x = G(x), G : Rm → Rm

Solution α: α = G(α) is called a fixed point of G.

Example: F(x) = 0x = x−AF(x) for some A ∈ Rm×m, nonsingular matrix.

= G(x)

Iteration:

initial guess x0

xn+1 = G(xn), n = 0, 1, 2, . . .



Recall x ∈ Rm

‖x‖p =

(m∑i=1

|xi|p)1/p

, 1 ≤ p <∞,

‖x‖∞ = max1≤i≤n

|xi|.

Matrix norms: operator induced

A ∈ Rm×m

‖A‖p = supx∈Rm,x 6=0

‖Ax‖p‖x‖p

, 1 ≤ p <∞

‖A‖∞ = max1≤i≤m

‖Rowi(A)‖1

= max1≤i≤m

m∑j=1

|aij |



Let ‖ · ‖ be any norm in Rm.

Definition

G : Rm → Rm is called a contractive mapping if

‖G(x)−G(y)‖ ≤ λ‖x− y‖, ∀x,y ∈ Rm,

for some λ < 1.



Contractive mapping theorem

Theorem (Contractive mapping theorem)

Assume

1 D is a closed, bounded subset of Rm.

2 G : D → D is a contractive mapping.

Then

∃ unique α ∈ D such that α = G(α) (unique fixed point).

For any x0∈D,xn+1 =G(xn) converges linearly to α with rate λ.



Proof

We will show that ‖xn‖ → α.

‖xi+1 − xi‖ = ‖G(xi)−G(xi−1)‖ ≤ λ‖xi − xi−1‖≤ . . . ≤ λi‖x1 − x0‖ (by induction)

‖xk − x0‖ = ‖k−1∑i=0

(xi+1 − xi)‖ ≤k−1∑i=0

‖xi+1 − xi‖

≤k−1∑i=0

λi‖x1 − x0‖ =1− λk

1− λ‖x1 − x0‖

<1

1− λ‖x1 − x0‖.



∀k, `:

‖xk+` − xk‖ = ‖G(xk+`−1)−G(xk−1)‖≤ λ‖xk+`−1 − xk−1‖≤ . . . ≤ λk‖x` − x0‖

<λk

1− λ‖x1 − x0‖ −→

k→∞0

⇒ {xn} is a Cauchy sequence ⇒ {xn} → α.



xn+1 = G(xn)↓n→∞

α = G(α)

⇒ α is a fixed point.Uniqueness: Assume β = G(β)

‖α− β‖ = ‖G(α)−G(β)‖ ≤ λ‖α− β‖,(1− λ)︸︷︷︸

>0

‖α− β‖ ≤ 0⇒ ‖α− β‖ = 0⇒ α = β.

Linear convergence with rate λ:

‖xn+1 −α‖ = ‖G(xn)−G(α)‖ ≤ λ‖xn −α‖.

�



Jacobian matrix

Definition

F : Rm → Rm is continuously differentiable (F ∈ C1(Rm)) if, for everyx ∈ Rm,

∂fi(x)∂xj

, i, j = 1, . . . ,m

exist.

F′(x)def:=

∂f1(x)∂x1

. . . ∂f1(x)∂xm

......

∂fm(x)∂x1

. . . ∂fm(x)∂xm

m×m

(F′(x))ij = ∂fi(x)∂xj

, i, j = 1, . . . ,m.



Mean Value Theorem

Theorem (Mean Value Theorem)

f : Rm → R,f(x)− f(y) = ∇f(z)T (x− y)

for some z ∈ x,y, where ∇f(z) =

∂f(x)∂x1

...∂f(x)∂x1

.

Proof: Follows immediately from Taylor’s theorem (linear Taylorexpansion). Since

∇f(z)T (x− y) =∂f(z)∂x1

(x1 − y1) + . . .+∂f(z)∂xm

(xm − ym).



No Mean Value Theorem for vector value functions

Note:

F(x) =

f1(x)...

fm(x)

,

fi(x)− fi(y) = ∇fi(zi)T (x− y), i = 1, . . . , n.

It is not true thatF(x)− F(y) = F′(z)(x− y)



Consider x = G(x), (xn+1 = G(xn)) with solution α = G(α).

αi − (xn+1)i = gi(α)− gi(xn)MV T= ∇gi(zin)T (α− xn), i = 1, . . . ,m

α− xn+1 =

∇g1(z1)T...

∇gm(zm)T

︸︷︷︸

Jn

(α− xn) zj ∈ α,xn

α− xn+1 = Jn(α− xn) (7.47)

If xn → α, Jn →

∇g1(α)T...

∇gm(α)T

= G′(α).

The size of G′(α) will affect convergence.



Theorem 2.9

Assume

D is closed, bounded, convex subset of Rm.

G ∈ C1(D)G(D) ⊂ Dλ = maxx∈D ‖G′(x)‖∞ < 1.

Then

(i) x = G(x) has a unique solution α ∈ D(ii) ∀x0 ∈ D,xn+1 = G(xn) converges to α.

(iii) ‖α− xn+1‖∞ ≤ (‖G′(α)‖∞ + εn) ‖α− xn‖∞,whenever εn −→

n→∞0.



Proof: ∀x,y ∈ D

|gi(x)− gi(y)| ≤∣∣∇gi(zi)T (x− y)

∣∣ , zi ∈ x,y

=

∣∣∣∣∣∣m∑j=1

∂gi(zi)∂xj

(xj − yj)

∣∣∣∣∣∣ ≤m∑j=1

∣∣∣∣∂gi(zi)∂xj

∣∣∣∣ |xj − yj |≤

m∑j=1

∣∣∣∣∂gi(zi)∂xj

∣∣∣∣ ‖x− y‖∞ ≤ ‖G′(zi)‖∞‖x− y‖∞

⇒ ‖G(x)−G(y)‖∞ ≤ ‖G′(zi)‖∞‖x− y‖∞ ≤ λ‖x− y‖∞⇒ G is a contractive mapping. ⇒ (i) and (ii).

To show (iii), from (7.47):

‖α− xn+1‖∞ ≤ ‖Jn‖∞‖α− xn‖∞

≤(‖Jn −G′(α)‖∞︸︷︷︸

εn −→n→∞

0

+‖G′(α)‖∞)‖α− xn‖∞. �



Example (p.104)

Solve{f1 ≡ 3x2

1 + 4x22 − 1 = 0

f2 ≡ x32 − 8x3

1 − 1 = 0, for α near (x1, x2) = (−.5, .25).

Iteratively[x1,n+1

x2,n+1

]=[x1,n

x2,n

]−[.016 −.17.52 −.26

] [3x2

1,n + 4x22,n − 1

x32,n − 8x3

1,n − 1

]



Example (p.104)

xn+1 = xn −AF(xn)︸︷︷︸G(x)

‖G′(α)‖∞ ≈ 0.04,‖α− xn+1‖∞‖α− xn‖∞

−→ 0.04, A =(F′(x0)

)−1

Why?

G′(x) = I−AF′(x), G′(α) = I−AF′(α)

Need

‖G′(α)‖∞ ≈ 0

A ≈(F′(α)

)−1, A =

(F′(x0)

)−1

m dimensional Parallel Chords Method

xn+1 = xn − (F′(x0))−1 F(xn)3. Rootfinding Math 1070


Newton’s Method for F(x) = 0

xn+1 = xn − (F′(xn))−1 F(xn), n = 0, 1, 2, . . .

Given initial guess:

fi(x) = fi(x0) +∇fi(x0)T (x− x0) +O(‖x− x0)‖2

)︸︷︷︸neglect

F(x) ≈ F(x0) + F′(x0)(x− x0) ≡M0(x)

M0(x): linear model of F(x) around x0.Set x1: M0(x1) = 0

F(x0) + F′(x0)(x1 − x0) = 0

x1 = x0 −(F′(x0)

)−1 F(x0)



In general, Newton’s method:xn+1 = xn − (F′(xn))−1F(xn)

Geometric interpretation:

mi(x) = fi(x0) +∇fi(x0)T (x− x0), i = 1, . . . ,mfi(x) : surface

mi(x) : tangent at x0.

In practice:

1 Solve a linear system F′(xn)δn = −F(xn)2 Set xn+1 = xn + δn



Convergence Analysis: 1. Use the fixed point iteration theorem

F(x) = 0, x = x− (F′(x))−1 F(x) = G(x), xn+1 = G(xn)

Assume F(α) = 0, F′(α) is nonsingular.Then G′(α) = 0 (exercise !)

‖G′(α)‖∞ = 0

If G′ ∈ C1(Br(α)) where Br(α) = {y : ‖y −α‖ ≤ r}, by continuity:‖G′(α)‖∞ < 1 for x ∈ Br(x)

for some r.By Theorem 2.9 D = Br ⇒ linear convergence.



Convergence Analysis: 2. Assume: F′(x) ∈ Lipγ(D)

(‖F′(x)− F′(y)‖ ≤ γ‖x− y‖ ∀x,y ∈ D)

Theorem

Assume

F′ ∈ C1(D)∃α ∈ D such that F(α) = 0F′(α) ∈ Lipγ(D)∃ (F′(α))−1 and ‖F′(α)‖−1 ≤ β

Then ∃ε > 0 such that if ‖x0 −α‖ < ε,=⇒ xn+1 =xn− (F′(xn))−1F(xn)→ α and ‖xn+1−α‖≤βγ‖xn−α‖2.

(βγ: measure of nonlinearity)So, need ε < 1

βγ .

Reference: Dennis & Schnabel, SIAM.3. Rootfinding Math 1070


Quasi - Newton Methods

xn+1 = xn −A−1n F(xn), An ≈ F′(xn)

Ex.: Finite Difference Newton

An = aij =fi(xn + hnej)− fi(xn)

hn≈ ∂fi(xn)

∂xj,

hn ≈√δ and ej = (0 . . . 0 1

↑jthposition

0 . . . 0)T .



Global Convergence

Newton’s method:

xn+1 = xn + sndn

when

dn = −(F(xn)′)−1F(xn)sn = 1

If Newton step sn not satisfactory, e.g. ‖F(xn+1)‖2 > ‖F(xn)‖2

sn ←− gsn for some g < 1 (backtracking)

We can choose sn such that

ϕ(s) = ‖F(xn + sdn)‖2

is minimized. Line Search3. Rootfinding Math 1070


In practice: minimize a quadratic model of ϕ(s).

Trust region

Set a region in which the model of the function is reliable. If Newtonstep takes us outside this region, cut it to be inside the region

(See Optimization Toolbox of Matlab)


> 3. Rootfinding > Matlab’s function fsolve

The MATLAB instructionzero = fsolve(’fun’, x0)

allows the computation of one zero of a nonlinear systemf1(x1, x2, . . . , xn) = 0,f2(x1, x2, . . . , xn) = 0,...fn(x1, x2, . . . , xn) = 0,

defined through the user function fun starting from the vector x0 as initialguess.

The function fun returns the n values f1(x), . . . , fn(x) for any value of the

input vector x.


> 3. Rootfinding > Matlab’s function fsolve

For instance, let us consider the following system:{x2 + y2 = 1,sin(πx/2) + y3 = 0,

whose solutions are (0.4761, -0.8794) and (-0.4761,0.8794).The corresponding Matlab user function, called systemnl, is defined as:

function fx=systemnl(x)fx = [x(1) 2+x(2) 2-1;fx = sin(pi*0.5*x(1) )+x(2) 3;]

The Matlab instructions to solve this system are therefore:

>> x0 = [1 1];>> options=optimset(’Display’,’iter’);>> [alpha,fval] = fsolve(’systemnl’,x0,options)alpha =

0.4761 -0.8794

Using this procedure we have found only one of the two roots. The other canbe computed starting from the initial datum -x0.


> 3. Rootfinding > Unconstrained Minimization

f(x1, x2, . . . , xn) : Rn → R;minx∈Rn

f(x)

Theorem (first order necessary condition for a minimizer)

If f ∈ C1(D), D ⊂ Rn, and x ∈ D is a local minimizer then∇f(x) = 0.

Solve:

∇f(x) = 0 with ∇f =

∂f∂x1

...∂f∂xn

. (7.48)



Hamiltonian

Apply Newton’s method for (7.48) with F(x) = ∇f(x).Need F′(x) = ∇2f(x) = H(x).

Hij =∂2f

∂xi∂xj

xn+1 = xn −H(xn)−1∇f(xn)

If H(α) is nonsingular and is Lipγ , then xn → α quadratically.Problems:

1 Not globally convergent.

2 Requires solving a linear system each iteration.

3 Requires ∇f and H.

4 May not converge to a minimum.

Could converge to a maximum or saddle point.



1 Globalization strategy (Line Search, Trust Region).

2 Secant Approximation to H.

3 Finite Difference derivatives for ∇f not for H.

Theorem (necessary and sufficient conditions for a minimizer):

4 Assume f ∈ C2(D), D ⊂ R2, ∃x ∈ D such that ∇f(x) = 0. Then xis a local minimum if and only if H(x) is symmetric positivesemidefinite (vTHv ≥ 0 ∀v ∈ Rn)

x is a local minimum f(x) ≤ f(y) for ∀y ∈ Br(x).



Quadratic model for f(x)

Taylor:

mn(x) = f(xn) +∇f(xn)T (x− xn) +12

(x− xn)TH(xn)(x− xn)

mn(x) ≈ f(x) for x near xn.

Newton’s method:

xn+1 such that ∇mn(xn+1) = 0



We need to guarantee that Hessian of the quadratic model issymmetric positive definite

∇2mn(xn) = H(xn).

Modifyxn+1 = xn − H−1(xn)∇f(xn)

whereH(xn) = H(xn) + µnI,

for some µn ≥ 0.If λ1, . . . , λn eigenvalues of H(xn) and λ1, . . . , λn eigenvalues of H.

λi = λi + µn

Need: µn : λmm + µn > 0.Ghershgorin Theorem: A = (aij), eigenvalues lie in circles with centersaii and radius r =

∑nj=1,j 6=i |aij |.



Gershgorin Circles

-2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8 5.6 6.4 7.2 8

-3

-2

-1

1

2

3

r1=3

1-2

r3=1

5 7

r2=2



Descent Methods

xn+1 = xn −H(xn)−1∇f(xn)

Definition

d is a descent direction for f(x) at point x0 iff(x0) > f(x0 +αd) for 0 ≤ α < α0

Lemma

d is descent direction if and only if ∇f(x0)Td < 0.

Newton: xn+1 = xn + dn, dn = −H(xn)−1∇f(xn).dn is a descent direction if H(xn) is symmetric positive definite.

∇f(xn)Tdn = −∇f(xn)TH(xn)−1∇f(xn) < 0

since H(xn) is symmetric positive definite.3. Rootfinding Math 1070


Method of Steepest Descent

xn+1 = xn + sndn, dn = −∇f(xn), sn = mins>0

f(xn + sdn)

Level curve: C = {x|f(x) = f(x0)}.If C is closed and contains α in the interior, then the method ofsteepest descent converges to α. Convergence is linear.



Weaknesses of gradient descent are:

1 The algorithm can take many iterations to converge towards a localminimum, if the curvature in different directions is very different.

2 Finding the optimal sn per step can be time-consuming. Conversely,using a fixed sn can yield poor results. Methods based on Newton’smethod and inversion of the Hessian using conjugate gradienttechniques are often a better alternative.


numerical mathematical analysis

Documents