mathematics 449/549: scientiﬁc computing

Mathematics 449/549: Scientific computing

Boualem KhouiderUniversity of Victoria

Lecture notes: Updated–September 2019

Contents

1 Preliminaries 5

1.1 What is Scientific Computing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Vector and Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Rounding and danger of cancellation of significant digits . . . . . . . . . . . . 14

1.4 Notions of Stability and Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 The Interpolation Polynomial 24

2.1 The interpolation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Lagrange Interpolation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Lagrange polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.2 Lagrange interpolation polynomial . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3 Interpolation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3.1 Error bound and convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 The Runge phenomenon and piece-wise polynomial interpolation . . . . . . . . . . . 28

2.4.1 The Runge phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.2 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . 29

1

2.5 Other methods for constructing the interpolation polynomial . . . . . . . . . . . . . 31

2.5.1 Vandermonde’s Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.2 Newton’s divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Initial value problems 36

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.2 Round off errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Higher order methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 One step methods: Runge-Kutta methods . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Multistep methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.3 Dealing with implicit methods . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4 Stability, convergence of multistep methods, and stiff equations 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Linear multistep-methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.1 Truncation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.2 Difference equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Zero-stability and convergence of LMM’s . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.4 Stiff equations and the notion of absolute stability . . . . . . . . . . . . . . . . . . . 69

4.4.1 Notion of Absolute Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5 A-stability, L-stability, and the BDF methods . . . . . . . . . . . . . . . . . . . . . . 73

2

4.6 Matlab ODE suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Miscellaneous Linear Algebra 75

5.1 Diagonally Dominant and Positive Definite Matrices . . . . . . . . . . . . . . . . . . 75

5.2 Least Square Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.1 General least squares and Gram-Smith Orthogonalization Procedure . . . . . 80

5.3 Matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Singular Value Decomposition and Empirical Orthogonal Functions . . . . . . . . . . 88

5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Finite difference and finite volume methods for transport and conservation laws 98

6.1 Introduction to finite differences: The heat equation . . . . . . . . . . . . . . . . . . 99

6.1.1 Explicit scheme for the heat equation . . . . . . . . . . . . . . . . . . . . . . 101

6.1.2 Stability of the forward scheme: von Neumann analysis . . . . . . . . . . . . 104

6.1.3 Implicit scheme for the heat equation . . . . . . . . . . . . . . . . . . . . . . 106

6.1.4 The Crank-Nicholson scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Time splitting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.3 Introduction to quasi-linear equations and scalar conservation laws . . . . . . . . . . 109

6.3.1 Prototype examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.3.2 Solutions by the method of characteristics . . . . . . . . . . . . . . . . . . . . 112

6.3.3 Notion of shocks and weak solutions . . . . . . . . . . . . . . . . . . . . . . . 114

6.3.4 Discontinuous initial data and the Riemann problem . . . . . . . . . . . . . . 117

6.3.5 Non-uniqueness of weak solutions and the entropy condition . . . . . . . . . . 120

6.4 Finite difference schemes for the advection equation . . . . . . . . . . . . . . . . . . 122

3

6.4.1 Some simple basic schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.4.2 Accuracy and consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.4.3 Stability and convergence: the CFL condition and Lax-equivalence theorem . 124

6.4.4 More on the leap-frog scheme: the parasitic mode and the Robert-Asselinfilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.4.5 The Lax-Friedrichs scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.4.6 Second order schemes: the Lax-Wendroff scheme . . . . . . . . . . . . . . . . 130

6.4.7 Some numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.4.8 Numerical diffusion, dispersion, and the modified equation . . . . . . . . . . . 134

6.5 Finite volume methods for scalar conservation laws . . . . . . . . . . . . . . . . . . . 139

6.5.1 Wrong shock speed and importance of conservative form . . . . . . . . . . . . 139

6.5.2 Godonuv’s first order scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.5.3 High resolution, TVD, and MUSCL schemes . . . . . . . . . . . . . . . . . . 143

4

Chapter 1

Preliminaries

1.1 What is Scientific Computing?

In a nutshell, scientific computing is the “science behind the collection of tools, techniques, andtheories when using a computer to solve mathematical problems in science and engineering”.

“Scientific computing draws from mathematics (areas of numerical analysis and mathematical mod-elling) and computer science to develop the best ways to use computer systems to problems fromscience and engineering”. See Figure 1.1.

A few key words:

Modelling is the key first step in which the scientists, engineers, and mathematicians draw themathematical formulation (e.g. calculus or probability theory) of a scientific or engineering prob-lems, in the form of a system of linear or nonlinear equations, an ordinary or partial differentialequation, a complicated integration form, a stochastic process, an optimization problem, etc.

Algorithm is the mathematical model that goes directly into the computer code. This is usuallybased on “controlled” approximations of the initial mathematical formulation of the science orengineering problem. This step often involves a “discretization” of the (continuous) mathematicalmodel, especially in the case of differential equations and integrals. The core of numerical analysisdeals with the “art” of designing and constructing “discretizations” that are both accurate andcomputationally stable. Most of the course will deal with the meaning of the last two highlightedwords.

Validation is the step where the computer solution is compared to an exact solution of the math-ematical formulation, when such solution exists for a certain parameter configuration, or to fieldmeasurement or experimental data. This is an important step before the computer code can bedelivered for the general user for real word applications such as weather and financial market fore-casts or the assessment of the economic growth or an electric power potential of a hydroelectricplant. It is also important for the developer so the model or algorithm can be refined accordingly

5

Figure 1.1: Scientific computing is a branch of modern applied mathematics and blends in multipledisciplines.

6

and this process is usually iterated until the validation step is successful.

Sources of errors. The “scientific computing” process of “solving” a science or engineeringproblem on a computer involves three levels of errors: Modelling errors arising during the math-ematical formulation, truncation errors occurring during the discretization step (or algorithmconstruction step) and round off errors arising because of limited accuracy in computer arith-metic.

Efficiency refers to fast computer codes that can tackle the problem at hand in a reasonable amountof time. For example, a weather forecast model that would take more than a day to advance theequations of atmospheric dynamics for one day of real time is not very useful. The design of fastalgorithms is an important part of numerical analysis and scientific programing. It depends onthe type of computers and programing languages used. There is a whole area of computer sciencecalled “high performance computing (HPC)” that is specifically devoted to this issue, especially forlarge-scale computing problems such climate and numerical weather predictions. The majority ofHPC relies on parallel computing and very sophisticated hardware: large computer clusters, sharedmemory machines, vector machines, GPUs, etc.

1.2 Vector and Matrix norms

Definition 1 (Vector norms) Let X = (x1, x2, · · · , xn) denote a vector in Rn. A vector norm isa function, denoted by the doubled vertical lines ||.||, from Rn to [0,+∞) that satisfies the followingconditions

1) ||X|| = 0 ⇐⇒ X = 0.

2) ||cX|| = |c|||X|| for all scalar c ∈ R (|.| is the absolute value) .

3) ||X + Y || ≤ ||X|| + ||Y ||, triangular inequality.

The following examples are the most commonly used vector norms.

i) Euclidean or L2 norm

||X||2 =

(

n∑

i=1

x2i

)1/2

ii) L1 norm:

||X||1 =

n∑

i=1

|xi|

iii) Max or L∞ norm||X||∞ = max

1≤i≤n|xi|

7

iv) Lp norm, for p > 1

||X||p =(

n∑

i=1

|xi|p)1/p

.

We can show that all these examples satisfy the properties 1), 2) and 3) of a vector norm. Here wedo it for the case of the L1 norm to illustrate. The remaining three cases are left as an exercise.Proving the triangular inequality for the L2 and Lp norms is a bit tricky (See Pbs. 2, 3, and 4.).They invovle the Cauchy-Schwartz and Holder inequalities, respectively. For the 1-norm, we have

||X||1 = 0 ⇐⇒n∑

i=1

|xi| = 0 ⇐⇒ |xi| = 0,∀i = 1, · · · , n ⇐⇒ X = (0, 0, · · · , 0),

the zero vector, and

|cX||1 =

n∑

i=1

|cxi| = |c|n∑

i=1

|xi| = |c|||X||1.

It remains to show that the triangular inequality 3) is also satisfied.

||X + Y ||1 =n∑

i=1

|xi + yi| ≤n∑

i=1

|xi|+ |yi| =n∑

i=1

|xi|+n∑

i=1

|yi| = ||X||1 + ||Y ||1.

Example: Let X = (1, 0, 2,−1) a vector in R4. Then

||X||2 =√1 + 4 + 1 =

√6; ||X||1 = 4, ||X||∞ = 2.

Vector norms in MatlabIn Matlab, there is a predefined function named norm: >>norm(V,p) returns the norm p of agiven vector V . If p is not specified, the default is the L2 norm, i.e >>norm(V) is equivalent to>>norm(V,2). Evidently, norm(V,Inf) returns the max norm of V .

Definition 2 (Matrix norm) A matrix norm is a function denoted by ||.||, defined on the “spaceof square matrices” and takes values in [0,+∞) such that

1) ||A|| = 0 ⇐⇒ A = 0

2) ||cA|| = |c|||A||

3) ||A+B|| ≤ ||A|| + ||B||

4) ||AB|| ≤ ||A|| ||B||

Note that properties 1),2),3) are those of a vector norm. For simplicity in exposition we denote by||.|| both the vector and matrix norms. When there is a risk for confusion, we use ||.||v and ||.||Mto denote vector and matrix norms, respectively.

8

Definition 3 (Subordinate matrix norm) Given a vector norm, ||.||v, we can show that

||A||M = max||X||v=1

||AX||v ≡ supX 6=0

||AX||v||X||v

is a matrix norm, called the subordinate matrix norm, to the given vector norm.

As a consequence, we have the following inequality (known as the compatibility condition) for anygiven subordinate matrix norm.

||AX||v ≤ ||A||M ||X||v . (1.1)

ExamplesThe common subordinate norms are those associated with the L∞, L1 and L2 vector norms. Theyare denoted respectively by ||A||∞, ||A||1, and ||A||2, though there is a risk of confusing them withthe vector norms they originate from. Interestingly, however, all these three subordinate norms areknown in a closed form and therefore there is no need for computing a maximum (See problem 6).They are given as follows.

i) The L∞ subordinate norm

||A||∞ ≡ max||X||∞=1

||AX||∞ = max1≤i≤n

n∑

j=1

|aij |

ii) The L1 subordinate norm

||A||1 ≡ max||X||1=1

||AX||1 = max1≤j≤n

n∑

i=1

|aij |

iii) The L2 subordinate norm

||A||2 ≡ max||X||2=1

||AX||2 =(

ρ(AAT ))1/2

, (1.2)

where ρ(AAT ) is the spectral radius of the matrix AAT given by

ρ(B) = max |λ|, λ is an eigenvalue of B .

Note that it follows immediately from (1.2) that if A is symmetric, then ||A||2 = ρ(A). However,to prove (1.2) it is easier to first consider the latter-symmetric case as it is illustrated next.

Exercise:

(a) Show that ||B||2 = ρ(B) if B is symmetric.

(b) Show that ||A||22 = ||ATA||2for any square matrix A.

(c) Deduce that ||A||2 =√

ρ(ATA).

9

Solution:

(a) ||B||2 = ρ(B) if B is symmetric.

First, we show that ρ(B) ≤ ||B||2 for any matrix B. Let (λ, V ) be an eigenvalue-eigenvector pairof B (i.e, BV = λV ) with ||V ||2 = 1. We have

|λ|2 = |λ|2||V ||22 = ||λV ||22 = ||BV ||22 ≤ ||B||22 ||V ||22 = ||B||22.

Thusρ(B)2 ≤ ||B||22.

In fact, this shows that the inequality ρ(A) ≤ ||A|| is valid for any given subordinate matrix norm||.||.

It remains to prove that ||B||2 ≤ ρ(B) for any symmetric matrix B. Recall one fundamentaltheorem of linear algebra states that the eigenvalues of a symmetric matrix are real and that theassociated eigenvectors form an orthonormal basis of Rn. Let λ1, λ2, · · · λn be the n real eigenvaluesof B (counting multiplicities) and V1, V2, · · · , Vm the associated orthonormal basis of eigenvectors.Then any given vector X ∈ Rn can be uniquely written as a linear combination of the Vi’s:

X =n∑

i=1

αiVi.

Assume ||X||2 = 1. We have

||BX||22 = ||Bn∑

i=1

αiVi||22 = ||n∑

i=1

αiBVi||22 = ||n∑

i=1

αiλiVi||22

=

n∑

i=1

||λiαiVi||22 =n∑

i=1

|λi|2 ||αiVi||22 ≤ max1≤i≤n

|λi|2n∑

i=1

||αiVi||22

= ρ(B)2||X||22 = ρ(B)2

i.emax

||X||2=1||BX||2 ≤ ρ(B),

which is what we wanted. We note that in the process we used the fact that the vectors Vi, i =1, · · · , n are mutually orthogonal, through the use of the Pythagorean theorem:

n∑

i=1

||βiVi||22 = ||n∑

i=1

βiVi||22

where βi = λiαi or βi = αi, i = 1, · · · , n.

(b) ||A||22 = ||ATA||.

Here A = (aij) denotes an n× n matrix and X = (xi) and Y = (yi) denote generic vectors in Rn.

10

Let

〈X,Y 〉 =n∑

i=1

xiyi

be the scalar (or dot) product in Rn. By definition of the scalar (or dot) product of vectors, wehave

(i)||X||22 = 〈X,X〉 ≡ XTX ≡n∑

i=1

x2i ,

(ii)〈AX,Y 〉 =n∑

i=1

yi

n∑

j=1

aijxj

=n∑

j=1

xj

(

n∑

i=1

aijyi

)

= 〈X,ATY 〉,

(iii)〈X,Y 〉 ≤ ||X||2||Y ||2. (1.3)

The last inequality is known as the Cauchy-Schwartz inequality (see Problem 2).

Using successively (i), (ii), and (iii) above, and the property in (1.1) of a subordinate matrix, wehave

||AX||22 = 〈AX,AX〉 = 〈X,ATAX〉 ≤ ||X||2||ATAX||2 ≤ ||ATA|| ||X||2,dividing by ||X||22 both sides and taking the max on all X 6= 0 yields

||AX||22||X||22

≤ ||ATA||2 =⇒ ||A||22 ≤ ||ATA||2.

It remains to show the inequality in the other direction, i.e, that ||ATA||2 ≤ ||A||22. We will exploitthe result in (a). Namely, we use the fact that ||ATA||2 = ρ(AAT ), because ATA is symmetric.

Let λm be the eigenvalue of ATA such that ρ(ATA) = |λm|. Let Vm be the eigenvalue of ATAassociated with λm: A

TAVm = λmVm. Assume that ||Vm||2 = 1. We have

||ATA||22 = |λm|2 = |λm|2||Vm||22 = 〈λmVm, λmVm〉 = 〈ATAVm, λmVm〉

= 〈AVm, λmAVm〉 = |λm| ||AVm||22 ≤ |λm| ||A||22||Vm||2 = |λm| ||A||22,i.e,

λ2m ≤ |λm|||A||22which implies that

|λm| ≤ ||A||22,if λm 6= 0. (But if λm = 0, then this statement becomes trivial, because it implies that A = 0thanks to, the previously shown inequality, ||A||22 ≤ ||ATA||2.)

(c) Conclusion:

We showed that ||A||22 = ||ATA||2 and that ||ATA|| = ρ(ATA), since this matrix is symmetric.When combined these two statements imply

||A||2 =√

ρ(ATA).

11

Frobenius norm

A well known (and widely used) matrix norm which is not a subordinate norm is the Frobeniusnorm given by

||A||F =

n∑

i=1

n∑

j=1

a2ij

1/2

which should not be confused with the L2 subordinate norm.

Definition 4 A matrix norm ||.||M is said to be compatible with a vector norm ||.||v if

||AX||v ≤ ||A||M ||X||v .

It is easy to see, from the definition, that a subordinate matrix norm is compatible with the vectornorm it originated from. We have the following important theorem whose proof is left as an exercise.

Theorem 1 If ||.|| is a matrix norm compatible with a vector norm, then

ρ(A) ≤ ||A||, for any square matrix A

Matrix norm in MatlabThe function norm works for matrices in the same way as for vectors: >>norm(A) return the 2-norm of the matrix A and >>norm(A,p) return the subordinate p-norm of A while >>norm(A,’fro’)returns the Frobenius norm of A.

1.3 Floating-point arithmetic

While most of the course will focus on the numerical analysis aspect of scientific computing, herewe give a brief overview of floating-point or computer arithmetic, just enough to appreciate themeaning (and especially the dangers) of round off errors. Here we follow loosely the beautiful noteswritten by M. Overton.

The Floating-point system (fl-pt for short) consists of a (finite) set of (rational) numbers usedto represent the whole real line endowed with the operations of addition and multiplication. Rightaway, we see a big problem with this. It is just impossible to make this representation accurate asit is impossible to “reach” arbitrary large numbers with a finite set of rationals. The same prob-lem arises for very small numbers that are too close to zero. While there are many possibilities toconstruct a floating point system with a sound representation of all real numbers for our daily scien-tific computations, many different computers use different systems which hinders the portability ofcomputer codes (scientific computing software). Fortunately, a more or less standard system exists.It was devised by the IEEE (Institute of Electrical and Electronics Engineers). According to the

12

IEEE standards, a good fl-pt system needs to provide a reasonably accurate representation for the(range of) real numbers involved in “typical” scientific operations and provide a “correct” and “sta-ble” scheme on how to deal with basic arithmetic operations such as additions and multiplications,which are “close” to the ordinary operations.

A fl-pt point number has the basic form

x = ±mβE

and occupies a certain number of bits in the computer’s memory space. The sign ± take one bitand the parameters m,E take the remaining bits: m is called the mantissa (or root), and E is theexponent. The parameter β is the base, a fixed integer. While IEEE standard uses the binary baseβ = 2, bases such as 8,12, and 16 are also used.

The mantissa m is a sequence of beta-“symbols” usually represented by the integers 0,1,...,β − 1.A sequence of 0’s and 1’s when β = 2. For β = 12 for example the following symbols are used:0,1,2,3,4,5,6,7,8,9,A,B,C.

The sequence m = b0.b1b2 · · · bn refers to the series expansion

m = b0 + b1β−1 + b2β

−2 + · · ·+ bnβ−n

where 0 ≤ bj ≤ β−1. The exponent is an integer bounded from above and from below by a positiveand negative limits Emax and Emin, respectively: Emin ≤ E ≤ Emax.

The IEEE single precision system uses a total of 32 bits: one for the sign, 23 for the mantissa, and8 for the exponent, Emin = −126, Emax = 127. The double precision mode (which is the modeused in matlab) is based on 64 bits. See the Overton article for details. For example the number71 is stored as 71 = 1.00011100000000000000000 × 26 in single precision. See Table 1. of Overton.

The fl-pt representation is unique if we require 1 ≤ m < β. Real numbers that can be representedunder this requirement are called normal numbers and their first mantissa coefficient is non zero:1 ≤ b0 ≤ β − 1.

The largest single precision number is

xmax = 1.11111 · · · 1× 2127 ≈ 3.4 × 1038.

The smallest normal number represented in the IEEE single precision is given by

xmin = 1× 2−126 ≈ 1.2× 10−38.

Because in base 2, b0 is fixed to b0 = 1, for all normal numbers this number doesn’t need to bestored in memory. This frees up one extra bit for the mantissa to allow more precision.

Numbers that can not be represented in this “normal fashion” are called subnormal numbers.Obviously, subnormal numbers are smaller (in magnitude) than xmin.

13

The Exponent uses a bit-string E = a1a2 · · · a8 which stores the number E+127, the 127 is calledthe exponent bias. When a1 = a2 = · · · a7 = 1, a8 = 0 we get E + 127 = (11111110)2 = (254)10 i.eEmax = 127 and when a1 = a2 = · · · a7 = 0, a8 = 1 we get E + 127 = 1 i.e, Emin = −126.

The sequence a1 = a2 = · · · = a8 = 1 (11111111) is reserved for the special numbers ±∞ and NaN.The sequence a1 = a2 · · · = a8 = 0 is reserved to subnormal numbers. This way the first coefficientof the mantissa b0 doesn’t need to be stored and the 23 bits can be all used to store “coefficients”after the “decimal” point. The smallest subnormal number is

xsubmin = 0.00 · · · 01× 2−126 = 2−23 × 2−126 = 2−149 ≈ 1.4× 10−45

Machine Precision and Machine Epsilon

In any given fl-pt system, the machine epsilon (ǫ) is defined as the difference between the numberone and the first fl-pt number larger than one. In base 2 with single precision IEEE fl-pt system,we have

1 ↔ 1.00000000000000000000000 × 20

and1 + ǫ↔ 1.000000000000000000000001 × 20 = 1 + 2−24

i.eǫ = 2−24 ≈ 5.9605 × 2−8.

It follows immediately that ǫ is also the relative distance between any two consecutive normal fl-ptnumbers in the same fl-pt system. If x is a positive normal ft-pt number, the next ft-pt numberlarger than x is x + ǫx. In this fashion, the number ǫ defines the precision of the machine. Itsets an upper bound on the relative error between the real number x and it fl-pt representation(approximation):

|x− x||x| ≤ ǫ,

expect for some exception that are reported below.

1.3.1 Rounding and danger of cancellation of significant digits

Computer arithmetic refers to the way a computer makes up additions and multiplications on fl-ptnumbers. The difficulty arises because the sum or multiplication of 2 ft-pt numbers is not necessarya fl-pt number. Thus some rounding (up, down or to the nearest) is performed after almost eachsuch operation. The mode of rounding (up, down or to the nearest) is something specific to eachcomputer.

As pointed out above, the relative error between an arbitrary real number x and its fl-pt approxi-mation (rounded up or down) x, is bounded by ǫ

|x− x||x| ≤ ǫ.

14

However this is restricted to the set of real number excepting overflows (numbers that are largerthan xmax), underflows (numbers that are smaller that xsubmin) and subnormal numbers. In theIEEE standard, overflows are treated as ∞ while underflows are set to zero. In IEEE, ∞ is a welldefined-legitimate number (or a set of number to be more precise) while NaN is a result of anoperation with an undetermined result such as 0×∞ and ∞−∞.

Idealized ft-pt arithmetic refers to fl-pt operations where the main operation is first done in exact(conventional real number arithmetic) and the result is then rounded (up, down or to the nearest)according to the rounding mode in use:

x⊕ y = (x+ y).

Here ⊕ is the addition operator in fl-pt arithmetic. The IEEE standard fl-pt system uses guardbits to come as close as possible to this ideal situation!

ExampleConsider a ft-pt system in base β = 10 with 6 significant digits. Let x = 1.92403 × 10−2 andy = 1.92275 × 10−2 be two ft-pt numbers in this system and consider the operation

x⊖ y = 0.00128 × 10−2 = 1.28000 × 10−5

We note that no rounding was necessary and the last step is the normalization step. The mainglitch here is that, after normalization, the mantissa contains mostly zeros. This is in fact not aproblem at all if x and y are precisely what they are; If they actually meant to be approximationsof other numbers, the result would be off by up to 3 significant digits ≈ 1%. For simplicity, imaginethat x is seen as the fl-pt approximation of x = 1.924039999 × 10−2 and y = y. Then, the trueresult would been x− y = 1.289999 × 10−5.

fl-pt addition is not associative and neither commutative: It is to make up examples where

x⊕ (y ⊕ z) 6= (x⊕ y)⊕ z.

However, to disprove the commutativity is a bit harder. One way to see it is to consider add manysmall numbers to one large number. If the large number comes first then the same will be justit–by it I mean the number itself!–but if one first add all the small numbers together then theymay sum up to a non significant amount that would change the result.

The example below demonstrates the non associativity.

Example: Let β = 10 and assume a mantissa with 4 digits in rounding down mode, i.e, chopping.Let x = 0.1234, y = −0.5508 × 10−4 and z = −0.1232. We have

(x⊕ y)⊕ z = 0.12334492 ⊕ z = 0.1233 − 0.1232 = 0.0001,

x⊕ (y ⊕ z) = 0.1234 ⊕ (− 0.12325508) = 0.1234 − 0.1232 = 0.0002

and(x⊕ z)⊕ y = 0.0002 ⊕ y = 0.2000 × 10−3 ⊕ 0.5508 × 10−4 = 0.1449 × 10−3.

We note that the actual result is x+ y+ z = 1.4492× 10−4. Note that the last option yields a veryaccurate result but the first two are very far off. The problem arises because we are adding two

15

numbers that are very far from each other and then subtracting two numbers that are very closeto each other. In principle, the loss of significant digits occurs when we encounter an addition oftwo numbers that are very far from each or a subtraction of two numbers that are too close to eachother.

While the round off error committed when adding numbers that are far from eachother is often harmless because it results in a small relative error, round off errorsdue to subtracting numbers that are very close to each other can lead to catastrophicresults. To avoid such errors one has to be extremely careful during programming. Thefollowing example illustrate how the danger of cancellation of significant digits can be minimized.

ExampleAssume β = 10 and a 5 digit mantissa with chopping mode of rounding. Consider the operation

y =1

x+ 1− 1

x+ 2.

In fl-pt arithmetic this operation can be very inaccurate for large values of x:

y = fl− pt(1

1001⊖ 1

1002) = 9.9900 × 10−4 − 9.9800×−4 = 1.0000 × 10−6

The actual values is y ≈ 9.9701 × 10−7!

To avoid this inaccuracy from happening, we should first transform this quantities to a differentbut equivalent expression that does not involve the subtraction of numbers that are too close toeach other. For instance, we can proceed as follows.

y =1

x+ 1− 1

x+ 2=

1

(x+ 1)(x+ 2).

In fl-pt arithmetic, we have

y =1

1001 × 1002=

1

1.0030 × 106= 9.9701 × 10−7

which is precisely the result found by exact arithmetic.

1.4 Notions of Stability and Conditioning

The goal of numerical methods is to solve mathematical problems on a digital a computer. To doso, we first need to design/use an algorithm (or a numerical method) for the given problem, whichoften provides only an approximate solution to the problem at hand, in place of the full or exactsolution. It is highly desirable to design an algorithm that leads to the most possibly accurateapproximation. Here we present some of the common problems that may arise, hinder this goal,and restrict the accuracy of the numerical solution. We provide some necessary conditions for afair approximation. Precisely, we need both a stable algorithm and a well-conditioned problem tohope for an accurate numerical solution. The precise definitions of these notions are given next. Aproblem which is NOT well-conditioned is said to be ill-conditioned.

16

Definition 5 A given mathematical problem is said to be ill-conditioned if small changes in dataproduce large deviations in the result.

Data −→ SolutionX −→ S

X = X + ǫ −→ S

The problem is well-conditioned if: ǫ << 1 =⇒ |S − S|/S << 1.

Definition 6 An algorithm is said to be stable if its approximate solution is close to the exactsolution of the original problem with slightly perturbed data.

Exact data: X Algorithm (computer code): −→ Approximate solution: SnPerturbed data X = X + ǫ Exact computation: −→ exact solution S

The algorithm is stable if for all X there exists a small perturbation ǫ, such that |Sn − S|/Sn issmall.

Example 1: The Hilbert matrix.

Hij =1

i+ j − 1, 1 ≤ i, j ≤ n

Consider the linear system HX = b with n = 3

1 1/2 1/31/2 1/3 1/41/3 1/4 1/5

xyz

=

11/613/1247/60

.

The exact solution is X = (1, 1, 1)T .

Consider the perturbed problem, obtained by rounding the entries of both the matrix H and theright hand side vector b to three significant digits. Let H and b be the truncated matrix andtruncated right hand side vector, respectively.

H =

1.00 0.500 0.3330.500 .333 0.2500.333 0.250 0.200

, b =

1.831.080.783

The (exact) solution to the perturbed problem HX = b is

X = (1.0895, 0.48797, 1.4910)T .

Let’s compare the perturbed solution to the solution of the original problem. The absolute error,using the L1 norm, is

||X − X|| = |x− x|+ |y − y|+ |z − z| = 1.09253

17

and the corresponding relative error is

||X − X ||||X|| = 0.364 = 36.4%.

A small perturbation of the original problem (on the order 1/1000) resulted in a deviation of 36 %in the solution. The problem HX = b is therefore ill-conditioned. We will see later in the coursethat this is an issue with the Hilbert matrix itself–it is ill conditioned.

Stability:Now, assume that we use “Gauss elimination” with 3 digits to approximate the solution of thesystem HX = b. This is our algorithm. The question is whether this algorithm is stable or not.

The approximate solution obtained by this algorithm (whose details are not shown here for stream-lining) is given by

Xn = (0.480, 1.88, 1.22)T .

First note that Xn 6= X (do you know why?). In fact, the two solutions are very far apart from eachother, another pitfall of ill-conditioning. Ill-conditioned problems are very sensitive to round-offerrors and thus are very tricky to handle in a fl-pt environment.

The question is whether the solution Xn is close to the exact solution of a perturbed problem.In other words, can we find a perturbation matrix E and perturbation vector e to the original(truncated) matrix H and truncated vector b, respectively, such that the solution of

(H + E)Xp = b+ e,

is close to Xn?

It is easy to check that the solution, Xp, for

1.06 0.521 0.3390.47 .328 0.2360.327 0.235 0.178

xyz

=

1.827331.085070.783315

,

isXp = (0.4650, 1.800, 1.1700)T

which is indeed fairly close to Xn. Also both the matrix and the right hand side vector of thislast system are small perturbations (less than 1%) of the matrix H and vector b. Therefore, ouralgorithm is stable.

Example 2:Consider the problem of computing the quantity

w =1000x

x− y − z, where x = 0.1276; y = 0.0004001; z = 0.1267.

The exact solution is given by w = 255, 251.05.

To check if the problem is ill-conditioned or not, we consider the small perturbation of the data

x = 0.1275, y = y; z = 0.1268.

18

The perturbed solution is

w =1000x

x− y − z= 425, 141.71

The perturbed solution has no significant digits compared to the original solution. Thus, theproblem is ill-conditioned.

Consider the algorithm of using fl-pt arithmetic to compute w in base b = 10 with precision k = 4and chopping mode. We have

w = fl(1000x

x− y − z) = 319, 000.

(The details of the fl-pt calculation of w are left as an exercise.) Can we find a perturbation to thedata so that the solution to the perturbed problem is close to w?

Let x = x, y = y, z = 0.126799 which is a small perturbation for the original data. We have

1000x

x− y − z= 319, 000 = w

The algorithm is thus stable.

Example 3:Consider the approximation

ex ≈ 1 + x+x2

2!+ · · · + xn

n!; x = −5.5.

Using n = 24, we gete−5.5 ≈ 0.00408677 = y.

Consider the perturbed data: x = x+ ǫ.

ex = ex+ǫ = exeǫ = ex(1 + ǫ+ǫ2

2!+ · · · )

≈ ex(1 + ǫ)

≈ (1 + ǫ)[1 + x+x2

2!+ · · ·+ xn

n!] = y

|y − y||y| = ǫ =⇒ The problem is well-conditioned.

Assume we use 5 digits (base 10) in a rounding mode to evaluate the given Taylor approximation.Is this stable?

Note that all terms on the form (−5.5)n/(n!) with n ≥ 25 add no further change (improvement)to the Taylor approximation in this fl-pt arithmetic system. (In fact 5.526/26! ≈ 4.4036 × 10−8 =0.000044036 × 10−3 is rounded to zero in a 5 digit precision when added to the sum of the first 25terms given by yn = 0.0055304 = 5.5304× 10−3 .) This algorithm yields and approximate solution

yn = 0.0055304

19

for all n ≥ 25. This approximate solution has no significant digits:

|y − yn|y

= 0.3532 = 35.32%!

Can we find a small perturbation x to the original data x so that y ≡ ex ≈ yn ? The answer is no.Otherwise, the solution y will be close to y because we know that this problem is well-conditioned:|y − y|/y = ǫ; if we suppose that in addition y is close to yn, we will get a contradiction: Suppose|y − yn|/yn < δ. Then

|y − yn|y

=|y − y + y − yn|

y≤ |y − y|

y+

|y − yn|y

≤ ǫ+|y − yn|yn

yny

≤ ǫ+ δyny

= α

α is small given that both ǫ and δ are small and that yny is order one. This is a contraction with

the fact that |y−yn|y = 0.3532 which is very large compared to the unit round-off which is on the

order of 10−4.

In fact, if we attempt to compute a perturbation ǫ that yields the solution yn (exactly or approxi-mately), we find

e−5.5+ǫ = 0.0026363 =⇒ −5.5 + ǫ = ln(0.0026363) =⇒ ǫ = ln(0.0026363) + 5.5 ≈ −0.4383,

which is clearly a large perturbation of |0.4383|/5.5 = 0.08 = 8%. The algorithm is thereforeunstable. Clearly this is due to a cancellation of significant digits in the Taylor expansion. Thiscan be dealt with for instance by changing the order of summation in the Taylor series.

1.5 Problems

1. Show that for any given two vectors X,Y ∈ Rn, we have the Cauchy-Schwartz’s inequality:

(

n∑

i=1

xiyi

)2

≤(

n∑

i=1

x2i

)(

n∑

i=1

y2i

)

.

Hint: Consider the discriminant of the quadratic form f(t) = ||X + tY ||22.

2. Follow the steps below to show that for any given two vectors X,Y ∈ Rn, we have Holder’sinequality

n∑

i=1

|xiyi| ≤(

n∑

i=1

xpi

)1/p( n∑

i=1

yqi

)1/q

,1

p+

1

q= 1.

(This one is in fact more involved.)

(a) Consider the function f(t) = t− tp/p, t ≥ 0, p > 1. Prove that f(t) ≤ f(1), ∀t > 0

(b) Deduce from 2a that

|xiyi| ≤1

p|xi|p +

1

q|yi|q for all p, q,

1

p+

1

q= 1.

Hint: let t = |xi|/|yi|q−1 when yi 6= 0.

20

(c) Deduce thatn∑

i=1

|xiyi| ≤1

pap||x||pp +

1

q

1

aq||y||qq, ∀a > 0

(d) Deduce Holder’s inequality. Hint: Let a = ||x||1/qp /||y||1/pq in 2b.

3. Use the following two steps below to prove the triangular inequality for the Lp norm, p ≥ 1,also known as Minkowski’s inequality.

(a) Show that for all x, y ∈ R, and p > 1 we have

|x+ y|p ≤ |x+ y|p−1|x|+ |x+ y|p−1|y|

(b) Use Holder’s inequality and the result here above to show that

||X + Y ||pp ≤ ||X + Y ||p−1p ||X||p + ||X + Y ||p−1

p ||Y ||p

and deduce Minkowski’s inequality.

4. Show that the following expressions form vector norms in Rn.

(a) ||X||∞ = max1≤i≤n |xi|.(b) ||X||2 =

(∑n

i=1 x2i

)1/2

(c) ||X||p = (∑n

i=1 xpi )

1/p, p > 1

5. Let ||.|| be a matrix norm which is compatible with some vector norm. Show that

ρ(A) ≤ ||A||.

6. Show the following identities for the given subordinate matrix norms:

i) The L∞ subordinate norm

||A||∞ ≡ max||X||∞=1

||AX||∞ = max1≤i≤n

n∑

j=1

|aij |

ii) The L1 subordinate norm

||A||1 ≡ max||X||1=1

||AX||1 = max1≤j≤n

n∑

i=1

|aij |

iii) The L2 subordinate norm

||A||2 ≡ max||X||2=1

||AX||2 =(

ρ(AAT ))1/2

where ρ(AAT ) is the spectral radius of the matrix AAT given by

ρ(B) = max |λ|, λ is an eigenvalue of B .

21

7. (a) Read the Overton notes and find the xmax, xmin, xsubmin and the epsilon machine for thedouble precision IEEE fl-pt system.

(b) Now open Matlab on your desktop (or laptop) computer and type realmin, realmax,eps,compare to the values found above and comment. realmin,realmax,eps are predefined Matlabconstant.

(c) Write a small computer program (in Matlab or other language) that can automaticallydetect the value of the Machine Epsilon. Assume that the base β = 2. Such programs areuseful when trying to write portable codes.

Hint: If the ǫ is the machine epsilon then in fl-pt arithmetic we have 1 ⊕ ǫ/2 = 1. In yourprogram you should not make use of predefined constants such as eps of Matlab.

8. Consider the real number

φ =

√5− 1

2.

We propose to compute the successive powers, φn, of φ on a digital calculator which doesn’tactually have the star key for making multiplications (a very very old machine ,).

a) Verify that φ solves the quadratic equation

φ2 = 1− φ.

Find the other root of this quadratic equation.b) Conclude that the powers, φn, of φ satisfy the recursive relation

xn = xn−2 − xn−1.

This relation can be used to compute the value φn from the values of φn−1 and φn−2 by asimple addition (or rather subtraction).c) Start with x0 = φ0 = 1 and x1 = 0.6180 which approximates φ to four digits (φ ≈0.61803398874989) and compute the values xn approximating the powers φn up to n = 16 byusing the recursive relation in (b). Use your calculator or the short Matlab program providedbelow to do the calculations.

(You don’t need to restrict the precision to k=4 or something like that–use the (full) precisionof your machine, e.g. double precision).Report your results in a comparative table. Your table should have three to four rows anda few columns to hold enough data (e.g. fill the table below). The first rows will have theexact values of φn, n = 0, 1, · · · , 16, the second will have the results of the recursive relationand the last one(s) will contain the corresponding absolute and/or relative errors.

n 0 1 2 5 8 11 14 16

φn 1 0.61803Computed xn 1 0.61803

|xn − φn| 0 0|xn − φn|/φn 0 0

d) Repeat the calculations with x1 = φ, i.e, the exact value and proceed to n = 50.e) Look at the behaviour of the errors as n grows and conclude whether the problem is illconditioned or not.f) Provide an analytical justification for this behaviour, i.e, use analytical tools to explain

22

what’s happening.Hint: recall the roots, φ±, of the equation φ

2 = 1−φ and prove that the sequence xn satisfiesxn = c1φ

n+ + c2φ

n− for some constants c1, c2.

Short Matlab program:

>> phi=(sqrt(5)-1)/2

phi =

6.1803e-01

>> x0=1;x1=phi;

>> X=[x0;x1]

X =

1.0000e+00

6.1803e-01

>> Phi=[1; phi]

Phi =

1.0000e+00

6.1803e-01

>> for I=2:16

x2=x0-x1;

X=[X;x2];

x0=x1;x1=x2;

Phi=[Phi;phi^I];end

>> display([(0:16)’, Phi, X, abs(Phi-X), abs(Phi-X)./abs(Phi+eps)])

9. The order or rate of convergence of a numerical approximation process is defined as theratio (in the logarithmic scale–ask me if this is not clear!) at which the approximate valueapproaches the desired target–often referred to as the exact value with respect to some smallparameter, ǫ, used to build the approximation. If for instance Aǫ is the approximate valueof some quantity A which satisfies A = Aǫ + O(ǫα)1, we say that Aǫ is an approximation oforder α or that the rate of convergence of Aǫ to A is α.

Compute the rate of convergence of the following limits:

a) limh−→0

sinh− h cos h

h= 0; b) lim

h−→0

eh2 − cos h

h2=

3

2, c) lim

n−→∞n(

e1/n − 1)

= 1

Note here it is assumed that we want to approximate the limit on the right hand side withthe quantity on the left for a given (fixed) h or n. This is the typical situation in numericalapproximations.

1Big. O. Are you wondering what this is?

23

Chapter 2

The Interpolation Polynomial

2.1 The interpolation problem

Let x0 < x1 < x2 < · · · < xn, n ≥ 0 in some given interval [a, b]. Let y0, y1, · · · , yn, (n+1) bemeasurements of some physical quantity, say, the temperature at different stations along a majorroad, taken at the point x0, x1, · · · , xn, respectively.

It is attempting to fit a function, lets call it Tapprox(x), that passes through each one of thesepoints, which can then give us an “approximate” value of the temperature at all points betweenthe stations. How to construct Tapprox(x) is referred to as the interpolation problem. There aremany ways to do this but the simplest way is to fit in a polynomial function of the form

pm(x) = a0 + a1x+ a2x2 + · · ·+ amx

m.

Question: What value should we take for the degree m? Should m be smaller or larger than n?

It is easy to see that if m < n, then unless the interpolation points, (xi, yi), i = 0, · · · , n are placedin a certain way, it is not always possible to find pm(x) that satisfies pm(xi) = yi, i = 0, 1, · · · , n.Try this with n = 2 (3 points) and m = 1, a line. On the other hand if m > n, the problem canhave many solutions. Try this with n = 0 (1 point) and m = 1, a line. There are many ways topass a line through one single point. i.e, in both cases the problem is ill posed. For the remainingcase, n = m, it turns out, we have the following theorem.

Theorem 2 Let x0 < x1 < x2 < · · · < xn be (n+1) distinct points and let y0, y1, · · · , yn be (n+1)corresponding values (measurements), n ≥ 0. Then, there exist a unique polynomial, pn(x), ofdegree n such that

pn(xi) = yi, i = 0, · · · , n. (2.1)

This is called the interpolation polynomial.

24

Proof:

Uniqueness: Assume qn(x) and pn(x) are two polynomials of degree n such that pn(xi) = qn(xi) =yi, i = 0, · · · , n. Then the polynomial rn(x) ≡ pn(x) − qn(x) satisfies rn(xi) = 0, i = 0, · · · , n, i.e,rn(x) is a polynomial of degree n with at least (n+1) roots. But we know from fundamental theoryof Algebra that a polynomial of degree n has at most n distinct roots. Therefore, rn(x) = 0 for allx.

Existence: By construction. See, for e.g, Lagrange interpolation below.

2.2 Lagrange Interpolation Method

2.2.1 Lagrange polynomials

Given x0 < x1 < x2 < · · · < xn, n ≥ 0, we can construct n+1 different polynomials of degree n sothat each one of them takes the value 1 at a fixed point xj and is zero at all other point, xi, i 6= j.These are given by

Lj(x) =

n∏

i=0,i 6=j

x− xixj − xi

, j = 0, 1, · · · , n

and are called the Lagrange polynomials. It easy to verify that

Lj(xi) = δij ≡

1 if i = j0 if i 6= j.

(2.2)

2.2.2 Lagrange interpolation polynomial

Consider n+1 points, x0 < x1 < x2 < · · · < xn and (n+1) corresponding values y0, y1, · · · , yn,n ≥ 0.The unique interpolation polynomial corresponding to these data is given by

pn(x) =

n∑

j=0

Lj(x)yj . (2.3)

Example:Consider the following data points.x -1 0 1y 4 1 0

Let us construct the interpolation polynomial using Lagrange’s method. Since we have exactlythree data point, n = 2. The three Lagrange polynomial are

L0(x) =x(x− 1)

(−1)× (−2)=

1

2x(x− 1)

25

L1(x) =(x+ 1)(x− 1)

(1) × (−1)= −(x+ 1)(x− 1)

L2(x) =(x+ 1)x

(2) × (1)=

1

2x(x+ 1).

Thus, the interpolation polynomial is given by

p2(x) = 4L0(x) + L1(x) + 0× L2(x) = 2(x− 1)x− (x+ 1)(x− 1) = x2 − 2x+ 1.

2.3 Interpolation error

Now, suppose that we actually know the true function f that takes the values yi at the correspondingpoints xi but for some (practical) reason we want to approximate it by a polynomial. This reasoncan be the fact that f(x) is hard to evaluate or it is only known in theory. Suppose we need tointegrate or differentiate this function. In this case we want to control the approximation errorbetween f and pn. This question can also be raised from a different point of view. One can wonderhow far pn would be from an arbitrary but smooth function f that passes through the points (xi, yi).What kind of function would be close to pn.

Let’s define the error as the maxnorm of the difference between f and pn

||f − pn|| = maxa≤x≤b

|f(x)− pn(x)|.

We have the following theorem.

Theorem 3 Let f be a class Cn+1 function on [a, b], i.e, f is (n+1)-continuously differentiableon [a, b]. Let a = x0 < x1 < x2 < · · · < xn = b, n ≥ 0, be (n + 1) distinct points and pn(x) thecorresponding interpolation polynomial;

pn(xi) = f(xi), i = 0, 1 · · · , n.

Then, the interpolation error or residual R(x) ≡ f(x)− pn(x) satisfies

R(x) =

[

n∏

i=0

(x− xi)

]

f (n+1)(ξ)

(n+ 1)!, for some ξ ∈ [a, b]. (2.4)

Proof:

The result is trivial if x = xi, ∀xi, i = 0, 1, · · · , n; both sides of the equality are zero. Letx 6= xi, i = 0, 1, · · · n, be fixed in [a, b]. Consider the polynomial qn+1 that interpolates the residualfunction R(z) at the (n+2) points x0, x1, · · · , xn and xn+1 = x (note that the variable is now z notx, because x is fixed). Since R(xi) = 0, i = 0, 1, · · · , n, only one term from the Lagrange formulain (2.3) is non zero. We have

qn+1(z) = R(x)Ln+1(z) = (f(x)− pn(x))n∏

i=0

z − xix− xi

.

26

Because of the “interpolation property”, the function Rq(z) = R(z) − qn+1(z) has at least n + 2roots in the interval [a, b], namely, the n + 2 interpolation points: z = x0, x1, · · · , xn and z = x.Rolle’s theorem applied repeatedly implies that the (n+1)th derivative of Rq(z) = R(z) − qn+1(z)must have at least one root in [a, b], i.e,

∃ξ ∈ [a, b] such that R(n+1)q (ξ) ≡ R(n+1)(ξ)− dn+1

dzn+1qn+1(ξ) = 0. (2.5)

Then, on the one hand, we have R(n+1)(ξ) = f (n+1)(ξ) because the (n+1)th derivative of pn(z) iszero. On the other hand, we have

dn+1

dzn+1qn+1(ξ) = (f(x)− pn(x))

(n+ 1)!∏ni=0(x− xi)

.

Thus, the equation in (2.5) is equivalent to

f (n+1)(ξ)− (f(x)− pn(x))(n+ 1)!

∏ni=0(x− xi)

= 0.

This is equivalent to

f(x)− pn(x) =f (n+1)(ξ)

(n+ 1)!

n∏

i=0

(x− xi),

which is the result of the theorem.

2.3.1 Error bound and convergence

Intuitively, we expect that, at least when a ≤ x ≤ b, if the number of interpolation points (n+1) isincreased, the interpolation error in (2.4) would decrease and eventually converge to zero, based onthe factorial in the denominator, which grows much faster than n. However, if the differences x−xiremain large the numerator will also become large when n is increased. This can happen if thepoints xi are clustered in some region and x is chosen far away from this region. More importantly,this also depends on the factor f (n+1)(ξ) which can also increase significantly with n.

For simplicity, let us assume that the (n+1) interpolation points are equally spaced: xi = a+ih, i =0, · · · , n, where h = (b−a)/n. This clearly avoids the clustering problem. Then, under the conditionof Theorem 3 we have

||f − pn|| ≤hn+1

(n+ 1)Mn+1, where Mn+1 = max

a≤x≤b|f (n+1)(x)|. (2.6)

To see this, consider the change of variables x ↔ t: x = th + a. Then the error term in (2.4)becomes

f(x)− pn(x) =hn+1

(n+ 1)!f (n+1)(ξ)

n∏

i=0

(t− i).

(Because x− xi = h(t− i).) Further, it is easy to show that

πn(t) = |n∏

i=0

(t− i)| ≤ n!, 0 ≤ t ≤ n.

In fact, we can show that πn(t) ≤ n!4 . See Problem 5 below.

27

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.5

0

0.5

1

1.5

2

fp

2

p10

p7

Figure 2.1: Interpolation of Runge’s function f(x) = 1/(1+25x2) showing the Runge phenomenon.

2.4 The Runge phenomenon and piece-wise polynomial interpo-lation

2.4.1 The Runge phenomenon

Consider the function f(x) = 11+25x2

on [−1, 1]. In Figure 2.1, we plot the graph of f(x) andthe associated interpolation polynomial pn, for different values of n, n = 2, 7, 10, correspondingto the (n + 1) equally spaced points, xi = −1 + 2i/n, i = 0, 1, · · · , n. This figure was generatedby executing the Matlab program below, which calls the 2 previous Matlab functions lagrange.mand interpolate.m. Note that as n is changed from 2 to 7, we see some kind of improvement ofthe approximation, although p7 shows some oscillations. But when n = 10 the oscillations becomeworse and the interpolation error becomes larger than that at n = 2. This is not typical for nicefunctions with reasonably bounded high order derivatives. but it gives us a warning, especially if wedon’t know much about f , that high order interpolation polynomials may display large-unwantedoscillations. (Try to compute max f (11)(x) for Runge’s function!)

Matlab program:

>> n=2;

>> xi=-1:2/n:1;

>> yi=1./(1.+25*xi.^2);

>> x=-1:1/100:1;

>> P2=interpolate(x,n,xi,yi);

28

>> n=7;

>> xi=-1:2/n:1;

>> yi=1./(1.+25*xi.^2);


>> n=10;


>> n=10;

>> xi=-1:2/n:1;

>> yi=1./(1.+25*xi.^2);


>> figure(1)

>> fplot(’1./(1+25*x^2)’,[-1,1])

>> hold on

>> plot(x,P2,’o-’)

>> plot(x,P7,’r-.’)

>> plot(x,P10,’--’)

>> axis([-1.2,1.2,-0.5,2])

>> grid on

>> legend(’f’,’p_2’,’p_7’,’p_10’,0)

>> figure(1)

>> print -depsc rungephen.eps

%***********M-file: lagrange.m

%************************************

function lg=lagrange(x,N,j,xj)

lg=1;

for l=1:j-1

lg=lg.*(x-xj(l))/(xj(j)-xj(l));

end

for l=j+1:N+1

lg=lg.*(x-xj(l))/(xj(j)-xj(l));

end

%************************************

%***********Mfile: interpolate.m

%************************************

function pn=interpolate(x,N,xj,yj)

pn=0;

for j=1:N+1

pn=pn+yj(j)*lagrange(x,N,j,xj);

end

%************************************

2.4.2 Piecewise polynomial interpolation

To avoid the possibility of catastrophic oscillations of a high order interpolation polynomial, asdepicted by the Range phenomenon above, many other “approximation” techniques have been

29

invented and used in practice. Some of those techniques consist in finding the “best approximation”among a certain class of functions, which minimizes a certain norm or semi-norm between the givenfunction f and the approximation function. This can be a polynomial of a certain degree, or a linearcombination of a certain class of ”basis functions”. e.g: sines and cosines yield an approximationknown as the discrete Fourier transform.

Among those approximation techniques there is what is known as the least square method whichconsists in minimizing the L2 norm or its discrete version. We will come back later to this subject.

However, one of the simplest ways to construct a fairly accurate interpolant, without the risk ofundesirable oscillations, in the case of large number of interpolation points1 is is to find instead apiecewise polynomial which connects the interpolation points (xi, yi). This often results in a brokencurve as the ones displayed in Figure 2.2.

There are many ways to do this. One of them is simply connecting two successive points (xi, yi)and (xi+1, yi+1) by a straight line. This is called piece-wise linear interpolation. If instead weselect three successive points, (xi, yi), (xi+1, yi+1) and (xi+2, yi+2) at a time and connect them byparabolas, we get a piece-wise 2nd order polynomial interpolation. Although, such high orderpiece-wise polynomial approximations have some merits especially when applied to finite-differenceschemes for differential equations or to numerical integration as we will see later in the course, theyhave the disadvantage of displaying sharp cusps at the points where the different pieces are gluedtogether, which can in some situations be sharper than those of the piece-wise linear interpolation.

One remedy for the sharp cusps resulting from the piece-wise linear and piece-wise parabolic inter-polation, is to match not only the polynomials at the extremities of the sub-intervals but also theirderivatives that results in a smoother approximation function, which is in fact not only continuousbut differentiable on the whole interval of interpolation. Such smoothed piece-wise interpolationpolynomial are achieved in Matlab by choosing the ”method” spline or cubic in the interp1 functionof Matlab. This is what is done in figure 2.3 for Runge’s function.

The piece-wise parabolic polynomial of Figure 2.2, is obtained by executing the following Matlabprogram. It calls the largange.m and interpolate.m routines from the previous sub-section:

>>xi=-1:2/10:1;

for I=0:2:8

xij=[xi(I+1),xi(I+2),xi(I+3)];

yij=1./(1.+25*xij.^2);

xj=xij(1):(xij(3)-xij(1))/10:xij(3);

P2j=interpolate(xj,2,xij,yij);

figure(2),hold on

plot(xj,P2j,’--’);

end

1A large number of points is sometimes required in practice, for example, when the interpolation domain is largeor when the data (the yi values) exhibit a lot of variability; the differences between successive y-values are somewhatlarge and changing.

30

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f

PW linear

PW parabolic

Figure 2.2: Piece-wise interpolation of Runge’s function f(x) = 1/(1+25x2) using piece-wise linearand piece-wise quadratic polynomials, using 11 interpolation points.

while for the piece-wise linear curve we used the interp1 function of Matlab:

>> x=-1:1/100:1;

>> xi=-1:2/n:1;

>> yi=1./(1.+25*xi.^2);

>> plot(xi,interp1(xi,yi,xi,’linear’),’:’);

2.5 Other methods for constructing the interpolation polynomial

2.5.1 Vandermonde’s Matrix

Lagrange’s method presented above is not the only way to construct the interpolation polynomial.One obvious way is to solve the following linear system of (n+1) equations

a0 + a1xi + a2x2i + · · ·+ anx

ni = f(xi), i = 0, 1, · · · , n

for the (n+1) unknowns ai, i = 0, 1, · · · , n. This can be written in matrix form as follows.

1 x0 x20 · · · xn01 x1 x21 · · · xn1...

......

...1 xn x2n · · · xnn

a0a1...an

=

f(x0)f(x1)

...f(xn)

.

31

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

f

PW spline

PW cubic

Figure 2.3: Piece-wise spline and piece-wise cubic interpolation of Runge’s function f(x) = 1/(1 +25x2) using interp1 function of Matlab, using 11 interpolation points.

The matrix in front is known as the Vandermonde matrix, named after its discoverer. At firstsight this appears like an attractive way to solve the interpolation problem. However, it turns outthat for large n and for some particular point sets, the solution is very sensitive to the input data,very small changes in the values of f(xi) or the interpolation points xi lead to large deviationsin the coefficients ai, i = 0, 1, · · · , n, i.e, the matrix itself is “ill-conditioned”. We will see whatthis exactly means later on in this course. Moreover, the numerical resolution of the linear systembecomes very costly when n gets large. In other words, it is in general a very bad idea in practiceto try to solve the interpolation problem using Vandermonde’s method. Nevertheless, it remains anice theoretical and pedagogical result.

2.5.2 Newton’s divided differences

In practical numerical approximation procedures, it is often desirable to be able to re-use anyprevious calculation we made at a given step to carry the results to a higher step. In this respect, theLagrange interpolation technique is deficient: Suppose we computed the interpolation polynomialusing Lagrange’s method, based on n + 1 given points but we are unsatisfied. Imagine that wedecided to add a few more interpolation points to gain more accuracy or that more data has justarrived. In this case, we have to restart from the beginning–computing again all the Lagrangepolynomials, Li, for the new configuration.

As an alternative, Newton’s divided differences computes the interpolation polynomial step by step,starting from the lowest degree then adding one point at a time, when going from one degree tothe next.

32

Divided differences:For the general case of non-equally spaced and arbitrarily ordered (n+1)-distinct points x0, x1, x2, · · · , xn,we introduce the divided differences as follows.

δ0f [x0] = f(x0), δ1f [x0, x1] =

f(x1)− f(x0)

x1 − x0,

δkf [x0, x1, · · · , xk] =δk−1f [x1, x2, · · · , xk]− δk−1f [x0, x1, · · · , xk−1]

xk − x0, 2 ≤ k ≥ n.

Then, the interpolation polynomial of f is given by

pn(x) = δ0f [x0] + δ1f [x0, x1](x− x0) + δ2f [x0, x1, x2](x− x0)(x− x1) (2.7)

+ · · ·+ δnf [x0, x1, · · · , xn](x− x0)(x− x1) · · · (x− xn−1) (2.8)

=

n∑

k=0

δkf [x0, x1, · · · , xk]k−1∏

j=0

(x− xj).

If a new point xn+1, xn+1 6= xj , j = 0, 1, 2, · · · , n, is added anywhere within the domain of interest,then the new interpolation polynomial of degree n+ 1 is given by

pn+1 = pn(x) + δn+1f [x0, x1, · · · , xn+1]n∏

j=0

(x− xj),

meaning that all the work previously is being recycled. This is not the case when using Lagrange’smethod!

Undivided differences:Now assume that x0, x1, x2, · · · , xn are n+ 1-distinct and equally spaced points—indexed in theincreasing order:

h = xi+1 − xi, xi = x0 + ih, i = 0, 1, 2, · · · , nLet fi = f(xi). We introduce the successive undivided differences:

∆0fi = f(xi); ∆fi = fi+1 − fi;

∆2fi = ∆fi+1 −∆fi = fi+2 − 2fi+1 + fi,

∆nfi = ∆(∆n−1fi) = ∆n−1(∆fi), n ≥ 1.

Claim: In this case we can show that the interpolation polynomial of f(x) associated withx0, x1, · · · , xn is given by

pn(x) = f0 +x− x0x1 − x0

∆f0 +(x− x0)(x− x1)

(x2 − x0)(x2 − x1)∆2f0 + · · ·+

(

n−1∏

i=0

x− xixn − xi

)

∆nf0. (2.9)

or pn(x) = f0 +

n∑

k=1

1

k!∆kf0

k−1∏

j=0

(t− j),

33

where t is such that x = x0 + th.

Proof of the Claim:

We have (for equally spaced points)

δ2f [x0, x1, x2] =δf [x1, x2]− δf [x0, x1]

x2 − x0=

(f(x2)− f(x1))/(x2 − x1)− (f(x1)− f(x0))/(x1 − x0)

x2 − x0.

This is equal to ∆2f(x0)(x2−x0)(x1−x0)

if x2 − x1 = x1 − x0 and similarly higher order terms:

δkf [x0, x1, · · · , xk] =(

k−1∏

i=0

1

xk − xi

)

∆kf(x0)

Therefore, in the case of equally spaced interpolation points, the expressions in (2.9) and (2.7) areindeed equivalent.

For the case of uniformly spaced points, the Newton form of the interpolation polynomial remindsus Taylor’s formula where the products

∏ki=0(x− xi) play the role of the powers (x − x0)

k and

∆kf0/∏ki=0(xk − xi) assume the role of f (k)(x0)/k!. In fact, it is easy to show that

∆kf0∏ki=0(xk − xi)

≈ f (k)(x0)

k!.

2.6 Problems

1. Consider the polynomial in (2.3). Prove that pn(xi) = yi and deduce that pn(x) is indeed theinterpolation polynomial of theorem 2.

2. Consider the data pointsx -1 0 1y 4 1 0.

Use Newton’s divided differences to compute the interpolation polynomial associated withthe given data.

3. Show that the error bound for the interpolation polynomial

||f − pn|| ≤hn+1

n+ 1M, where M = max

a≤x≤b|f (n+1)(x)|

can be generalized for the case of a non-uniformly spaced interpolation points, a = x0 < x1 <x2 < · · · < xn = b, by setting

h = max0≤i≤n−1

xi+1 − xi

34

4. Prove that the Vandermonde determinant is given by

det

1 x0 x20 · · · xn01 x1 x21 · · · xn1...

......

...1 xn x2n · · · xnn

=∏

i>j

(xi − xj).

Deduce that the interpolation polynomial exists and is unique if and only if the interpolationpoints x0, x1, · · · , xn are distinct.

5. Consider the polynomial

πn(t) =

n∏

i=0

(t− i) 0 ≤ t ≤ n.

It is trivial to see that |πn(t)| ≤ n!.

Show that

|πn(t)| ≤n!

4, 0 ≤ t ≤ n.

Hint: Show that for all 0 ≤ i0 ≤ n−1 we have (t− i0)(i0+1− t) ≤ 1/4, for all i0 ≤ t ≤ i0+1.

6. Use the expressions in (2.4) and (2.6), to prove the following statements.

i) If the interpolation points are uniformly spaced with a step size xi+1 − xi = h, 0 ≤ i ≤n− 1, then the residual satisfies

R(x) = O(hn+1)

and the interpolation polynomial pn(x) converges uniformly2 to f(x) as n −→ +∞, ifMn+1 doesn’t increase too fast. How fast is too fast?

ii) The error bound in (2.6) remains valid even if the interpolation points are not uniformlyspaced. Instead of the uniform step size, we can set

h = max0≤i≤n−1

|xi+1 − xi|

where x0 < x1 < · · · < xn.

iii) The interpolation polynomial is unique. i.e. Given n+ 1 distinct (interpolation) points,x0 < x1 < · · · < xn, and (n+ 1) values, y0, y1, · · · , yn, then the associated interpolationpolynomial, such that pn(xi) = yi, i = 0, · · · , n, is unique. (Don’t redo the proof ofTheorem 2, find another method.)

7. Show that the polynomial defined in (2.7) satisfies pn(xi) = f(xi), i = 0, · · · , n and deducethat this is indeed the interpolation polynomial.Hint: Show that

f(xj) = f(xj) + (xj − x0)δf [x0, x1] + (xj − x1)(xj − x0)δ2f [x0, x1, x2] + · · ·

· · · + (xj − xj−1)(xj − xj−2) · · · (xj − x0)δjf [x0, x1, · · · , xj ].

2Do you know what does this mean?

35

Chapter 3

Initial value problems

3.1 Introduction

Here, we discuss some basic numerical methods, including their accuracy and convergence, forinitial value problems (IVPs; Systems of differential equations) on the form

dX

dt= F (X, t), t ∈ [0, T ], T > 0

X(0) = X0. (3.1)

Here, X = (x1, x2, · · · , xd)T is a vector field (function of t) in Rd, d ≥ 1 and F = (f1, f2, · · · , fd) isa smooth (vector) function from Rd × [0, T ] to Rd. Usually t represents time and T is a fixed timeperiod up to which we desire to solve or integrate the IVP.

We note that without loss of generality, we can assume that the system above is autonomous, i.e,F is independent of t, and that any high order ordinary differential equation can be converted toa first order system of the form (3.1).

As an interesting example, we consider the predator-prey equations that are used to model popu-lation dynamics in biology and environmental sciences. Details on these equations can be found inalmost any good textbook on differential equations.

x = αx− βxy

y = −γy + δxy. (3.2)

Here, x, y are two scalars (real variables) representing the population densities of preys and preda-tors, respectively while α, β, γ, δ are positive parameters whose fixed values depend on the particularspecies involved. The notation x ≡ dx

dt is adopted here.

The system in (3.2) admits two equilibrium points: at (0, 0) and at (xs, ys) = (γ/δ, α/β); (0, 0) is asaddle point (unstable) while (xs, ys) is a centre (stable). The trajectories are closed curves closingaround the centre (xs, ys) whose shape is close to an ellipse, especially near the centre. This can

36

be seen easily if we consider the linearized system about (xs, ys). Let x′ = x− xs, y′ = y − ys be

the deviations from the equilibrium point (xs, ys). Then, (x′, y′) solve the system

x′ = αx′ − βysx′ − βxsy

′ − βx′y′

y′ = −γy′ + δysx′ + δxsy

′ + δx′y′.

For small enough perturbations, the nonlinear terms involving the product x′y′ can be ignored andthe above system becomes linear. If in addition we plug in the values of xs, ys and multiply thetop and bottom equations by δx′ and βy′, respectively, we obtain

d

dt

(

δx′2 + βy′2)

= 0,

i.e, δx′2 + βy′2 = Const., which is equivalent to saying that the trajectories of (x′, y′) are ellipses.

Also, by formally dividing the second equation in (3.2) by the first one, we obtain

dy

dx=y(δx− γ)

x(α− βy).

This first order ode can be solved by separation of variables to yield the implicit solution

H(x, y) := α ln(y)− βy − δx+ γ ln(x) = C(constant).

This implies that the trajectories of the predator-prey system are the level curves of the functionH(x, y) which can be shown to be indeed closed around the point (xs, ys) (because H(x, y) is astrictly convex functions). The complete phase portrait of the predator-prey system is given inFigure 3.1.

As we will see later this system provides an interesting test for numerical methods, dividing upthose that yield closed approximate trajectories from those that don’t!

We end this introduction section by discussing the well posedness of the IVP (3.1). While our goalhere is to use numerical methods to find an approximate solution for the IPV (3.1), before doingso it is important to realize whether the problem has a solution and whether the actual solution is“computable”. An IVP which has a unique solution and the solution is not very sensitive to smallperturbation in the initial data is said to be well posed. Formally, an IVP of the type (3.1) is saidto be well posed if for any given initial condition, in some neighborhood of X0, it has a uniquesolution (in a certain sense) X(t,X0) which is continuous with respect to X0. We note that thecontinuity with respect to X0 guarantees that a small error (typical due to round off truncation)will yield a solution which is close (in some sense) to the solution associated with the unperturbeddata, X0, at least for sufficiently small time t. According to the theory of differential equations, wehave the following, existence and uniqueness theorem.

Theorem 4 (Well posedness) The initial value problem (3.1) has a unique solution if

1. F (x, t) is continuous with respect to (X, t) on Rd × [0, T ]

37

(xs,y

s)

Figure 3.1: A sketch of the phase portrait of the predator-prey system

2. F (x, t) satisfies the Lipschitz condition with respect to X:

∃L > 0 such that ||F (X1, t)− F (X2, t)|| ≤ L||X1 −X2||, ∀t ∈ [0, T ].

Moreover, under the conditions (1) and (2), the unique solution is continuous with respect to theinitial condition, X0.

The conditions (1) and (2) of the above theorem can be relaxed to the neighbourhood of (X0, 0)but in this general case, it is not guaranteed that the solution extends to the whole interval [0, T ].

Next, we provide a few examples that illustrate the meaning of Theorem 4. In essence it providessufficient conditions for the existence, uniqueness, and stability of solutions (well conditioning).Examples

1. Smooth Systems are always well posed. For smooth system such as the predator andprey problem in (3.2), it is easy to establish well posedness using Theorem 4. We have

F (x, y) =

(

αx− βxy−γy + δxy

)

is a continuously differentiable function, from R2 to R2 thus the IVP satisfies the conditions

38

of Theorem 4; we can show that

||F (x1, y1)− F (x2, y2)|| ≤ L||(x1, y1)− (x2, y2)||

on any fixed rectangle (with L depends on the rectangle). Here, ||.|| is the Euclidean norm.Thus the IVP has a unique solution for any given initial data, which extends to t −→ ∞. Theextension to infinity results from the fact that since the solution exits in the initial triangle,where the Lipschitz condition is readily established, once the solution reaches the boundary ofthis rectangle, at time t = T ∗, we re-consider the IVP with initial condition is the exit point(T∗, x∗, y∗), where x∗, y∗ is precisely the point where the solution intersects the boundary.

2. Multiple solutions when Lipschitz’s condition is not satisfied. Consider

x = x1/3, x(0) = 0.

This IVP doesn’t satisfy the conditions of the well-posedness theorem, namely, the functionF (x) = x1/3 is not Lipschitz near zero:

∀L > 0, |x1/3 − 0| > L|x− 0|, near zero.

It is easy to see that it has infinitely many solutions. One of the solutions being simplyx0(t) = 0 for all t > 0. Moreover, by integrating the ODE we arrive at the general solution

x(t) =

[

2

3(t− c)

]3/2

.

Thus, for all c ≥ 0, the function

x(t) =

0, if 0 ≤ t ≤ c[

23 (t− c)

]3/2, if t > c

is also a continuously differentiable solution to the IVP1. This is a counter example thatbasically demonstrates that uniqueness can be lost if the Lipschitz condition if not satisfied.

3. The Lipschitz condition is not necessary for well posedness. Consider the ODE

x =

x ln |x|, if x > 00, if x = 0.

Since lnx is unbounded when x −→ 0, we can easily see that the associated functional is notLipschitz near zero. Nonetheless, we can show that the associated IVP has a unique solution,for any initial condition x(0) = x0, e.g., x0 ≥ 0. To see this, notice that if x0 > 0, then thetheorem applies and the unique solution, given by x(t) = exp(Cet) where C = lnx0

2, cannotcross the line x = 0 for all t ≥ 0 and when x0 = 0, x(t) = 0 is the only solution.

4. IVP’s for which Lipschitz’s condition is not satisfied are generally hard to han-dle both analytically and numerically. The function f(x) = x sin(1/x) is clearly notdifferentiable a x = 0 but we can also show that it is not Lipschitz. To see this, considerpoint x and y on the form x = 1

2nπ+π/2 and y = 12nπ+3π/2 . We have x sin(1/x)− y sin(1/y) =

1It is left to the student to prove that x(t) is continuously, differentiable, and its derivative is continuous2For x0 < 0 the solution is x(t) = − exp(Cet) with C = ln(−x0).

39

(2nπ+π/2)−1+(2nπ+3π/2)−1 and x−y = π ((2nπ + π/2)(2nπ + 3π/2))−1 = O(n−2) whenn −→ ∞. So clearly, we cannot have |f(x) − f(y)| ≤ L|x − y| in the neighbourhood of zerosince f(x)− f(y) = O(n−1) for this particular choice of x and y. So the theorem cannot beapplied to the IVP x = f(x), x(0) = x0 and since we cannot construct solutions in closedform either, the question remains as to whether this IVP has a unique solution or not.

5. There are function that are Lipschitz but they are not differentiable. In many(low-level) textbooks, the existence and uniqueness theorem for ODE’s is stated with therequirement of differentiability with respect to the variable X, in place of the Lipschitzcontinuity requirement. Note that the function f(x) = |x| is (globally) Lipschitz, because||x| − |y|| ≤ |x − y|, but it is not differentiable at x = 0. So for the associated IVP, we canonly apply the existence and uniqueness as presented in Theorem 4, which is more generalthe one requiring differentiability, with which some students are perhaps more familiar.

3.2 Euler’s method

Consider the IVP in (3.1). For simplicity in exposition, we assume that d = 1. In fact, all thenumerical methods developed in this section will be for the 1d case, their generalization to higherdimension is straightforward, when it is not trivial.

Let t0 = 0 < t1 < t2 < · · · < tn = T be a subdivision of the interval [0, T ]. Let hi = ti+1 − ti. Thesubdivision or discretization is said to be uniform when hi is independent of i, i.e, hi = h = T/ni = 0, · · · , n − 1. We wish to find “approximate” values , xi ≈ x(ti), for the solution x(t) of theIVP at the discrete points t1, · · · , tn, given the initial value x0 = x(t0). Euler’s method starts byapproximating the derivative (or the slope) x(t) on the interval [ti, ti+1] by

x(t) ≈ x(ti+1)− x(ti)

hi.

Then, the differential equation becomes

x(ti+1)− x(ti) ≈ hif(x(t), t).

If, in addition, we assume that over the interval [ti, ti+1], f(x(t), t) ≈ f(x(ti), ti) then we get theformula x(ti+1) ≈ x(ti) + hif(x(ti), ti) which allows one to compute an approximate value forx(ti+1) given the value of x(ti). However, only the exact value of x(t0) = x0 is known in practice.Nevertheless, a sequence of approximate values xi, i = 1, 2, · · · , n can be obtained iteratively bysimply replacing x(ti) by xi in the above equation. We have the Euler scheme:

x0 given (3.3)

xi+1 = xi + hif(xi, ti), i = 1, 2, · · · , n.

An alternate derivation of Euler’s method consists of simply assuming that the slope of the solutionx(t) = f(x(t), t) is constant over the interval [ti, ti+1] and is set to f(x(ti), ti). Then a simpleintegration of the new ode: x = f(xi, ti) on [ti, ti+1] leads to (3.3)–as its solution.

40

Example:For illustration, we consider the example

x = 2x+ t, 0 ≤ t ≤ 2, x(0) = 0.

We note for this simple case, the exact-analytical solution is known in closed form. It is given by

x(t) =e2t − 1

4− t

2.

If we apply Euler’s method to this problem with a uniform step size h = 2/n, we obtain for n = 20or h = 0.1, for example,

x0 = 0, x1 = x0 + h(2x0 + 0) = 0, x2 = x1 + h(2x1 + h) = 0.01, etc.

In general, we have

x0 = 0; xi+1 = xi + h(2xi + ti) = xi + h(2xi + ih) = (1 + 2h)xi + ih2.

The automatic processing of such an iterative process is easily carried on a computer using anappropriate programming language. A simple Matlab code is given in Table 3.1. Note that the two

>>fplot(’(exp(2*x)-1)/4 - x/2’,[0,2]), hold on

>>n=20; h=2/n; x0=0; x=x0;

>>for I=1:n

x1=(1+2h)*x0+(I-1)*h^2;

x=[x,x1];

x0=x1;

end

plot(0:h:2,x,’r--’)

legend(’Exact’,’h=0.1’)

title(’Exact and numerical solution’)

Table 3.1: A Matlab code for Euler’s method.

plot commands permit to display both the exact and the numerical solution on the same graph.The result is displayed in Figure 3.2 (a) for the two cases n = 20 and n = 40. Two importantpoints are worth noting here.

1. The exact and numerical solutions are undistinguishable for small t but as t grows they divergesignificantly from each other.

2. When the step-size is reduced the approximation error between the exact and the numericalsolutions decreases.

The error ei = x(ti)− xi is plotted on the lower panel of Figure 3.2 (b). Curiously by doubling thenumber of points, the error is reduced by roughly a half throughout the interval [0, 2] and in both

41

cases the error seems to increase exponentially with time. As we will see, this is typical for firstorder one step methods. Euler’s scheme is a first order, one-step method.

To further illustrate the performance of Euler’s method for practical problems, we apply it to thepredator-prey equations (3.2). In Figure 3.3, we plot the time evolution of the prey and predatorpopulations and the corresponding trajectories in phase space, obtained by Euler’s method withfour different step sizes. We note that for a too large step size, the trajectory spirals out and thesolution diverges along the x-axis, very quickly, without displaying the expected closed trajectoryand the expected periodic behaviour. It is only with h = 0.01 and smaller that we start to seeperiodic orbits. Using Euler’s method with a large step size can indeed lead to a misleadingnumerical solution.

3.2.1 Error analysis

To better understand the numerical results in Figure 3.2 and the behaviour of Euler’s method ingeneral, we proceed here to the analysis of the actual approximation error between the exact solutionof the IVP and numerical solution obtained by Euler’s method. A central question of practicalinterest is whether this error vanishes when the step size h −→ 0, i.e, whether the numericalsolution converges to the actual solution of the IVP.

We begin by introducing the local truncation error. Assume that the solution at time ti, x(ti), isknown and let xi+1 = x(ti) + hif(x(ti), ti), i.e, xi+1 is the approximate solution at ti+1 obtainedby Euler’s method when the solution at time ti is known. The error commited in one time step isgiven by

eh,i = x(ti+1)− xi+1. (3.4)

This should be differentiated from xi+1 given in (3.3) which accumulates the errors from all pre-ceding time steps.

Assume that x(t) is sufficiently smooth and consider the Taylor expansion of x(ti+1):

x(ti + hi) = x(ti) + hix(ti) +h2i2x(ξ), ti ≤ ξ ≤ ti+1.

Plugging in the expression of xi+1 and the Taylor expansion of x(ti+1) in (3.4) we get

eh,i = x(ti) + hix(ti) +1

2h2i x(ξ)− [x(ti) + hif(x(ti), ti)] =

1

2h2i x(ξ), ti ≤ ξ ≤ ti+1.

In the last step we used the fact that x(ti) satisfies the differential equation xi = f(xi, ti). Thus,the local error is proportional to h2 times the second derivative of x(t). We say that eh,i = O(h2i )when h −→ 0. 3 In particular, we have the following upper bound:

|eh,i| ≤ h2M2, i = 1, 2, · · · , n,3A quantity f is said to be a big O of g and we write: f(x) = O(g(x)) when x −→ a if limx−→a f(x)/g(x) = C

where C is a constant.

42

where throughout this chapter, we set Mk = max[0,T ]

∣

∣

∣

∣

dkx(t)

dtk

∣

∣

∣

∣

= max[0,T ]

∣

∣

∣

∣

dk−1

dtk−1f(x(t), t)

∣

∣

∣

∣

, k = 1, 2, · · · 4

Given a solution x(t), the truncation error is defined as

τh,i =x(ti+1)− x(ti)

hi− f(x(ti), ti) ≡

x(ti+1)− x(ti)

hi− x(ti) =

x(ti+1)− xi+1

hi=eh,ihi

= O(hi),

i = 1, 2, · · · , n. (3.5)

To be more precise, the truncation error is the difference between the numerical scheme and theunderlying differential equation.

The actual error also called the global error is given by

eh,i = x(ti+1)− xi+1, i = 1, 2, · · · , n, (3.6)

where xi+1 is given by (3.3). It is in some sense the accumulation of the local errors of all thepreceding time steps. In particular, the numerical method is said to converge if the global errorgoes to zero when the (maximum) time step goes to zero. For Euler’s method, we have the followingtheorem. For simplicity in exposition we assume a uniform time step, h.

Theorem 5 (First order convergence and error growth) Assume that Euler’s method (3.3)is applied to an IVP, which satisfies the conditions of Theorem 4, with a uniform time step h = T/n.

Let L > 0 be the associated Lipschitz constant and M2 = max[0,T ]

∣

∣

∣

∣

d2x(t)

dt2

∣

∣

∣

∣

. Then, we have the following

upper bound for the global error

|eh,i| ≤1

2

Mh

L

(

eLti − 1)

, (3.7)

for all discrete times ti = ih, i = 1, 2, · · · , n. Consequently, we have maxi=1,2,··· ,n |ei,h| −→ 0 whenh −→ 0.

Before proceeding to the proof, we note that this theorem mainly states that i) for t = ih fixed theglobal error e(t) decreases linearly with h (convergence with respect to h, uniformly on [0, T ]) andii) for fixed h the error increases exponential with t (error growth). Both constants L and M , inTheorem 3.7 play a role in dictating how fast or how slow the convergence and the error growthare.

Proof:Using the definitions of local and global errors given above, we have

ei+1 =x(ti+1)− xi+1 = x(ti+1)− xi+1 + xi+1 − xi+1 = ei+1 + xi+1 − xi+1

= ei+1 + x(ti) + hf(x(ti), ti)− [xi + hf(xi, ti)] = ei+1 + ei + h[f(x(ti), ti)− f(xi, ti)].

By the triangular inequality and the fact that f is Lipschitz, we arrive at

|ei+1| ≤ |ei+1|+ |ei|+ hL|x(ti)− xi| = |ei+1|+ |ei|(1 + Lh).

4Mk can be written explicitly in terms of partial derivatives of f(x, t) and in principle an upper bound can befound in terms of the variations of f without invoking the solution x(t).

43

Thus by iterating, we get

|ei| ≤ |ei|+ |ei−1|(1 + Lh) ≤ |ei|+ [|ei−1|+ |ei−2|(1 + Lh)](1 + Lh)

≤ |ei|+ |ei−1|(1 + Lh) + · · · + |e1|(1 + hL)i−1 + |e0|(1 + hL)i.

We now use the fact that e0 = 0 and that |ei| ≤ h2M2/2. This leads to the geometric sum

|ei| ≤1

2h2M2

[

1 + (1 + hL) + (1 + hL)2 + · · · (1 + hL)i−1]

=1

2

hM2

L

[

(1 + hL)i − 1]

.

The theorem then follows but remarking that (1 + hL)i = exp[i ln(1 + hL)] ≤ eihL.

3.2.2 Round off errors

In practice, when a differential equation is “solved” on a computer, there are two types of errorsthat are involved. The discretization error due to the approximation of the differential equation bythe numerical scheme and the round off error due to the use of floating-point arithmetics. Here weconsider the combined effect of these errors on the IVP solution for the case of Euler’s method.

Let xi, i = 0, 1, 2, · · · , n be the actual approximate solution containing both round off and dis-cretization errors. We have

xi+1 = xi + h[f(xi, ti) + ǫi] + ηi, i = 0, 1, · · · , n− 1, (3.8)

where ǫi and ηi are respectively the errors accumulated during the evaluation of f and the subse-quent operations (multiplication by h and summation of xi), at step i. Let

Ei+1 = x(ti+1)− xi+1

be the associated-actual global error. We have

Ei+1 = ei+1 + xi+1 − xi+1.

Ei is essentially the global error ei plus an error

ei = xi − xi,

which is the error due solely to round off errors.

Set ǫ = maxi ǫi and η = maxi ηi. We have

ei+1 = xi − xi + h[f(xi, ti)− f(xi, ti)]− hǫi − ηi.

Thus,

|ei| ≤ |ei−1|(1 + hL) + (hǫ+ η) ≤ · · · ≤ (hǫ+ η)[1 + (1 + hL) + · · ·+ (1 + hL)i−1].

Where, we again assumed that e0 = 0. Consequently, we obtain

|ei| ≤hǫ+ η

hL(eLti − 1).

44

Adding the global error from Theorem 5 yields

|Ei| ≤ (eLti − 1)

[

ǫ

L+

η

hL+M2h

2L

]

.

The right hand side of this inequality has three terms. The first two terms, involving ǫ and η, aredue to round off errors while the last term (involving M2) is due to the accumulated truncationerrors. This suggests that when the time step is relatively large compared to the max round offerror (max(ǫ, η)), the discretization error dominates but when h is too small (on the order of η/L),the round off error takes over. While the discretization error decreases linearly with h, the roundoff error increases as 1/h. This, in fact, can lead to catastrophic results if h is chosen to be toosmall for the given computer hardware or for the given problem. An optimal step-size h0 whichshould not be exceeded can be obtained by minimizing the upper bound of Ei. We have

h0 =√

2η/(M2L).

Notice that h0 depends on both the maximum round off error (of the hardware) and the constantsM2 and L of the problem at hand. A sketch of the combined global error as a function of h is givenin Figure 3.4.

3.3 Higher order methods

As we saw in the previous section, while Euler’s method is conceptually simple and easy to code,it is only linearly convergent; to decrease the error by one order of magnitude, we need to decreaseh by the same amount. So if we want an error that is too small, a similarly small h is required. Anexcessively small h however is both prohibitive in terms of computational efficiency and, as we havejust saw, subject to a catastrophic proliferation of round off errors, which will prevent convergence.To overcome this problem, we need to use higher order methods, i.e, methods whose global errorgoes to zero as hp, p ≥ 2; ei = O(hp). Euler’s method is only first order, i.e, p = 1. With an orderp method, reducing the time step by a factor of 2, reduces the error by a factor of 2p.

In this section, we will discuss two families of high order methods. One family of one steps methodsand one family of multi-step methods. One step methods use only the solution at time t toadvance to the next time step t + h while multi-step methods use the solution at previous steps,t, t− h, t− 2h, · · · , t− kh, for some fixed k = 1, 2, 3, · · · to advance to the next step t+ h. We willnamely consider the family of explicit Runge-Kutta methods as an example of one-step methodsand the family of Adams methods as an example of multi-step methods.

3.3.1 One step methods: Runge-Kutta methods

A one-step method for IVPs takes the form

xi+1 = xi + hΦh(xi, xi+1, ti).

A one-step method is said to be explicit if Φh is independent of xi+1, otherwise it is said to beimplicit. The main difficulty with implicit methods resides in the fact it may lead to solving a

45

nonlinear equation (or system) or a large linear system, at each time-step. However, as we willsee in the next chapter they have one major advantage especially when dealing with stiff-equationsbecause of their intrinsic stability features.

A celebrated family of one-step methods, known as Runge-Kutta (RK) methods, rely on a certainnumber of intermediate steps or stages to achieve high order accuracy, they are often called multi-stage methods. They are often written on the form

xi,l = xi + h

l∑

j=1

αljkj , ti,l = ti + γlh, kl = f(

xi,l, ti,l)

, l = 1, 2, · · · , r

xi+1 = xi + hr∑

l=1

βlkl, i = 0, 1, 2, · · · , n− 1. (3.9)

Here, the coefficients αl,j, βl, γl are the main parameters which determine the accuracy of themethod while the integer r ≥ 1 determines the number of stages. First, the coefficients must satisfythe following consistency conditions.

γl =l∑

j=1

αlj, l = 1, 2, · · · , r, andr∑

l=1

βl = 1. (3.10)

In a nut shell, these conditions guarantee that the cumulative steps made in the t and x directionsare consistent with each other and that the βl sum to one so that when f = c (constant), thenumerical scheme yields the exact linear solution x(t) = x0 + ct. They are often arranged in amatrix-vector form

A = (αlj)1≤j≤l≤r, b = (βl)1≤l≤r, c = (γl)1≤l≤r

and stored in a (Butcher) tableau:

c A

bT .

Implicit versus explicit RK methods:If the diagonal elements of A are all zero: αll = 0, l = 1, · · · , r, then the one-step method isexplicit, otherwise it is implicit. For the moment, we consider explicit methods only. Thus the firstexpression in (3.9) reduces to

xi,l = xi + h

l−1∑

j=1

αljkj .

An important sub-family of (explicit) two-stage RK methods takes the simple form

k1 = fi ≡ f(xi, ti),

k2 = f(xi + αhfi, ti + γh)

xi+1 = xi + h [β1k1 + β2k2] , i = 0, 1, 2, · · · , n − 1. (3.11)

When β1 = 1, β2 = 0 this reduces to Euler’s method. Below, we derive necessary and sufficientconditions for the coefficients (including the compatibility in (3.10)) in order for the two-stage

46

methods (3.11) to be second order. Next, we give three well known examples of second order RKmethods. We will see later that these three choices satisfy the second order accuracy requirements(necessary and sufficient conditions).

1. Mid-point or improved Euler’s method:

xi+1 = xi + hf(xi + 0.5hfi, ti + 0.5h). (3.12)

(β1 = 0, β2 = 1, α = γ = 0.5.)

2. Modified Euler:

xi+1 = xi +h

2[fi + f(xi + hfi, ti + h)]. (3.13)

(β1 = 0.5, β2 = 0.5, α = γ = 1.)

3. Huen’s method:

xi+1 = xi +h

4[fi + 3f(xi +

2

3hfi, ti +

2

3h)]. (3.14)

(β1 = 1/4, β2 = 3/4, α = γ = 2/3.)

Note that the two constraints in (3.10) are readily satisfied. Another famous Runge-Kutta methodis the following 4-stage (RK4) method:

k1 = f(xi, ti), k2 = f(xi + 0.5hk1, ti + 0.5h)

k3 = f(xi + 0.5hk2, ti + 0.5h), k4 = f(xi + hk3, ti + h)

xi+1 = xi +h

6(k1 + 2k2 + 2k3 + k4) . (3.15)

The Butcher tableaux for these four methods are as follows.Midpoint Modified Euler Huen

0 0 00.5 0.5 0

0 1,

0 0 01 1 0

0.5 0.5,

0 0 02/3 2/3 0

1/4 3/4,

RK4

0 0 0 0 00.5 0.5 0 0 00.5 0 0.5 0 01 0 0 1 0

0.66667 0.33333 0.33333 0.66667.

The implementation of the four examples of Runge-Kutta methods listed above is left as an exercisefor the student. This can be easily achieved by using the Matlab code in Table 3.1 as a template.In Figure 3.7, we show the numerical results when both the midpoint (3.12) and 4-stages (3.15)methods are applied to the example of Figure 3.2. Notice that while with Euler’s method theerror is essentially divided by a factor of 2 when the time step is halved, for the mid-point and the

47

4th order RK methods, the error is reduced by roughly a factor of 4 and 16, respectively. Thisfactor (which we may call the effective order of convergence) can be effectively estimated from thedata as follows. Let Eh(t) be the numerical error associated with a given method. Assume thatEh(t) = O(hp). Then for two time steps h1, h2, we have

Eh1Eh2

≈ hp1hp2

=⇒ p = ln

(

Eh1Eh2

)

/ ln (h1/h2) .

With h1 = h, h2 = h/2 this simplifies to p = log2

(

Eh1Eh2

)

. For the three examples discussed above,

we have at time t = 2:

Eh(t)h Euler Midpoint RK4

0.1 4.0651 0.3101 0.6165 e-030.05 2.3347 0.0842 0.4186e-04

p 0.8001 1.8813 3.8802

We note that the estimated p takes roughly the values 0.8, 1.88, 3.88. Recall that Euler’s methodhas an established linear convergence. Thus, the expected p is unity. However due to the natureof Taylor approximation, the value p = 1 is guaranteed only in the asymptotic limit when h −→ 0.In fact, if we keep halving the time step, we get the sequence 0.8001, 0.8982, 0.9483, · · · as thecorresponding estimations of p for Euler’s method. This suggests a rather slow but convincingconvergence behaviour to the value p = 1. It is left as an exercise for the reader to establish thatthe corresponding sequences for the mid point and RK4 methods converge to 2 and 4, respectively,suggesting a second order and a fourth order convergence, respectively. In fact, we have the followingresult.

Theorem 6 The two-stage methods in (3.12), (3.13),(3.14) are second order accurate while thefour stage method in (3.15) is fourth order accurate, i.e, they respectively exhibit a global error,ei = x(ti)− xi, which converges to zero as O(h2) and O(h4), respectively.

This theorem is a consequence of a more general convergence result for one-step methods whichis presented below. But before doing so, let us first return to the predator-prey system (3.2) toillustrate the performance of these high order methods as we did for Euler’s method in (3.3). Theresults of the second and fourth order RK methods in (3.12) and (3.15) are displayed in Figures(3.5) and (3.6), respectively. Note that the solutions do not diverge even for h = 1 as all thesolutions display the expected oscillatory behaviour. A nicely closed trajectory is found for h = 0.5and smaller when the RK2 method is used and for h = 1 and smaller when the RK4 method isused.

With the RK2 method, there is a visible increase in the magnitude of the oscillations when h = 1,suggesting instability of the equilibrium point. This is the same qualitative behaviour displayed byEuler’s method in Figure 3.3 when h = 0.5.

We now return to the formal issue of convergence of one-step methods. We begin by introducingthe notion of local truncation error and the concept of consistency. For convenience, we rewrite the

48

explicit one-step method asxi+1 = xi + hΨh(xi, ti).

Then, as for Euler’s method, the one-step (local) error is given by

ei = x(ti+1)− [x(ti) + hΨh(x(ti), ti)]

and the truncation error is

τ ih =eih

=x(ti+1)− x(ti)

h−Ψh(x(ti), ti) =

[

x(ti+1)− x(ti)

h− x(ti)

]

− [Ψh(x(ti), ti)− f(x(ti), ti)] .

Here we used the fact that x(ti) − f(x(ti), ti) = 0. In other words, the truncation error is theerror between the differential equation and its approximate version, represented by the numericalscheme.

Definition 7 (Consistency) A one-step scheme is said to be consistent of order p if its truncationerror satisfies

τ ih = O(hp).

It is said to be (simply) consistent if limh−→0

τ ih = 0.

Note that if a numerical scheme is consistent of order p then the local error satisfies ei = O(hp+1)and if it is (simply) consistent then ei = o(h). 5

Theorem 7 (Convergence of one-step methods) If a one-step method is consistent of orderp and the functional Ψh is Lipschitz with respect to xi, then its global error ei = x(ti)−xi convergesto zero when h −→ 0 as a O(hp).

The proof is a simple generalization of Theorem 5 for the convergence of Euler’s method. It is thusleft as an exercise for the reader.

Remark: We note that to establish the consistency requirement with p ≥ 2, we often need to usethe fact that f is differentiable (to a certain order) with respect to x, which renders the Lipschitzcondition in the above theorem obsolete.

According to this theorem, to show that a given multi-stage Range-Kutta method converges andhas an order of accuracy p, it suffices to show that it is consistent of order p. We note that theLipschitz condition for Ψh is guaranteed in this case provided the IVP, itself, satisfies the Lipschitzcondition (See Theorem 4).

We now apply this strategy to establish convergence for the family of two-stage methods in (3.11).The same procedure can be applied for the 4 stage method above. We start by Taylor expandingthe two-variable function f(xi+αhfi, ti+γh) in (3.11). We have, with xi = x(ti) and fi = f(xi, ti),

xi+1 = xi+β1hfi+β2h

[

fi + γh∂fi∂t

+ αhfi∂fi∂x

]

+O(h3) = xi+(β1+β2)hfi+γβ2h2 ∂fi∂t

+β2αh2fi

∂fi∂x

+O(h3).

5f(x) = o(g(x)) when x −→ a if limx−→a g(x)/f(x) = 0. f is said to be a little O of g.

49

Similarly, when we Taylor expand the solution x(t), we get

x(ti + h) = xi + hdxidt

+h2

2

d2xidt2

+O(h3).

Recall that dxi(ti)dt = f(x(ti), ti). Then,

x(ti + h) = xi + hfi +h2

2

d

dt[f(x(ti), ti)] +O(h3).

We apply the chain rule for the remaining derivative to obtain (using again the relation x(ti) =f(x(ti), ti))

x(ti + h) = xi + hfi +h2

2

[

∂fi∂t

+ fi∂fi∂x

]

+O(h3).

By combining the two results, we have

ei+1 := x(ti + h)− xi+1 = hfi(1 − β1 − β2) + h2(1

2− γβ2)

∂fi∂t

+ h2(1

2− αβ2) fi

∂fi∂x

+O(h3)

= O(h3),

if the following conditions are satisfied

β1 + β2 = 1, and β2α = β2γ =1

2.

The first condition guarantees that the scheme is first order while the additional two ensure secondorder accuracy by eliminating theO(h2) terms from the expansion of ei+1. Note that the anticipatedconditions (3.10) are indeed satisfied here and more importantly, the three methods in (3.12) to(3.14) are obtained after making three particular choices of the coefficients, β1, β2 and γ.

Advantages and disadvantages of RK methodsWe close the section by enumerating some apparent advantages and disadvantages of Runge-Kuttamethods. On the advantage side, we have

• high order accuracy,

• explicit,

• easy to code, and

• low storage.

However, the main disadvantage of RK methods resides in the fact that they are multi-stage,meaning that

• multiple evaluations

of the function f are required at each time step. This can be highly prohibitive computationallyespecially if we are dealing with a high dimensional system or–even worse–a partial differentialequation.

50

3.3.2 Multistep methods

One good way to construct inexpensive high order methods that avoids multiple evaluations of thefunction f at each time step is through multi-stepping, i.e, using the solution at several previoussteps to advance to the next time step. Multi-step methods take the form

xi+1 = xi + hΦh(xi−k, xi−k+1, · · · , xi, xi+1, ti), (3.16)

where k ≥ 0 defines the level of the multi-stepping. This formula reduces to a one-step methodwhen k = 0. Again, if Φh is independent of xi+1, then the method is explicit otherwise it is implicit.

An important class of multi-step methods is represented by the family of Adams methods. They aresub-divided into the sub-class of Adams-Moulton which are implicit and that of Adams-Bashforthwhich are explicit. Adams methods are easily derived using polynomial interpolation for the deriva-tive x(t) based on previous steps as interpolation points. Adams methods assume a uniform timestep.

Consider the differential equation x = f(x, t) and let h = T/n and xi = ih be a uniform dis-cretization of [0, T ]. Assume that a sequence of approximate values x0, x1, · · · , xi, i ≥ k are known.Then

xi+1 = xi +

∫ ti+1

ti

x(t)dt := xi +

∫ ti+1

ti

f(x(t), t)dt.

Let Pk be the interpolation polynomial of x(t) using the data points

(ti−k, xi−k), (ti−k+1, xi−k+1), · · · , (ti, xi), (ti+1, xi+1)

where xj ≈ f(xj, tj) := fj. Note that because of the approximate nature of xj , the values xjare known only approximately. Adams methods are then obtained by further approximating theintegral above by the corresponding integral of the interpolation polynomial. In other words, forAdams methods, the quantity hΦh in (3.16) is simply the corresponding Newton-Cotes quadratureformula of the integral above, i.e,

xi+1 = xi +

∫ ti+1

ti

Pk(t)dt. (3.17)

If ti+1 is included in the set of interpolation points, the method is implicit, otherwise, it is explicit.The latter case relies on the extrapolation of the interpolation polynomial, on [ti−k, ti] to [ti, ti+1].

Adams-Bashforth methods

Here we derive a few of the well known Adams-Bashforth examples, i.e, in the case when ti+1 isnot an interpolation point. We note that in this case, when k = 0 (with only one interpolationpoint), the corresponding interpolation polynomial is constant P0(t) = fi and the numerical schemereduces to Euler’s method: xi+1 = xi +

∫ ti+1

tifidt = xi + hfi.

With k = 1, we have two interpolation points ti−1, ti, leading to a linear polynomial

P1(t) = fi−1 +fi − fi−1

h(t− ti−1).

51

Thus, the numerical scheme is given by

xi+1 = xi +

∫ ti+1

ti

P1(t)dt = xi +h

2(3fi − fi−1). (3.18)

This is the second order Adam-Bashforth method. Accordingly, for k = 2 and k = 3, we get,respectively, the 3rd and 4th order Adams-Bashforth methods:

xi+1 = xi +h

12(23fi − 16fi−1 + 5fi−2), (3.19)

xi+1 = xi +h

24(55fi − 59fi−1 + 37fi−2 − 9fi−3). (3.20)

The details of the derivation are left as an exercise for the reader.

Adams-Moulton methods

As mentioned above Adams-Moulton methods are obtained from (3.17) when ti+1 is an interpolationpoint. Thus for k = 0, we get the second order Adams-Moulton method,

xi+1 = xi +h

2(fi + fi+1), (3.21)

which is also known as the trapezoidal method, for the obvious reason. When k = 2, we obtain thefourth order Adams-Moulton method

xi+1 = xi +h

24(9fi+1 + 19fi − 5fi−1 + fi−2). (3.22)

The reader may want to attempt the derivation of the 3rd order Adams-Moulton method corre-sponding to k = 1.

Truncation error and convergence of Adams methods

As it is usually the case for quadrature formulae, the truncation error for Adams methods can beobtained in principle by integrating the interpolation error.

τh =1

h[x(ti+1)− xi+1] =

1

h

∫ ti+1

ti

[x(t)−Pk(t)]dt =1

h

∫ ti+1

ti

i+l∏

j=i−k

(t−tj)1

(k + l + 1)!

dl+k+2x(ξ(t))

dtk+l+2dt,

where l = 0 or l = 1 according to whether ti+1 is included or not. Because the product∏i+lj=i−k(t−tj)

doesn’t change sign in the interval [ti, ti+1] (≥ 0 if l = 0 and ≤ 0 if l = 1), we have by the meanvalue theorem for integrals 6,

τh =1

h

1

(k + l + 1)!

dl+k+2x(ξ)

dtk+l+2

∫ ti+1

ti

i+l∏

j=i−k

(t− tj)dt.

6Mean Value Theorem for Integrals: Let f, g be two integrable functions on [a,b], such that g(x) ≥ 0 for allx ∈ [a, b] (or equivalently g(x) ≤ 0 for all x ∈ [a, b]). Then there exists a point c ∈ [a, b] such that∫ b

a

f(x)g(x)dx = f(c)

∫ b

a

g(x)dx.

52

For the second order Adams-Bashforth method (3.18) (k = 1, l = 0), we get

τh =1

2h

d3x(ξ)

dt3

∫ ti+1

ti

(t− ti−1)(t− ti)dt =1

2h

d3x(ξ)

dt3

∫ h

0s(s+ h)ds =

5h2

12

d3x(ξ)

dt3= O(h2),

which show that this method is indeed second order.

Provided that f is Lipschitz, it is a mere easy exercise to adopt the convergence theorem for one-step methods to the family of Adams methods. At least in the case of explicit methods, this isachieved by establishing the fact that, under these conditions, the functional Φh is Lipschitz withrespect to the arguments xi, xi−1, · · · and go from there. The case of implicit methods is a bittricky because it requires some carefully algebraic manipulations. Nonetheless, in theory we havethat all Adams methods converge with an order of convergence given by the number of interpolationpoints provided the function f(x, t) is smooth enough with respect to x and t. We have for Adamsmethods

|x(t)− xi| = O(hk+l+1).

The leap-frog method

The leap-frog methodxi+1 = xi−1 + 2hfi, (3.23)

is an example of a multi-step method which is NOT part of the family of Adams methods. However,it can be derived by a similar procedure. In fact, this scheme is obtained by assuming the approxi-mation f(x(t), t) ≈ fi and integrating over the interval [ti−1, ti+1]. Despite some pathologies of theleap-frog method, it is widely used especially in climate modelling because it is a multi-step method(efficient) which both is second order accurate and has a low-storage requirement (fi−k, k ≥ 1 arenot needed at time step i+ 1).

Following the error analysis done above for Adams methods, the truncation error for the leap-frogmethod can be written as

τh =1

h

∫ ti+1

ti−1

(t− ti)d2x(t)

d2dt.

However, because the polynomial part of the interpolation error, namely, (t − ti), changes signover the integration interval, the mean value theorem cannot be used to simplify the integral.Nonetheless, the truncation error can be estimated by means of Taylor expansion as it was donefor Euler and RK methods. We have

τh =1

h[x(ti+1)− xi+1] =

1

h[x(ti + h)− x(ti − h)− 2hx(ti)],

x(ti + h) = x(ti) + hx(ti) +1

2h2d2x(ti)

dt2+

1

6h3d3x(ξ1)

dt3,

x(ti − h) = x(ti)− hx(ti) +1

2h2d2x(ti)

dt2− 1

6h3d3x(ξ2)

dt3.

Combining the three expressions yields

τh =h2

3

x′′′(ξ1) + x′′′(ξ2)

2= O(h2),

which shows that the leap-frog method is second order accurate.

53

3.3.3 Dealing with implicit methods

As will be discussed in the next chapter, the use of implicit methods is almost a requirement whendealing with stiff differential equations because implicit methods are the only methods which canexhibit the nice property of A-stability. However, their implementation in practice is problematicbecause they may involve solving non-linear equations or large linear or nonlinear systems. Whilethe particular treatment of this issue is problem– and method–dependent, here we illustrate howsuch methods could be constructed on an ad hoc basis.

The simplest example of an implicit method is the backward-Euler:

xi+1 = xi + hf(xi+1, ti+1).

It can be thought of as a first order Adams-Moulton method. If the function g(x) = x − hf(x)has a known inverse function g−1(x) then we can set: xi+1 = g−1(xi) and the problem is solved.However, such inverse is not easy to find especially for systems and the inverse may not be unique.In such case, choosing the right solution xi+1 can be tricky. One may need to resort to the physicsof the problem to choose the right solution.

Often, a solution xi+1 can be found by means of successive iterations. If for example we were giventhe implicit method

xi+1 = Φh(xi−k, · · · , xi, xi+1, ti),

then starting with an initial guess, x0i+1 = xi for example, we can carry the iterations

xm+1i+1 = Φh(xi−k, · · · , xi, xmi+1, ti),m = 0, 1, 2, · · · ,

until convergence. However such fixed-point iterations may or may not converge depending on thedifferential equation (or system) and the method at hand. In many cases a more sophisticatedroot-finding method such as Newton-Raphson may be needed.

Another attractive formula consists of using an explicit method to first “predict” an a priori estimatefor the solution xi+1. Such methods often called “predictor-corrector” methods are themselvesexplicit methods and therefore may loose some of the good stability features of implicit methods.If for example we use Euler’s method as a predictor for the 2nd order Adams-Moulton, we get

x∗ = xi + hfi, xi+1 = xi +h

2[fi + f(x∗, ti+1)].

This is indeed the 2nd order Runge-Kutta method in (3.13). Combining 2nd order Adams-Bashforthand 2nd order Adams-Moulton yields

x∗ = xi +h

2(3fi − fi−1), xi+1 = xi +

h

2[fi + f(x∗, ti+1).

54

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2

4

6

8

10

12

14

t

x(t

)

Exact and numerical solution. Euler

Exact

h=0.1

h=0.05

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2

4

6

8

10

12

14

t

Glo

bal E

rror

Linear convergence of Euler’s method

h=0.1

h=0.05

Figure 3.2: Performance of Euler’s method.

55

0 10 20 300

100

200Euler Method: h=1

PreyPredator

0 200 400 6000

50

100Trajectory

0 10 20 300

100

200Euler Method: h=0.5

PreyPredator

0 50 100 150 200 2500

50

100

150Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

Figure 3.3: Performance of Euler’s method on the practical example of the predator-prey equations:α = 0.25, β = 0.01, γ = 1, δ = 0.01, x0 = 80, y0 = 30

h0 h

Eh

Discretization error

Round off error

Figure 3.4: Behaviour of the combined round off and discretization error for Euler’s method.

56

0 10 20 300

100

200RK2 Method: h=1

PreyPredator

60 80 100 120 1400

50Trajectory

0 10 20 300

100

200RK2 Method: h=0.5

PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

Figure 3.5: Performance of RK2 (mid-point) method on the practical example of the predator-preyequations: α = 0.25, β = 0.01, γ = 1, δ = 0.01, x0 = 80, y0 = 30

57

0 10 20 300

100

200RK4 Method: h=1

PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

0 10 20 300

100


PreyPredator

60 80 100 120 1400

20

40Trajectory

Figure 3.6: Performance of RK4 method on the practical example of the predator-prey equations:α = 0.25, β = 0.01, γ = 1, δ = 0.01, x0 = 80, y0 = 30

58

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2

4

6

8

10

12

14

t

x(t

)

Exact and numerical solution

Exact

Euler

Mid−point

RK4

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.05

0.1

0.15

0.2

0.25

0.3

0.35

t

Glo

bal E

rror

Quadratic convergence of Mid−Point Method’s method

h=0.1

h=0.05

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1

2

3

4

5

6

7

8x 10

−4

t

Glo

ba

l E

rro

r

4th order convergence of RK4’s method

h=0.1

h=0.05

Figure 3.7: Performance of midpoint and the 4-stage Runge-Kutta method. The Euler solution isalso shown for reference.

59

3.4 Problems

1. Solve the differential equation

x = 2x+ t, x(0) = 0, 0 ≤ t ≤ 2

using the second order mid-point (3.12) and the 4th order (3.15) RK methods with step sizesh = 0.1, h = 0.05, h = 0.025, h = 0.0125. Write down the exact solution and compute theerrors corresponding to each method and each time step size (value of h) at time t = 2, andgroup them in a table (as done in the text). Find the actual or effective order of convergenceassociated with each halving of the time step and report your results in the error table.Compare your results to those reported in the text and conclude.

2. Consider the IVP x = xt, 0 ≤ t ≤ T, x(0) = 2. Use the first order Euler method to solvethis IVP with the following sequence of step sizes: hk = 2−kT, k = 2, 3, 4, · · · 30 (all integersbetween 2 and 30, inclusively). For each step size find the error ek between the exact solutionand the numerical solution at the final time T = 0.0001. Note that for the smallest step sizeh = 2−30T , the number of steps is n = 230 and consequently the simulation may take severalminutes on your computer. Plot the error k as a function of hk. What do you observe? Doyou see convergence as h −→ 0? Explain?

3. Use both the leap-frog and the second order Adams-Bashforth methods to solve the predator-prey system. Repeat the experiments in Figures 3.3,3.5,3.6. What do you see? Use the secondorder RK method to start the multistep methods.

4. Derive the 3rd and 4th order Adams-Bashforth methods.

5. Derive the 3rd order Adams-Moulton method.

6. Consider the system of equations

d

dt

(

xy

)

=

(

−1000 10 −1/10

)(

xy

)

x(0) = 1, y(0) = 2

Part I: Numerics

a) Try using the fourth order Runge-Kutta method (see problem 1) to solve this systemof equations, integrating out to t = 1. What size time step is necessary to achieve a rea-sonably accurate approximate solution? (The true solution is x(t) = e−1000t(9979/9999) +e−t/10(20/9999), y(t) = 2e−t/10.) Turn in a plot of x(t) and y(t) that shows what happens ifyou choose the time step too large. Also turn in a plot of the computed x(t) and y(t) onceyou have found a good size time step. For both cases, plot also the exact solution on thesame graph for comparison.

b) Try solving this system of ODE’s with the 2nd order Adams-Moulton, also known as thetrapezoidal method. Note the matrix is triangular so the associated linear system is easyto solve–analytically. Now what size time step do you need to obtain a reasonably accurateapproximate solution. Can you explain why the second-order trapezoidal method is betterthan the fourth-order Runge-Kutta method for this problem?

60

c) Matlab built in ODE solvers: Now create an M-file to store the ODE and use the Matlabroutines ode45 and ode23s to solve the given system. Your M-file can look like somethinglike this:

%%%%%M-file: myodefunction.m

function yp=myodefunction(t,y)

%yp(1)= -1000*y(1) +y(2);

%yp(2)= -y(2)/10;

yp=[-1000*y(1) +y(2);-y(2)/10];

then execute

>>[t,y]=ode45(@myodefunction, [0 1], [1 2])

and

>>[t,y]=ode23s(@myodefunction, [0 1], [1 2])

respectively.

Plot the solution x(t), y(t) obtained at each time and observe that it is well resolved by Matlab(use plot(t,y(:,1)), plot(t,y(:,2))). Count the number of steps used in each case by measuringthe size of the returned vector t (use the command size(t)). Compare the two numbers. Whatdo you see? Can you say why? What can you conclude about the system?

Part II: Analysis.

For simplicity, now imagine we are using Euler’s method to solve this system. This leads usto the following difference equations:

(

xn+1

yy+1

)

=

(

xnyn

)

+ h

(

−1000 10 −1/10

)(

xnyn

)

or in matrix notation:Xn+1 = Xn + hAXn

where

X =

(

xy

)

and A =

(

−1000 10 −1/10

)

and h is the step size. Show that the solution to this difference system is given by

Xn = (hA+ I)nX0,

where X0 = X(0) is given. Deduce the value of h for which the above solution will givemeaningful results (e.g. a solution that doesn’t blow up for large n). Compare with yourprediction in part I.

7. Consider the ODE

x = x sin(1/x) if x 6= 0 and x = 0 if x = 0, x(0) = 0.001

a. Solve this equation using the matlab functions ode45,ode23,ode23s, and ode15s. Use at-span array t = 0.0 : 0.001 : 1: e.g. >>[t,x45]=ode45(@odefct,0:.001:1,1);. Plot thesolutions on top of each other and then conclude.b. Now use the ode45 alone to solve the same problem when x0 = 0.0015 and when x0 =0.00155. Plot the solutions on the same graph. What do you see?

61

8. Prove Theorem 7.

9. Show that the RK4 method in (3.15) is indeed 4th order accurate.

62

Chapter 4

Stability, convergence of multistepmethods, and stiff equations

4.1 Introduction

Here we discuss the convergence and stability properties for a family of multistep and one-stepmethods and their performance in practice, especially when solving the so-called stiff equations.Recall that for an explicit one-step method,

xn+1 = xn + hΦh(tn, xn),

we have convergence, i.e, the global error, x(tn) − xn goes to zero when the time step h goes tozero, whenever the method is consistent of order p, i.e, the truncation error goes to zero when hgoes to zero at a rate,

τh = O(hp),

provided the functional Φh is Lipschitz with respect to xn. The order of convergence of the globalerror is the same as that of the truncation error, i.e, O(hp). The same result can be obtained forimplicit one-step methods as well. Here we will investigate the case of linear multistep methodssuch as Adams methods.

4.2 Linear multistep-methods

A linear multistep-method (LMM, for short) of order r is on the form

xn+1 =

r∑

l=0

αlxn−l + h

r∑

l=−1

βlfn−l

the αl’s and βl’s are some real valued coefficients. Note that if β−1 = 0, then the method is explicitotherwise it is implicit. Also, under the requirement that the method is exact for constants, when

63

f ≡ 0, zero-order accurate, we need∑r

n=0 αl = 1. Below we assume that this condition is alwayssatisfied.

Examples

1. Forward Euler: xk+1 = xk + hf(xk, tk), explicit, one-step.

2. Backward Euler: xk+1 = xk + hf(xk+1, tk+1), implicit, one-step.

3. Adams methods:

xk+1 = xk + hr∑

l=−1

βlfk−l

r determines the number of steps. Their explicit version (β−1 = 0) are known as Adams-Bashforth methods and the implicit version are known as Adams-Moulton methods. Recallthat Adams methods are based on the integration of the interpolation polynomial of thefunction f(x(t), t) corresponding to the interpolation points tk−r, · · · , tk, tk+1 on the interval[tk, tk+1]. Set fk = f(x(tk), tk). For example, a linear interpolation using the point tk, tk+1

p1(t) =fk+1 − fk

h(t− tk) + fk

yields

∫ tk+1

tk

x(t) dt ≈∫ tk+1

tk

fk+1 − fkh

(t− tk) + fk dt =1

2(fk+1 − fk)h+ fkh =

h

2(fk+1 + fk)

or

x(tk+1)− x(tk) ≈h

2(fk+1 − fk)

which leads to the second order Adams-Moulton scheme:

xk+1 = xk +h

2(fk+1 + fk)

where we used the “approximation” fk = f(xk, tk). The second order Adams-Moulton is alsoknown as the trapezoidal method.

On the other hand, a linear interpolation using the point xk−1, xk yields the second orderAdams-Bashforth method

xk+1 = xk +h

2[3fk − fk−1].

4. Leap-frog method: the leap-frog method is obtained by integrating the zeroth order interpo-lation polynomial f(x(t), t) ≈ f(x(tk), tk) on the interval tk−1, tk+1. This yields

xk+1 = xk−1 + 2hfk.

Exercise:Derive the second order Adams-Bashforth and the leap-frog schemes.

Exercise:Derive the third order Adams-Bashforth and Adams-Moulton methods.

64

Exercise:View the forward and the backward Euler methods as Adams methods. Work out the formalderivation.

4.2.1 Truncation error

Assuming that the solution, xn−l, l = 0, 1, · · · , r, is known exactly at all these previous steps, priorto xn+1, the truncation error is given by

τnh ≡ 1

h[x(tn + h)− xn+1] =

1

h

[

x(tn + h)−(

r∑

l=0

αlxn−l + h

r∑

l=−1

βlfn−l

)]

.

Examples:It is easy to show for example that the (so-called) 2nd order Adams-Bashforth method

xn+1 = xn +h

2[3fn − fn−1],

and for the mid-point (leap-frog) method,

xn+1 = xn−1 + 2hfn,

we haveτh = O(h2).

The details are left as an exercise for the student (see previous chapter).

4.2.2 Difference equation

Consider the following initial value problem

x′ = −2x+ 1; x(0) = 1,

for which the exact solution is given by

x(t) =1

2e−2t +

1

2.

Applying the mid-point (leap-frog) method to this problem leads to the difference equation

xn+1 = xn−1 − 4hxn + 2h, (4.1)

assuming that two starting values, x0, x1, are provided. Typically, we have x0 = x(0), given by theinitial condition and x1 ≈ x(h) computed by a one-step, preferably a highly accurate method. Butwe require only that x1 converges to x(0) when h −→ 0.

65

A solution to the difference equation above can be obtained in a similar fashion as for linear differ-ential equations, i.e, the general solution is given as a linear combination of a linearly independentset of solutions to the homogeneous equation plus a particular solution.

xn = xHn + xpn.

where xHn solves the homogeneous part of the difference equation

xHn+1 = xHn−1 − 4hxHn .

Assume xHn = ρn. This yields a quadratic equation for ρ (see Problem # 8 of Chapter 1)

ρ2 + 4hρ− 1 = 0,

whose solutions areρ+ = −2h+

√

1 + 4h2, ρ− = −2h−√

1 + 4h2

and the general solution for the homogeneous equation is

xHn = c1ρn+ + c2ρ

n−,

Note that, for h > 0, |ρ−| > 1, |ρ+| < 1, similarly to Problem #8 of Chapter 1. We can easily checkthat a particular solution is given by xpn = 1/2 and thus the solution to the difference equation isgiven by

xn = c1ρn+ + c2ρ

n− +

1

2.

Where the constants c1, c2 are obtained through the starting values x0 = 1, x1, by solving the linearsystem

x1 = c1ρ+ + c2ρ−, x0 = c1 + c2.

We obtain

c1 =1

4+x1 − 1

2 + h

2√1 + 4h2

, c2 =1

4− x1 − 1

2 + h

2√1 + 4h2

Note thatρn+ =

(

−2h+√

1 + 4h2)n

= en ln(−2h+1+2h2+O(h4)) ≈ e−2nh = e−2tn

and similarly

ρn− = (−1)n(

2h+√

1 + 4h2)n

≈ (−1)ne2nh = (−1)ne2tn .

Moreover, we have c1 −→ 1/2 and c2 −→ 0, provided x1 −→ x(0) = 1 when h −→ 0, while (ρ±)n

remain bounded, for fixed integration time interval tn ∈ [0, T ].

This, in theory implies, that

xn ≈ 1

2e−2hn +

1

2=

1

2e−2tn +

1

2= x(tn).

i.e. the method converges, as h −→ 0 if limh−→0 |c2ρn−| = 0. This follows directly from theassumption that x1 −→ 0 when h −→ 0 since |ρ−| = e2tn ≤ e2T is bounded. But this not guaranteed

66

in practice. For non-zero h the term c2ρn− introduces an undesirable error in the solution, which

oscillates and grows exponentially with n, i.e, the computed solution is actually approximated by

xn ≈ x(tn) + c2ρn−,

where the part c2ρn− can be relatively large and oscillatory.

For finite but large n, the computed solution rapidly diverges from the targeted exact solution. Thealgorithm thus becomes “unstable” ( in some sense to be clarified) if integrated over a very longtime period and may prevent convergence in practice. The part of the solution associated with ρ+is called the physical mode while the one associated with ρ− is called the computational orspurious mode. However, because the leap-frog method is cheap (very efficient) to run, secondorder accurate, and has low storage requirement, it is widely used in practical applications (e.g.numerical weather predictions and climate modelling).

In practice, the spurious mode is easily filtered out. Because of its oscillatory nature, the spuriousmode is easily eliminated by a filtering strategy (see the book of Dale Durran: Numerical methodsfor geophysical wave equations). This is how the leap-frog method is implemented in practice, i.e,by filtering the spurious mode. For all the reasons just mentioned, the midpoint method (which isknown as leap-frog when applied to PDEs) is widely used in climate modeling where the systemsof differential equations are very large and are typically run for very long periods of time; bothaccuracy and efficiency are highly desirable.

Regardless of this numerical instability (when n is increased and h > 0 is fixed), the example aboveshows that the mid-point method does converge when h −→ 0, on all fixed time intervals [0, T ],due to a “minimal stability” property, as we will see next.

4.3 Zero-stability and convergence of LMM’s

Again consider the LMM scheme

xn+1 =

r∑

l=0

αlxn−l + h

r∑

l=−1

βlfn−l

and isolate the homogeneous part

xn+1 =r∑

l=0

αlxn−l

Considering solutions on the formxn = ρn

leads to the characteristic equation

ρr+1 − α0ρr − α1ρ

r−1 − · · · − αr = 0.

Note that because α0+α1+ · · ·+αr = 1, ρ0 = 1 is always a solution for the characteristic equationassociated to the homogeneous difference equation. Note also that ρ0 = 1 is the only root associatedwith a one-step method, whose characteristic equation is simply ρ = 1.

67

Let ρ0 = 1, ρ1, ρ2, · · · , ρr−1 be the r, possibly complex, roots of the characteristic polynomial above.

Definition: (Zero-stability)

The LMM is said to be zero-stable if one of the following statements is true

• all the roots of the characteristic polynomial satisfy |ρl| < 1 except for ρ0 = 1; in this casethe method is said to be strongly stable

• all the roots satisfy |ρl| ≤ 1 and if |ρl0 | = 1, then ρl0 is not a repeated root; in this case, themethod is said to be weakly stable.

Theorem 8 If an LMM is consistent and zero-stable, then it converges, when the time step h −→ 0.

RemarkNote that the condition |ρ| ≤ 1 in the definition of zero-stability is somehow natural since otherwisepowers of ρ will grow exponentially and diverge but the condition of no roots of magnitude 1 beingrepeated is not obvious. In fact, if ρl0 is a repeated root, then we can easily see that xHn = nρnl0 isalso a solution for the homogeneous difference equation. This solution will grow linearly with n if|ρl0 | = 1.

Some examples of zero-stable methods

The 2nd order Adams-Bashforth method

xn+1 = xn +h

2[3fn − fn−1]

is strongly stable. Its characteristic equation,

ρ2 = ρ,

has two roots: ρ0 = 1, ρ1 = 0. In fact, we can show that all Adams methods are strongly zero-stableand so are one-step methods. The characteristic equations for Adams methods are on the formρr+1 = ρr (whose solutions are again 0, repeated r times, and unity) and for one-step methods itis simply ρ = 1.

The characteristic equation of the midpoint (leap-frog) method

xn+1 = xn−1 + 2hfn

is given byρ2 = 1.

68

The roots are ρ± = ±1 and the method is therefore only weakly zero-stable. This explains bothwhy the method converges, when h −→ 0, for our simple example above, according to Theorem 1,and it is numerically unstable for a fixed h > 0 and n increasing, because of the root ρ− = −1 whichobviously induces the computational mode seen in the example above. Notice on the other handthat all Adams- and all one step-methods will not have such a computational mode since roots oftheir respective characteristic polynomials, other that ρ0 = 1, are either zero or do not exist.

4.4 Stiff equations and the notion of absolute stability

We saw in the previous section that zero-stability PLUS consistency guarantees, in theory, con-vergence of a numerical method when h −→ 0. However, as it is shown by the example with themid-point method above, this is not always enough in practice, and this is viewed as a manifes-tation of the fact that this method is only weakly zero-stable. While we expect such behaviorfor all methods that are only weakly zero-stable, it turns out however that even strongly-stablemethods such as one-step explicit or Adams-Bashforth methods can lead to catastrophic numericalinstabilities for some particular “family” of equations, unless a very small time step is used. Thisis demonstrated by the following example.

Example: Consider the first order Euler’s method applied to the IVP

y′ = −100y + 100, y(0) = y0. (4.2)

We obtainyn+1 = yn − 100ynh+ 100h

and the solution to the difference equation is

yn = (1− 100h)n(y0 − 1) + 1.

Note that the exact solution is y(t) = (y0 − 1)e−100t + 1 and we clearly have convergence of ynto y(t), when h −→ 0 without any spurious oscillations, since (1 − 100h)n ≈ e−100hn for smallenough h, which is consistent with the fact that the method is strongly zero-stable. However, for|1 − 100h| > 1, h > 2

100 = 0.02, yn will grow and oscillate between negative and positive values asn grows and will become very far from the exact solution after just a few time steps. i.e, we haveanother kind of numerical instability. An instability apparently due mostly to the given equationitself and not to the method used. In fact any explicit method will exhibit such behavior for valuesof h which are not very small, for this one particular equation or for similar ones.

The behaviour seen in the example above is typical for the so-called stiff equations. There is no cleardefinition of what a stiff equation is exacty but as a rule of thumb, we can say that an equationis stiff if when solved with an explicit method, it requires very small time steps to maintain a“reasonably looking numerical solution”:

For a stiff equation, a small time step is needed to avoid numerical instability and notfor accuracy purposes.

69

For stiff equations we need a notion of stability beyond the zero-stability introduced above. Weneed methods with good absolute-stability properties, i.e methods that will allow the use of largetime steps without exhibiting undesirable numerical instabilities.

4.4.1 Notion of Absolute Stability

For a detailed discussion see the book by LeVeque: finite difference methods for differential equa-tions.

Consider the model problemx′ = λx, x(0) = x0

where λ is an arbitrary constant which is possibly complex. Think of λ as being the eigenvalue of amatrix associated with a system of differential equations. Given a numerical method, the goal hereif to observe the actual convergence of the method (i.e, the accuracy of the numerical solution),when applied to this particular model equation.

Let’s start with Euler’s method.xn+1 = xn + λhxn.

Again, we consider solutions on the form

xn = ρn

for the difference scheme above. This yield ρ = (1 + λh). Let’s call ρ the amplification (or ratherdamping) factor. Let z = λh, a complex number. Clearly ρ = ρ(z), a function of z, and we expectEuler’s method to behave well when |ρ| < 1.

Definition: (Region of Absolute Stability)The region of absolute stability for a given numerical method is the region in the complex planewhere the amplification factor is less than one in magnitude: |ρ(z)| < 1, when the method is appliedto the model problem above.

For Euler’s method this region reduces to z, |1 + z| < 1, i.e, the unit circle centered at z0 = −1.

Implicit methods and stiff equations

Let’s now consider the backward Euler method:

xn+1 = xn + hλxn+1,

We obtain ρ = 1 + zρ or ρ = 11−z . |ρ| < 1 ⇐⇒ |1 − z| > 1. The region of absolute stability is

now the entire region of the complex plane outside the unit circle centred at z0 = 1. This region ismuch bigger than and includes the absolute stability region of the forward Euler method.

It is clear from this calculation, that if instead we’ve used backward Euler method for the examplein (4.2), then no numerical instability will be exhibited for all values h > 0. In practice, implicit

70

methods behave better than explicit methods for such stiff equations. The larger is the absolutestability region the better is the method suited for stiff equations. The regions of absolute stabilityof the forward and backward Euler methods are shown in Figure 1 (a) and (b), respectively.

ExerciseShow that the region of absolute stability

• for all second order Runge-Kutta methods,

xn+1 = xn + h [β1fn + β2f(xn + αhfn, tn + γh)] ,

is given by |ρ| = |1 + z + z2/2| < 1, which reduces to the region inside the ellipse-like region,centred at z0 = −1, shown on Figure 1 (c). Here α, β1, β2, γ satisfy the conditions in previouschapter of second order consistency.

• for the midpoint (leap-frog) method,

xn+1 = xn−1 + 2hλxn,

it reduces to the line segment connecting the z = i and z = −i on the imaginary axis asshown on Figure 1 (d). Note that two roots

ρ± = z ±√

z2 + 1

are associated with this method when applied to the model equation.

• for the trapezoidal or 2nd order Adams-Moulton method,

xn+1 = xn +h

2(fn + fn+1),

it is simply the whole left-half of the complex plane as indicated on Figure 1 (e). Note thathere there is only one root to the characteristic equation. It is given by

ρ =2 + z

2− z.

• For the 2nd order Adams-Bashforth we have

ρ2 − (1 +3

2z)ρ+

z

2= 0

ρ± =1

2+

3

4z ± 1

2

√

1 +9

4z2 + z

The region of A-stability is the intersection of |ρ±| < 1. It can be illustrated in matlab usingthe following code

%%%2nd order Adams Bashforth A-stability region

x=-2:.05:2;y=x;[X,Y]=meshgrid(x,y); Z=X+i*Y;

rhom =0.5 +3*Z/4-0.5*sqrt(1+9/4*Z.^2+Z);

71

(C) Huen’s method

−2 0 2

−2

0

2

(A) Forward Euler

−2 0 2

−2

0

2

(B) Backward Euler

−2 0 2

−2

0

2

−2 0 2

−2

0

2

(D) Midpoint (leap−frog)

White == stabe region, except for Midpoint which is stable only on [−i,i].

(E) Trapezoidal

−2 0 2−2

−1

0

1

2

Figure 4.1: Region of absolute stability for some numerical methods. Which of these method areA-stable and which ones are L-stable?

rhop =0.5 +3*Z/4+0.5*sqrt(1+9/4*Z.^2+Z);

figure(111),clf

contour(X,Y,abs(rhom),0:.1:1,’r’),hold on

contour(X,Y,abs(rhop),0:.1:1,’b’)

title(’region of A-stability of Adams Bashorth method’)

text(-1.5,1.75,’blue: |\rho_+|<1; red: |\rho_-|<1’)

Execute this code and discuss the region of A-stability of the 2nd order Adams Bashforthmethod.

72

4.5 A-stability, L-stability, and the BDF methods

A LMM is said to be A-stable if its region of absolute stability contains the whole left half ofthe complex plane. It turns out that only implicit methods, such as the backward Euler and thetrapezoidal methods, have this property (LeVeque). This is obviously the reason why they performbetter on stiff equations.

However, for very stiff equations, practical experiments showed that the region of absolute stabilityneeds to be larger than just the left-half of the complex plane for the method to perform nicely.We need methods for which the region of absolute stability includes a big portion of the right-half of the plane, such method are said to be L-stable. From the methods we saw so far, onlythe backward Euler is L-stable. Other methods with this property of L-stability exist in theliterature and are coded in the available software. One whole family of such methods are the so-called implicit BDF (backward differentiation formula) methods. The main idea of BDF methodsconsists of interpolating the solution x(t) using interpolation point xn−k, xn−k+1, · · · , xn, xn+1 thendifferentiate the resulting polynomial before inserting into the equation, as opposed to Adamsmethods where we interpolate, f(t, x(t)), and then integrate. There is a big body of literature onBDF methods and one of the matlab routines, devoted to stiff equations, namely, ODE15S whichis based on implicit BDF.

4.6 Matlab ODE suite

For more information see the paper by Ashino et al.1. In a nutshell we have the following:

The Matlab ODE suite contains three explicit methods for nonstiff problems:

• The explicit Runge-Kutta pair ode23 of orders 3 and 2,

• The explicit Runge-Kutta pair ode45 of orders 5 and 4, of Dormand-Prince,

• The Adams-Bashforth/Moulton predictor-corrector pairs ode113 of orders 1 to 13,

and two implicit methods for stiff systems:

• The implicit Runge-Kutta pair ode23s of orders 2 and 3,

• The implicit numerical differentiation formulas ode15s of orders 1 to 5.

Matlab’s rule of thumb for stiff and non-stiff equations:

1Ryuichi Ashino, Michihiro Nagase, and Remi Vaillancour: Behind and beyond the matlab ODE suite, CRMtechnical reports, CRM-2651, January 2000

73

Typically, if an explicit matlab routine such as ode45 takes longer to solve a given equation ona given time interval than an implicit routine such as ODE23S or ODE15S, with the same errortolerance, then the equation is stiff. Otherwise, i.e, if the ODE45 is faster, then the equation isnon-stiff.

74

Chapter 5

Miscellaneous Linear Algebra

Here we discuss a few topics of linear algebra that are relevant for scientific computing problems.

Perhaps the most popular problem in linear Algebra is that of solving the linear equation

AX = b.

We know that this problem has a unique solution for every right hand side b if and only the matrixA is non-singular. The latter is equivalent to saying that the determinant of A is non zero or thatall its eigenvalues are nonzero. This is also equivalent to

AX = 0 ⇐⇒ X = 0.

Thus, an important question before trying to solve the problem AX = b is to find out whether Ais non-singular or not. Computing its determinant or finding its eigenvalues is not always the mostefficient method especially when dealing with a very large dimension problem, which rises from thediscretization of a differential equation for example.

Thus, we begin by exploring the properties of a few special matrices that often result from suchapplications.

5.1 Diagonally Dominant and Positive Definite Matrices

Definition 8 A matrix A ∈ Rn×n is said to be diagonally dominant if

|aii| ≥∑

j 6=i

|aij |.

A is said to be strictly diagonally dominant if

|aii| >∑

j 6=i

|aij |, ∀i = 1, 2 · · · , n.

75

ExampleThe matrix

Ah =

2 + h1 −1 0 · · · 0

−1 2 + h2 −1. . . 0

0. . .

. . .. . .

......

. . . 0− 1 2 + hn−1 −1

0 · · · 0 −1 2 + hn

, (5.1)

where h1, h2, · · · , hn are real parameters. The matrix Ah is diagonally dominant for all hj ≥ 0, j =1, 2, · · · , n and strictly diagonally dominant when hj > 0, j = 1, 2, · · · , n.

Theorem 9 If A is a strictly diagonally dominant matrix, then A is non-singular.

proof: Assume the theorem is false. Let X be a non-zero vector in R such that AX = 0. Let i0be the integer such

|xi0 | = max1≤i≤n

|xi|.

We have0 = (AX)i0 = ai0i0xi0 +

∑

j 6=i0

ai0jxj.

⇐⇒ ai0i0 = −∑

j 6=i0

ai0jxjxi0

.

Note that the fact that X 6= 0 guarantees that xi0 6= 0. This implies

|ai0i0 | ≤∑

j 6=i0

|ai0j ||xj ||xi0 |

≤∑

j 6=i0

|ai0j |,

which contradicts the fact that A is strictly diagonally dominant.

Definition 9 A matrix A ∈ Rn×n is said to be symmetric positive definite if it is symmetricand satisfies

XTAX ≡n∑

i,j=1

aijxixj > 0,∀X 6= 0.

If A satisfies only XTAX ≥ 0 for all X ∈ Rn, then A is said to be semi-positive definite.

Theorem 10 If A is symmetric positive definite, then A is non-singular.

Proof:If the theorem was false, then there will be X 6= 0 such that AX = 0, which would imply thatXTAX = 0. This contradicts the fact that A is positive definite.

76

Theorem 11 If A is symmetric, strictly diagonally dominant, and aii > 0, for all i = 1, · · · , n,then A is positive definite.

Proof

XTAX =n∑

i=1

aiix2i +

∑

j 6=i

aijxixj ≥n∑

i=1

aiix2i −

∑

j 6=i

|aijxixj | ≥n∑

i=1

aiix2i −

1

2

∑

j 6=i

|aij |(x2i + x2j).

The first inequality results from the fact that a+ b > a− |b| when a ≥ 0 and the second is because2|ab| ≤ a2 + b2.

Now,n∑

i=1

∑

j 6=i

|aij |(x2i + x2j) = 2n∑

i=1

∑

j 6=i

|aij |x2i

because A is symmetric, thus,

XTAX ≥n∑

i=1

aii −∑

j 6=i

|aij |

x2i > 0

provided there is at least one i0 such that xi0 6= 0, which is guaranteed if X 6= 0.

ExampleThe matrix Ah in (5.1) is symmetric positive definite if hj ≥ 0, j = 1, 2, · · · , n. The case hj > 0follows directly from the fact that Ah is strictly diagonal dominant. To see that the result remainstrue for hj ≥ 0 in general, we proceed as follows.

Let X be a vector in Rn. We have

XTAhX = 2x21 − x1x2 − x2x1 + 2x22 − x2x3 − x3x2 + · · · − xnxn−1 + 2x2n +n∑

i=1

hix2i .

We rearrange the terms not involving hj to form a succession of squares by grouping like termstogether as follows:

2x21 − x1x2 − x2x1 + 2x22 = x21 + (x1 − x2)2 + x22.

The remaining x22 will be added to subsequent terms to form a term (x2 − x3)2 and so on. This

yields

XTAhX = x21 + (x1 − x2)2 + (x2 − x3)

2 + · · ·+ (xn−1 − xn)2 + x2n +

n∑

i=1

hix2i ≥ 0,

as a sum of non-negative numbers. Further if XTAhX = 0 then

x21 = (x1 − x2)2 = (x1 − x2)

2 = (xn−1 − xn)2 = x2n =

n∑

i=1

hix2i = 0.

This implies x1 = x2 = · · · = xn = 0 and Ah is thus positive definite.

77

Theorem 12 Let A = (ai,j)0≤i,j≤n be a symmetric matrix. Then the following statements areequivalent.

i. A is positive definite.

ii. The principle minor matrices of A, Ak = (ai,j)1≤i,j≤k, k = 1, 2, · · · n are all positive definite.

iii. The eigenvalues of A are all positive.

iv. The minor determinants of A, det(Ak), k = 1, 2, · · · , n are all positive.

Proof:Going from i. to ii. is straightforward. It results from the fact that for each vector Xk =(x1, x2, · · · , xk)T ∈ Rk, k = 1, 2, · · · , n, the vector X = (x1, x2, · · · , xk, 0, · · · , 0)T obtained byexpanding Xk with n − k zeros is a vector in Rn such that XT

k AkXk = XTAX. The latter ispositive iff Xk 6= 0.

To show that i. implies iii., consider an eigenvalue-eigenvector pair (λ, V ) of A.

0 < V TAV = λV TV = λ||V ||22.

Since V 6= 0 by definition of an eigenvector, we conclude that λ > 0.

iii. implies i. results from the fact that as a symmetric matrix A has a complete set of orthogonaleigenvectors. Let V1, V2, · · · , Vn be as such. Then for all X ∈ Rn, we have

XTAX =

n∑

j=1

αjVj

T

A

n∑

j=1

αjVj

=n∑

j=1

λjα2j ||Vj ||22.

The latter expression is strictly positive if and only if X 6= 0. Finally, it is straight forward toestablish that iii. and iv. are equivalent by using the simple fact that the determinant of a matrixis given by the product of its eigenvalues.

5.2 Least Square Approximation

Let (xi, yi), i = 1, 2, · · · ,m be m data points. We wish to find a polynomial

Pn(x) = a0 + a1x+ · · ·+ anxn

of degree at most n that is as close as possible to the given data points.

We note that when m = n+1 and that the n+1 points (xi)1≤i≤n+1 are distinct, such Pn(x) existsuniquely and it satisfies Pn(xi) = yi, i = 1, · · · , n + 1. It is called the interpolation polynomial.However, in general we are interested in the practical case when m >> n + 1. Trying to find theinterpolation polynomial when m is large is not a good idea for many reasons, two of which are

78

due to the fact that higher order interpolation polynomials can result in large oscillations (Rungephenomenon) and that trying to find a large number of polynomial coefficients, which is what theissue really is, is simply impractical.

In this case we would like to find Pn(x) that minimizes the “distance” to the given data points, insome sense to be specified.

In least square approximation, we seek to minimize

Φ(a0, a1, · · · , an) =m∑

i=1

wi [yi − (a0 + a1xi + · · · anxni )]2 ,

where wi > 0, i = 1, 2, · · · , n are weights that can be chosen to favour data points that are moreimportant than others. Maybe some measurement are more accurate or maybe we want to getcloser to some particular points because of a particular interest, etc.

The expression above defines a quadric function of the coefficient vector ~a ≡ (a0, a1, · · · , an), whichis differentiable and coercive (lim||~a||−→∞Φ(~a) = +∞, on Rn+1. Therefore, it has a global minimum,which is also a local minimum reached at a critical point of Φ:

∇~aΦ = 0.

We have∂Φ

∂aj= −2

m∑

i=1

wixji [yi − (a0 + a1xi + · · · anxni )] = 0, j = 0, 1, · · · , n.

This is equivalent to

n∑

k=0

(

m∑

i=1

wixjixki

)

ak =m∑

i=1

wixjiyi, j = 0, 1, · · · , n.

This is a linear system of the coefficients aj , j = 0, 1, · · · , n that can be written in matrix form as

S~a = ~b

where

Skj =m∑

i=1

wixjixki ,

~bj =m∑

i=1

wixjiyi.

Further, if we introduce the inner-product

〈f, g〉 =m∑

i=1

wif(xi)g(xi),

thenSkj = 〈xj , xk〉, bj = 〈xj, y〉.

With the given m data points, let E be the m× (n+ 1) Vandermonde matrix, given by

E =

1 x1 x21 · · · · · · xn11 x2 x22 · · · · · · xn2...

......

......

...1 xm x2m · · · · · · xnm

.

79

We have the following result.

Lemma 1 We haveS = ETWE and ~b = ETY,

where W is the diagonal matrix given by W = diag[w1, w2, · · · , wm] and Y = (y1, y2, · · · , ym)T .

The proof is left as an exercise for the student. By this lemma we can prove the following theorem.

Theorem 13 If there is at least (n+1) of the m data points, x1, x2, · · · , xm that are distinct, thenthe matrix S is symmetric positive definite.

Thus, the linear system has a unique solution, which is a unique global minimum and unique criticalpoint for Φ. Thus Φ is convex, everywhere.

proof: The fact that S is symmetric follows directly from its structure and Lemma 1 in particular.Also from Lemma 1, we have

XTSX = (EX)TW (EX) ≥ 0,

since W is positive definite as a diagonal matrix whose diagonal entrees are all positive. Moreover,this quantity is zero only if EX = 0. But the rank of E is exactly n+1 because the Vandermondematrix associated with (n + 1) distinct points (chosen among the m rows of e) is non singular, asa consequence of the existence and uniqueness of the interpolation polynomial.

5.2.1 General least squares and Gram-Smith Orthogonalization Procedure

Let φ0(x), φ1(x), · · · , φn(x) be n + 1 functions of the variable x defined in some internal [a, b](possibly the whole real line) that are linearly independent. Imagine, we wish to find a functionψ(x) that “approximates” the set of data points (xi, yi), 1 ≤ i ≤ m, with m >> n+ 1, on the form

ψ(x) =

n∑

j=0

ajφj(x).

Note in the case of least squares using polynomial approximation, the functions φj’s are simplythe monomials φ0(x) = 1, φ1(x) = x, · · · , φn(x) = xn. But this general framework will allow theconsideration of arbitrary functions which are not only polynomials. If we know for example thatthe data comes from the values of a periodic function f(x), then it makes sense to use a combinationof sines and cosines in the place of a polynomial by setting, for instance,

φ0(x) = 1, φ1(x) = cos(2πx/(b − a)), φ2(x) = cos(4πx/(b − a)), · · · , φn(x) = cos(2nπx/(b− a))

orφ1(x) = sin(2πx/(b − a)), φ2(x) = sin(4πx/(b − a)), · · · , φn(x) = sin(2nπx/(b− a)),

or a combination of sines and cosines mixed together (with n = 2p)

φ0 = 1, φ1(x) = cos(2πx/(b − a)), φ2(x) = sin(2πx/(b − a)), φ3(x) = cos(4πx/(b − a)),

80

φ4(x) = sin(4πx/(b − a)), · · · , φ2p−1(x) = cos(2pπx/(b − a)), φ2p(x) = sin(2pπx/(b − a)).

We thus seek to minimize

Φ(a0, a1, · · · , an) =m∑

i=1

wi

yi −n∑

j=0

ajφ(xj)

,

which will lead to the linear algebra problem

S~a = ~b,

where Sjk = 〈φj , φk〉 and bj = 〈y, φj〉.

We note that if the φj’s were orthogonal, then this system will be easy to solve, because the matrixS will be simply diagonal. Recall that two vectors are said to be orthogonal if their inner productis zero.

Thus, in order to have an efficient least square algorithm it is desirable to get a set of basis functionsφj that are orthogonal. However, this is not trivial in practice when the data points depend on othercriteria. Nonetheless, given a set of functions φj that are linearly independent it is always possible

to construct a set of orthogonal replica’s φj . One such technique, known as the Gram-Schmithorthogonalization procedure, is given next.

Gram-Schmith orthogonalization procedure

1. Set φ0 = φ0

2. Set φ1 = φ1 − 〈φ1,φ0〉

〈φ0,φ0〉φ0

3. Set φk+1 = φk+1 −∑k

l=0〈φk+1,φl〉

〈φl,φl〉φl

Theorem 14 Given (n + 1) linear independent functions φj(x), j = 0, 1, · · · , n, then the φj con-structed by the Gram-Schmith orthogonalization algorithm are indeed orthogonal.

ProofWe show that for all k = 0, 1, · · · , n−1, φk+1 is orthogonal to all φj(x), j = 0, 1, · · · , k. We proceedby induction.

First, for k = 0 we have

〈φ1, φ0〉 = 〈φ1, φ0〉 −〈φ1, φ0〉〈φ0, φ0〉

〈φ0, φ0〉 = 0.

Here we used two of the main properties of the inner product.Namely

〈X + Z, Y 〉 = 〈X,Y 〉+ 〈Z, Y 〉 and 〈cX, Y 〉 = c〈X,Y 〉,

81

for any arbitrary set of vectors X,Y,Z and scalar c: set X = φ1, Y = φ0 and Z = − 〈φ1,φ0〉

〈φ0,φ0〉φ0 to

obtain the first expression, noting that − 〈φ1,φ0〉

〈φ0,φ0〉is indeed a scalar and then set X = Y = φ0 and

c = − 〈φ1,φ0〉

〈φ0,φ0〉to get to the final result.

Second, we assume that 〈φj , φl〉 = 0, l = 0, 1, · · · , j − 1, j = 1, 2, · · · , k and show that this remainstrue when j = k + 1. We have

〈φk+1, φj〉 = 〈φk+1, φj〉 −k∑

l=0

〈φk+1, φl〉〈φl, φl〉

〈φl, φj〉

= 〈φk+1, φj〉 −〈φk+1, φj〉〈φj , φj〉

〈φj , φj〉 = 0,

based on the fact that 〈φl, φj〉 = 0, l 6= j. This concludes the proof.

5.3 Matrix factorization

The most popular method for solving a linear system of equations AX = b is perhaps the Gausselimination method. It mainly consists of reducing the problem into one that of an upper triangularmatrix UX = b, by successive simple linear operations performed on the rows of the matrix A insuch a way to eliminate the entree below the main diagonal. With A1 = A ≡ (aij)1≤i,j≤n, at eachstage of the elimination, one gets a matrix on the form

Ak =

ak11 ak12 · · · · · · · · · ak1n0 ak22 · · · · · · · · · ak2n...

. . .. . .

...... 0 akkk · · · akkn...

... akk+1,k · · · akk+1,n...

...0 · · · 0 aknk · · · aknn

where

ak+1i,j = aki,j −mika

kkj, mik =

akikakkk

, i = k + 1, · · · , n, j = k + 1, · · · , n, k = 1, n − 1.

Note that by constructionak+1i,k = 0, i = k + 1, n

and that rows are interchanged and a deeper ith row, i ≥ k + 1, is brought up, whenever, thediagonal element akkk is zero. This amounts to multiplying the matrix Ak by a permutation matrixPik obtained by swapping with each other the ith and k rows of the identity matrix.

82

At the nth stage we get An = U , an upper triangular matrix. The lower triangular matrix

L =

1 0 · · · · · · 0m21 1 0 · · · 0...

. . .. . .

. . ....0

mn1 · · · mn,n−1 1

,

formed by the successive multipliers mij and ones on its diagonal is the inverse operators thatundoes all the successive operations leading to U = An, in the sense that we have the followingresult.

Theorem 15 (LU Factorization) For any given n × n matrix A, there exist three matricesP,L,U such that

PA = LU,

where P is a permutation matrix, L is a lower triangular matrix and U an upper triangular matrix.The matrix L has ones on its main diagonal. The diagonal elements of U are all non zero if andonly if A is non singular.

Note that det(A) =∏ni=1 uii. To solve the system AX = b, one would perform the same operations

above on the vector b, considered as an (n + 1)th column of A, to obtain what’s often called anaugmented matrix.

In Matlab, the backward division operation, >>X=A\b, does in fact invoke the Gauss eliminationalgorithm to solve the given system. Also, the command-function lu applied to a matrix A returnsthe three matrices P,L,U of the theorem. Type >>help lu in the Matlab command window tolearn more. Once an LU factorization is established. The solution of the system AX = b can beobtained in two steps each involving the solution of triangular system: First solve LY = Pb andthen solve UX = Y , by respectively, forward and backward substitutions.

For symmetric positive definite matrices, we have what is known as Cholesky factorization, leadingto one single triangular matrix, which is therefore cheaper to store and obtain.

Theorem 16 (Cholesky Factorization) A matrix A is symmetric positive definite if and onlyif there exist a lower triangular matrix L (whose diagonal elements are not necessarily all ones)such that

A = LLT .

The Cholesky factorization is also available in Matlab. Type >>help cholesky.

Another well known factorization which applies to all matrices, even if they are non-square m× nmatrices (usually m >> n), is the QR factorization, which involves an orthogonal matrix Q andan upper triangular matrix R (for some reason it is traditionally denoted by R and not U). QRfactorization is particularly useful for the problem of least squares (see Question #2 in the list ofproblems below).

83

Theorem 17 (QR Factorization) Let A be an m × n matrix. Then there exist an orthogonalm× n matrix, Q, and an upper triangular n× n matrix R such that

A = QR.

Recall that an orthogonal matrix is a matrix whose columns are orthonormal to each other, i.e, if~q1, ~q2, · · · , ~qn are the n columns of Q, then

~qj · ~qk =

0, if k 6= j1, if j = k.

Consequently, if Q ∈ Rm×n is orthogonal, then

QTQ = In×n.

For square matrices, i.e, if n = m, an orthogonal matrix is invertible and its inverse is its transposeQ−1 = QT and consequently its transpose is also orthogonal, i.e, its rows are also orthogonal.

In Matlab, the QR factorization is obtained by using the command qr. There are many algorithmsknown to produce the QR factorization of a matrix A. There is for example the Givens and theHouseholder transformations which apply specifically to square matrices. But an easy and straight-forward way to obtain a QR factorization is indeed to apply the Gram-Schmith orthogonalizationto the columns of the matrix A. This works as follows.

Let A = [~a1,~a2, · · · ,~an], where the ~a′js refer to the n columns of A, each of which is an Rm vector.The QR factorization is obtained by a slight modification of the Gram-Schmith algorithm providedabove, consisting mainly of normalizing the orthogonal vectors ~aj ’s. We have

1. ~e1 =1

||~a1||~a1

2. ~uk+1 = ak+1 −∑k

l=1 (~ak+1 · ~el)~el

3. ~ek+1 =~uk+1

||~uk+1||, k = 1, 2, · · · , n− 1.

Consequently, we haveA = QR

withQ = (~e1, ~e2, · · · , ~en)T

and

Rij =

~ei · ~aj , if j ≥ i0, otherwise,

namely,

R =

~a1 · ~e1 ~a2 · ~e1 · · · ~an · e10 ~a2 · ~e2 · · · ~an · e2...

. . .. . .

...0 · · · 0 ~an · en

.

84

To see this, let Rj be the jth column of R. Then

QRj =

j∑

i=1

(~aj · ~ei)~ei.

Note that from steps 2 and 3 of the Gram-Schmith algorithm, we have

~aj · ~ej = ||~uj ||.

Thus,

QRj =

j−1∑

i=1

(~aj · ~ei)~ei + (~aj · ~ej)~ej =j−1∑

i=1

(~aj · ~ei)~ei + ~uj = ~aj,

i.e, A = QR.

5.4 Condition number

The condition number is a measure on how much the matrix A is sensitive to small perturbations,when solving the system AX = b, and particularly to round off errors. For a quick illustration,consider the perturbed system

AX = b

where X = X + δX and b = b + δb where δX and δb are small perturbations of the vectors X andb. Combining the original and perturbed systems: AX = b and AX = b, yields

δX = A−1δb.

If we denote by ||.||v a vector norm and by ||.||m its subordinate matrix norm, then

||δX ||v ≤ ||A−1||m||δb||v and ||b||v = ||AX||v ≤ ||A||m||X||v .

Thus,||δX ||v

||A||m||X||v≤ ||A−1||m

||δb||v||b||v

or||δX ||v||X||v

≤ cond(A)||δb||v||b||v

(5.2)

where cond(A) ≡ ||A||m||A−1||m is the condition number of A associated with the matrix norm||.||m.

The inequality in (5.2) states that the relative error ||δX ||v||X||v

committed on the solution X is boundedby the relative error committed on the data vector b times the condition number of A. Thus themore the condition number is large, the more the relative error of X can be. If for example theerror on b is solely due to round off errors, which is typically the case when using a direct methodsuch as Gauss elimination (or one of the factorization methods listed above), then the error onthe solution X can not be very significant unless the condition number is large enough to counterbalance the round off errors. Given that the latter are typically on the order of the machine epsilon,

85

which is roughly 10−16 on a double precision environment (such as Matlab), a condition number of1015 or larger could results in errors of 10% and larger. Thus, the matrix A is typically said to beill conditioned if cond(A) ' 1015.

Note that the conditional number is always defined with reference to compatible matrix norm.In Matlab, the command >>cond(A,p), with p=1,2 or Inf would return the conditional numberassociate to the corresponding matrix norm. As for the matrix norm, the default value is 2.

Further Remarks

• The condition number is bounded from below by the ratio between the largest and the smallesteigenvalues of the matrix. Indeed, recall that for any given compatible matrix norm, we have

||A|| ≥ ρ(A),

where ρ(A) = max|λ|, λ is an eigenvalue of A is the spectral radius of A. Thus,

cond(A) = ||A|| ||A−1|| ≥ ρ(A) ρ(A−1).

But the eigenvalues of A−1 are the inverse of the eigenvalues of A. If the eigenvalues of A aresuch that

|λ1| ≤ |λ2| ≤ · · · ≤ |λn|,then the eigenvalues of A−1 satisfy

1

|λ1|≥ 1

|λ2|≥ · · · ≥ 1

|λn|.

Thus,

cond(A) ≥ ρ(A) ρ(A−1) =

∣

∣

∣

∣

λnλ1

∣

∣

∣

∣

.

In some sense the condition number of A provides a measure on how much its largest andsmallest eigenvalues are far apart. ODE’s based on such matrices are notoriously known tobe often stiff. Re-examine problem 7 of Chapter 3.

• For the 2-norm the condition number is given by

cond(A) =√

ρ(ATA)

√

ρ(A−1TA−1) =

∣

∣

∣

∣

σnσ1

∣

∣

∣

∣

,

where σ1, σ2, · · · , σn are known as the singular values of A. In particular, they satisfy

σ2j = µj , j = 1, 2, · · · , n

where µ1 ≤ µ2 ≤ · · · ≤ µn’s are the eigenvalues of ATA. Singular values will be discussed inthe next section.

ExampleConsider the tridiagonal matrix in (5.1), corresponding to hj = 0, resulting in the tridiagonal

86

matrix with 2’s on the main diagonal, denoted by A for simplicity. The eigenvalues of A are givenby

λk = 2− cos(kπ

n+ 1), k = 1, 2, · · · , n.

Indeed,

AX = λX ⇐⇒ xj−1 − (2− λ)xj + xj−1 = 0, j = 1, 2, · · · , n, x0 = xn+1 = 0.

We seek a solution on the form xj = ρj for this difference equation. This leads to the characteristicequation

ρ2 − (2− λ)ρ+ 1 = 0.

This quadratic equation has two roots

ρ± =2− λ

2± 1

2

√

(2− λ)2 − 4.

The roots ρ± are distinct except when λ = 0 and λ = 4. The general solution to the differenceequation is given by

xj = c1ρj+ + c2ρ

j−.

The boundary conditions x0 = xn+1 = 0 imply that c2 = −c1 and c1(ρn+1+ − ρn+1

− ) = 0. For a nontrivial solution to exist, we need

ρn+1+ − ρn+1

− = 0 ⇐⇒(

ρ+ρ−

)n+1

= 1.

Now, back to the quadratic equation, we have

ρ+ρ− = 1 and ρ+ + ρ− = 2− λ.

This implies, in particular, that

ρ2(n+1)+ = 1.

i.e, ρ+ and ρ− are the 2(n + 1)th roots of unity. We first note that the two real roots ρ+ = 1 andρ+ = −1 are eliminated because they lead to λ = 0 and λ = 4, respectively; they both yield amatrix A−λI which is non singular, i.e, 0 and 4 are not eigenvalues. Thus, the only valid solutionsare

ρ+ = exp

(

ikπ

n+ 1

)

, ρ− = exp

(

−i kπ

n+ 1

)

, k = 1, 2, · · · , n (i2 = −1),

which lead to

λk = 2− (ρ+ + ρ−) = 2− 2 cos

(

kπ

n+ 1

)

, k = 1, 2, · · · , n

and

x(k)j = c1(ρ

j+ − ρj−) = 2ic1 sin

(

jkπ

n+ 1

)

, j = 1, 2, · · · , n, k = 1, 2, · · · , n,

which yields the anticipated answer when 2ic1 = 1.

Note that0 < λ1 < λ2 < · · · < λn < 2.

87

Now, the condition number of A in the 2-norm satisfies

cond(A) = ||A||2||A−1||2 = ρ(A)ρ(A−1) =2− 2 cos(nπ/(n+ 1))

2− 2 cos(π/(n + 1)),

since λn and 1/λ1 are the largest eigenvalues of A and A−1, respectively. We have

2− 2 cos(nπ

n+ 1) = 2

(

1− cos(π − π

n+ 1)

)

= 2 + 2 cos(π

n + 1)

and

cos(π

n+ 1) = 1− 1

2

π2

(n+ 1)2+ o(

1

n2).

Thus,

cond(A) =2 + 2− π2

(n+1)2+ o( 1

n2 )

π2

(n+1)2 + o( 1n2 )

=4(n+ 1)2

π2+ o(1).

Recall that this matrix results from the discretization of the boundary value problem

y′′ = f(x), y(a) = α, y(b) = β.

As we can see the matrix will become ill conditioned when n becomes very large:

cond(A) ' 1015, when n ' 107.

However, for reasonable values of n below 103 the condition number of A remains low and thenumerical solution of the linear system AX = b is expected to behave relatively well.

5.5 Singular Value Decomposition and Empirical Orthogonal Func-tions

Another use of matrices is data storage and data management. When a matrix is not used as anoperator from RM to RN , a linear transformation to be precise, it is often used to arrange datais a nice and transparent way. Consider for example the daily records of temperature at severallocations on a map over a long period of time, ranging from a few years to a century, for example.This record is most naturally arranged into a matrix whose entries are the actual temperaturerecordings such that its columns are the recording-station locations, running from 1 to N andits rows are the successive days along the whole record duration, running say from 1 to M . Suchdataset is often called a time series. WhenM (or N for that matter) is very large, the matrix can bevery large and extracting any useful information or even just storing that data can be challenging.It is thus meaningful to look at efficient ways to extract useful information and/or compress thedata into a smaller matrix, whenever possible.

Consider the datazi,j = ψ(xi, tj), 1 ≤ i ≤M, 1 ≤ j ≤ N,

88

which represent some measurement or numerical simulation values of a quantity ψ(x, t) (such astemperature or population density) at the locations xi, 1 ≤ i ≤M and time instances tj, 1 ≤ j ≤ N ,corresponding for example to measurement stations, space and time simulation grid points, andobservation times, respectively. TypicalM << N , i.e, there is much less observation locations thanthere is measurement times.

We wish to find a set of functions ak(t) and φk, k = 1, · · · ,M such that

ψ(xi, tj) =

M∑

k=1

ak(tj)φk(xi), 1 ≤ i ≤M, 1 ≤ j ≤ N (5.3)

and that are orthogonal to each other in the sense of some inner product to be defined. Thiswould provide easy access to the data structure and may also be useful for data compression.In the context of empirical orthogonal functions, the ak(t)’s are known as the amplitude timeseries functions or the principal components and the φk(x)’s are the spatial mode or spatialpattern functions also known as the empirical orthogonal functions or EOF’s. We introduce,the data matrix

D =

z11 z12 · · · z1Nz21 z12 · · · z1N...

. . ....

zM1 · · · zMN

and the discrete inner product

〈f, g〉 =M∑

i=1

f(xi)g(xi)

and the time average

a(t) =1

N

N∑

j=1

a(tj).

We assume that the time average has been removed from the data, i.e,

1

N

N∑

j=1

zij = 0,∀i = 1, 2, · · · ,M.

The time correlation of two functions of time a(t) and b(t) is given by

a(t)b(t) =1

N

N∑

j=1

a(tj)b(tj).

We wish to construct spatial pattern and time amplitude functions, φk(x) and ak(t), such that

〈φk(x), φl(x)〉 = δkl ≡

1, if k = l0, otherwise,

andNak(t)al(t) = λkδkl, 1 ≤ l, k ≤M.

89

The quantities

λk =

N∑

j=1

ak(tj)2 = ~aTk~ak, k = 1, · · · ,M

are known as the amplitude mode variances. Here, ~ak = (ak(t1), ak(t2), · · · , ak(tN ))T . The matrix

C =1

NDDT

formed by the covariances

Ckl = ψ(xk, t)ψ(xl, t) =1

Ndk d

Tl ,

where dk = [zk,1, zk,2, · · · , zk,N ] is the k-th row of the matrix D.

The trace of the matrix C,

trace(C) =1

N

M∑

i=1

N∑

j=1

z2ij ,

is the total data variance.

The constraint in (5.3) can be written in matrix notation as

D = EA

where

E =

φ1(x1) φ2(x1) · · · φM (x1)φ1(x2) φ2(x2) · · · φM (x2)

.... . .

...φ1(xM ) φ2(xM ) · · · φM (xM )

and

A =

a1(t1) a1(t2) · · · a1(tN )a2(t1) a2(t2) · · · a2(tN )

.... . .

...aM (t1) aM (t2) · · · aM (tN )

.

The pattern functions, or EOF’s

Now,DDT = EA(EA)T = EAATE.

By construction the rows of A are orthogonal to each other we have

AAT = L = diag[λ1, λ2, · · · , λM ]

is a diagonal matrix. ThusDDT = ELET

where E is an orthogonal matrix, by construction. Therefore, this is simply a diagonalization of thematrix DDT . Thus, to find E and the mode variances, λj , j = 1, · · · ,M all is needed is computing

90

the eigenvalues and eigenvectors of the matrix DDT . Since DDT is an M ×M symmetric matrix,the existence of M real eigenvalues and a complete set of orthogonal eigenvectors is guaranteedthanks to fundamental theory of linear algebra. This achieves the construction of the matrix E,i.e, the pattern functions φj(x).

The time series mode functions, or Principal Components

Since E is orthogonal, the constraint D = EA implies

A = ETD,

which yields the principle components ~a1,~a2, · · · ,~aM , as the rows of A. Note that the conditionthat the principle components are orthogonal to each other is recovered from the fact that L is thediagonalization of DDT

AAT = ETD(ETD)T = ETDDTE = L.

Link to Singular Value Decomposition

Let D be an M ×N matrix N ≥M .

Theorem 18 There exist two orthogonal matrices U and V in RM×M and RN×N , respectively,and diagonal matrix Σ in RM×N such that

D = UΣV T

Σ =

σ1 0 · · · 0 0 · · · 00 σ2 · · · 0 0 · · · 0

. . .

σM 0 · · · 0

.

where the diagonal elements are called the singular values of D.

Proof:Consider the augmented matrix

B =

[

OM DDT ON

]

,

whereOM , ON are respectively theM×M andN×N matrices whose entrees are all zeros. Since thematrix B is symmetric, thus there exist a set of real eigenvalues, σ1, σ2, · · · , σM+N and a completeset of eigenvectors [~uT1 , ~v

T1 ]T , [~uT2 , ~v

T2 ]T , · · · , [~uTN+M , ~v

TN+M ]T , such that

[

OM DDT ON

](

~u~v

)

= σ

(

~u~v

)

.

or

D~v = σ~uDT~u = σ~v

.

91

Thus,DDT~u = σD~v = σ2~u,

i.e, σ2 is an eigenvalue of the symmetric positive semi-definite matrix DDT and ~u is an associatedeigenvector. Similarly, DTD~v = σ2~v, i.e, σ2 is also an eigenvalue of DTD with ~v the associatedeigenvector. Given that the rank of D cannot be larger than M (assumed ≤ N), only M σ valuescan be non-zero.

Let U = [~u1, ~u2, · · · , ~vN ] the matrix whose columns are the N orthogonal eigenvectors of DTD andU = [~u1, ~u2, · · · , ~uN ] formed the N orthogonal eigenvectors of D DT . Let

Σ =

σ1 0 · · · 0 0 · · · 00 σ2 · · · 0 0 · · · 0

. . .

σM 0 · · · 0

.

We haveDV = UΣ =⇒ D = DV V T = UΣV T .

Also,UTDV = UTUΣ = Σ.

Thus, once the matrices U and V are formed through the eigenvectors of DDT and DTD, respec-tively, the matrix of singular values follows from this equation. Note that the sole knowledge of theeigenvalues of DDT and DTD doesn’t uniquely determine the singular values, however, the factthat these two matrices are symmetric semi-definite guarantees that the singular values are real.

In the context of EOF or principal component analysis, discussed above, it is easy to see that thematrix U provides the spatial patterns functions U = E and the matrix of amplitude functions isgiven by A = ΣV T . Thus, the EOF or principal component analysis is nothing but a simplifiedsingular value decomposition. Note that in this context only the square values λk = σ2k have aphysical meaning, i.e, they provide the variance associated with each time series mode or principalcomponent.

The total variances

σ2 =1

N

M∑

i=1

N∑

j=1

z2ij =1

NtraceDDT =

1

N

M∑

i=1

σ2i ,

i.e, the average of the principal component variances.

5.6 Problems

1. Recall that the condition number in the 2-norm is the ratio of largest to smallest singularvalue. The singular value decomposition of an n by n matrix takes the form

A = UΣV T

where U, V are n×n orthonormal matrices and Σ is an n×n diagonal matrix with the singularvalues σi on its diagonal.

92

a) Write a matlab routine to generate an n × n matrix with an arbitrary (known) conditionnumber. You can proceed as follows. For a given condition number, condno,

• start by generating two randommatrices U1, V1 (using the matlab function rand: >>U1 = rand(n,n));

• make the two matrices U1, V1 orthogonal by using QR factorization. This is done asfollows. Execute [Q,R] = qr(U1) and set U = Q (repeat the same process to constructV ); Say why the constructed matrices U, V are both orthogonal.

• generate a diagonal matrix Σ whose diagonal elements are given by σii = condno−(i−1)/(n−1), 1 ≤i ≤ n. This can be accomplished in matlab with the command:>>SIGMA = diag(condno^(-(0:1:n-1)/(n-1))) ;

• set A = UΣV T . Compute the L2 condition number of the matrix A you generated byexecuting the command cond(A) to check whether it is equal to the one prescribed atthe begining.

Your matlab M-file can look like this:

function A = matgen(n, condno)

%Input size n and condition number condno

%output: an nxn random matrix A whose condition number is condno.

U = rand(n,n);

[Q,R] = qr(U);

U = Q;

... (fill in the blanks)

b) With n = 16 and for each condno = 1, 104, 108, 1012, 1016, use your matlab routine in (a) togenerate an n×n random matrix A with condition number condno. Also generate a randomvector X_true of length n, and compute the product >>b= A*X_true.

Solve AX = b by Gauss elimination, X=A\b in matlab. Determine the relative error ||X −Xtrue||/||Xtrue||. Explain how this is related to the condition number of A. Compute theresidual ||b − Ax||/(||b||). Plot both the relative error and the residual as functions of thecondition number, for condno = 1, 104, 108, 1012, 1016. Does the algorithm for solving Ax = bappear to be backward stable, in other words, is the computed solution the exact solution toa nearby problem?

Note: Forward stability means that a small error in the solution ||X − Xtrue|| implies theresidual ||b − AX|| is small while backward stability means a small residual implies a smallerror in the solution.

c) Now solve Ax = b by inverting A: >>Ainv = inv(A), and then multiply b by A−1:>>X=Ainv*b. Again compare the relative errors and the residuals. Is the new algorithmbackward stable?

d) Redo b) and c) with n = 32 and with n = 64. Does the value of n matter at all?

Matlab Hint: to compute the p-norm of a vector or matrix X in matlab, for a given p, simplyexecute the built in matlab function norm: >>norm(X,p)

93

2. Consider the QR factorization

A = Q

[

RO

]

where Q is an m×m orthogonal matrix, R is n×n upper triangular matrix and O is m−n×nmatrix with zero entrees (n ≤ m).

LetQ =

[

Q1 Q2

]

where Q1 is m× n and Q2 is m× (m − n); both orthogonal matrices (that is their columnsare mutually orthogonal: QTi Qi = I, i = 1, 2). The couple (Q1, R) is often called the economysize QR decomposition. Type >>help qr in matlab.

a) Consider the problem of least square approximation Ax ≈ b where A is a m × n matrixand b ∈ Rm (n ≤ m). By noting that ||Ax − b||22 = ||Rx − QT1 b||22 + ||QT2 b||22 (why?). Showthat the problem of least square approximation reduces to the solution of the n× n system

Rx = QT1 b,

which constitutes an attractive procedure for solving the least square problem. Show that Ris non singular if A has full rank.

Note: The “matrix division” X=A\b of matlab uses the QR decomposition, as outlined above,and returns the least square solution when A is not a square matrix.

Moreover, the condition number defined as the ratio between the largest to the smallestsingular value, can be extended to non square matrices.

b) Show that cond(ATA) = (cond(A))2 and then argue why for cases when cond(A) is largethe QR factorization technique is advantageous, for least square problems, then the normalmethod which solves instead ATAX = AT b, but for moderate cond(A) values the normalmethod is preferable in general. Hint: the normal method takes advantage of the Choleskyfactorization.

3. Let A be a square matrix and consider its singular value decomposition A = UΣV where U, Vare orthogonal matrices and Σ = a diagonal matrix with diagonal entrees s1, s2, · · · , sn.(a)By noting that AV T = ΣU , show that

|si| |Ui|2 ≤ ||A||2|Vi|2, for all i = 1, 2, · · · , n.Where Ui, Vi are the ith columns of the matrices U, V , respectively, |.|2 is the Euclidean vectornorm, and ||.||2 is the associated subordinate matrix norm.

(b)Deduce that |si| ≤ ||A||2 for all i = 1, 2, · · · , n.(c)Use the definition of the matrix subordinate norm

||A||2 = max|X|2=1

|AX|2

to show that||Σ||2 = max

1≤i≤n|si|

for any given diagonal matrix Σ with diagonal entrees s1, s2, · · · , sn.(d)Show that ||A||2 ≤ ||Σ||2 where ||.||2 is the Euclidean subordinate matrix norm and Σ isthe diagonal matrix formed by the singular values of A.

(e)Deduce that ||A||2 = max1≤i≤n

|si|.

94

4. Execute the following Matlab code, then report and comment the results. Also put somecomments in the provided blank space (fill in the blanks) to help document the code forusers. For example, on the first line (Synopsis) you say what the code precisely does. e.g.,What is the input and what is the output?

Hint: Use the matlab help command to learn about the commands that are not familiar toyou.

95

%Synopsis:_This matlab code________________________________________________________________

%__________________________________________________________________________________________

D=[319.32 320.36 320.82 322.06 322.17 321.95 321.20 318.81 317.82 317.37 318.93 319.09

319.94 320.98 321.81 323.03 323.36 323.11 321.65 319.64 317.86 317.25 319.06 320.26

321.65 321.81 322.36 323.67 324.17 323.39 321.93 320.29 318.58 318.60 319.98 321.25

321.88 322.47 323.17 324.23 324.88 324.75 323.47 321.34 319.56 319.45 320.45 321.92

323.40 324.21 325.33 326.31 327.01 326.24 325.37 323.12 321.85 321.31 322.31 323.72

324.60 325.57 326.55 327.80 327.80 327.54 326.28 324.63 323.12 323.11 323.99 325.09

326.12 326.61 327.16 327.92 329.14 328.80 327.52 325.62 323.61 323.80 325.10 326.25

326.93 327.83 327.95 329.91 330.22 329.25 328.11 326.39 324.97 325.32 326.54 327.71

328.73 329.69 330.47 331.69 332.65 332.24 331.03 329.36 327.60 327.29 328.28 328.79

329.45 330.89 331.63 332.85 333.28 332.47 331.34 329.53 327.57 327.57 328.53 329.69

330.45 330.97 331.64 332.87 333.61 333.55 331.90 330.05 328.58 328.31 329.41 330.63

331.63 332.46 333.36 334.45 334.82 334.32 333.05 330.87 329.24 328.87 330.18 331.50

332.81 333.23 334.55 335.82 336.44 335.99 334.65 332.41 331.32 330.73 332.05 333.53

334.66 335.07 336.33 337.39 337.65 337.57 336.25 334.39 332.44 332.25 333.59 334.76

335.89 336.44 337.63 338.54 339.06 338.95 337.41 335.71 333.68 333.69 335.05 336.53

337.81 338.16 339.88 340.57 341.19 340.87 339.25 337.19 335.49 336.63 337.74 338.36];

%comment:_D is_a_________________________________________________

[M,N]=size(D) %This returns ____________________________________

figure

subplot(2,1,1)

contour(D) %comment:___________________________________________________

colorbar

C = D*D’; %comment:___________________________________________________

[V,L]=eig(C);

lambda = diag(L) % This displays_______________________in___________________order

A=V’*D; % This computes ________________________________________

96

Vn = V(:,M); %To select the ________________ associated with the __________________________

An = A(M,:); %This is the _______________________ associated with the _________________________

Df= Vn*An;

subplot(2,1,2)

contour(Df) %comment:___________________________________________________

97

Chapter 6

Finite difference and finite volumemethods for transport andconservation laws

Foreword

The celebrated Chapman-Kolmogorov equation for a diffusion Markovian process reduces to thewell known Fokker-Planck equation [Gardiner, 2004]

∂p(z, t/y, t′)

∂t+∇z · (A(z, t)p(z, t/y, t)) =

1

2

∑

i,j

∂2

∂zi∂zj(Bi,j(z, t)p(z, t/y, t)) . (6.1)

Here p(z, t/y, t′) is the probability density distribution of the random variable z at time t given y attime t′ of the underlying Markovian process. ∇z is the gradient differential operator with respectto the variable z, A(z, t) is a vector function known as the drift, representing the deterministicdynamics of the process, and B = [Bi,j ] is the diffusion matrix describing the Gaussianity orrandomness of the process. When B = 0 the Focker-Planck equation is also known as the Liouvilleequation, describing the evolution of the probability distribution of a random process undertakingdeterministic dynamics. Taking the derivative with respect to the known variable y at time t′,instead, yields the famous backward equation [Gardiner 2004]

∂p(x, t/y, t′)

∂t′+A(y, t′) · ∇yp(x, t/y, t

′) = −1

2

∑

i,j

Bi,j(y, t′)

∂2

∂yi∂yj

(

p(x, t/y, t′))

. (6.2)

Note that while the two partial differential equations above are given in terms of the variablesz and t or y and t′, respectively, the remaining variables (y, t′) for the first equation and (x, t)for the second) can be treated as parameters and therefore ignored when we are only concernedwith numerical or analytic solution methodology for theses PDEs. This kind of equations arewide spread in the applied physical sciences. The term involving the vector A is also known as a

98

transport process. In fluid mechanics, for example, it models the action of the flow field on thedynamical quantity under consideration, such as temperature, density, or momentum. It appearsin either a conservative form

∂tq +∇ · (Aq) = 0 (6.3)

as in the forward Focker-Plank equation (6.1) or in advective form

∂tq +A · ∇q = 0 (6.4)

as in the backward equation (6.2). Also under the obvious condition of ellipticity, the matrix Bcan be diagonalized and the associated diffusion operator is reduced to the more standard Laplaceoperator. Therefore, we are interested here in the numerical solution of the advection diffusionequation

∂tq +A · ∇q = D∆q (6.5)

or∂tq +∇ · (Aq) = D∆q (6.6)

which is a superposition of the transport equation of conservative or advective type and the diffusionequation

ut = D∆u

also known as the heat equation. In this series of lectures we will discuss some standard numericaltechniques for these types of equations. Special emphasis will be given on finite difference and finitevolume methods for the advection and conservation equations in (6.4) and (6.3), respectively. Wewill treat in some details the case when the advection field A depends on the solution q that leadsto shock formation and other types of singularities, which are important in gas dynamics and otherfields of practical importance.

Unless otherwise stated, from now on, we consider only partial differential equations in the 2variables (x, t), where −∞ < x < +∞ represents the space variable and t ≥ 0 is time. The solutionis denoted by u(x, t).

6.1 Introduction to finite differences: The heat equation

We introduce some basics of the finite difference methodology for partial differential equationthrough the simple case of the heat or diffusion equation in 1 dimension

ut = Duxx,

where D > 0 is a constant heat conduction or diffusion coefficient.

The finite differences method applied to the heat equation above, starts by the approximation ofthe partial derivatives, ut and uxx by their corresponding finite difference quotients. For a smoothfunction f(x) of the variable x, we have according to Taylor expansion

f(x0 + h) = f(x0) + f ′(x0)h+1

2f ′′(x0)h

2 + · · · + 1

n!f (n)(x0)h

n +1

(n+ 1)!f (n+1)(ξ)h(n+1)

99

where h is a non zero increment or displacement along the real line, starting from a fixed pointx0 and ξ is between x0 and x0 + h. Recall that, a function g is said a big O of h and we writeg = O(hp) if

limh−→0

g(h)

hp= Constant .

Assuming h > 0 and using Taylor approximation for f(x0 + h) and f(x0 − h), the forward andbackward difference formulas follow immediately,

Forward Formula:

f ′(x0) =f(x0 + h)− f(x0)

h− 1

2f ′′(ξ)h =

f(x0 + h)− f(x0)

h+O(h) ≈ f(x0 + h)− f(x0)

h, (6.7)

Backward Formula:

f ′(x0) =f(x0)− f(x0 − h)

h+

1

2f ′′(ξ)h =

f(x0)− f(x0 − h)

h+O(h) ≈ f(x0)− f(x0 − h)

h, (6.8)

Furthermore, the 3rd order Taylor approximations of the difference f(x0+h)− f(x0−h) yields the

Centered Formula:

f ′(x0) =f(x0 + h)− f(x0 − h)

2h− 1

6

f ′′′(ξ1) + f ′′′(ξ2)

2h2

=f(x0 + h)− f(x0 − h)

2h+O(h2) ≈ f(x0 + h)− f(x0 − h)

2h, (6.9)

where x0 − h ≤ ξ1 ≤ x0 ≤ ξ2 ≤ x0 + h, whereas the 4th order Taylor approximation of the sumf(x0 + h) + f(x0 − h) leads to an approximation for the second order derivative f ′′(x0).

Centered Formula for the second order derivative:

f ′′(x0) =f(x0 + h)− 2f(x0) + f(x0 − h)

h2+

1

24(f ′′′(ξ1) + f ′′′(ξ2))h

2

=f(x0 + h)− 2f(x0) + f(x0 − h)

h2+O(h2) ≈ f(x0 + h)− 2f(x0) + f(x0 − h)

h2, (6.10)

The formulas on the very-right hand sides of (6.7) to (6.10) are only some of the very basic ex-amples of finite difference approximations for the first and second order derivatives to a first andsecond order accuracy, respectively. ( The forward and backward finite difference approximationsin (6.7) and (6.8) are first order accurate, therefore called first order approximations while thosein (6.9) and (6.10) are second order accurate and are called second order approximations. ) Finitedifference formulas of higher order and for higher order derivative can be derived by using similarmanipulations of Taylor approximations or polynomial approximations (e.g. interpolation). Alsodifferent combinations of points to the left or the right of the point x0 can be considered separately.

A finite difference method for a given partial differential equation PDE consists of the approximationof the partial derivatives of its (unknown) solution u by a corresponding finite difference formulaof a certain order.

100

6.1.1 Explicit scheme for the heat equation

Consider the heat equationut = Duxx

on a finite rod x ∈ (0, L) with the initial condition u(x, 0) = u0(x) and boundary conditionsu(0, t) = α(t) and u(L, t) = β(t), t ∈ [0, T ]. Consider a discretization of the rectangle [0, L]× [0, T ]into a finite number of nodes (xj , t

n), j = 0, 1, · · · ,M + 1, n = 0, 1, · · · , N such that xj = jh andtn = n∆t where h ≡ ∆x = L/(M + 1) and ∆t = T/N . The set of all points (xj , t

n) is calleda grid or a mesh while h and ∆t are respectively called the time step and spatial grid size. Letunj = u(xj , t

n). Using a forward finite difference approximation for the time derivative combinedwith a centered formula for the second order spacial derivative, applied at each node (j, n), theheat equation can be rewritten as

un+1j − unj

∆t+O(∆t) = D

unj+1 − 2unj + unj−1

h2+O(h2). (6.11)

Let wnj be the finite sequence of real numbers satisfying the following difference scheme

wn+1j = wnj +

D∆t

h2(wnj+1 − 2wnj + wnj−1), j = 1, · · · ,M, n = 0, 1, · · ·N − 1, (6.12)

obtained from (6.11) by dropping the small error terms O(∆t) and O(h2), known as the truncationerror, and using the initial and boundary conditions

w0j = u0(xj), w

n0 = α(tn), w

nM+1 = β(tn).

This is the main philosophy behind finite differences, to obtain an approximate solution for thegiven PDE at the interior grid points

u(xj , tn) ≈ wnj , j = 1, · · · ,M, n = 1, 2 · · · , N.

As we shall see below, for the scheme (6.12) for the heat equation, we have

u(xj , tn) = wnj +O(∆t) +O(h2),

i.e, the approximation is first order in time and second order in space, which is a statement ofconvergence as well, as ∆t, h −→ 0; the formula convergence is proved below, after the introductionof the notions of consistency and accuracy. For a given finite discretization, an approximation forthe solution u(x, t) at a given interior point (x, t) ∈ (0, L) × (0, T )–not on part of the grid canbe obtained by 2d interpolation of the discrete solution wnj . Ideally, the order of interpolationwould match that of the numerical approximation to obtain an optimal approximation in terms ofefficiency and accuracy.

Note that the initial condition u0(x) and the boundary conditions α(t), β(t) provide the startingpoints w0

j , j = 0, 1, · · · ,M + 1 and the lateral grid point values wn0 and wnM , n = 1, 2, · · · , Nrespectively while the numerical scheme is evolved from time step to time step using the formula(6.12) to provide the solution at time tn +∆t, in terms of the solution at time tn. Such a schemeis called explicit as opposed to an implicit scheme where obtaining wn+1

j from wnj involves theinversion of a linear or non-linear system of algebraic equations (see next subsection).

101

Definition 10 (Consistency):A numerical scheme, L(wnj ) = 0, for a given PDE, P(u(x, t)) = 0, is said to be consistent if thetruncation error

τh(h,∆t) ≡ P(u(x, t)) −L(unj ) −→ 0, h,∆t −→ 0.

The scheme is said consistent of order (p, q) if

τh(h,∆t) = O(hp) +O(∆tq).

For the forward explicit scheme (6.12) for the heat equation we have

ut −Duxx −(

un+1j − unj

∆t−D


h2

)

= O(h2) +O(∆t),

i.e, the forward scheme for the heat equation is consistent of order (2, 1).

Definition 11 (Stability):A numerical scheme, L(wnj ) = 0 for an evolution partial differential equation on [0, T ] is said to bestable if the discrete solution satisfies

maxj=1,··· ,M

|wnj | ≤ C, ∀n = 1, · · · , N

where C > 0 is a constant independent on the grid size, h,∆t.

Theorem 19 (Convergence, Lax equivalence theorem)If the original PDE problem is well posed then, the discrete solution of the numerical schemeconverges to the solution of the PDE when h,∆t −→ 0 if and only if the scheme is consistent andstable.

Proof:Here, we assume that both the PDE and the numerical scheme are linear. The extension tononlinear PDE’s and non-linear schemes is more complex and will be addressed via finite volumes.Let u(x, t) be the the solutions to the corresponding PDE and wnj the discrete solution of thenumerical scheme. Let unj = u(xj , t

n). The linear numerical scheme can be written as

wn+1 = Lh,∆twn

where wn = (wnj ) is the Rn vector representing the discrete solution and Lh,∆t is the linear operator

(a matrix) associated with the numerical scheme, which depends on the grid size parameters h and∆t. For the forward scheme (6.12) for the heat equation, we have

(Lh,∆twn)j = wnj +

D∆t

h2(wnj+1 − 2wnj − wnj−1).

Note that the stability requirement implies that the operator L and all its powers stay bounded asN,M −→ +∞ (or equivalently as h,∆t −→ 0), i.e,

102

Stability =⇒ ||Lh,∆t||n ≤ C, ∀N,M,n ≥ 0, n ≤ N.

To satisfy such condition if suffices to have

||Lh,∆t|| ≤ 1.

However this later condition can be relaxed to

||Lh,∆t|| ≤ 1 +O(∆t).

By the consistency requirement, we have

un+1 = Lh,∆tun +O(∆t2) +O(h2)∆t.

Thus,un+1 − wn+1 = Lh,∆t(u

n − wn) +O(∆t2) +O(h2)∆t.

Using basic linear algebra we have

||un+1 − wn+1|| ≤ ||Lh,∆t||||un − wn||+O(∆t2) +O(h2)∆t,

and by induction on n, we arrive to

||un+1−wn+1|| ≤ ||Lh,∆t||n+1||u0−w0||+(||Lh,∆t||n+||Lh,∆t||n−1+· · ·+||Lh,∆t||+1)(O(∆t2)+O(h2)∆t).

From the initial condition, we have u0 − w0 = 0. Thus,

||un+1 −wn+1|| ≤ ((1 +O(∆t))n + (1 +O(∆t))n−1 + · · · + (1 +O(∆t)) + 1)(O(∆t2) +O(h2)∆t)

≤ (1 +O(∆t))n+1 − 1

O(∆t)(O(∆t2) +O(h2)∆t) ≤ (eT − 1)(O(∆t) +O(h2)) −→ 0,∆t, h −→ 0,

and the rate of convergence is the same as the order of consistency, i.e, linear in time and quadraticin space.

Numerical Tests and stability of the forward schemeConsider the PDE

ut =1

16uxx, x ∈ (0, 1), t ∈ (0, T )

u(x, 0) = sin(2πx) (6.13)

u(0, t) = u(1, t) = 0.

The exact analytical solution for this PDE is given by

u(x, t) = e−1

4π2t sin(2πx).

The matlab code for solving this problem using the forward/explicit scheme (6.12) is given belowin (6.14) and the results obtained with two different time-step sizes are plotted in Figure 6.13. Thespacial discretization consists of 11 grid points and the time step is ∆t = 0.02 for the top panel and∆?t = 0.2 for the bottom panel. With ∆t = 0.02, the numerical scheme (6.12) provides an accurate

103

solution for this problem while the larger time step value ∆t = 0.2 leads to a numerical solutionthat grows without bounds. Such behavior is known as a numerical instability. In fact, we willshow below that the forward scheme (6.12) is conditionally stable ; It is stable only for a relativelysmall ∆t values; the largest eigenvalue of the linear operator associated with the forward scheme(6.12) is smaller or equal to one up to O(∆t) provided D∆t/h2 ≤ 1

2 . Instead of going through thetedious task of computing the eigenvalues of the matrix, we use an alternative methodology forstability of difference schemes known as the von Neumann analysis.

%%%Forward scheme for the heat equation:

%%% INPUT

%%%%%Advection velocity:

mu = 1/16;

%%%%Grid size; Use periodic boundary conditions

X=1;M=10;Tend=4;

h=1/(M+1);

Dt=0.02;

x= 0:h:X; %%% x(1) =0, x(2) = h, , ..., x(M) = X -h; x(M+1) = X;

wn=sin(2*pi*x);

time=0;

mu = mu*Dt/h^2;

while(time<Tend)

wn(2:M+1) = wn(2:M+1) +mu*(wn(3:M+2)-2*wn(2:M+1)+wn(1:M));

time=time+Dt;

end

figure(2)

xx=0:1/1000:1;

plot(xx, exp(-time/4*pi^2)*sin(2*pi*xx))

hold on

plot(x,wn,’x’,’linewidth’,2)

Matlab code for the explicit scheme for the heat equation. (6.14)

6.1.2 Stability of the forward scheme: von Neumann analysis

Consider the forward in time centered in space numerical scheme for the heat equation

wn+1j = wnj +

∆tD

∆x2(

wnj+1 − 2wnj + wnj−1

)

.

Consider simple solution for this difference equation on the form of

wnj = ρne2πijlh (6.15)

where i =√−1 and l is an integer, which can be thought of as a discrete version of the Fourier

harmonics and the amplitude ρ is known as the amplification factor.

104

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6

−4

−2

0

2

4

6x 10

−5

exact

dt=0.02

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3

−2

−1

0

1

2

3x 10

−3

exact

dt=0.2

Figure 6.1: Explicit scheme for the heat equation. Numerical solution (crosses) compared to theexact solution (solid) for (6.13) at time t=4 with 11 spatial grid points. Top: ∆t = 0.02, Bottom:∆t = 0.2

105

Theorem 20 (von Neumann)A numerical scheme for an evolution equation is stable if and only if the associated largest ampli-fication factor satisfies

|ρ| ≤ 1 +O(∆t).

We skip the details of the proof here but in the nutshell it is due to the fact that in the linear casethe largest eigenvalue of the difference scheme matches the amplification factor of von Neumann.

Inserting the expression of wnj in (6.15) into the forward scheme (6.12) yields

ρ = 1 +D∆t

h2

(

e2πilh − 2 + e−2πilh)

= 1− 2D∆t

h2(1− cos(2πlh))

i.e

|ρ| ≤ 1 ⇐⇒ 2D∆t

h2sin2(πhl) ≤ 1 ⇐⇒ ∆t ≤ h2

2D.

Thus, the difference scheme (6.12) is conditionally stable. For the example (6.13) with D = 1/16and h = 1/11 we have stability if and only if

∆t ≤ 16

2× 112≈ 0.06,

which explains the results in Figure 6.13.

The condition of a time step being as small as h2 is very bad news for this method especially if wewant to integrate for a long period of time. Below, we introduce the implicit scheme, that dealswith difficulty.

6.1.3 Implicit scheme for the heat equation

Instead of approximating the derivatives for the heat equation at time tn using a forward finite def-erence in time let us instead consider an approximation at time tn+1 and use a backward differencein time. We arrive at the implicit/backward scheme for the heat equation

wn+1j = wnj +

D∆t

h2

(

wn+1j+1 − 2wn+1

j +wn+1j−1

)

. (6.16)

Note that because wn+1j is not given ”explicitly” in terms of wnj , the time evolution of this scheme

necessitates the inversion of a linear system of equations:

(I − µA)Xn+1 = Xn + Fn+1

where

X =

wn1wn2...

wnM

, A =

−2 11 −2 1

. . .. . .

. . .

1 −2 11 −2

, Fn+1 = D

α(tn+1)0...0

β(tn+1)

106

where D = D∆th2

.

It is clear from its derivation that this new scheme is consistent and is first order in time and secondorder in space.

Let us look at the stability properties of (6.16) using von Neumann’s method. The amplificationfactor in this case satisfies

ρ = 1− 4D∆t

h2ρ sin2(πlh)

i.e,

0 ≤ ρ =1

1 + 4D∆th2

ρ sin2(πlh)≤ 1,∀D ≥ 0

independently on the values of ∆t and h. The implicit scheme is unconditionally stable.

Because of this unconditional stability property the implicit scheme appears to be much superiorthan its explicit counterpart because in principle it can be run with an arbitrarily large time stepand will still provide sensible results but it is much more expensive in terms of numerical operationsper time step. Moreover, because it is only first order accurate in time, the time step required toachieve an accuracy on the order of h2 for a given spatial grid size h is ∆t ≈ h2, i.e, as small asthe time step required to achieve a numerical stability with the explicit scheme...Ideally, we wanta method which is both accurate and stable for large values of ∆t at least as large as h. TheCrank-Nicholson scheme described below is one of such methods.

6.1.4 The Crank-Nicholson scheme

Crank-Nicholson’s scheme combines the forward/explicit and the backward/implicit schemes in(6.12) and (6.16) to provide a method which is both second order in time and space and uncondi-tionally stable. It is obtained via a straight average of the two schemes (6.12) and (6.16):

wn+1j = wnj +

µ

2

(

wn+1j+1 − 2wn+1

j + wn+1j−1 + wnj+1 − 2wnj + wnj−1

)

. (6.17)

Notice that the Crank-Nicholson scheme is implicit and as the backward-method it involves thesolution of a linear system at each iteration.

(I − 1

2µA)Xn+1 = (I +

1

2µA)Xn +

1

2(Fn+1 + Fn).

Pb. 1 Show that the Crank-Nicholson scheme is consistent to the second order in both time andspace and it is unconditionally stable. Hint: To show that it is second order accurate in both timeand space, consider Taylor expansion in both time and space for the solution u(x, t) about the cell-center (xj , tn + ∆t/2), highlighting the fact that the Crank-Nicholson method is centered in bothtime and space.

107

6.2 Time splitting methods

Time splitting is an useful technique which consists on breaking down a complex PDE equationinto a few simple parts for which numerical schemes are easily constructed and analyzed. Forillustration we consider an advection diffusion equation in one space dimension, a reduced modelfor the Focker-Plank equation

ut + a(x, t)ux = Duxx.

After discretization of the spatial derivatives, using the appropriate difference schemes or someother technique to approximate the spatial derivatives, we arrive to a linear system of differentialequations with respect to time

d

dtw = Aw +Bw (6.18)

where −A is the discrete advection operator and B the discrete diffusion operator, and w(t) =(wj(t))1≤j≤M ≈ (u(xj , t))1≤j≤M . Time splitting consists in dividing this linear systems onto twonatural systems that are integrated separately and successively during each time step, each corre-sponding to the operators A and B, respectively. To integrate the discrete system (6.18) from t tot+∆t, we proceed as follows.

The time splitting algorithm:

1. Let wnj = wj(tn) be given at time tn

2. Solve ddtw

1j = Aw1 on [t, t+∆t], with w1(t) = w(t)

3. Solve ddtw

2j = Bw2 on [t, t+∆t], with w2(t) = w1(t+∆t)

4. Set w(t+∆t) = w2(t+∆t), t = t+∆t and proceed to step 2.

The main advantage of the time splitting method is that it permits to use numerical schemes thatare known to converge and perhaps readily implemented for each one of the differential operatorsseparately, e.g. the advection operator for which various numerical schemes will be designed belowand the diffusion operator introduced above. However, the splitting methods introduce splittingerrors which limit the overall order of accuracy to first order, as it is revealed by the consistencyanalysis performed next.

Let us analyze the consistency of the splitting methodology to see whether this is a sensitive methodto use in practice. Assume that A and B are two linear time independent operators, as in the case ofthe advection-diffusion problem. According to the theory of linear systems of differential equations,the solution to the total linear system (6.18) is given by

w(t+ s) = es(A+B)w(t) =

(

I + s(A+B) +s2

2(A+B)2 +O(s3)I

)

w(t)

=

(

I + s(A+B) +s2

2(A2 +B2 +AB +BA) +O(s3)I

)

w(t), (6.19)

108

for s small (thinking s = ∆t ). One step of the splitting scheme yields

w(t+ s) = esBw2(t) = esBesAw(t) =

(

I + sB +s2

2B2 +O(s3)I

)(

I + sA+1

2s2A2 +O(s3)I

)

w(t)

=

(

I + s(A+B) +s2

2(A2 +B2 + 2BA) +O(s3)I

)

w(t).

(6.20)

Thus w(t + ∆t) − w(t + ∆t) = O(∆t3) if AB = BA, i.e, if A and B commute with each otherand w(t + ∆t) − w(t + ∆t) = O(∆t2) if A and B do not commute, which is generally the case inpractice. Therefore the time splitting method is only first order accurate in time, but surprisinglyit yields second order accurate results, in practice. Nevertheless, a more elaborate version of thetime splitting method which is formally second order accurate regardless the commutativity of theoperators A and B is introduced by Strang and, therefore, known as the Strang-splitting method.Stang splitting method consists in symmetrizing the splitting operation by introducting an extrastep to the time splitting algorithm, where one of the two operators is solved twice with one halftime step. The Strang-splliting method is given next.

Strang-Splitting

1. Let wnj = wj(tn) be given at time tn

2. Solve ddtw

1j = Aw1 on [t, t+∆t/2], with w1(t) = w(t)

3. Solve ddtw

2j = Bw2 on [t, t+∆t], with w2(t) = w1(t+∆t/2)

4. Solve ddtw

3j = Aw3 on [t+∆/2, t +∆t], with w3(t+∆t/2) = w2(t+∆t)

5. Set w(t+∆t) = w3(t+∆t), t = t+∆t and proceed to step 2.

Notice that in practice, when this algorithm is called successively in time, only the first and lasttime steps need to involve half time steps: performing step 5 followed by step 2 is equivalentto performing one full time step with the operator A. This may explain why the standard timesplitting is practically second order.

Pb. 2 Show that the Strang-splitting method is second order accurate.

6.3 Introduction to quasi-linear equations and scalar conservationlaws

A quasi-linear equation is a partial differential equation of the form

ut + a(x, t, u)ux + b(x, t, u) = 0. (6.21)

109

If in addition the coefficient a is independent of u and b is linear in u, then the equation is said to belinear. For practical purposes, more often than not, our ‘computational domain’ will be the finiteinterval 0 ≤ x ≤ 1, for which a boundary condition is needed at least at one end of the domain.Also we assume that the equation in (6.21) is supplemented by an initial condition at t = 0:

u(x, 0) = u0(x). (6.22)

The couple pde + initial condition (6.21) and (6.22) is called a the Cauchy problem.

6.3.1 Prototype examples

Our main focus here is on the following two prototypes of equations: advection equations,

ut + a(x, t)ux = 0, (6.23)

and scalar conservation lawsut + (f(x, t, u))x = 0, (6.24)

both for being very important in many applications and as simple prototypes for more general andmore complex models used in applications. For instance, in the general area of fluid mechanics,the advection equation is used as a model for the transport of tracers such as temperature, density,or the concentration of a certain chemical by the fluid flow, when the latter does not change withchanges in the tracer u. It is widely used in biology and atmosphere-ocean sciences to model theconcentration of pollutants and other substances in the presence of wind and or ocean currents.The term a(x, t) represents the flow velocity, i.e, the wind or the stream of water–called as theadvection velocity or speed of propagation and u(x, t) is some measure of the tracer–called theadvected variable. More often advection models involve two to three space dimensions:

ut + a1(x, y, z, t)ux + a2(x, y, z, t)uy + a3(x, y, z, t)uz = 0

or ut + a(x, y, z, t) · ∇u = 0

where a = (a1, a2, a3) represents the three dimensional velocity field and ∇ is the gradient operator.Here we consider only the one space dimension case, extension to higher dimensions is mostly onlytechnical. In some applications, the advection coefficient is stochastic, i,e, depends on a randomvariable or that the PDE/conservation law is forced by a stochastic forcing. Solving such a stochasticmodel numerically can be easily handled by the techniques developed here.

The conservation law equation on the other hand, models the evolution of a conservative quantity.That is a quantity whose variation inside a closed domain is equal to its flux across the boundaries,i.e, the amount which flows in minus the amount which flows out, of the closed domain. In onespace dimension this can be stated as follows. Let u(x, t) be the concentration density of such aconserved quantity within an interval [x, x + dx]. Let f(x, t, u) be the flux of u at the extremityx and f(x + dx, t, u) the flux at x + dx. This is illustrated in Figure 6.2 for both the 1D and 2D

cases. By assuming that the rate of change in the total quantity∫ x+dxx u(ξ, t, u)dy is equal to the

total flux through x and x+ dx, we can write

∂

∂t

∫ x+dx

xu(x, t)dx = −f(x+ dx, t, u(x + dx, t)) + f(x, t, u(x, t)). (6.25)

110

Flux out

Flux in Flux out

X X + dX

Ω δΩ

Flux in

Flux in

Flux out

Figure 6.2: The integral form of the conservation law states that the total rate of change in u iscompensated by the flux-in minus the flux-out.

Notice, the minus sign in front of f(x+dx, t, u) guarantees that the flux on the right side is directedinward if f is positive and outward if f is negative. The opposite is true for the flux on the left side.The equation (6.25) is known as the integral form of the conservation law. Notice that no regularitywith respect to x for u or f is required when we derived this equation. As we will see later this hassome important consequences in designing numerical schemes for solving such equations.

Dividing by dx both sides of (6.25) and letting dx go to zero yields the differential form of theconservation law in (6.24), provided both u and f are smooth. However, in many textbooks andresearch papers, only the differential notation is used even when u or f are not smooth, in thiscase u is no longer a solution in the classical sense but only in a weak sense, which will be clarifiedbelow.

Another way to derive (6.24) from (6.25) is by noting that

−f(x+ dx, t, u(x + dx, t)) + f(x, t, u(x, t)) = −∫ x+dx

x(f(y, t, u))xdy,

which leads to the equality of the integrands because the length dx is arbitrary. This is actuallythe way the conservation law generalizes to 2 space dimensions and higherou. In fact, consider aclosed domain Ω of R2 and let ∂Ω be its boundary. Then equating the total rate of change of u inΩ to the total flux of u across the boundary of Ω yields

∂

∂t

∫

Ωu(x, y, t)dxdy =

∫

∂ΩF (x, y, t, u) · ndΓ

111

where F is the flux vector and n is the unit normal vector to ∂Ω which is directed to the inside ofΩ. Invoking the divergence theorem yields

∂

∂t

∫

Ωu(x, y, t)dxdy = −

∫

Ω∇ · F (x, y, t, u) dxdy

for all bounded domain Ω. This yields the conservation law in differential form

ut +∇ · F (x, y, t, u) = 0.

Links between the advection and the conservation law equations

The advection and the conservation law equations are intimately interconnected. In the case whenf = f(u) the conservation equation (6.24) can be rewritten in the advective form as

ut +df(u)

duux = 0

and when a is constant with respect to x the advection equation can also be viewed as a conservationlaw. Moreover, note that in general every quasi-linear equation can be written as a conserved partplus a forcing

ut + (f(t, x, u))x + c(x, t, u) = 0.

Such equations are sometimes called balance laws and they are widely used in practice. The searchfor adequate–numerically well balanced schemes for this kind of equations is a very active researcharea.

Finally, we note that the advection equation in (6.23) is linear while the conservation equationin (6.24) can be non-linear if ∂uf 6= 0. A very commonly used prototype example of a nonlinearconservation law is the celebrated Burger’s equation:

ut +1

2(u2)x = 0, (6.26)

which becomesut + uux = 0,

when written in advective form.

6.3.2 Solutions by the method of characteristics

Consider the quasi-linear equation in (6.21). Let x = x(t) be a parametric curve in the (x, t) planesuch that x = a(x, t, u(x(t), t)) where u(x(t), t) ≡ z(t) is the solution to (6.21) along this curve.Using the chain rule and plugging into the equation in (6.21) yields

z =∂u

∂t+∂u

∂xx =

∂u

∂t+ a(x, t, u)

∂u

∂x= −b(x, t, u) = −b(x, t, z).

112

i.e, finding a solution for the quasi-linear equation reduces to solving the following system of firstorder ordinary differential equations.

x = a(x, t, z) (6.27)

z = −b(x, t, z)x(0) = x0, z0 = u(x0, 0) = u0(x0)

Equations (6.27) are known as the characteristic equations and the resulting solution curves x =x(t), x(0) = x0 are called characteristic curves.

Example 1: Solution to the advection equationFor simplicity we assume that the advection speed a is constant, in which case the advectionequation reduces to

ut + aux = 0.

The characteristic equations for this simple example are

x = a

z = 0

whose solution is x(t) = x0 + at, z(t) ≡ u(x(t), t) = u0(x0). Two key important points should benoted here. i) The characteristic curves are straight lines and ii) the solution u is constant alongthe characteristic lines. The characteristic curves for the advection equation are sketched in Figure6.3 for both a > 0 and a < 0. Note that when a > 0 the characteristics are directed to the rightand when a < 0 they are directed to the left. In some sense the sign of a indicates the direction ofpropagation of information. In fact the advection equation is also called the one-way wave equation,where a is the speed of propagation of the wave.

To find the solution u(x, t) at an arbitrary point (x, t) in the x-t plane, one needs to follow thecharacteristic line passing through (x, t) back to its original point at t = 0: (x0, 0) with x = x0+at.This leads to

u(x, t) = u0(x0) = u0(x− at). (6.28)

Example 2: Burger’s equationThe Burger equation constitutes a somewhat more complex example of a quasi-linear PDE. How-ever, we still can, in principle, construct exact solutions using the method of characteristics.

The system of characteristic equations for Burger’s equation is given by

x = z

z = 0.

The characteristic solution is thus given by

u(x(t), t) = u0(x0); where x(t) = x0 + u0(x0)t. (6.29)

Again note that the characteristics are straight lines and the solution is constant along the char-acteristic lines, with one important difference, however; the characteristic curves are no longer

113

Time, t

Xa>0X0

a<0X0

X=X

0 +

at

X=X0 + at

Figure 6.3: Characteristic lines for the advection equation. When a > 0 the characteristics aredirected to the right and when a < 0 they are directed to the left.

parallel to each other. As we will see below this has rather ”unpleasant” consequences. Providedthe characteristics lines do not cross each other, which is guaranteed for at least a short period oftime if the initial data u0 is continuous, the solution to Burgers equation is given by the followingimplicit formula

u(x, t) = u0(x− u0(x0)t), x = x0 + u0(x0)t.

The characteristic lines associated with Burger’s equation are illustrated in Figure 6.4.

Pb. 3 Use the method of characteristics to solve the following quasi-linear equations.

ut + xux = 0

andut + ux + x = 0.

Write down the solution u(x, t) and draw the characteristic curves.

6.3.3 Notion of shocks and weak solutions

Note that because the slope of the characteristic curves x = x0+u0(x0)t for Burger’s equation (6.4)increases when u0(x0) increases and decreases when u0(x0) decreases, the characteristic curves willaccordingly diverge or converge toward each other (see Figure 6.4). Two convergent characteristiclines will ultimately cross each other at some point in the x-t plane. Beyond such intersection pointthe characteristic solution is no-longer valid, because the value of u(x, t) at such a point is notunivalued–one can follow back either one of the two intersecting characteristic lines.

One way to correct for this flaw is by stopping the characteristic lines as soon as they cross eachother. Let Σ be the set of such crossing points in the x-t plane. The solution can then be definedon both sides of Σ by following the corresponding characteristic line back to its origin. Below we

114

X=X0 + u0(X0)t

X0X0

X=X0 + at

Time, t

Xu0’(x) > 0 u0’(X) < 0

Figure 6.4: Characteristic lines for Burger’s equation. When u′0(x) > 0 the characteristics aredivergent and when u′0(x) < 0 they are converging toward each other.

will see that Σ is a parametric curve on the form x = s(t) as show on Figure 6.5. It constitutes acurve of discontinuity for u(x, t). Such a curve is called a shock curve by analogy to gas dynamics.One of the main difficulties in practice is to find the shock curve x = s(t). For any given (x1, t1)one has to determine whether two characteristic lines cross each other prior to time t1 along thecurve x = x1.

Pb. 4 Show that a shock forms in the solution for Burger’s equation if and only if the initialcondition satisfies

u′0(x) < 0 for some x

and that the first time a shock occurs is given by

T∗ = − 1

minx u′0(x).

After a shock is formed the solution u(x, t) is no longer valid in the classical sense except for itsrestrictions on the sub-domains located on either side on the shock. Nevertheless, such solutioncan be defined in the weak sense on the whole x-t plane.

As a motivation let us first assume that u is a classical (i.e smooth) solution to our conservationlaw

ut + f(u)x = 0, u(x, 0) = u0(x).

Let φ(x, t) be an arbitrary smooth function which is compactly supported inside (−∞,+∞) ×[0,+∞), i.e, φ(x, t) = 0 outside some bounded region of (−∞,+∞) × [0,+∞). Multiply theconservation equation for u by φ and integrate w.r.t. (x, t) on the whole domain, yields

∫ +∞

0

∫ +∞

−∞utφ(x, t) dx dt +

∫ +∞

0

∫ +∞

−∞f(u)xφ(x, t) dx dt = 0

115

Shock curve

Time, t

X

Σ

Figure 6.5: The shock curve Σ separates two regions of the x-t plane where the solution is smoothand is uniquely determined by the characteristics. The solution is discontinuous across the shockcurve.

and integration by parts leads to

∫ +∞

−∞u(x, t)φ(x, t)

∣

∣

∣

∣

+∞

0

dx−∫ +∞

0

∫ +∞

−∞uφt dx dt+

∫ +∞

0f(u(x, t))φ(x, t)

∣

∣

∣

∣

+∞

−∞

dt−∫ +∞

0

∫ +∞

−∞f(u)φx dx dt = 0,

which implies

∫ +∞

−∞u0(x)φ(x, 0) +

∫ +∞

0

∫ +∞

−∞uφt dx dt+

∫ +∞

0

∫ +∞

−∞f(u)φx dx dt = 0.

Definition 12 A function u(x, t) is said to be a weak solution for the conservation law

ut + (f(x, t, u))x = 0, u(x, 0) = u0(x).

if for any test function φ(x, t) sufficiently smooth (e.g. C1) with a compact support1 in (−∞,+∞)×[0,+∞), the solution u(x, t) satisfies

∫ +∞

0

∫ +∞

−∞u(x, t)φt(x, t) dxdt+

∫ +∞

0

∫ +∞

−∞f(x, t, u)φx(x, t) dxdt = −

∫ +∞

−∞u0(x)φ(x, 0) (6.30)

Remark:Note that according to the definition of weak solutions, given above, a C1 function u(x, t) is a

1i.e, there exist a bounded rectangle [t1, t2] × [a, b] ⊂ (−∞,+∞) × [0,+∞) such that φ(x, t) = 0 out side thisrectangle.

116

solution to the conservation law in the classical sense if and only if it is a solution in the weaksense. Therefore the notion of weak solutions is more general and the set of weak solutions containsdiscontinuous solutions as well as the classical C1 solutions as a special subset. However, in somesituations weak solutions are not unique, in the sense that one initial value problem can have morethan one weak solution. Selecting the physically relevant solution can be tricky. For Burger’sequation for example, the physical solution coincides with the vanishing viscosity solution:

u(x, t) = limǫ−→0

uǫ(x, t)

where∂uǫ∂t

+ uǫ∂uǫ∂x

= ǫ∂2uǫ∂x2

,

under the grounds that the unviscid Burger equation is a mathematical idealization of the viscousBurger equation when the viscosity is very small. However, given two weak solutions it is not easyto identify which one is the limiting viscosity solution and which one is not. The answer is providedby an extra condition known as the entropy condition satisfied by the physical solution. We willsee this in detail below.

Note that the notion of weak solutions is very abstract and it is not obvious how one wouldhandle this in practice. Nevertheless the theorem below provides the necessary ingredients both forconstructing weak solutions and for gaining physical insight.

Theorem 21 (Rankine-Hugoniot Condition) Let Σ be a curve in (−∞,+∞)×(0,+∞) parametrizedby x = s(t). Let u(x, t) be a C1 function on both side of but possibly not defined and discontinuousacross the curve Σ. Assume that u is a solution to the conservation law

ut + (f(u))x = 0

on all points (x, t) not on Σ. For each point (x1, t1) ∈ Σ we set

u±(x1, t1) = limΩ1∪Ω2∋(x,t)−→(x1,t1)±

u(x, t).

i.e. the limits from the right and from the left of Σ. Then u(x, t) is a weak solution for theconservation law if and only if the shock speed s satisfies

s ≡ ds

dt=f(u+)− f(u−)

u+ − u−. (6.31)

The proof of this theorem is not terribly hard but it is quite technical and therefore left as anexercise for the interested student.

6.3.4 Discontinuous initial data and the Riemann problem

As it is pointed out above, the notion of weak solutions permits to define discontinuous solutionsfor conservation laws. Here we propose to construct such weak solutions with discontinuous initial

117

data. For simplicity, we consider the Burger equation with discontinuous initial data consisting oftwo constant states, a left and a right state:

ut + uux = 0

u0(x) =

uL if x < 0uR if x > 0.

(6.32)

The problem in (6.32) is known as the Riemann problem.

We propose to construct simple weak solutions for the Riemann problem associated with Burger’sequation, using the Rankine-Huguniot condition (6.31).

Shock waves:Consider a discontinuous function which consists of the two left and right constant states on bothsides of a shock curve Σ : x = s(t); s(0) = 0

u(x, t) = uR if x > s(t) (6.33)

u(x, t) = uL if x < s(t). (6.34)

According to the Rankine-Hugoniot condition we have

s =1

2

u2R − u2LuR − uL

=uR + uL

2.

Note that the shock speed in this case is constant and the curve Σ is a straight line. Also recallthat the speed of the characteristic lines on each side of the shock is simply uL and uR, respectively.Therefore the speed of the shock is exactly halfway between the left and right characteristic speeds.This has physical sense and the associated weak solution is called a shock wave. A rough sketch ofthe shock wave solution is given in Figure 6.6 for both cases when uR < uL and when uL > uR.Note that in the first case the characteristics run into the shock line and stop and therefore thesolution on both sides of the shock is consistent with the characteristic solution while in the secondcase the characteristics diverge away and the region surrounding the shock is not reached by anyof the characteristics.

Rarefaction waves:Now assume that uL < uR so the characteristics emanating from both side of the discontinuityare divergent from each other. In this case we can actually construct another weak solution to theRiemann problem associated with the Burger’s equation. First note that for t > 0 the functionu(x, t) = x/t, satisfies Burger’s equation

ut + uux = 0.

For t > 0 consider

u(x, t) =

uL if xt < uLx/t if uL <

xt < uR

uR if uR <xt .

(6.35)

First note that u(x, t) is a solution to Burger’s equation on each one of the designated parts of thedomain and is continuous in the whole x-t plane. We can therefore show by simple integration by

118

time T

Σ

X

: x = (ur+ul)/2 t

u = ur u = ul

time T

Σ : x = (ur+ul)/2 t

u = ur u = ul

X

Figure 6.6: Shock wave solutions for Burger’s equation. The two cases when uL > uR (top) andwhen uL < uR (bottom) are shown.

119

X

u = ul u = ur

time T

Rarefaction fan u = x/t

Figure 6.7: Rarefaction wave solution for Burger’s equation. The rarefaction fan is shown.

parts that indeed u(x, t) is a weak solution (see exercise 5 below). This type of solution is called ararefaction wave by analogy to compressible gas dynamics and the solution u = x/t in the middleis referred to as a rarefaction fan, see the illustration in Figure 6.7.

In summary, this shows that the Riemann problem for Burger’s equation has at least two weaksolutions when uL < uR, one is a shock wave and the other is a rarefaction wave. Therefore, weaksolutions for conservation laws are in general non-unique.

Pb. 5 Let Ω be a bounded open set in the xt-plane. Let Σ be a curve passing through Ω dividing itonto two disjoint open subsets Ω1,2 such that Ω = Ω1 ∪ Σ ∪ Ω2. Let φ(x, t) be a smooth function(e.g. C1) supported in Ω, that is φ vanishes outside a compact set K ⊂ Ω. Let

u(x, t) =

u1(x, t) if (x, t) ∈ Ω1

u2(x, t) if (x, t) ∈ Ω2,

where u1, u2 are two C1 functions satisfying the Burger equation in Ω1,Ω2, respectively. Use inte-gration by parts to show that if in addition u is continuous across Σ, then

∫

Ωuφt +

1

2u2φx dxdt = 0.

Deduce that (6.35) is a weak solution to Burger’s equation.

6.3.5 Non-uniqueness of weak solutions and the entropy condition

As illustrated above with the example of a Riemann problem for Burger’s equation, weak solutionsare non-unique. However, common sense suggests that for any given Cauchy problem, only onesolution is physically relevant. We need an additional constraint to choose this physically relevantsolution among all the weak solutions. In fact, in reality some viscosity is always associated witha given conservation law, so instead we have

uǫt + (f(uǫ))x = ǫ∂2uǫ

∂x2, (6.36)

120

and the zero viscosity limit, ǫ −→ 0, is just a convenient mathematical idealization. It is true inmany physical applications! Therefore, one universally accepted criterion states that the physicallyrelevant weak solution for the conservation law ut + (f(u))x = 0 is the limit of the solution to theviscous equation (6.36) when ǫ −→ 0. On the other hand it is easy to show that, when combinedwith appropriate initial conditions, the latter has a unique solution. In practice, however, it isnot clear how to establish if a given weak solution is actually the vanishing viscosity limit or not.The answer to this question is provided by the concept of entropic solutions. In a nutshell, theentropy condition states that the physical solution satisfies a general principle of thermodynamicsthat the entropy always decreases. It remains to find which among the weak solutions for a givenconservation law satisfies the so-called entropy condition. There are many versions of the entropycondition for a given conservation law

ut + (f(u))x = 0.

i) A somewhat abstract version of the entropy condition, but with a clear physical significancegoes as follows. Given a conservation law

ut + f(u)x = 0.

A convex function Φ(u) and a flux function Ψ(u) are called an entropy/entropy flux pair if

Ψ′(u) = Φ′(u)f ′(u).

Given an entropy/entropy flux pair (Φ,Ψ), a solution u for the conservation law is said to bean entropic solution if it satisfies

∂Φ(u)

∂t+∂Ψ(u)

∂x≤ 0, (6.37)

in the weak sense. In many physical problem the entropy function Φ(u) is some measure ofenergy. It can be shown that for Burger’s equation, an entropy/entropy flux pair is given by

φ(u) = u2; Ψ(u) =2

3u3.

Here Φ(u) = u2 can be thought of as an energy density and Ψ(u) is the energy flux. The mostapparent merit of this formulation is that it generalizes ’easily’ to systems of conservationlaws.

ii) A more practical version of the entropy condition is the following:

u(x+ z, t)− u(x, t) ≤ C

(

1 +1

t

)

z, z, t > 0. (6.38)

iii) Perhaps the most abstract of them is due to Kruzkov. A (weak) solution to the conservationlaw is said an entropy solution in the sense of Kruzkov if, in addition, it satisfies

∫ ∞

0

∫ +∞

−∞sign(u− k) [(u− k)φt + (f(u)− f(k))φx] dxdt ≥ 0 (6.39)

for all real constants k and all test functions φ ≥ 0. This version of the entropic solution isuseful for theory.

121

iv) Finally, we give the Lax entropy condition for shocks: A shock solution for the Riemannproblem with left and right states uL, uR is said to be an entropic shock if the shock speed,s, satisfies

f ′(uR) < s < f ′(uL). (6.40)

It is easy to see that only one of the two shock solutions for the Riemann problem for Burger’sequation is an entropic solution. Namely, the one associated with the case uR < uL (by the Laxentropy condition iv) as well as condition ii)). Moreover, it can be shown that the unique entropicsolution in the case uL < uR is the rarefaction wave in (6.35).

6.4 Finite difference schemes for the advection equation

We start by discussing some basic simple finite difference schemes for the advection equation

ut + aux = 0,

where a is a positive constant. In principle these schemes are easily generalized to non-constantadvection speeds with an arbitrary sign.

6.4.1 Some simple basic schemes

Throughout this paper we will assume the following discretization of space-time domain

xj = j∆x; tn = n∆t,

where ∆x,∆t > 0 are respectively the spatial and time step sizes. We denote by unj the approxi-mate/numerical solution to the solution u(xj , tn).

Perhaps the most obvious scheme to attempt for the advection equation, which unfortunately turnsout to be unstable, is obtained by taking a first order forward finite differencing in time and a centreddifferencing in space:

un+1j = unj −

a∆t

2∆x(unj+1 − unj−1). (6.41)

This scheme is referred to below, simply, as the centred scheme. Other simple possibilities are totake a first order derivative in space either to the left or to the right, combined with the forwarddifferencing in time, yielding the so-called first order upwind and downwind schemes, respectively:

un+1j = unj −

a∆t

∆x(unj − unj−1) (6.42)

and

un+1j = unj −

a∆t

∆x(unj+1 − unj ). (6.43)

Note that those two schemes are also known as the upstream and downstream schemes; dependingon the application: hydrodynamics or gas dynamics (ocean v.s. atmosphere). The word upwind

122

n−1

n+1

n

j j+1j−1

(a)

jj−1

(b)

n

n+1

n

n+1(c)

j j+1 j−1 j+1

n+1(d)

Figure 6.8: Simple finite difference stencils for the advection equation: (a) forward in time centredin space, (b) upwind, (c) downwind, (d) leap-frog.

refers to the fact that the finite differencing is performed in the direction opposite to the wind anddownwind when the difference scheme follows the wind direction. Accordingly, when a < 0 thescheme in (6.43) becomes upwind and the scheme (6.42) becomes downwind.

Warning: As we will see below both the centred and the downwind schemes (6.41) and (6.43) arenot recommended in practice because they are both unstable.

A slightly more sophisticated scheme is the leap frog scheme, which uses centred differences in bothspace and time:

un+1j = un−1

j − a∆t

∆x(unj+1 − unj−1). (6.44)

The stencils for the four schemes listed above are given in Figure 6.8.

6.4.2 Accuracy and consistency

Definition 13 Let Lh(uh) = 0 denote the numerical discretization, for a given partial differentialequation denoted by L(u(x, t)) = 0, with a time step, ∆t, and grid spacing, ∆x. The numericalscheme is said to be consistent if the truncation error:

τh = L(u)− Lh(u) (6.45)

satisfieslim

∆t,∆x−→0τh = 0.

The scheme is said to be consistent of order (p, q) accurate or simply of order (p, q) if

τh = O((∆x)p + (∆t)q).

Examples:

Consider the advection equationL(u) ≡ ut + aux = 0.

123

Using simple Taylor expansions, it is easy to see that the “centred scheme” (6.41) is consistent oforder (2,1), namely

ut + aux −u(x, t+∆t)− u(x, t)

∆t− a

u(x+∆x, t)− u(x−∆x, t)

2∆x=− ∆t

2utt(x, η) − a

(∆x)2

6uxxx(ξ, t)

= O(∆t+ (∆x)2).

i.e, this scheme is first order accurate in time and second order in space while the upwind anddownwind schemes (6.42) and (6.43) are only first order in both space and time,

τh(upwind/downwind) = O(∆t+∆x).

The leap-frog scheme (6.44), on the other hand is second order in both space and time

τh(leapfrog) = O((∆t)2 + (∆x)2).

Pb. 6 Show that the leap-frog scheme (6.44) is second order accurate in both time and space.

6.4.3 Stability and convergence: the CFL condition and Lax-equivalence theo-rem

Definition 14 A numerical scheme for an evolution equation on a finite interval [0, T ], on theform

un+1j = Sh(u

nj )

is said to be stable if there exists a constant C > 0 such that

||un|| ≤ C||u0||

for a certain norm ||.|| in RN , where N is the number of spatial grid points.

Convergence of the numerical scheme

Theorem 22 (Lax-equivalence theorem) The numerical solution to a well posed linear prob-lem converges to the solution of the continuous equation if and only if the numerical scheme isconsistent, and stable. The rate of convergence or the order of accuracy of the numerical solutionis equal to the truncation error of the numerical scheme.

This elegant and powerful theorem is often summarized as follows

consistency + stability = convergence .

124

von Neumann stability analysisThe study of stability of a numerical scheme can be very tedious but the use of Fourier analysis whenappropriate simplifies it a great deal. This idea was first used by von Neumann. For simplicity,we assume that any discrete function, fj, i.e, defined on the grid points xj = j∆x by its valuesfj = f(xj) can be expanded in discrete Fourier modes

fj =

N/2∑

l=0

ρleij∆x2πl + complex conjugate terms

where i =√−1 and ρl are complex Fourier coefficients. N here is the number of spatial grid points

and N/2 is known as the Nyquist number. It represents the largest wavenumber represented on anN-points grid.

To simplify the notation we setφl = 2πl∆x.

For the numerical solution unj evolving in the discrete time, tn, we have

unj =

N/2∑

l=0

ρnl eijφl .

Note that the upper script n is an index not a power.

Theorem 23 (von Neumann Stability) A numerical scheme for an evolution PDE is stable(in the sense of von Neumann) if and only if the ratio

ρl =ρn+1l

ρnl,

known as the amplification factor, satisfies

|ρl| ≤ (1 + ∆t), l = 0, · · · , N/2.

Although the proof of this theorem is almost trivial, by Parseval’s equality, it has a huge significanceand a big impact on our way of studying numerical methods, because it is very easy to use, especiallyfor linear problems. In fact, when the numerical scheme is linear, it is enough to consider solutionson the form of a single Fourier mode

unj = ρnl eijφl . (6.46)

Plugging in (6.46) into the

• the centred scheme (6.41) yields

ρn+1eijφl = ρneijφl − ρnµ

2eijφl

(

eiφl − e−iφl)

125

where µ = a∆t/∆x. Thus the amplification factor, ρ ≡ ρn+1

ρn , satisfies

∀∆t > 0 (sufficiently small), there exists φl such that ρ = 1− µi sin(φl).

Thus, |ρ|2 = 1 + µ2 sin2(φl) > (1 + ∆t)2. Therefore the centred scheme (6.41) is unstable assuggested above.

• the backward (in space) first-order scheme (6.42) yields

ρ = 1− µ(1− e−iφl) = 1− µ(1− cos(φl))− iµ sin(φl),

|ρ|2 = 1 + µ2(1 + cos2(φl)− 2 cos(φl))− 2µ(1− cos(φl) + µ2 sin2(φl)

= 1 + 2µ2 + 2µ(1 − µ) cos(φl)− 2µ = 1− 2µ(1− cosφl) + 2µ2(1− cosφl).

Clearly if µ < 0, i.e a < 0, then ρ > 1+∆t for some values of φl, for all ∆t sufficiently small,and when µ > 0 (a > 0), ρ ≤ 1 provided 0 ≤ µ ≤ 1. In other words, the upwind schemeis conditionally stable: ∆t ≤ ∆x/|a| or |a|∆t ≤ ∆x while the downwind scheme is alwaysunstable. Similarly we can show that the forward-scheme (6.43) is stable if −1 ≤ µ ≤ 0and unstable otherwise. Notice that the upwind scheme amounts to taking the derivative inthe direction opposite to the advection speed, i.e, the direction from where the informationarrives.

• the leap-frog scheme (6.44) yields

ρ2 = 1− 2ρµi sin(φl),

which has 2 roots

ρ± = −iµ sin(φl)±√

1− µ2 sin2(φl) if |µ| ≤ 1 (6.47)

=⇒ |ρ±|2 = 1 if |µ| ≤ 1.

When |µ| > 1, one can always find a value for φl for which the two roots ρ± are both imaginaryand then necessarily one of them has to be strictly larger than one for all values of ∆t. Thus, asthe upwind scheme, the leapfrog scheme is conditionally stable: ∆t ≤ ∆x/|a|. Note however,that unlike the previous schemes, von Neumann analysis for the leap-frog method leads to twoamplitude modes, ρ±, for a given spatial mode exp(ijφl). Nevertheless, as we will see below,only one of them is physical, namely, ρ+, that is, it represents an approximation (converges)to the exact solution, while the other one is an artifact of the numerical discretization. Thelatter is often called a computational or a parasitic mode.

CFL condition:The CFL condition, named after its discoverers, Courant, Friedrichs, and Lewy, states that if adifference scheme for an evolution equation is stable then the domain of dependence, correspondingto one time step, of the numerical scheme contains the domain of dependence of the originalcontinuous equation, when both the time step and the spatial grid size ∆t,∆x −→ 0.

Let’s first clarify the notion of domain of dependence. Given the partial differential equation,

ut = F (u, ux)

126

Table 6.1: Some simple schemes for the advection equation and their properties.Scheme Order of accuracy Stability CFL condition

Centred O(∆t+ (∆x)2) Unstable Satisfied if ∆t ≤ ∆x/|a|Upwind O(∆t+∆x) Stable if ∆t ≤ ∆x/|a| Satisfied if ∆t ≤ ∆x/|a|

Downwind O(∆t+∆x) Unstable Not satisfiedLeap-frog O((∆t)2 + (∆x)2) Stable if ∆t ≤ ∆x/|a| Satisfied if ∆t ≤ ∆x/|a|

with an initial conditionu(x, t) = u0(x),

the domain of dependence of this equation at some point (x, t) is the set of all points x0 such thatu(x, t) depends on the value of u0 at x0). For instance according to the solution by the methodof characteristics, the domain of dependence of the advection equation (6.28) reduces to a singlepoint

D(x,t) = x0 = x− at .The domain of dependence corresponding to one time step is

D(x,t+∆t) = (x− a∆t, t) .

This is illustrated in figure 6.9 for a > 0. The numerical domain of dependence of the numericalschemes considered so far are as follows

Dcentred(x, t+∆t) = (x−∆x, t), (x+∆x, t)

Dbackward(x, t+∆t) = (x−∆x, t), (x, t)Dforward(x, t+∆t) = (x, t), ((x +∆x, t)

Dleapfrog(x, t+∆t) = (x−∆x, t), (x, t−∆t), (x+∆x, t)Note that, as shown in Figure 6.9, the point (x − a∆x, t) is within the range of the domain ofdependence of all the schemes above, so that the CFL condition is satisfied, provided the stabilitycondition |∆t ≤ ∆x/|a| is satisfied, except for the forward scheme, corresponding to downwinddifferencing, in this case, which explains its instability. Notice that this makes physical sense. Asit is suggested by the method of characteristics, when a > 0 the solution to the advection equationuses information on the left to advance forward in time while the forward scheme uses informationon the right. Note that the upwind and the leapfrog schemes do not have this problem. Also, thecondition |a|∆t ≤ ∆x can be interpreted as a requirement that the numerical speed of propagationof information is smaller than the advection speed of the continuous problem.

Interestingly, the centred scheme (6.41) satisfies the CFL condition, if ∆t ≤ ∆x/|a|, but it is notstable. This is a good example illustrating the important fact that the CFL condition is a necessarycondition for stability but it is not sufficient.

Table 6.1 summarizes the properties of the four simple schemes that we have covered so far.

127

t+Dt

X −Dx

(X,t+Dt)

X X +Dx

tX−aDt

Figure 6.9: Domain of dependence of the advection equation for a > 0 and the CFL condition.

6.4.4 More on the leap-frog scheme: the parasitic mode and the Robert-Asselinfilter

According to table 6.1, the best we have, among the simple schemes seen so far, is the leap-frogscheme; it is stable and second order accurate in both space and time. Notice also that as such itis relatively cheap and easy to implement. This may explain in part why this scheme is so popularin the engineering and atmosphere/ocean communities. However, it has at least two drawbacks: 1)it is a multistep (2 steps or 3 levels) scheme which means it needs the knowledge of the solutionat two successive time steps in order to advance to the next one and 2) it carries a parasitic modewhich may ruin the numerical solution when used for long time integrations, if it is not filteredcarefully. More details on the leap-frog scheme and its parasitic mode can be found in the literature(e.g. Durran). Here we briefly illustrate and demonstrate the behaviour of this parasitic mode andgive a strategy for controlling it in practice.

Recall the von Neumann amplification factors associated with the leap-frog scheme

ρ± = −iσ ±√

1− σ2

where we set σ = µ sin(φl). Since |ρ±| = 1, we can set

ρ± = eiψ± , ψ± = arctan

( −σ±√1− σ2

)

.

With ψ0 = arcsin(σ), we haveψ+ = −ψ0, ψ− = −π + ψ0,

ρ+ = e−iψ0 , ρ− = ei(−π+ψ0),

128

and the Fourier mode solution for the leap-frog scheme is given by

unj = (c1ρn+ + c2ρ

n−)e

jiφl = c1ei(jφl−nψ0) + c2(−1)nei(jφl+nψ0)

≈ c1e2πli(xj−atn) + c2(−1)ne2πli(xj+atn) (6.48)

where we used the fact that µ = a∆t/∆x, φl = 2πl∆x, n∆t = tn; j∆x = xj and the approximation

arcsin(σ) ≈ σ; sin(φl)/φl ≈ 1.

The expression in (6.48) clarifies that the term representing ρ+ has the form f(x−at) and thereforeprovides an approximation for the solution to the advection equation while the remaining termrepresents a wave moving in the opposite direction (to the left) and its amplitude oscillates betweenpositive and negative values. Clearly, the latter is an artifact of the numerical discretization, calledthe computational or parasitic mode, and may lead to serious damage to the solution if it is notcontrolled in some way.

There are many ways how to control the computational mode. One of them is to make sure thatthe ’extra initial condition’, i.e, solution at first time step, needed to advance the leap-frog methodis chosen so that the coefficient c2, of the parasitic mode, is zero during the decomposition of u0into Fourier modes. One easy way to guarantee that initially the parasitic mode is zero is to set assecond initial data at t = ∆t, required to evolve the multi-step leap-frog method, to be

u1j =∑

l

ρ+l ρ0l eijφl

given at t = 0 we have

u0j =∑

l

ρ0l eijφl .

However, for long integration periods the parasitic mode can be excited and grow just from round-offerrors.

A safe commonly used strategy for controlling the computational mode for the leap-frog scheme,known as the Robert-Asselin filter, is given next. The Robert-Asselin filter consists in averagingthe solution at every time step by using the previous and the future solution, at t−∆t and t+∆t,respectively, by introducing an extra-filtering step. Let unj denote the solution filtered in such away. The two step leap-frog plus Robert-Asselin filtering scheme is given by

un+1j = un−1

j − µ(unj+1 − unj−1)

unj = unj + γ(un+1j − 2unj + un−1

j ), (6.49)

where γ is a small filtering parameter usually taking to be γ = 0.06. Some atmospheric models usedfor cloud physics use values as large as γ = 0.3 (Durran). It is important to note that the filteringstep destroys the second order accuracy in time and the resulting scheme is only first order.

6.4.5 The Lax-Friedrichs scheme

The Lax-Friedrichs scheme is a clever modification for the unstable centred scheme (6.41), whichmakes it stable. It consists of replacing the term unj by the average from the neighbouring cells to

129

obtain

un+1j =

unj+1 + unj−1

2− a∆t

2∆x

(

unj+1 − unj−1

)

. (6.50)

As the centred scheme the Lax-Friedrichs scheme (6.50) is second order accurate in space and onlyfirst order in time. However, it is stable under the CFL condition |µ| ≡ |a|∆t/∆x < 1 . In fact,the associated von Neumann amplification factor is given by

ρ = cos(φl)− iµ sin(φl)

=⇒ |ρ|2 = cos(φl)2 + µ2 sin(φl)

2 ≤ 1 if |µ| ≤ 1.

Pb. 7 Show that the Lax-Friedrich’s scheme is first order in time and second order in space andthat its amplification factor is given by

ρ = cos(φl)− iµ sin(φl).

The main advantages of the Lax-Friedrichs scheme, when compared to the three other stableschemes listed above reside in the facts that it is second order in space and it is a one step method,therefore easier to implement in practice compared to the leap-frog method. However, it is only firstorder accurate in time, which limits its use in practical applications and it is extremely dissipativeas we will see below.

6.4.6 Second order schemes: the Lax-Wendroff scheme

Perhaps the simplest way to achieve a one-step (2 levels) scheme, which is second order in bothtime and space for the advection equation, is to resort to Taylor expansion in the time variable

u(x, t+∆t) = u(x, t) + ∆tut(x, t) +(∆t)2

2utt(x, t) + · · ·

using the advection equation ut = −aux yields

u(x, t+∆t) = u(x, t)− a∆tux(x, t) +a2(∆t)2

2uxx(x, t) + · · ·

Now, we use second order centred differencing to approximate the spatial derivatives ux and uxxto obtain the Lax-Wendroff scheme

un+1j = unj −

µ

2

(

unj+1 − unj−1

)

+µ2

2

(


)

. (6.51)

It is easy to show that the Lax-Wendroff scheme (6.51) is second order accurate in both space andtime and it is stable under the CFL condition |a∆t/∆x| ≤ 1. The amplification factor for thisscheme is

ρ = 1− iµ sin(φl)− µ2(1− cos(φl)).

130

Pb. 8 Show that the Lax-Wendroff scheme is second order in both time and space and it is stableunder the CFL condition a∆t/∆x ≤ 1.

If instead we use a second order approximation of the first order space derivative in the Taylorexpansion and using only points that are in the upwind direction, namely, xj−2, xj−1, xj whena > 0, we obtain the Beam-Warming’s scheme:

un+1j = unj −

µ

2

(

3unj − 4unj−1 + unj−2

)

+µ2

2

(

unj − 2unj−1 + unj−2

)

. (6.52)

This scheme can be derived by using polynomial interpolation. In fact, let

P2(x) = uj−2 +uj−1 − uj−2

∆x(x− xj−2) +

uj − 2uj−1 + uj−2

2(∆x)2(x− xj−1)(x− xj−2)

be the 2nd degree polynomial interpolating u on the grid points xj, xj−1, xj−2. Using this poly-nomial to approximate the derivatives ux, uxx at x = xj yields the Beam-Warming scheme. Thedetails are left as an exercise for the student.

Note that because the stencil of the Beam-Warming scheme uses two grid cells on the left of xj ,the associated CFL condition becomes 0 ≤ a∆t/∆x < 2. In fact we can show that this schemeis stable under this CFL condition. Compared to Lax-Wendroff’s scheme this new scheme allowslarger time steps. However, if the advection speed is not too large the time step might be restrictedto that of the Lax-Wendroff for accuracy reasons.

Pb. 9 Show that the Beam-Warming scheme is second order in both time and space and it is stableunder the CFL condition a∆t/∆x ≤ 2.

Pb. 10 Derive the version of the Beam-Warming scheme when a < 0.

6.4.7 Some numerical experiments

Here we assess the performance of each one of the (stable) methods listed above, for the advectionequation, the interval [0, 1] with periodic boundary conditions and two different initial data. Weconsider a smooth initial condition consisting of a hump-like shaped profile,

u0(x) = exp(−100(x − .5)2),

and a non-smooth–piece-wise constant, initial condition, called here a square wave

u0(x) = 0 if |x− 0.5| > 0.25, u0(x) = 1 if |x− 0.5| < 0.25.

The advection velocity is assumed constant and normalized to a = 1 and we integrate to time t = 1so that the initial profile is moved forward by a distance equal to the length of the interval [0, 1]and by periodicity we have u(x, t = 1) = u0(x). We use a mesh size ∆x = 0.01 and a time step∆t = 0.008 corresponding to a Courant number µ = ∆t/(|a|∆x) = 0.8.

131

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Exact

upwind

Lax−Friedrichs

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

Exact

Leap frog

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Exact

Lax−Wondroff

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Exact

Beam−Warming

Figure 6.10: Solution to the advection equation using different finite difference schemes. Case of asmooth hump advected periodically through the interval [0, 1]

132

−0.5 0 0.5 1−0.5

0

0.5

1

1.5

−0.5 0 0.5 1−0.5

0

0.5

1

1.5

Exact

upwind

Lax−Friedrichs

Exact

Leap frog

R−A Leap frog

−0.5 0 0.5 1−0.5

0

0.5

1

1.5

Exact

Lax−Wondroff

−0.5 0 0.5 1−0.5

0

0.5

1

1.5

Exact

Beam−Warming

Figure 6.11: Same as Fig. 6.10, except for the non-smooth, piece-wise constant, square wave. Notethat for this case results with both the plain leap-frog and leap-frog combined with the Robert-Asselin filter are shown.

133

In Figure 6.10, we plot the numerical solutions, obtained with each one of the 5 methods, against theexact solution for the advection equation. In Figure 6.11 we report the similar plots correspondingto the non-smooth square wave. Here we note a few important points.

• First, for the smooth solution in Figure 6.10, as expected the second order methods (leapfrog,Lax-Wendroff, and Beam-Warming) are highly accurate whereas the first order upwind andLax-Friedrichs methods yield very unsatisfactory results. They both suffer from an excessivedissipation; that is the wave amplitude decays in time. Surprisingly, the upwind scheme seemto perform better than Lax-Friedrichs, although, the latter is second order accurate in space.

• Second, for the non-smooth case in Figure 6.11, on the other hand, the two first-order methodsseem to perform better than the three second order methods, although they tend to smoothout the discontinuity. The second order schemes exhibit high oscillations near the disconti-nuities. This is typical for high order methods, they exhibit an oscillatory behaviour nearshocks. Note also that Lax-Wendroff is somewhat better than Beam-Warming and that Lax-Wendroff produces oscillations behind the shock while the oscillations in the Beam-Warmingare located in front of the discontinuity.

Below, we will see how to design high resolution or non-oscillatory scheme which in principleare second order accurate in regions where the solution is smooth and only first order butnon oscillatory near the shocks.

• Third, note that the leap-frog scheme seem to the exhibit the worst oscillatory behaviour forthe non-smooth solution. In fact, most of this is due to the presence of the computationalmode which tend to amplify the oscillations. Things look a lot better when the computationalmode is controlled by the Robert-Asselin filter. Notice that the filtered leap-frog is somewhatbetween the Lax-Wendroff and a first order method for it exhibits some oscillations behindthe shock and smooths out the discontinuity at the same time.

Below, we analyze the different schemes in detail to understand better their performances.

6.4.8 Numerical diffusion, dispersion, and the modified equation

Consider the upwind scheme (6.42) for the advection equation. With some simple manipulationsthis scheme can be rewritten as

1

2∆t(un+1j − un−1

j ) +∆t

2

un+1j − 2unj + un−1

j

(∆t)2= −a

unj+1 − unj−1

2∆x+a∆x

2


(∆x)2

or

un+1j − un−1

j

2∆t+ a

unj+1 − unj−1

2∆x=a∆x

2


(∆x)2+

∆t

2


j

(∆t)2. (6.53)

For all practical purposes this scheme can be thought of as an approximation for the diffusiveadvection equation

ut + aux =a∆x

2uxx +

∆t

2utt,

134

with a second order truncation error, O((∆x)2 + (∆x)2). Differentiating the advection equationonce with respect to t, yields utt = auxt = (aut)x = a2uxx, i.e, the diffusive equation above becomes

ut + aux =a∆x

2

(

1− a∆t

∆x

)

uxx.

This equation is sometimes called the modified equation approximated by the upwind scheme tothe second order. In essence, the upwind scheme approximates the viscosity solution introduced in(6.36) corresponding to the advection equation with the viscosity ǫ = a∆x

2 (1−µ) where µ = a∆t/∆xis the Courant number. On the one hand this can make us believe that in fact the upwind schemecomputes the right physical solution, by approximating the vanishing viscosity solution, and on theother hand it provides an explanation for the poor performance of this scheme, especially in thecase of the smooth solution in Figure 6.10, of under-predicting the solution.

Note that the viscosity coefficient, introduced by the upwind scheme, is zero if the Courant numberis chosen to be one. Such choice is avoided in practice because it may lead to numerical instabilitiesbecause of round-off errors. Further as noted above a little viscosity is necessary to guaranteeconvergence to the physical solution.

This phenomenon is known as numerical dissipation or diffusion. It is due to the term a∆x2

unj+1−2unj +u

nj−1

(∆x)2+

∆t2

un+1

j −2unj +un−1

j

(∆t)2on the right hand side of (6.53). In fact, multiplying the viscous advection equa-

tion by u and integrate on [0, 1] yields

d

dt

∫ 1

0

u2

2dx+

a

2

∫ 1

0(u2)xdx = ǫ

∫ 1

0uuxxdx. (6.54)

Where we used uut = (u2)t/2 and uux = (u2)x/2 . By integration by parts we have

∫ 1

0uuxxdx = uux|10 −

∫ 1

0(ux)

2dx = −∫ 1

0(ux)

2dx

provided the boundary terms vanish, which happens if we use periodic or homogeneous Dirichletor Neumann boundary conditions. The second term on the right of the integral equation (6.54)vanishes for the same reason and yields the following energy dissipation principle

d

dt

∫ 1

0

u2

2dx = −ǫ

∫ 1

0(ux)

2dx < 0 if ux 6= 0.

Thus if integrated to infinity the upwind scheme will smooth out all the gradients in the solutionand will ultimately converge to a constant-flat solution.

As we can surmise from Figures 6.10 and 6.11, the Lax-Friedrichs scheme is more dissipative thanthe upwind scheme. Indeed, the Lax-Friedrichs scheme is equivalent to

un+1j − unj

∆t+ a

unj+1 − unj−1

2∆x=

(∆x)2

2∆t


(∆x)2− ∆t

2


j

(∆t)2. (6.55)

which yields a dissipative modified equation, which is the second order approximation of the Lax-

Friedrichs scheme, with a viscosity ǫ = (∆x)2

2∆t − a2 ∆t2 = (∆x)2(1 − µ2)/(2∆t) which is typically

larger than that of the upwind scheme.

135

This dissipation or diffusivity problem is typical to first order schemes for hyperbolic systems.Let us now consider the second order schemes and derive the modified equations associated withthose schemes. Since they are already second order accurate, we should consider approximationswith an order of accuracy higher than two. Below, we take a slightly different route than what isdone above, to derive the modified equation. Namely, we use Taylor expansion like we did for thetruncation error but instead we keep the first neglected term and add it to the advection equationto form the modified equation.

For the leap-frog method we have

un+1j − un−1

j

2∆t+a

unj+1 − unj−1

2∆x= ut+

(∆t)2

6uttt(x, t)+O((∆t)4)+aux+a

(∆x)2

6uxxx(x, t)+O((∆x)4)

(because all even terms in the Taylor series cancel out). Again using the advection equation wehave: uttt = −a3uxxx and therefore the modified equation, which is approximated to the fourthorder by the leap-frog scheme, is

ut + aux = −a(∆x)2

6

(

1− µ2)

uxxx. (6.56)

This equation is known as the dispersive wave equation. First, it is easy to show that this equationconserves energy (see exercise 11 below). This is consistent with the fact that the von Neumannamplification factors for the leap-frog method satisfy |ρ±| = 1 when µ < 1.

Second, dispersion refers to a physical system where waves of different wavelengths propagate atdifferent wave speeds. Let’s look at wave like solution on the form

u = ei(kx−ωt)

for the dispersive wave equation. Here k = 2πl is the wavenumber, ω is called the phase orfrequency, ω/k defines the phase speed, the speed at which the wave propagates. Plugging thisansatz into (6.56) yields, the dispersion relation

ω = ak − a(∆x)2

6

(

1− µ2)

k3, (6.57)

i.e, the phase speed, ck = a − a(∆x)2

6

(

1− µ2)

k2, depends highly on the wavenumber, and higherthe wavenumber faster the wave disturbance moves relative to the advective motion. Short wavesmove relatively faster. This is very unlike the plain-original advection equation were all waves moveat the same speed–the advection speed a.

Now, we reconsider the von Neumann amplification factors for the leap-frog method in (6.47).Recall that we have two solutions (one is physical and the other is a numerical artifact), which wedenote here by:

u+ = ei(kj∆x−nψ0); u− = (−1)nei(kj∆x+nψ0).

Let us consider only the physical mode, u+. From the discussion above, using n = tn/∆t, theassociated phase speed is given by

c+ =ψ0

k∆t=

arcsin(µ sin(φl))

k∆t.

136

Therefore, one way to derive the dispersion relation in (6.57) is to operate directly on the expressionof ψ0. Using Taylor expansion to expand both the sine and the arcsine functions, for small φl values(resolved modes), yields

ψ0 ≈ µφl −µ

6(1− µ2)φ3l = a∆tk − (∆x)2

6a∆t(1− µ2)k3

which implies that the phase speed of the physical mode for the leapfrog scheme satisfies

c+ ≈ a− a(∆x)2

6(1− µ2)k2.

This matches exactly the expression for ck found above, using the dispersive wave equation.

Now, we can explain why in Figure 6.11, the leapfrog scheme exhibits oscillations. They result fromthe fact that the small truncation errors located near the discontinuity, which also accumulate withtime, are viewed as wave disturbances of much shorter wavelengths which then propagate at theirown speeds different from that of the actual solution.

Dispersion relation for the upwind, LF, and LW schemes

Recall that the amplification factors for the previous two first order schemes, namely the upwindand the Lax-Friedrichs, are respectively given by

ρuw = 1− µ(1− cos(φl))− iµ sin(φl) and ρLF = cos(φl)− iµ sin(φl).

In the light of the analysis done in the previous paragraphs for the leap-frog scheme, we set ρn ≡e−in∆tω where ω is a generalized phase so that ℜ(ω)/k yields the phase speed of the numerical wavesolution and ℑ(ω) yields the exponential growth or damping rate of the wave amplitude. We havefor the upwind and Lax-Friedrichs schemes, respectively,

e−iωuw∆t = 1− µ(1− cos(φl))− iµ sin(φl) , e−iωLF∆t = cos(φl)− iµ sin(φl).

For small ω∆t, we havee−iω∆t ≈ 1− iω∆t.

Henceωuw∆t ≈ µ sin(φl)− iµ(1− cos(φl)) and ωLF∆t ≈ µ sin(φl)− i(1 − cos(φl)).

First note that both schemes have quite strong damping rates,

ℑ(ωuw) =µ

∆t(1− cos(k∆x)) ≈ −1

2k2∆x,ℑ(ωLF ) = −1− cos(φl)

∆t≈ − a

2µ∆xk2,

respectively, and consistently with the previous results, the LF scheme being more damped–by afactor of 1/µ.

Their phase speeds are equal and are given by

ℜ(ωup)k

=ℜ(ωLF )

k=

a

k∆xsin(k∆x) ≈ a

(

1− k2(∆x)2

6

)

.

137

This clearly shows that both schemes are dispersive, similarly to the leap-frog scheme. Nevertheless,small wave disturbances typically occurring at the grid scale decrease in magnitude at a faster ratethan the domain-scale physical wave, i.e, before they disperse and damage the large scale solution.

Now we turn to the Lax-Wendroff and Beam-Warming schemes. We have

ρLW = 1− iµ sin(φl)− µ2(1− cos(φl))

yielding

ωLW =µ

∆tsin(φl)− i

µ2

∆t(1− cos(φl)), (6.58)

i.e, for the phase speed, we have the same dispersive behaviour as for the previous schemes but amuch smaller damping rate

ℑ(wLW ) = − µ2

∆t(1− cos(φl)) ≈ − µ2

2∆tk2(∆x)2,

of the same order as the phase speed deviation. For grid scale disturbances, the wavenumber, k,

scales as 1/∆x, therefore the damping rates for those small scale waves is on the order O( µ2

2∆t)which seems to be small, explaining the propagation of the small oscillations, in Figure 6.10, awayfrom the discontinuity before they get damped.

More importantly note the propagation of the oscillations for the Lax-Wendroff scheme, in Figure6.10, to the left of the discontinuities, is associated with the fact that according to the dispersionrelation (6.58) smaller wavelength disturbances move slower than those with larger wavelengths. Wehave a similar behaviour for the Lax-Wendroff scheme. But the Beam-Warming scheme exhibitssomehow larger amplitude oscillations moving to the right, which seems to anticipate that theBeam-Warming scheme has both a weaker damping rate and dispersive phase speeds characterizedby smaller wavelengths propagating faster than larger ones. The details are left as an exercise.

Pb. 11 Show that the dispersive wave equation

ut + aux = γuxxx

conserves energy, that isd

dt

∫ 1

0

u2

2dx = 0.

Pb. 12 Determine the exponential damping rate and the phase speed for the Beam-Warming scheme.Deduce that this scheme is dispersive and that wave disturbances with larger wavenumbers (smallerwavelengths) propagate faster than those with smaller wavenumbers.

138

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Initial

Exact

upwind

Lax−Friedrichs

Figure 6.12: Riemann Problem for Burger’s equation solved with both upwind and Lax-Friedrichsschemes, showing that the upwind scheme predicts a wrong shock speed–a stationary wave.

6.5 Finite volume methods for scalar conservation laws

6.5.1 Wrong shock speed and importance of conservative form

Consider Burger’s equation with the discontinuous initial condition

ut +1

2(u2)x = 0

u0(x) =

1 if x < 0.250 if x > 0.25.

(6.59)

We view Burger’s equation as an advection equation with a non-linear advection speed a(u) = uand attempt to solve this problem with both the upwind and Lax-Friedrichs methods. The upwindscheme is generalized to Burger’s equation as follows.

un+1j = unj −

∆t

∆x

(

max(unj , 0)(unj − unj−1) + min(unj , 0)(u

nj+1 − unj )

)

. (6.60)

Whereas the formula for Lax-Friedrichs’ scheme amounts to approximating the flux derivative usingcentred differences.

un+1j =

1

2(unj+1 + unj−1)−

∆t

2∆x

(

(unj+1)2/2− (unj−1)

2/2)

. (6.61)

The results are shown in Figure 6.12 where the two numerical solutions are compared to theexact solution to the Riemann problem. Note that the Lax-Friedrichs method predicts well thepropagation of the shock wave, with a significant smearing of the discontinuity, as excepted, dueits high viscosity, while the upwind scheme predicts a steady solution which doesn’t change withtime. In fact, this latter result can be easily recovered analytically from the upwind scheme, itself.The numerical solution in this case remains zero on the right of the shock because the advectionvelocity is zero and remains one on the left because the backward finite difference derivative is zero.

139

The main reason for why the Lax-Friedrichs scheme out performs the upwind scheme, in this case,is because the former has a conservative form. A numerical scheme for a conservation law

ut + (f(u))x = 0

is said to be conservative if it can be written on the form

un+1j = unj +

[

Fj+1/2 − Fj−1/2

]

. (6.62)

Where Fj+1/2, called a numerical flux which is in some sense an approximation of the flux f(u) atthe cell interface, xj+1/2 = xj +∆x/2.

Here we show that indeed the Lax-Friedrichs scheme has a conservative form. Let

FLFj+1/2 =1

2(unj+1 − unj )−

µ

2(f(unj+1) + f(unj ).

Then the Lax-Friedrichs schemes can be written as

un+1j = unj + FLFj+1/2 − FLFj−1/2.

Thus it is conservative. We will see below that, the Lax-Wendroff and the Beam-Warming methodsare also conservative.

Remark:It is worthwhile noting that the formula for the Lax-Friedrichs flux, FLF , at the interface j + 1/2,is the average of the left and right fluxes, f(uj), f(uj+1), plus a centered difference, uj+1 − uj ,suggesting a diffusive term. Thus, the Lax-Friedrichs scheme can be viewed as a discrete version ofthe diffusive equation

ut + (f(u)− ǫux)x = 0

where ǫ = (∆x)2

2∆t is the viscosity coefficient.

Convergence of Conservative Schemes:It is shown in the literature that if a numerical scheme for a conservation law ut+(f(u))x = 0is consistent, stable, and has a conservative form, then the resulting numerical solutionconverges to a weak solution to the conservation law.

This statement can be viewed as an extension of the Lax-equivalence theorem to non-linear con-servation laws. Notice however, since weak solutions are not unique, it is not guaranteed thatthe numerical solution obtained by a consistent, stable, and conservative scheme converges to thephysical entropic solution. For that some extra-properties of the numerical scheme are needed toenforce the entropy condition.

6.5.2 Godonuv’s first order scheme

Let xj = j∆x, j = · · · ,−2,−1, 0, 1, 2, · · · and tn = n∆t, n = 0, 1, 2, · · · be a discretization ofthe space-time domain (x, t). We divide the real line into sub-intervals [xj−1/2, xj+1/2], called grid

140

cells, where xj+1/2 = xj +∆x/2. We define the cell average of the solution u(x, t) at each grid cellas

unj =1

∆x

∫ xj+1/2

xj−1/2

u(x, tn)dx. (6.63)

Consider the conservation lawut + (f(u))x = 0.

Next, we integrate this equation on the rectangle [tn, tn+1]× [xj−1/2, xj+1/2], often called the controlor finite volume, hence the name finite volume methods. We have

∫ tn+1

tn

∫ xj+1/2

xj−1/2

ut(x, t)dxdt +

∫ tn+1

tn

∫ xj+1/2

xj−1/2

(f(u(x, t)))xdxdt = 0

or

∫ xj+1/2

xj−1/2

u(x, tn+1)dx−∫ xj+1/2

xj−1/2

u(x, tn)dx+

∫ tn+1

tn

(f(u(xj+1/2, t)))dt−∫ tn+1

tn

(f(u(xj−1/2, t)))dt = 0.

Dividing by the mesh size ∆x and introducing the averages unj above, yields

un+1j = unj −

∆t

∆x

[

Fnj+1/2 − Fnj−1

]

, (6.64)

where

Fnj+1/2 =1

∆t

∫ tn+1

tn

(f(u(xj+1/2, t)))dt,

is the average flux of u through the interface x = xj+1/2, tn ≤ t ≤ tn+1. Provided we find anadequate numerical approximation to the time integral, the formula (6.64) provides a numericalscheme for the conservation law, in conservative form.

Piecewise constant approximations and the Riemann problem at the cell interfaces:Assume that at time t = tn, the cell averages unj are known. Consider the approximation:

u(x, tn) ≈ unj , if xj−1/2 ≤ x ≤ xj+1/2.

Then finding an approximation for the flux Fj+1/2, amounts to solving the Riemann problem

ut + (f(u))x = 0, t ∈ [tn, tn+1] (6.65)

u(x, tn) =

uL if x < xj+1/2

uR if x > xj+1/2

where the left and right states are given, respectively, by uL = uj and uR = unj+1.

To compute the flux Fj+1/2 we need to know the solution u along the interface x = xj+1/2, tn ≤t ≤ tn +∆t. We introduce the shock speed at the cell interface,

s =f(uR)− f(uL)

uR − uL,

141

according to the Rankine-Hugoniot jump condition. Under the condition that f(u) is convex, usingthe method of characteristics, introduced above, we have, for tn ≤ t ≤ tn +∆t,

u(xj+1/2, t) =

uL if f ′(uL) > 0 & f ′(uR) > 0uR if f ′(uL) < 0 & f ′(uR) < 0uL if f ′(uL) ≥ 0 & f ′(uR) ≤ 0 & s > 0uR if f ′(uL) ≥ 0 & f ′(uR) ≤ 0 & s < 0u∗ if f ′(uL) ≤ 0 & f ′(uR) ≥ 0,

(6.66)

where in (6.66), u∗, called a sonic point, is defined such that f ′(u∗) = 0 and corresponds to ararefaction wave solution. Notice that the convexity of f guarantees that u∗ exists and is unique.Note also the fist four cases correspond either to an entropic shock solution where, according toLax’s criterion,

f ′(uL) ≥ s ≥ f ′(uR)

so that uj+1/2 = uL if s > 0 and uj+1/2 = uR if s < 0 or a rarefaction waves where the rarefactionfan is either completely to the left or completely to the right of the interface according to whetherf ′(uR,L) < 0 or f ′(uR,L) > 0, respectively. The last case, however, corresponds to rarefaction wavewhere the rarefaction fan contains characteristics going both to the left and to the right. It is calleda transonic rarefaction, by analogy to gas dynamics, where such a rarefaction happens when thefluid on one side of the wave moves at a speed smaller than the speed of sound (subsonic) and onthe other side it moves with a speed larger than the speed of sound (supersonic).

Here we show that the sonic point u∗ is in fact the root of the equation f ′(u∗) = 0. Assumef ′(uL) < 0 < f ′(uR), then in this case there is a rarefaction wave connecting the left and rightstates. Inspired by the solution to Burger’s equation, a rarefaction wave solution is sought on theform u(x, t) = v(x/t). Plugging this ansatz into the conservation law ut + (f(u))x = 0, yields

− x

t2v′(x/t) + f ′(v(x/t))v′(x/t)

1

t= 0.

Hencef ′(v(x/t)) =

x

t.

Along the interface x = 0 we have u∗ = u(0, t) = v(0) and f ′(v(0)) = 0.

The celebrated Godunov’s method is now obtained by simply using the solution to the Riemannproblem in (6.66) at each interface xj+1/2 to computes the numerical fluxes Fj−1/2, Fj+1/2 in (6.64),yielding

Fj+1/2 =

f(uL) if f ′(uL) > 0 & f ′(uR) > 0f(uR) if f ′(uL) < 0 & f ′(uR) < 0f(uL) if f ′(uL) ≥ 0 & f ′(uR) ≤ 0 & s > 0f(uR) if f ′(uL) ≥ 0 & f ′(uR) ≤ 0 & s < 0f(u∗) if f ′(uL) ≤ 0 & f ′(uR) ≥ 0.

(6.67)

Notice that the solution u∗ in (6.65) can be replaced by the shock solution, i.e, uR if s < 0 anduL if s > 0, even when f ′(uL) < 0 < f ′(uR), and will still provide a weak solution satisfying the

142

Rankine-Hugoniot jump condition, but as we already know this will lead to an unphysical weaksolution which violates the entropy condition. The resulting numerical solution is referred to asan all shock solution and provides an approximation to the weak solution to the conservation law,which is not necessarily the physically relevant–entropic solution.

Case of the advection equation and Stability of Godunov’s methodIf we replace the conservation law by the simple advection equation with a positive advection speed,a > 0, then the solution to the Riemann problem at the cell interface, xj+1/2, reduces to

u(xj+1/2, t) = uj ,

and Godunov’s method reduces to the upwind scheme

un+1j = unj −

a∆t

∆x(unj − unj−1). (6.68)

Recall that the upwind method is stable under the CFL condition. Therefore, provided the CFLcondition

max(f ′(uL), f′(uR))∆t ≤ ∆x (6.69)

is satisfied, Godunov’s method is stable.

Consistency and convergenceMoreover, given the way the fluxes were computed, i.e, using the exact solution of the Riemannproblem using first order approximation of the initial data, namely piecewise constant cell averages,we can show that Godunov’s method is consistent of order 1 in both space and time, just like theupwind scheme. Moreover, since it is also conservative by construction, Godunov’s method isguaranteed to convergence to the entropic weak solution, provided we always choose the entropicsolution for the Riemann problem. The numerical test of Figure 6.12, is repeated with the Godunovmethod and the results are shown in Figure 6.13, where Godunov’s method is compared to the Lax-Friedrichs method. As we expect Godunov’s method predicts the right shock speed and has muchless dissipation than the Lax-Friedrichs method. However, one big disadvantage of Godunov’smethod is that it requires the solution of a Riemann problem at each cell interface and every timestep. This can be very costly especially for non-linear systems of conservation laws.

An other serious shortcoming of Godunov’s method is that it is only first order in both timeand space. This limits its use in practice but it constitutes an important cornerstone for moresophisticated so called high resolution methods developed below.

6.5.3 High resolution, TVD, and MUSCL schemes

As noted above one of the major shortcomings of Godunov’s method is that it is only first accuratein both space and time. As noted above a crucial step in deriving this method is the piecewiseconstant approximation step, when the solution u(x, t) at time t = tn is approximated by the cellaverage

u(x, tn) ≈ unj , xj−1/2 ≤ x ≤ xj+1/2,

to advance to the next time step, tn+1, which is only a first order approximate/interpolation. SinceGodunov’s method computes the cell averages at each new time step, this piecewise approximation

143

−0.2 0 0.2 0.4 0.6 0.8 1 1.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Initial

Exact

Godunov

Lax−Friedrichs

Figure 6.13: Same as Figure 6.12 but with Godunov’s method instead of the upwind scheme.

appears to be very convenient, indeed. Given the cell averages unj , we can use piecewise linearinterpolation to yield a second order reconstruction

u(x, tn) ≈ u(x, tn) ≡ unj + σj(x− xj), xj−1/2 < x < xj+1/2,

which can be used instead to advance the solution to next step (see Figure 6.14). The slopes σjcan be computed in many different ways; using many different finite differencing formulas as wewill see below.

To illustrate let us consider the advection equation

ut + aux = 0, a > 0.

Given the piecewise linear reconstruction values, u(x, tn), at time t = tn, the ”exact” solution forthis equation at t = tn +∆t is given by

u(x, tn +∆t) = u(x− a∆t, tn).

Provided a∆t/∆x < 1, we have

u(x, tn +∆t) =

unj−1 + σj−1(x− a∆t− xj−1) if xj−1/2 < x < xj−1/2 + a∆t

unj + σj(x− a∆t− xj) if xj−1/2 + a∆t < x < xj+1/2.(6.70)

Averaging ˜u(x, tn+1) on the intervals (xj−1/2, xj+1/2),

un+1j =

a∆t

∆xunj−1 +

σj−1

∆x

∫ xj−1/2+a∆t

xj−1/2

(x− a∆t− xj−1)dx+

∆x− a∆t

∆xunj +

σj∆x

∫ xj+1/2

xj−1/2+a∆t(x− a∆t− xj)dx,

yields the cell averages at tn +∆t, given by

un+1j = unj −

a∆t

∆x(unj − unj−1)−

a∆t

2∆x(∆x− a∆t) (σj − σj−1). (6.71)

144

If we set the slopes to zeros, σj = 0, then we recover the first order Godunov (upwind) method in(6.68). A second order scheme is obtained, if the slope is choosing to be a first order finite differenc-ing formula which approximates the derivatives ux(xj, tn) in the j’th grid cell, using neighbouringcell average values. The following three choices of slopes yield three popular second order schemes

Centered: σj =uj+1 − uj−1

2∆x(Fromm)

Upwind: σj =uj − uj−1

∆x(Beam-Warming)

Downwind: σj =uj+1 − uj

∆x(Lax-Wendroff).

Pb. 13 Show that when a < 0 the second order scheme for the advection equation using piecewiselinear reconstruction is given by

un+1j = unj −

a∆t

∆x(unj+1 − unj ) +

a∆t

2∆x(∆x+ a∆t) (σj+1 − σj). (6.72)

Reconstruct, Solve, Average:The algorithm followed above to derive Godunov’s method as well as its second order version forthe advection equation involves three main steps.

1) Reconstruct: Given the cell averages unj , we (re)construct piecewise constant or piecewiselinear approximations

u(x, tn) ≈ ujn or u(x, tn) ≈ ujn + σj(x− xj), xj−1 < x < xj+1/2

2) Solve: Solve the Riemann problem

ut + (f(u))x = 0, tn ≤ t ≤ tn+1, u(x, tn) = u(x, tn)

3) Average: Average the solution at tn+1

un+1j =

1

∆x

∫ xj+1/2

xj−1/2

u(x, tn+1) dx.

A few remarks on the Reconstruct-Solve-Average algorithm:Note the first step is crucial, it sets the order of accuracy of the method. Piecewise constants yielda first order method while a piecewise linear approximation yields a second order scheme. In factmethods using parabolic approximations exist are often used in practice, they are known are PPM(for piecewise parabolic methods). Piecewise parabolic reconstruction yields a third order methodand so on. For, the second step it is desirable to be able to solve the conservation law exactly.This is possible in the case of the advection equation and for the conservation law in general whenpiecewise constants are used in step one, yielding Godunov’s method. But in general, we need toresort to quadrature formulas to approximate the flux integrals along the time interval [tn, tn+∆t].

Finally, when a finite volume method, such as Godunov’s, is applied to a conservation law, it yieldsthe average values un+1

j directly, which takes care of step three.

145

High order finite volume method for conservation lawsRecall the finite volume derived in (6.64) for Godunov’s method

un+1j = unj −

∆t

∆x

[

Fnj+1/2 − Fnj−1

]

,

where

Fnj+1/2 =1

∆t

∫ tn+1

tn

(f(u(xj+1/2, t)))dt.

To reconcile with the case when u(x, t) is not piece-wise constant at time t = tn, we approximatethe time integral by a rectangular rule yielding an Euler step

uj+1 = unj −∆t

∆x

(

Fnj+1/2 − Fnj−1/2

)

(6.73)

whereFnj+1/2 = f(u∗(xj+1/2, t

+n )).

Hereu∗(xj+1/2, t

+n ) = lim

t−→tnu(xj+1/2, t)

is obtained by solving the Riemann problem with left and right states

uL = u(xj+1/2,−, tn), uR = u(xj+1/2,+, tn)

where again

u(xj+1/2,−, tn) = limx−→xj+1/2,x<xj+1/2

u(x, tn) and u(xj+1/2,+, tn) = limx−→xj+1/2,x>xj+1/2

u(x, tn).

Note that in the case of a piecewise constant approximation these limits are simply uL = unj and

uR = un+1j , and the solution to the Riemann problem is constant on the time interval [tn, tn +∆t],

provided the CFL condition ∆tmax(|f ′(uR)|, |f ′(uL)|) ≤ ∆x is satisfied. Note this is not true forhigher order reconstructions.

To obtain higher order accuracy in time, we often use a predictor-corrector scheme such as themid-point Runge-Kutta method in place of the first-order Euler step (6.73):

uj+1/2 = unj −∆t

2∆x

(

Fnj+1/2 − Fnj−1/2

)

(6.74)

uj+1 = unj −∆t

∆x

(

Fn+1/2j+1/2 − F

n+1/2j−1/2

)

whereFn+1/2j+1/2 = f(u∗(xj+1/2, t

+n+1/2)).

Oscillations and TVD methodsRecall from Figure 6.11 that one major problem with second order methods is that they tend togenerate unphysical oscillations near discontinuities. One way to control these oscillation is to usea hybrid method where the second order reconstruction is limited to regions where the solution, u,is smooth and use a first order method in regions where u presents discontinuities of some sort. Tofind an intelligent way to do so, we first introduce a mathematical tool to control those oscillations.This is achieved by the total variation, which is introduced next.

146

Definition 15 The total variation (TV) of a real valued function f is given by

TV(f) = sup

+∞∑

j=−∞

|f(ξj+1)− f(ξj)|

where the supremum is taken on all subdivisions · · · < ξ−1 < ξ0 < ξ1 < · · · of the real line.

To help develop some intuition, we list a few properties of TV.

• If f(x) is monotonic (i.e, non-increasing or non-decreasing) on some interval [a, b], then thetotal variation of f on [a, b] is given by

TV[a,b](f) = |f(b)− f(a)|

• If f(x) is piecewise constant on the line segments (xj−1/2, xj+1/2), then

TV(f) =

+∞∑

j=−∞

|f(xj+1)− f(xj)|,

i.e, the sum of all the jumps of f .

• If f(x) is piecewise linear on the line segments (xj−1/2, xj+1/2), then

TV(f) =

+∞∑

j=−∞

|f(x−j+1/2)− f(x+j−1/2)|++∞∑

j=−∞

|f(x+j+1/2)− f(x−j+1/2)|,

i.e, the sum of the variations within each one of the line segments plus all the jumps acrossthe interfaces.

• If f is differentiable, then

TV(f) =

∫ +∞

−∞|f ′(x)|dx.

Since the exact solution to the advection equation simply propagates at a constant speed, its TVdoesn’t change with time. Moreover, it can be shown that the total variation for an entropicsolution for a conservation law, in general, doesn’t increase with time. It can decrease after ashock but never increases. However, this is not always the case for the numerical solution. Clearlythe oscillations generated by the second order schemes in Figure 6.11 do increase the TV of thenumerical solution. A reasonable way to attempt to avoid those unphysical oscillations, is thus torequire that the total variation of the numerical solution does not increase.

Definition 16 The numerical scheme

uj+1 = unj −∆t

∆x

(

Fnj+1/2 − Fnj−1/2

)

is said to be total variation diminishing (TVD) if

TV(un+1) ≤ TV(un). (6.75)

147

For instance we can show that the average step in the Reconstruct-Solve-Average algorithm doesactually diminish total variation. Therefore, when the slopes are set to zeros, σj = 0, the upwindscheme for the advection equation is TVD.

Definition 17 A numerical scheme

un+1j = H(unj−k, u

nj−k+1, · · · , unj+l)

is said to be monotone if∂H

∂uη≥ 0, η = j − k, · · · , j + l.

One important property of monotone schemes is that they do not create new local maxima orminima:

max un+1j ≤ maxunj , and minun+1

j ≥ minunj

As such monotonic schemes are TVD. On the other hand TVD schemes are monotonicity preserving.A numerical scheme is said to be monotonicity preserving if monotonic data remains monotonic:

· · · unj−1 ≤ unj ≤ unj+1 ≤ · · · =⇒ · · · un+1j−1 ≤ un+1

j ≤ un+1j+1 ≤ · · ·

We have the following general statement

monotonic schemes ⊂ TVD schemes ⊂ monotonicity preserving schemes.

It is easy to show that, under the CFL condition, a∆t ≤ ∆x, Godunov’s method for the advectionequation is monotonic. In fact, for a > 0, we have

un+1j = H(unj , u

nj−1) ≡ (1− µ)unj + µunj−1.

Therefore it is TVD as noted above. Unfortunately, it is not possible to construct linear second orderschemes which are monotonic, according to the following theorem due to Godunov. In agreementwith the fact that second order schemes exhibit oscillations.

Theorem 24 (Godunov) A linear monotonic scheme is at most first order accurate.

In order to achieve high order schemes without unphysical oscillatory behaviour we need to makeschemes which are intrinsically nonlinear. A family of such schemes are the hybrid schemes men-tioned above, which are second order in smooth regions and only first order near shocks, to avoidoscillations.

To guarantee that our hybrid numerical scheme remains TVD and second order accurate in smoothregions, we must choose the slopes σj so that the TV of the reconstructed function is not largerthan that of the discrete cell averages unj . This is achieved by defining the slope σj as a non-linearfunction of the data unj in such a way to control the total variation. The associated non-linearfunction is called a limiter and methods based on this idea are known as slope-limiter methods,

148

first introduced by van Leer when he derived his famous MUSCL schemes (Monotonic Upstream-Centered Schemes).

Perhaps the simplest choice of a limiter, which guarantees second order accuracy in regions whereu is smooth and satisfies the TVD property at the same time, is the minmod limiter:

σj = minmod(unj+1 − unj

∆x,unj − unj−1

∆x) (6.76)

where

minmod(a, b) =

a if |a| < |b| and ab > 0b if |b| < |a| and ab > 00 if ab ≤ 0.

Note that when a, b have the same sign, the minmod function chooses the one which is smaller inmagnitude and when ab ≤ 0, it returns zero. Instead of using the downwind slope, yielding theLax-Wendroff scheme, or the upwind slope, leading to the Beam-Warming scheme, the minmodlimiter chooses the one which is smaller in magnitude. When the upwind and downwind slopeshave opposite sign, the reconstructed cell value is kept constant. This latter must correspond to alocal minimum or a local maximum, and the minmod limiter tends to preserve the local extrema,hence avoiding to create overshootings and undershootings near those extrema. Thus it does notincrease TV and does not generate oscillations.

A more popular choice of limiter, due to van Leer, is the MC-limiter (monotonized-centeredlimiter):

σj = minmod(unj+1 − unj−1

2∆x, 2unj+1 − unj

∆x, 2unj − unj−1

∆x)

Note, the MC-limiter chooses the smallest in magnitude between the centred-differences, corre-sponding to Fromm’s method, and twice the upwind or the downwind formulas, and returns zeroswhen there is a change is sign. This limiter yields highly accurate centered slopes in smooth regionsand sharper resolution near discontinuities.

Flux-limitersUsing piecewise linear reconstruction for the advection equation yields (see (6.71) and (6.72))

un+1j = unj −

∆t

∆x(Fnj+1/2 − Fnj−1/2)

where

Fnj+1/2 =

aunj +a

2(∆x− a∆t)σj if a > 0

aunj+1 −a

2(∆x+ a∆t)σj+1 if a < 0.

Introducing the negative and positive parts of a

a+ = max(a, 0), a− = min(a, 0)

the flux Fj+1/2 can be rewritten in a compact form as

Fj+1/2 = a−uj+1 + a+uj +1

2|a|(

1− δt|a|∆x

)

δj , (6.77)

149

j+1

Piecewise constant

Piecewise linear

j−1 j

Figure 6.14: Piecewise constant (dashed) and piecewise linear reconstructions.

where δj is the jump in u across the upwind interface; δj = uj − uj−1 if a > 0 and δj = uj+1 − ujif a < 0.

We introduce the upwind side of j as

J =

j − 1 if a > 0j + 1 if a < 0

and the jump∆uj = uj+1 − uj.

We define

θ =∆uJ∆uj

.

and letδj = φ(θ)∆uj

in (6.77), where φ is a non-linear function of θ, called a flux-limiter, which is defined so that theresulting scheme is second order accurate in smooth regions and TVD at the same time.

Some linear choices of φ yield the popular second order (linear) schemes:

φ(θ) = 0 : upwind

φ(θ) = 1 : Lax-Wendroff

φ(θ) = θ : Beam-Warming

φ(θ) =1

2(1 + θ) : Fromm,

150

which are of course not TVD according to Godunov’s theorem. While some clever non-linear choicesyield high resolution–TVD methods. The following are the most popular ones:

φ(θ) = max(0,min(1, θ)) : minmod

φ(θ) = max(0,min((1 + θ)/2, 2, 2θ)) : MC-limiter

φ(θ) = max(0,min(1, 2θ),min(2, θ)) : superbee

φ(θ) =1 + θ

1 + |θ| : van Leer.

They all have their strengths and weaknesses, depending on the application. Those are well docu-mented in the literature (see LeVeque’s book: finite volume methods for hyperbolic problems, forexamples). The performance of each of the choices of limiters listed above is shown on the differentpanels in Figure 6.15, for both the smooth and the non-smooth/square wave. We see that in gen-eral the TVD limiters capture very well both the smooth solution (with an accuracy comparableto the second order schemes) and the non-smooth square wave. The second order methods have anoscillatory behaviour near the discontinuity and the upwind method is too diffusive.

151

−0.5 0 0.5 10

0.5

1

1.5

2

−0.5 0 0.5 10

0.5

1

1.5

2

−0.5 0 0.5 10

0.5

1

1.5

2

−0.5 0 0.5 1−1

0

1

2

3

−0.5 0 0.5 10

0.5

1

1.5

2

−0.5 0 0.5 1−1

0

1

2

3

−0.5 0 0.5 1−1

0

1

2

3

−0.5 0 0.5 1−1

0

1

2

3

Exact

minmod

Exact

MC

Exact

superpee

Exact

van Leer

Exact

upwind

Exact

Lax−Wendroff

Exact

Beam−Warming

Exact

Fromm

Figure 6.15: Performance of the different limiter functions listed in the text for both the smoothand non-smooth data.

152

Bibliography

[1] Crispin W. Gardiner, 2004, Handbook of Stochastic Methods: for Physics, Chemistry and theNatural Sciences, Springer.

[2] Randall J. LeVeque, 2002, Finite Volume Methods for Hyperbolic Problems (Cambridge Textsin Applied Mathematics),Cambridge University Press.

[3] Dale Durran, 1998, Numerical Methods for Wave Equations in Geophysical Fluid Dynamics,Springer.

153

mathematics 449/549: scientiﬁc computing

Documents