notes on computational mathematics - uw-madison …jdbrunner/computational… ·  ·...

112
Notes on Computational Mathematics James Brunner August 19, 2013 This document is a summary of the notes from UW-Madison’s Math 714, fall 2012 (taught by Prof. Shi Jin) and Math 715, spring 2013 (taught by Prof. Saverio Spagnoli). This summary was written by and for me, so a lot may be unclear to others. If you are reading this and are not me, email me with questions: [email protected]. I probably can’t answer them, but then again you probably aren’t going to ask any, because the chances anyone but me reads this are slim. If you are reading this, however, I welcome both questions and comments (like if you find an error!) because they are sure to improve this document, and more importantly my own understanding of the material. This document is a work in progress and won’t be fully edited for while, so there may be typos, errors, etc. If you find any, feel free to bring them to my attention. 1

Upload: vannhan

Post on 08-Mar-2018

252 views

Category:

Documents


12 download

TRANSCRIPT

Notes on Computational Mathematics

James Brunner

August 19, 2013

This document is a summary of the notes from UW-Madison’s Math 714, fall 2012 (taughtby Prof. Shi Jin) and Math 715, spring 2013 (taught by Prof. Saverio Spagnoli). Thissummary was written by and for me, so a lot may be unclear to others. If you are readingthis and are not me, email me with questions: [email protected]. I probably can’tanswer them, but then again you probably aren’t going to ask any, because the chances anyonebut me reads this are slim. If you are reading this, however, I welcome both questions andcomments (like if you find an error!) because they are sure to improve this document, andmore importantly my own understanding of the material. This document is a work in progressand won’t be fully edited for while, so there may be typos, errors, etc. If you find any, feelfree to bring them to my attention.

1

Contents

1 Numerical Differentiation 4

2 Solving ODE’s 6

3 Finite Difference Methods for PDEs 93.1 Parabolic Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Stability, Consistency, Convergence . . . . . . . . . . . . . . . . . . . . . . . 103.3 Methods for Non linear Diffusion equations . . . . . . . . . . . . . . . . . . . 123.4 The heat equation in Higher Dimensions . . . . . . . . . . . . . . . . . . . . 133.5 Finite Difference Methods for Hyperbolic Equations . . . . . . . . . . . . . . 153.6 Non Linear Hyperbolic Equations . . . . . . . . . . . . . . . . . . . . . . . . 183.7 Numerical Solutions to Non Linear Hyperbolic Equations - Conservative Schemes 21

3.7.1 Lax-Friedrich Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7.2 Lex-Wendroff Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7.3 The Godunov Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.8 Non Linear Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.9 High Resolution Shock Capturing Schemes . . . . . . . . . . . . . . . . . . . 34

3.9.1 Flux Limiter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 343.9.2 Slope Limiter Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.10 Central Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.11 Relaxation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Front Propagation 40

5 Finite Volume Methods For PDEs 43

6 Spectral Methods For PDEs 456.1 Trigonometric Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1.1 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 486.2 Basics about the Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 496.3 Spectral Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.4 Smoothness & Spectral Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 52

6.4.1 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.4.2 Spectral Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.5 Non Periodic Problems - Chebyshev Points . . . . . . . . . . . . . . . . . . . 54

2

6.5.1 FFT for Chebyshev Spectral Methods . . . . . . . . . . . . . . . . . . 55

7 Monte Carlo Methods 577.1 Probability Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.1.1 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.2 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 607.3 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.3.1 Discrete Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.4 Multivariate Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 617.5 Variance Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Numerical Linear Algebra 648.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.1.1 Induced Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 658.1.2 General Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 658.3 Projection Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688.4 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . . . . 698.5 Householder Triangularization . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.5.1 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728.6 Conditioning and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.6.1 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 748.6.2 Accuracy and Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 748.6.3 Backwards Stability of QR algorithms . . . . . . . . . . . . . . . . . 758.6.4 Stability of Back Substitution . . . . . . . . . . . . . . . . . . . . . . 77

8.7 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788.8 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.9 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8.9.1 Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.9.2 Two Great Eigenvalue Theorems . . . . . . . . . . . . . . . . . . . . 92

8.10 Numerically Finding the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 938.11 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.11.1 Classical Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . 948.11.2 Arnoldi Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968.11.3 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . 99

9 Finite Elements 1049.1 Introduction and Review of PDEs . . . . . . . . . . . . . . . . . . . . . . . . 1049.2 A First Look at Finite Elements . . . . . . . . . . . . . . . . . . . . . . . . . 1079.3 Some Analysis - Hilbert & Sobolev Spaces . . . . . . . . . . . . . . . . . . . 112

3

Chapter 1

Numerical Differentiation

We can approximate a derivative with a number of simple finite-difference ideas:

centred difference f ′(x0) ≈ f(x0+h)−f(x0−h)2h

one sided difference f ′(x0) ≈ f(x0+h)−f(x0)h

Estimation for the error of these approximations can be made using Taylor expansions.Taylor expanding f(x) and using this to derive expressions for f(x0 + h) and f(x0 − h)and plugging these expressions into the approximations give an exact expression for theapproximation which can be compared to f ′(x0). For example, this procedure shows:

f(x0 + h)− f(x0 − h)

2h= f ′(x0) +

f ′′′(x0)

6h2 + o(h2)

This shows that the centred difference approximation is second order with respect to h, andalso depends on the regularity of f(x), because any blow up of f (n)(x) will cause the error

(f′′′(x0)

6h2 + o(h2)) to blow up (note in this error, the term o(h2) is only true if f ∈ C∞loc).

Using this same strategy, we can determine the error of any scheme of the form

f ′(x0) ≈ αf(x− h) + βf(x0) + γf(x0 + h)

and also use Taylor polynomials to choose α, β, and γ. For a higher order method, we couldincrease terms, using f(x0 + 2h), etc.Alternatively, we may use interpolating polynomials of degree n, Pn(x), and take f ′(x) ≈P ′n(x). Note, we can instruct an interpolating polynomial using Lagrange interpolation asfollows:

Pn(x) =n∑i=0

f(xi)li(x)

li(x) =n∏j 6=i

x− xjxi − xj

Accuracy of interpolation depends on regularity of f . We can also use piecewise polynomialinterpolation for better accuracy.If we have a discontinuity in f , higher order interpolation will generate oscillations. The

4

best interpolation is the one interpolated with smooth data.Divided difference between points is defined:

1st f [xi, xi+1] = f(xi+1)−f(xi)xi+1−xi

2nd f [xi−1, xi, xi+1] = f [xi,xi+1]−f [xi−1,xi]xi+1−xi−1

nth f [x0, ..., xn] = f [x1,...,xn]−f [x0,...,xn−1]xn−x0

We want to choose the interpolation with the smaller (in absolute value) divided difference.We can also produce the interpolating polynomial using Newton’s interpolation:

Pn(x) = f(x0)+f [x0, x1](x−x0)+f [x0, x1, x2](x−x0)(x−x1)+...+f [x0, ..., xx](x−x0)(x−x1)...(x−xn−1)

5

Chapter 2

Solving ODE’s

In solving ODE’s (and PDE’s, for that matter) we will need to ask whether our methodof approximating a solution is stable - i.e. will small errors refrain from growing into largeones. However, for this question to be answerable, the equations we are solving should alsobe stable - i.e. small perturbations in initial data will not grow exponentially. So we have atheorem:

Theorem 1. If f(y, t) is Lipschitz Continuous, ie ∃ L > 0 st |f(y, t)− f(y∗, t)| ≤ L|y− y∗|∀y, y∗, t, then there exists a unique solution to the IVP dy

dt= f(y, t), y(0) = y0 where y(t) is

continuously differentiable.

For stability of the solution, the Lipschitz condition is sufficient.To solve the IVP

dydt

= f(y, t)

y(0) = y0

there are a number of numerical methods, generally based on replacing the derivative witha numerical approximation to a derivative. The simplest is the forward Euler method:

yn+1 − yn

∆t= f(yn, tn)

Here the derivative has been replaced by the forward difference approximation, and thisan explicit method. The backward Euler method is similar, but requires evaluation off(yn+1, tn+1), and so is implicit and requires solving a non-linear equation to complete (an-alytically or, for example, by Newton’s method). More generally, a Crank-Nicolson schemeis one in which these two are combined in a weighted average:

yn+1 − yn

∆t= θf(yn, tn) + (1− θ)f(yn+1, tn+1)

For θ = 12, this is called the ”midpoint rule” and is second order.

Finally, we may use a centred difference approximation for the derivative to get a secondorder ”leapfrog” scheme:

yn+1 − yn−1

2∆t= f(yn, tn)

6

It is necessary to determine the stability as well as accuracy of these schemes. To considerstability, we can consider f(y, t) = λy, so that we know y = y0e

λt. Then, if λ < 0, wehave |y(t)| ≤ |y0| ∀ t > 0. The forward Euler method can in this case be written asyn = (1 + λ∆t)ny0. Then |yn| ≤ |y0| is only true if |(1 + λ∆t)| < 1. This gives a region ofstability for λ∆t. This is called the region of absolute stability. For the backwards Eulermethod, however, we see that here,

yn =

(1

1− λ∆t

)ny0

This is then stable for |1− λ∆t| > 1, which is true if <(λ) < 0, and, of course, if <(λ) > 0,the analytical solution would not be stable, so the numerical one shouldn’t be either. Thesame is true for the midpoint method. We call methods like this “A-Stable”. Interestingly,there is a theorem stating there is no A-stable method of third or higher order. There arehigher order methods, and they have regions of stability, and so are useful. For example, wehave multi-step methods. The general method of an r step method to solve ut = f(u) is:

r∑j=0

αn+ju = ∆t

r∑j=0

βjf(un+j)

An Adams Method is one of the type:

un+r = un+r−1 + ∆tr∑j=0

βjf(un+j)

If βr is chosen to be 0, this is an explicit method, called Adams-Bashforth, and the coefficientscan be chosen so that it is O(∆tr). If not, it is an implicit, Adams-Mouton, method, andcan be O(∆tr+1). We can choose the coefficients by considering the truncation error

τ(tn+r) =1

∆t

[r∑j=0

αjuj+n −∆t

r∑j=0

βjf(un+j)

]Then we can Taylor expand u and f , and plug this in and choose coefficients to fit theproperties we want.There are also Multi-stage methods, such as the Runge-Kutta methods. These require us tocompute intermediate values. For example, the 2nd order Runge Kutta method is (and let’slet ∆t = k)

u∗ = un + k2f(un)

un+1 = un + kf(u∗)

and the 4th order Runge Kutta method is:

F0 = f(un, tn)

F1 = f(un + 12kF0, tn + 1

2k)

F2 = f(un + 12kF1, tn + 1

2k)

F3 = f(un + kF2, tn+1)

un+1 = un + k6(F0 + 2F1 + 2F2 + F3)

7

These also require relatively small k for stability.

8

Chapter 3

Finite Difference Methods for PDEs

3.1 Parabolic Equations

Our goal is to numerically approximate solutions to equations such as ut = ∆u with giveninitial and boundary conditions. The idea of a finite difference method is to discretize timeand space and use numerical approximations to derivatives. This will give a system ofequations (which must be solved for implicit methods). If we have a tridiagonal matrix, thefollowing algorithm can be used to solve it:

say −AjNj+1 +BjNj − CjNj−1 = Mj

and N0 = Nm = 0

with A,B,C > 0

and B > A+ C

let Nj = EjNj+1 + Fj

so E0 = F0 = 0

then −AjNj+1 +BjNj − Cj(Ej−1Nj + Fj−1) = Mj

Nj =Aj

Bj−CjEj−1Nj + 1 +

CjFj−1+Mj

Bj−CjEj−1

Ej =Aj

Bj−CjEj−1Fj =

CjFj−1+Mj

Bj−CjEj−1

So this can be solved with 2 loops.A simple finite difference method to solve ut = uxx would look like:

un+1j − unjk

= θ

(unj+1 − 2unj + unj−1

h2

)+ (1− θ)

(un+1j+1 − 2un+1

j + un+1j−1

h2

)

Here h is the space grid SIZE(assumed to be constant - ie we have a uniform grid), j theindex of the individual space grid point, k the time step size, and n the time point index.

9

3.2 Stability, Consistency, Convergence

There are a few ways to think about the stability of these methods. For example, thediffusion equation (above) has the property that the L2 norm of the density function (u(x, t)above) should decrease. It is then reasonable to expect the same of our numerical solution(using the l2 norm for our discrete data). Remembering some analysis, Plancherel’s theoremsays ||f ||L2 = ||f ||L2 , so we can work with the Fourier transform if we want. We canapproximate the Fourier transform of our exact solution with the discrete Fourier transformof our numerical solution:

u(ξ, t) = hJ∑j=0

uj(t)eiξxj

and so

uj(t) =1

h

∑ξ

u(ξ, t)eiξxj

I should note here I am using xj = jh as the grid points (this is all 1 D in space for ease oftyping, the principle extends to higher D). We can plug this second relation into our schemeand we get:∑

ξ

un+1 − un

keiξhj = θ

∑ξ

uneiξh − 2 + e−iξh

h2eiξjh + (1− θ)

∑ξ

un+1 eiξh − 2 + e−iξh

h2eiξjh

so ∀ integer ξ:

un+1 − un

k= θ

eiξh − 2 + e−iξh

h2un + (1− θ)e

iξh − 2 + e−iξh

h2un+1

And this simplifies to:

un+1 − un = θ−4 sin( ξh

2)

h2un + (1− θ)

−4 sin( ξh2

)

h2un+1

and if we let λ = kh2 , we at long last see:

un+1 =1− 4λθ sin2( ξh

2)

1 + 4λ(1− θ) sin2( ξh2

)un

and so this method is stable (in the sense that it agrees that ||u||L2 is decreasing, as it should)if ∣∣∣∣∣ 1− 4λθ sin2( ξh

2)

1 + 4λ(1− θ) sin2( ξh2

)

∣∣∣∣∣ ≤ 1

Some straightforward computation reveals that this is true if:

λ(2θ − 1) ≤ 1

2

10

This method of determining stability is called Von Neumann Analysis after John Von Neu-mann, who, along with being one of the best mathematicians and scientists of the past 100years, was a member of the Manhattan Project. In general, Von Neumann analysis involves:

assume unj = gne−ijξh

where g = un+1(ξ)un(ξ)

and we have stability if |g| ≤ 1

This method for determining stability works for finite difference methods applied to linearPDE’s, which is something I guess. We can check out consistency (ie truncation errorapproaches 0 with gridsize) using Taylor expansions (and a lot of scratch paper). It wouldbe great, then if we could say something about convergence of the method. Luckily, we have:

Theorem 2. The Lax Equivalence Theorem A scheme converges iff it is consistent andstable.

By convergent, we mean that if Ek(·, t) is the difference between the analytic and nu-merical solution to a problem, ||Ek|| → 0 as k → 0 in some norm || · ||. Formally, call thenumerical solution Uk and the analytic solution u, then

Ek(·, t) = Uk(·, t)− u(·, t)

and local truncation error is:

Lk(x, t) =1

k[u(x, t+ k)−Hk(u(·, t), x)]

Where Hk is the operator associated with the method (ie Hk(u(·, t), x) is the result of usingthe method to find Un+1(·) when Un(·) = u(·, tn) exactly). Then consistency is the conditionthat ||Lk|| → 0 as k → 0.Continuing with some more formal definitions: A method is of order p if for all sufficientlysmooth initial data with compact support, ∃ a constant Cp such that ||Lk|| ≤ Cpk

p ∀ k < k0,t < T .Formally, a method is stable if for each time T , ∃ Cs > 0 such that ||(Hk)

n|| ≤ Cs ∀ nk ≤ T ,k < k0, noting that nk = tn.So we see now that in Von Neumann analysis, we can have stability if |g| ≤ 1 + ck.

We can also think about energy in the case of the diffusion equation to determine stability.The problem is:

ut = uxx

u(x, 0) = u0(x)

u(0, t) = u(1, t) = 0

and so a multiplication, integration, and integrating by parts gives:

1

2∂t

∫ 1

0

u2dx = uux|10 −∫ 1

0

(ux)2dx ≤ 0

11

and so,||u(·, t)||L2 ≤ ||u(·, 0)||L2

This shows that energy in the system should decay (it is the diffusion equation after all).Now, because of the 0 Dirichlet boundary condition, this leads to the Maximum Principlethat the maximum of u can only be obtained along x = 0, x = 1, or t = 0. We can lookat L∞ stability for a finite difference method to solve a linear PDE rather simply by solvingfor un+1

j and using some basic inequalities. L2 stability, on the other hand, is a very longcomputation. One helpful equality in this computation is:

J−1∑j=1

uj(vj+1 − 2vj + vj−1) = uJ(vJ − vJ−1)− u1(v1 − v0)−J−1∑j=1

(uj+1 − uj)(vj+1 − vj)

The computation is left as an exercise (ha!).

3.3 Methods for Non linear Diffusion equations

Now we deal with a more difficult class of equations, non linear diffusion equations, whichhave the form:

ut = ∂x(D(u, x)∂xu)

We will usually just deal with 0 Dirichlet boundary conditions on the interval [0, 1], but ofcourse that is not necessary (just simpler). We note that this equation gives a conservativesystem, because if ∂xu|x=0,1 = 0, then

∂t

∫ 1

0

udx = 0

so that with no flux, total mass is conserved. A scheme that preserves this property is calleda conservative scheme (shockingly). A scheme that is not conservative is probably bad. Forexample, the obvious finite difference scheme to deal with this equation:

∂tuj =Dj+1 −Dj−1

2h

uj+1 − uj−1

2h+Dj

uj+1 − 2uj + uj−1

h2

is not conservative. This isn’t good. I can point out here that a good rule of thumb forANY numerical approximation is to try to keep the analytic properties of the system. Aconservative scheme for this system is:

∂tuj =Dj +Dj+1

2h

uj+1 − ujh2

− Dj +Dj−1

2h

uj − uj−1

h2

which can be more compactly written as:

∂tuj =Dj+1/2∂xuj+1/2 −Dj−1/2∂xuj−1/2

h

12

For stability, we again notice that∫uut =

∫u∂x(D(u, x)∂xu)

12∂t∫ 1

0u2dx = −

∫ 1

0D(∂xu)2dx < 0

||u||L2[0,1] < ||u0||L2[0,1]

And so we will call a method stable if this property is kept. To determine numerical stability,Von Neumann analysis no longer works. We need to use the energy method. This is quite alot of computation, I will not include it here.

3.4 The heat equation in Higher Dimensions

In two dimensions, the equation becomes

ut = ∆u

for readability, we will denote δ2xu = uj+1,k − 2uj,k + uj−1,k and δ2

yu similarly. If we use a

square, uniform grid, ie ∆x = ∆y, I will use λ = ∆t(∆x)2 = ∆t

(∆y)2 . It turns out that for anexplicit scheme, the stability condition is:

∆t

(∆x)2+

∆t

(∆y)2≤ 1

2

We can use Von Neumann stability analysis, here the Fourier expansion gives:

unj,k = gne−i(ξj∆x+ηk∆y)

An implicit scheme in 2D leads to a complicated linear algebra problem. The Crank Nicolsonscheme is:

un+1j,k − unj,k

∆t=

1

2(∆x)2

(δ2xu

n+1j,k + δ2

yun+1j,k + δ2

xunj,k + δ2

yunj,k

)Alternatively, we have the Alternating Direction Method (ADI) (see what I did there?):

un+1/2 − un = 12λ(δ2xu

n+1/2 + δ2yu

n)

un+1 − un+1/2 = 12λ(δ2xu

n+1/2 + δ2yu

n+1)

The advantage here is that in each half-step, only one derivative is implicit. This gives twotri-diagonal systems, which are easy to solve. This is a 2nd order unconditionally stablemethod. This is shown using Von Neumann analysis. It is second order because it canbe shown to be a fourth order correction to the Crank Nicolson scheme. The ADI can beextended to 3D and is still unconditionally stable:

u∗ − un = λ2

[δ2x(u∗ + un) + δ2

y(2un) + δ2

z(2un)]

u∗∗ − un = λ2

[δ2x(u∗ + un) + δ2

y(u∗∗ + un) + δ2

z(2un)]

un+1 − un = λ2

[δ2x(u∗ + un) + δ2

y(u∗∗ + un) + δ2

z(un+1 + un)

]13

Computations which eliminate u∗ and u∗∗ show that this is second order as well.In general, we can deal with an arbitrarily high number of dimensions with the FractionalStep Method. The problem is then:

ut = A A = A1 + A2 + ...+ Aq

Ai = δxixi

We can attack this in q steps, dealing with only one dimension at a time with a centreddifference operator Bi approximating Ai:

Bi =1

(∆x)2δ2xi

so that we have:un+i/q − un+i−1/q

∆t= Bi

un+i/q + un+i−1/q

2

This is a Crank-Nicolson scheme in each direction, and so is unconditionally stable, and itis also 2nd order. A more general problem is

ut = uxx + f(u)

Here we can use a fractional step method:

ut = uxx tn → tn+1/2

ut = f(u) tn+1/2 → tn+1

This is a consistent splitting, and the splitting error is first order, we whatever methodwe choose to solve each half-step, overall we will have a first order method. To see thatit is consistent with first order error, we do the following computation with the operatorsAu = δxxu and Bu = f(u)

ut = Au+Bu = (A+B)u

u(t) = e(A+B)tu0

un+1 = e(A+B)∆tun

Un+1/2 = eA∆tun

Un+1 = eB∆tUn+1/2 = eB∆teA∆tun

Un+1 − un+1 =(eB∆teA∆t − e(A+B)∆t

)un

now we Taylor expand the exponentials and simplify to see that:

Un+1 − un+1 =(∆t)2

2(BA− AB) +O((∆t)3)

14

And so the overall error is first order. We can improve this to 2nd order splitting error usingStrang’s splitting, which uses one half step of A, one step of B, and one half step of A. Thisforms the exponential

e∆t/2Ae∆tBe

∆t/2A

ande

∆t/2Ae∆tBe∆t/2A − e∆t(A+B) = O((∆t)3)

so this is second order splitting. Also, in implementation this is n−1 steps of simple splittingbegan and ended with a half step of A.

3.5 Finite Difference Methods for Hyperbolic Equa-

tions

The simplest hyperbolic equation is:

ut + aux = 0

u(x, 0) = u0(x)

which can be solved analytically using the method of characteristics, and so will be nice forsetting up some theory of these methods. The solution is u(x, t) = u0(x− at), so the initialdata propagates at speed a. This advection equation can also be a system:

Ut + AUx = 0

with U ∈ Rd and A ∈ Rd×d. The equation is called hyperbolic if A has d distinct, real,eigenvalues. This means that A = TΛT−1 is diagonalizable (so Λ is diagonal), so we cansolve this system. In fact, if V = T−1U , we have:

Vt + ΛVx = 0

vi(x, t) = vi(x− λit, 0)

U = TV

A first guess at a finite difference approach to these problems would be

un+1j − unjk

+ aunj+1 − unj−1

2h= 0

and indeed this is first order in time and second order in space. However, Von Neumannanalysis shows:

g−1k

+ a e−iξh−eiξh

2h= 0

g = 1 + akhi sin(ξh)

|g|2 = 1 +(akh

)2sin2(ξh) > 1

15

so this is unconditionally unstable. However, we can use

un+1j − unj+1+unj−1

2

k+ a

unj+1 − unj−1

2h= 0

Here (I will skip the calculation, but it is good practice of Von Neumann stability analysisif you’ve forgotten how to do it) we have

|g|2 = 1−

(1− a2

(k

h

)2)

sin2(ξh)

and so ∀ξ, |g|2 ≤ 1 ⇔ |a| kh≤ 1. We call λ = |a| k

hthe Courant number or CFL number,

and this particular scheme is the Lax - Friedrichs Scheme, which we will have to returnto in more complicated cases (after all right now we are talking about a problem that iseasy to solve analytically!). The stability condition is 0 ≤ λ ≤ 1. We can understand thiscondition in a physical sense we remember that hyperbolic equations (our example is theadvection equation and is simple, but a more important example is the wave equation) allowinformation to propagate at finite speeds (in contrast to parabolic equations). In this casethat speed is a. This means that our solution should not ”see” information from too faraway. The area of influence on the solution is called the domain of dependence and thestatement that 0 ≤ λ ≤ 1 is the statement that the domain of dependence is inside thenumerical domain of dependence, ie the information used in the numerical method.Another approach we could take is a one sided difference approximation, known as an upwindor downwind method:

downwindun+1j −unjk

+ aunj+1−unj

h= 0

upwindun+1j −unjk

+ aunj −unj−1

h= 0

Von Neumann analysis reveals that the stability condition is the CFL condition, ie λ ≤ 1,depending on propagation direction: if a > 0, the downwind scheme is unconditionallyunstable, and if a < 0 the upwind scheme is unconditionally unstable. This makes sensephysically, the downwind scheme should only work if information is propagating to the left(backwards or ”down wind”), and upwind should only work if information is propagatingto the right (forwards). This is because each scheme only uses information from the left orright, but not both.So far all of these method have been first order. For a second order method we can use theLax - Wendroff Scheme, which comes from the Taylor series (as all these kind of do) and theequation itself:

un+1j = unj + k∂tu

nj + k2

2∂ttu

nj +O(k3)

un+1j = unj − ak∂xunj + a2k2

2∂xxu

nj +O(k3)

un+1j = unj − ak

unj+1−unj−1

2h+ a2k2

2

unj+1−2unj +unj−1

2h

This is a second order method because local truncation error is O(k3 + kh2), and Von Neu-mann analysis (all these chances to practice that!) shows that this is stable under the CFLcondition. (I would also like to point out that this derivation is a good clue as to how we are

16

going to develop numerical methods for more interesting hyperbolic equations). One thingthat comes up with a second order scheme is a problem with discontinuities (here they canonly come from initial data, later they will develop on their own). We see oscillations atdiscontinuities in second order schemes.Discontinuities cause some problems in first order methods, as well. Considering an up-wind scheme, using Taylor series to find local truncation error shows it to be a first orderapproximation to ut + aux = 0, but a second order approximation to

∂tu+ a∂xu =ah

2

(1− ak

h

)uxx

and the CFL condition implies a positive diffusion coefficient. So this method will tend tosmooth out the solution, which will of course cause discontinuities to be destroyed. This istrue of other methods, for example the Lax-Friedrichs scheme has a slightly larger diffusioncoefficient. This phenomenon is known as numerical diffusion or numerical dissipation.Similar truncation error analysis of the Lax-Wendroff scheme reveals it to be a third orderapproximation to

ut + aux = −1

6ah2

(1−

(ak

h

)2)uxxx

We say that this is a dispersive scheme.There is a general Dispersion Relation for a differential equation (more on this in my appliedmath notes), using the assumption that

u ∼ ei(ωt−ξx)

where ω is the frequency and ξ is the wave number. This leads to (plugging in to the equationfrom the Lax-Wendroff truncation error analysis)

iω − aiξ =−1

6ah2

(1−

(ak

h

)2)

(iξ3)

solving for ω gives the dispersion relation

ω(ξ) = aξ +1

6ah2

(1−

(ak

h

)2)ξ3

for the upwind scheme, the dispersion relation is

ω(ξ) = aξ + iah

(1− ak

h

)ξ2

and in general we have (in 1 D)

ω(ξ)ξ

group velocity

ω′(ξ) phase velocity

17

It is important to realize that the dispersion relation is a property of the PDE, not thescheme. We associate a dispersion relation with the scheme when the scheme will act morelike some PDE than the one it was designed for (eg the upwind scheme being a secondorder approximation to the diffusive PDE). The dispersion relation reveals damping andoscillations.

3.6 Non Linear Hyperbolic Equations

We will rarely care about the advection equation, as it is simple and can generally be dealtwith analytically. A more interesting problem is a non-linear advection type equation:

ut + f(u)x = 0

where f : Rn → Rn is the flux. If the Jacobian matrix f ′(u) has n real eigenvalues and acomplete set of n linearly independent eigenvectors, this equation is hyperbolic. This formof the equation is called a conservation law, with u being the conserved quantity, because(assuming flux vanishes at the boundaries)

∂t

∫udx = 0

An example of a hyperbolic system that is non linear is the Euler equations for gas dynamics:

∂tρ+ ∂xρu = 0 conservation of mass

∂tρu+ ∂x(ρu2 + p) = 0 conservation of momentum

∂t(

12ρu2 + ρe

)+ ∂x

((12ρu2 + ρe+ p

)u)

= 0 conservation of energy

so that

U =

ρ

eu

12ρu2 + ρe

f(U) =

ρu

ρu2 + p(12ρu2 + ρe+ p

)u

The Jacobian of the system has 3 eigenvalues u, u± c where c is the speed of sound, whichdepends of ρ, p and e. Another example (which is less unwieldy and so easier to use fordemonstrative purposes) is the inviscid Burger’s equation

ut +

(u2

2

)x

= 0

18

where u is wave speed. We can use the method of characteristics on these equations as well,and we see that the characteristics have equations

x(t) = x0 + tu0(x0)

and u(x, t) = u0(x0) for some x0. These characteristics run in straight lines. However, eachcharacteristic’s slope is u0(x0), and so they may cross. When characteristics cross, we losethe uniqueness of the solution. This is called a shock. Also possible are rarefactions inwhich characteristics separate, leaving ”gaps” in the solution. When characteristics meet,∂u/∂x→∞, and because

∂u

∂x=du0

dx

∂x0

∂x=

u′0(x0)

1 + tu′0(x0)

this implies that1 + tb(u

′0(x0)) = 0

where tb is the time the discontinuity occurs. If this time is positive, ie

tb =−1

u′0(x0)> 0

then a shock will occur. So, if there is some x0 so that u′0(x0) < 0 there will be a shock. Wecan also calculation the shock time by

tb = min

(−1

u′0(x0)

)At a shock (or a rarefaction) the strong solution fails to exist. However, we can still discussa weak solution to the equation.Consider

φ(x, t) (ut + f(u)x) = 0

for a text function phi(x, y) ∈ C10 , and the integral∫R

∫ ∞0

φ (ut + f(u)x) dtdx = 0

We then integrate by parts, and say that u is a weak solution to ut + f(u)x = 0 if for anyphi ∈ C∞0 ,

−∫Rφ(x, 0)u0(x)dx−

∫R

∫ ∞0

(uφt − f(u)φx)dtdx = 0

holds. If u is a strong solution, it must also be a weak solution, but the converse is not true.The following theorem is important in determining what happens at a shock:

Theorem 3. if u is a weak solution to ut + f(u)x = 0, then at a discontinuity [f(u)] = s[u]

Here, [α] = limx→x+sα(x) − limx→x−s α(x). This theorem, the Rankine-Hugoniot jump

condition, tells us the speed of the shock, s:

s =[f(u)]

[u]

19

This all follows from Greene’s theorem.So, considering the inviscid Burger’s equation with Riemann initial data

ut +(u2

2

)x

= 0

u(x, 0) =

uL x < 0

uR x > 0

we can look for a self similar solution, which has the form

u(x, t) = u(xt

)= u(ξ)

Then the equation is reduced to(u− ξ)uξ = 0

So the solution is either a constant or u = ξ. If uL > uR, we see that a shock forms withspeed

s =u2R/2− u2

L/2

uR − uL=

1

2(uR + uL)

and so the solution is simply

u(x, t) =

uL x < st

uR x > st

However, if uL < uR, the solution is

u(x, t) =

uL x < uLt

xt

uLt < x < uRt

uR x > uRt

This is one of infinitely many weak solutions, of course, but this one is the solution derivedfrom the self similar equations. We can choose a different weak solution by considering theBurger’s equation with viscosity:

ut =

(u2

2

)x

= εuxx

This parabolic equation has a unique solution uε which depends on epsilon. We can thentake limε→0 u

ε to arrive upon the viscosity solution to the inviscid Burger’s equation.Yet another way to deal with the non uniqueness is through an entropy condition. If λ(u) isan eigenvalue of f ′(u), the Lax entropy condition is that, at a shock,

λ(uL) > s > λ(uR)

20

This condition can indicate if there is a shock or rarefaction, and under this condition, theinviscid Burger’s equation can only have a shock if uL > uR. Otherwise, it has a rarefaction.There are other entropy conditions. For example, if there exists Φ(u) which is convex andΨ(u) such that

Ψ′(u) = Φ′(u)f ′(u)

they are called convex entropy Φ and entropy flux Ψ, and they must satisfy

∂tΦ(u) + ∂xΨ(u) ≤ 0

For the inviscid Burger’s equation, the viscosity solution can be shown to satisfy this condi-tion.

In order to solve non linear hyperbolic equations numerically it is necessary to considerthe concepts of bounded variation and total variation. We will not delve deeply into theseconcepts, but rather confine ourselves to their relevance in numerical methods. Total varia-tion can be defined

TV (u(·, t)) =d∑i=1

lim supε→0

1

ε

∫Rd|u(xi + εei, t)− u(xi, t)|dx

The total variation can be thought of as a measurement of the oscillations of a function u.Bounded variation means

BV (Rd) = f : TV (f) ≤ CIf we have the problem

ut +∑d

i=1 ∂xifi(u) = 0u(x, 0) = u0(x) ∈ L∞(Rd)

then given Φ and Ψ, there is a unique u such that the entropy condition is satisfied and||u(·, t)||∞ ≤ ||u0||∞ and if

u0 ∈ L∞(Rd) ∩BV (Rd)

then

u(·, t) ∈ BV (Rd)

TV (u(·, t)) ≤ TV (u0)

3.7 Numerical Solutions to Non Linear Hyperbolic Equa-

tions - Conservative Schemes

Again, we are trying to solveut + f(u)x = 0

The general form of a conservative scheme for hyperbolic equations is

∂tuj +Fj+1/2 − Fj−1/2

∆x= 0

These are the schemes we want. A theorem from Lax & Wendroff states

21

Theorem 4. A consistent, conservative scheme, if convergent, converges to a weak solutionof the non linear conservation law

Clearly, it is only necessary to know the numerical flux Fj±1/2 to know a conservativescheme. However, that is not always a great way to implement the scheme! In general, wewill use the notation fj = f(uj) and

uj =1

∆x

∫ xj+1/2

xj−1/2

u(x)dx

or, for simplicity, uj = u(xj)

3.7.1 Lax-Friedrich Scheme

The Lax-Friedrich Scheme is

Fj+1/2 =fj + fj+1

2− ∆x

2∆t(uj + 1− uj)

The Local Lax-Friedrich Scheme is

Fj+1/2 =fj + fj+1

2−√aj+1/2

2(uj + 1− uj)

where√a = max|f ′(u)|, u ∈ (uj, uj+1), or the largest eigenvalue of f ′(u) if we are dealing

with a system of equations.

3.7.2 Lex-Wendroff Scheme

The Lax-Wendroff Scheme is

Fj+1/2 =f(uj+1) + f(uj)

2−∆tf ′(uj+1/2)

f(uj+1)− f(uj)

2∆x

where uj+1/2 is the average of uj and uj+1. It is not obvious why this scheme is consistent,but the derivation of this scheme will make it clear. We can start, as usual, with a Taylorexpansion:

un+1j = unj + ∂tu

nj ∆t+ ∂ttu

nj

∆t2

2+O(∆t3)

we then use the PDE:

ut = −f ′(u)ux = −f(u)x

utt = −(f(u)x)t = −f ′(u)tx = −(f ′(u)ut)x

= (f ′(u)∂xf(u))x

22

For last equality, making the substitution ut = −f(u)x from the PDE. This gives us expres-sions for utt that we can substitute into the Taylor expansion get rid of the time derivativesin favour of spacial derivatives:

un+1j = unj −∆t∂xf(unj ) +

∆t2

2(f ′(unj )∂xf(unj ))x +O(∆t3)

Then we can use standard finite difference approximations for ∂xf(unj ) and so on. To keepthe scheme second order (as it can be seen to be so far from the Taylor expansion) we willneed centred difference approximations. Making these substitutions and rearranging termsgives:

un+1j − unj

∆t+f(unj+1)− f(unj−1)

2∆x=

∆t

2

f ′(unj+1/2)f(unj+1)−f(unj )

∆x

∆x−f ′(unj−1/2)

f(unj )−f(unj−1)

∆x

∆x

3.7.3 The Godunov Scheme

The non linear upwind scheme is called the Godunov Scheme. The basic idea of this scheme isto divide the problem into small spacial regions in which the problem is a Riemann problem.Then the solution to the Riemann problem (either analytic or making use of a RiemannSolver) is used. We will define the solution of the local Riemann problem as un(x, ∗). Tocreate a set of local Riemann problems, we need piecewise constant data, so we approximateu(x):

unj =1

∆x

∫ xj+1/2

xj−1/2

u(x)dx

where u(x) is the exact solution and un(x, ∗) := unj for t = tn, xj−1/2 < x < xj+1/2. Thissolution is valid for tn < t < tn+1. Then the solution from our method at each time step tn

is

Unj =

1

∆x

∫ xj+1/2

xj−1/2

un(x, tn)dx

and likewise,

Un+1j =

1

∆x

∫ xj+1/2

xj−1/2

un(x, tn+1)dx

meaning we take our solution and approximate it as piecewise constant again so that wehave another set of local Riemann problems.Step by step this means we

1. Approximate the data as piecewise constant (Unj ))

2. Evolve the data as a Riemann problem (un(x, ∗))

3. Approximate this solution as piecewise constant (Un+1j )

23

For the problem of non linear advection, we have

∂tun + ∂xf(un) = 0

for tn < t < tn+1. Integrating,

1

∆t

∫ tn+1

tn

∫ xj+1/2

xj−1/2

(∂tun + ∂xf(un))dxdt = 0

and so

1∆x

∫ xj+1/2

xj−1/2un(x, tn+1)dx− 1

∆x

∫ xj+1/2

xj−1/2un(x, tn)dx

∆t

+

∫ tn+1

tnf(un(xj+1/2, t))dt−

∫ tn+1

tnf(un(xj−1/2, t))dt

∆x= 0

or, more simply,un+1j − unj

∆t+F nj+1/2 − F n

j−1/2

∆x= 0

where

F nj+1/2 =

1

∆t

∫ tn+1

tnf(un(xj+1/2, t)dt

This is an upwind scheme, and its construction makes it clear that it will only be valid aslong as the local Riemann problems do not interact, that is the local parts of the piecewisedata do not collide. Because the data is moving at a speed less than max |f ′(u)|, this methodwill be valid as long as

∆t <∆x

max |f ′(u)|ie, before neighbouring waves arrive. This is the CFL condition again. To see this, we canexamine the solution of the Riemann problems. The solution is self similar, so, locally,

un(x, t) = un(x− xj+1/2

t

)so at x = xj+1/2, the solution does not depend on time t. we therefore have

F nj+1/2 = f(un(xj+1/2)) = f(R(0;unj ;unj+1))

where R(xt;Un

j ;Unj+1

)solves

ut + f(u)x = 0

u(x, 0) =

unj x < 0

unj+1 x > 0

24

This solution is then only valid before waves begin to intersect. We may also use this notationto say

Fj+1/2 = R(0;uj;uj+1)

We can either use the analytic solution to the Riemann problem, or use a Riemann Solver.This is a first order accurate scheme due to the approximation at each step of the data aspiecewise constant (ie averaging over small intervals). The Godunov scheme gives betterresolution than the Lax-Friedrich’s scheme due to less numerical dispersion. The viscositymodified equation is again

∂tu+ ∂xf(u) = C∆xuxx

The difference is in the constant C, and Cupwind < CLax−Friedrich.

The Godunov scheme satisfies the entropy condition as long as f(u) is convex. If f(u)is convex,

R(0;ul;ur) =

f(ul) ul > u0 & s > 0

f(ur) ur < u0 & s < 0

f(u0) ul < u0 < ur

f ′(u0) = 0

s = [f ][u]

Taking convex Φ(u), we need Ψ(u) so that Ψ′(u) = Φ′(u)f ′(u) and ∂tΦ(u) + ∂xΨ(u) ≤ 0. If

∂tΦ(un) + ∂xΨ(un) ≤ 0

for tn ≤ t ≤ tn+1, then

1

∆x∆t

∫ tn+1

tn

∫ xj+1/2

xj−1/2

∂tΦ(un) + ∂xΨ(un)dxdt =

1∆x

∫ xj+1/2

xj−1/2Φ(un(x, tn+1))dx− 1

∆x

∫ xj+1/2

xj−1/2Φ(un(x, tn))dx

∆t

+1

∆t

∫ tn+1

tnΨ(un(xj+1/2, t)dt− 1

∆t

∫ tn+1

tnΨ(un(xj−1/2, t)dt

∆x≤ 0

and soΦ(un+1

j ) + Φ(unj )

∆t+

Ψ(R(0;unj ;unj+1))−Ψ(R(0;unj−1;unj ))

∆x≤ 0

because we have

1

∆x

∫ xj+1/2

xj−1/2

Φ(un(x, tn+1))dx ≥ Φ

(1

∆x

∫ xj+1/2

xj−1/2

un(x, tn+1)dx

)and by definition

Φ

(1

∆x

∫ xj+1/2

xj−1/2

un(x, tn+1)dx

)= Φ(un+1

j )

25

This scheme thus satisfies the discrete entropy inequality

It may be necessary to use an approximate Riemann solver if the Riemann problem cannotbe solved. Roe’s approximate Riemann solver linearizes the flux of the system so that forconstant matrix A,

f = Au

Local linearization givesut + A(ul, ur)ux = 0

A must have n real eigenvalues and a complete set of eigenvectors (λi, γi), and should ap-proximate the Jacobian. The requirements are, in general,

• A(ul, ur)(ur − ul) = f(ul)− f(ur)

• Diagonalizable with real eigenvalues

• A(ul, ur)→ f ′(u) as ul, ur → u

At a shock, f(ur)−f(ul) = s(ur−ul), so ∃ λi such that A(ur−ul) = λi(ur−ul) with s = λi(shock speed). For a scalar equation,

a =f(ur)− f(ul)

ur − ul

To analyse the scheme, we note that a full set of eigenvectors gives a basis for the space weare working in, so

ur − ul =n∑p=1

αpγp

and soR(ξ;ul;ur) = u(ξ) = ul +

∑λp<ξ

αpγp = ur −∑λp>ξ

αpγp

This can of course be recognized as a characteristic decomposition of the linearized problem.As a result, the Roe solver gives

FRoe = f(ul) +∑λp<0

αpλpγp = f(ur)−∑λp>0

αpλpγp

soFj+1/2 = FRoe(uj, uj+1) = A(uj, uj+1)γpj+1/2

= λpj+1/2γpj+1/2

(You have to love those subscripts on subscripts)We now have to determine, once and for all, how to choose A. Consider η ∈ [0, 1] and

q(η) = ui−1 + (ui − ui−1)η

f(ui)− f(ui−1) =∫ 1

0df(q(η))dη

=∫ 1

0f ′(q(η))q′(η)dη

=∫ 1

0f ′(q(η))(ui − ui−1)dη

= (ui − ui−1)∫ 1

0f ′(q(η))dη

26

so if we choose∫ 1

0f ′(q(η))dη to be A, we will satisfy the first condition on A. However,

this may not be an easy integral to evaluate. Roe’s solution to this question was to use aparametrization vector z(q). This must be invertible, so q = q(z). Thus, f(x) = f(q(z))),and we can integrate.

z(η) = zi−1 + (zi − zi−1)η

zi = z(ui)

z′(η) = zi − zi−1

then

f(ui)− f(ui−1) =

∫ 1

0

df(z(η))

dηdη = (zi − zi−1)

∫ 1

0

f ′(z(η))dη = C(zi − zi−1)

and

ui − ui−1 =

∫ 1

0

dq(z(η))

dηdη = (zi − zi−1)

∫ 1

0

q′(z(η))dη = B(zi − zi−1)

So if we can find z such that B and C are invertible and easy to evaluate, then A = CB−1.

It is nice to illustrate the scheme with the problem of isothermal flow:

∂tρ+ ∂xρu = 0

∂tρu+ ∂x(ρu2 + aρ2) = 0

and with m = ρu,

U =

ρ

m

and

f(U) =

ρu

ρu2 + a2ρ

=

m

m2

ρ+ a2ρ

so

f ′(U) =

0 1

a2 − u2 2u

We need to choose A, so we need z. Let

z = ρ−1/2

ρ

ρu

=

z1

z2

=

ρ1/2

ρ1/2u

then

q(z) = U = z1z

f(U) =

z1z2

a2z21 + z2

2

27

and we can write U and f(U) in quadratic form:

z =1

2(zl + zr) =

z1

z2

=1

2

ρ1/2l + ρ

1/2r

ml

ρ1/2l

+ mr

ρ1/2r

and

Ul − Ur =

2z1 0

z2 z1

(zl − zr)

and

f(Ul)− f(Ur) =

z2 z1

2a2z2 2z2

(zl − zr)

so

B =

2z1 0

z2 z1

C =

z2 z1

2a2z2 2z2

So, finally, we have

A = CB−1 =

0 1

a2 − u2 2u

where

u =z2

z1

=

√ρlul +

√ρrur√

ρl +√ρr

is called the “Roe Average”.

3.8 Non Linear Stability

In order to determine the stability of methods used to solve non linear equations, sometougher analysis is required. The first concept needed I have already mentioned: totalvariation. Total variation may be thought of as a way to “measure” the oscillatory natureof a function:

TV (v) = lim supε→0

1

ε

∫R|v(x)− v(x− ε)|dx

Also, if v is differentiable, then

TV (v) =

∫R|v′(x)|dx

We can also consider total variation in more dimensions:

TVT (u(x, t)) = lim supε→0

1

ε

(∫ T

0

∫R|u(x+ ε, t)− u(x, t)|dxdt+

∫ T

0

∫R|u(x, t+ ε)− u(x, t)|dxdt

)

28

We often consider the “1” norm of our functions:

||v||1,T =

∫ T

0

‖v(·, t)‖1dt =

∫ T

0

∫R|v(x, t)|dxdt

and the space of functions that are finite under that norm:

L1,T = v|‖v‖1,T <∞

ConsiderK = u ∈ L1,T |TVT (u) ≤ R, supp(u(·, t)) ⊂ [−M,M ]∀t ∈ [0, 1]

K is a compact set.We can also define the total variation of a discrete solution unj :

TVT (u) =∞∑

j=−∞

N∑n=0

(∆t|unj+1 − unj |+ ∆x|un+1

j − unj |)

=N∑n=0

∆t‖un‖TV +N∑n=0

‖un+1 − un‖1

We will call a numerical solution with ∆t = k uk.

A numerical method is total variation stable (TV stable) if all approximate solutions ukfor k < k0 lie in some fixed set K (where R and M depend on the initial data u0 and theflux function f(u) but not on the time step k).

Theorem 5. Consider a conservative method with a Lipschitz continuous numerical fluxF(u;j). Suppose that for each initial data u0, ∃ some k0 > 0 such that TV (un) ≤ R ∀n, k,k < k0, nk ≤ T . Then the method is TV stable.

This theorem means that it is sufficient to check TV stability in space under the rightconditions. We also have the lemma:If the conditions of the theorem are met, ∃α such that ‖un+1−un‖ ≤ αk ∀n, k k < k0, nk < T

Proof. (of the Theorem)TVT (u) ≤

∑Nn+0 kR +

∑Nn=0 αk = (α +R)

∑k ≤ (α +R)NK ≤ (α +R)T

Proof. (of the lemma)The scheme is conservative, so

un+1j − unj = ∆t

∆x[F (un; j)− F (un; j − 1)]

‖un+1 − un‖1 = ∆t∑∞

j=−∞ |F (un; j)− F (un; j − 1)|

F (un; j) depends on a finite number of spacial points, unj−p to unj+q. F is Lipschitz continuousin u, so

|F (un; j)− F (un; j − 1)| ≤ C max−p<i<q

|unj+i − unj+i−1| ≤q∑

i=−p

|unj+i − unj+i−1|

29

so

‖un+1 − un‖1 ≤ C∆t∞∑

j=−∞

q∑i=−p

|unj+i − unj+i−1| = C∆t

q∑i=−p

TV (un) ≤ C∆tR(q + p− 1)

and so α = CR(p+ q − 1)

So, under the conditions of the theorem, it is sufficient that

TV (un) =∞∑

j=−∞

|unj+1 − unj |

is bounded for the method to be TV stable. Furthermore,

Theorem 6. Suppose uk is generated by a numerical method in conservation form with aLipschitz continuous numerical flux consistent with some scalar conservation law. If themethod is TV stable, then the method is convergent (converges to a weak solution).

A stronger condition for a method is Total Variation Diminishing (TVD), meaning

TV (un+1) ≤ TV (un)

Unsurprisingly, TVD ⇒ TV stable (as a trivial consequence of theorem 5), and so a TVDmethod that is consistent and conservative converges to a weak solution.Another desirable property in a numerical method the absence of numerical oscillations. Amethod is Monotonicity Preserving if

u0j ≥ u0

j+1∀j ⇒ unj ≥ unj+1∀j, n

Theorem 7. If a method for a hyperbolic problem is TVD, it is monotonicity preserving

Proof. TV (un) =∑∞

j=−∞ |unj+1 − unj | ≤ TV (u0) =∑∞

j=−∞ |u0j+1 − u0

j | because the scheme isTVD. Monotonicity of the initial data implies

TV (u0) =∞∑

j=−∞

|u0j+1 − u0

j | = |u0−∞ − u0

∞|

and because of the finite wave speed of a hyperbolic problem, un−∞ = u0−∞ and un∞ = u0

∞.This means that

TV (u0) = u0−∞ − u0

∞ =∞∑

j=−∞

unj+1 − unj ≤ TV (un)

by the triangle inequality. The scheme is TVD, so equality must hold. This implies that

TV (un) =∞∑

j=−∞

unj+1 − unj = un−∞ − un∞

which can only be true if unj ≥ unj+1, so the scheme is monotonicity preserving.

30

For analytic weak solutions of hyperbolic problems,

‖u(·, t2)‖1 ≤ ‖u(·, t1)‖1∀t2 ≥ t1

and i u and v are both entropy solutions with different initial data,

‖u(·, t2)− v(·, t2)‖1 ≤ ‖u(·, t1)− v(·, t1)‖1∀t2 ≥ t1

We would like our numerical methods to in some sense preserve this, using the norm

‖un‖1 =∑j

|unj |

A numerical method is l1 contracting if

‖un+1 − vn+1‖1 ≤ ‖un − vn‖1

∀ un, vn numerical solutions with different initial data.

Theorem 8. l1 contracting ⇒ TVD

Proof. Let u, v be solutions with different initial data.

‖un+1 − vn+1‖1 ≤ ‖un − vn‖

TV (un+1) =∑

j |un+1j+1 − un+1

j |

let vnj = unj+1, ⇒ vnj is a numeric solution.

⇒ TV (un+1) =∑j

|vn+1j − un+1

j | ≤∑j

|vnj − unj | = TV (unj )

Consider for a non linear advection equation ut + f(u)x = 0 the scheme

un+1j − unj

∆t+f(unj )− f(unj−1

∆x= 0

so that

un+1j = unj − ∆t

∆x

(f(unj )− f(unj−1)

)vn+1j = vnj − ∆t

∆x

(f(vnj )− f(vnj−1)

)then let w = u − v. We can show that this method is l1 contracting by showing that‖wn+1

j ‖ ≤ ‖wnj ‖.

wn+1j = wnj −

∆t

∆x

[f(unj )− f(vnj )− f(unj−1) + f(vnj−1)

]

31

and by the mean value theorem, ∃θ so that

wn+1j = wnj −

∆t

∆x

[f ′(θnj )wnj − f ′(θnj−1)wnj−1

]=

[1− ∆t

∆xf ′(θnj )

]wnj +

∆t

∆tf ′(θnj−1)wnj−1

⇒|wn+1

j | ≤ |1 + ∆t∆xf ′(θnj )||wnj |+ |∆t∆x

f ′(θnj−1)||wnj−1|

∆x∑

j |wn+1j | ≤ ∆x

∑j |1 + ∆t

∆xf ′(θnj )||wnj |+ ∆x

∑j |

∆t∆xf ′(θnj−1)||wnj−1|

if the CFL condition is met, then 1 − ∆t∆xf ′(u) ≥ 0 ∀u, and ∆t

∆x|f ′(u)| ≤ 1 ∀u, which means

that

‖wn+1‖1 ≤ ∆x∑j

(1− ∆t

∆xf ′(θnj )

)|wnj |+∆x

∑j

(∆t

∆xf ′(θnj−1)

)|wnj−1| = ∆x

∑j

|wnj | = ‖wn‖1

so this scheme is l1 contracting under the CFL condition.

If u(x, t) and v(x, t) are analytical solutions which satisfy the entropy conditions, andu0(x) ≥ v0(x)∀x, then u(x, t) ≥ v(x, t)∀x, t. Again, we want to preserve this propertyin our numerical method.A method Un+1 = H(Un) is called a Monotone Method if

∂unjH(un) ≥ 0∀j

then, if u0j ≥ v0

j∀j,

un+1j − vn+1

j = H(unj )−H(vnj ) =∑k

∂H∂uk

wnk (unk − vnk ) ≥ 0

and so (inductively) we see that the property is preserved.

Theorem 9. Any monotone method is l1 contracting.

Proof. We need to show that ‖un+1 − vn+1‖1 ≤ ‖un − vn‖1.Suppose

un+1j = H(unj−k, u

nj−k+1, ..., u

nj+k)

let

unj = (unj−k, ..., unj+k); snj = sgn(uj − vnj )

‖un+1 − vn+1‖1 = ∆x∑

j |un+1j − vn+1

j | = ∆x∑

j |H(unj )−H(vnj )|

= ∆x∑

j sn+1j

(H(unj )−H(vnj )

)and

H(unj )−H(vnj =∫ 1

0dHdθ

(θunj + (1− θ)vnj

)dθ

=∫ 1

0

∑2k+1l=1

∂H∂ul

(ξnj (θ))(unj−k+l−1 − vnj−k+l−1)dθ

32

where ξnj (θ) =(θunj + (1− θ)vnj

). Let m = j − k + l − 1. Then

∆x∑

j sn+1j

(H(unj )−H(vnj )

)= ∆x

∑j s

n+1j

∫ 1

0

∑2k+1l=1

∂H∂ul

(ξnj (θ))(unm − vnm)dθ

= ∆x∑

m(unm − vnm)∫ 1

0

∑2k+1l=1 sm+k−l+1

∂H∂ul

(ξnm+k−l+1(θ))dθ

≤ ∆x∑

m |unm − vnm|∫ 1

0

∑2k+1l=1

∂H∂ul

(ξnm+k−l+1(θ))dθ

= ∆x∑

m |unm − vnm|∫ 1

0

∑2k+1l=1

∂H∂ul

(ξnm+k−l+1(θ))dθ = 1 because:

un+1j = unj −

∆t

∆x

[F (unj−k+1, ..., u

nj+k)− F (unj−k, ..., u

nj+k−1)

]so

un+1k = H(u1, .., u2k+1) = uk+1 − ∆t

∆x

[F (un1 , ..., u

nj+k−1)− F (un2 , ..., u

nj+k)

]∂H

∂uk+1= 1− ∆t

∆x

(∂F

∂uk+1(u1, ..., uj+k−1)− ∂F

∂uk(u2, .., uj+k)

)∂H∂u1

= −∆t∆x

∂F∂u1

(u1, ..., uj+k−1)

∂H∂uj+k

= ∆t∆x

∂F∂uj+k

(u2, ..., uj+k)

and for all i 6= k + 1, 1, or j + k,

∂H∂ui

= −∆t

∆x

[∂F

∂ui(u1, ..., uj+k−1)− ∂F

∂ui−1

(u2, ..., uj+k)

]so all of these cancel except for the leading 1 in ∂H

∂uk+1and so the sum is 1.

Finally, we present a couple of theorems without proof:

Theorem 10. (Due to Crandall & Majda)The numerical solution of a monotone method converges to the entropy solution of the scalarconservation laws.

Theorem 11. (Due to Harten, Lax, & Hyman)A monotone method is at best first order.

We have now developed the following hierarchy:

Monotone Method⇒ l1 Contracting⇒ TVD⇒ Monotonicity Preserving

33

3.9 High Resolution Shock Capturing Schemes

Monotone methods are at most first order, so if higher accuracy is desired, numerical oscilla-tions will occur. To deal with this, artificial viscosity can be introduced to dampen artificialoscillations. For example, if we want to improve the (second order) Lax-Wendroff schemefor a linear problem:

ut + aux = 0

un+1j = unj − ν(unj+1 − unj−1) + 1

2ν(unj+1 − 2unj + unj−1)

ν = a∆t∆x

is the Courant number

we need to add an artificial viscosity term:

kQ(unj+1 − 2unj + unj−1)

so the modified scheme is

un+1j = unj − ν(unj+1 − unj−1) +

1

2ν(unj+1 − 2unj + unj−1) + kQ(unj+1 − 2unj + unj−1)

Inspecting the truncation error of this modified method:

L(x, t) = LLW −Q(unj+1 − 2unj + unj−1) = O(h2)

so the scheme is still second order if Q is chosen wisely. The last challenge is then to choosethe parameter Q so that we do not loose second order convergence, but large enough todampen the numerical oscillations. The way to do this is to make Q depend on u, makingit larger near shocks, where these oscillations occur. If Q is constant, it cannot be chosen todampen all the oscillations, which is consistent with the theorem by Godunov:

Theorem 12. A linear, monotonicity preserving method is at most first order

So, if we want to preserve monotonicity, we need Q ∼ 1/h, and thus have a first orderscheme. Clearly, a higher order, monotonicity preserving scheme must be non linear, evenfor this linear problem. So we will need to choose a function Q(u) which is large, on theorder of 1/h, near the shock and very small away from the shock.

3.9.1 Flux Limiter Methods

We can manipulate the flux of a method, splitting it according to areas of high and low flux:

FH(u; j) = FL(u; j) + (F (u; j)− FL(u; j))

and add a “limiter”FH = FL + φ(u; j)(FH − FL)

where

φ(u; j)

1 smooth region

0 near discontinuity

34

F = FH + (1− φ)(FL − FH)

this the Flux Corrected Transport (FCT) Method. Applying it to our example, the Lax-Wendroff method:

un+1j = unj − ν(unj+1 − unj−1) + 1

2ν(1− ν)(unj+1 − 2unj + unj−1)

FH(u; j) = auj + 12a(1− ν)(uj+1 − uj)

Let F (u; j) = auj + 12a(1− ν)(uj+1 − uj)φj

and we use the “smoothness indicator”

θj =uj − uj−1

uj+1 − uj

which has the property that, in smooth regions, θj ∼ 1, and at discontinuities, |θj| 1. Sowe can easily create φj = φ(θj) with the properties we require of it.

Theorem 13. The flux limiter method F (u; j) given above is consistent with the equationut + aux = 0 provided φ(θ) is a bounded function. It is second order accurate (on smoothsolutions with ux bounded away from 0) provided φ(1) = 1 with φ Lipschitz continuous atθ = 1.

so this scheme becomes:

un+1j = unj −

∆t

∆x

a(unj − unj−1) +

1

2a(1− ν)

[(unj+1 − unj )φj − (unj − unj−1)φj−1

]= uj −

(ν − 1

2ν(1− ν)φj−1

)(uj − uj−1)− 1

2ν(1− ν)φj(uj+1 − uj)

In general, suppose the scheme is

un+1j = uj − Cj−1(uj − uj−1) +Dj(uj+1 − u− j)

Theorem 14. In order for the above scheme is be TVD, the following conditions on thecoefficients are sufficient:

Cj−1 ≥ 0 Dj ≥ 0 0 ≤ Cj +Dj ≤ 1 ∀j

Proof.

un+1j+1 −un+1

j = unj+1−unj −Cj(unj+1−unj )+Cj−1(unj −unj−1)+Dj+1(unj+2−unj+1)−Dj(unj+1−unj )

un+1j+1 − un+1

j = Cj−1(unj − unj−1) + (1− Cj −Dj)(unj+1 − unj ) +Dj+1(unj+2 − unj+1)

35

then∑j |u

n+1j+1 − un+1

j | ≤∑

j Cj−1|uj − uj−1|+∑

j(1− Cj −Dj)|uj+1 − uj|+∑

j Dj+1|uj+2 − uj+1|

=∑

j Cj|uj+1 − uj|+∑

j(1− Cj −Dj)|uj+1 − uj|+∑

j Dj|uj+1 − uj|

=∑

j |uj+1 − uj|

= TV (n)

⇒ TV (n+ 1) ≤ TV (n)

In our scheme, we have

Cj−1 = ν − 12ν(1− ν)φj−1

Dj = −12ν(1− ν)φj−1

Because ν ∈ (0, 1), φ > 0, we cannot satisfy the theorem. We can rewrite the scheme as

un+1j = unj −

[ν +

1

2ν(1− nu)φj−1 −

1

2ν(1− ν)φj

uj+1 − ujuj − uj−1

](uj − uj−1)

so that

Cj−1 = ν + 12ν(1− nu)φj−1 − 1

2ν(1− ν)φj

uj+1−ujuj−uj−1

Dj = 0

and

Cj−1 = ν + 12ν(1− nu)φj−1 − 1

2ν(1− ν)

φjθj

= ν − 12ν(1− ν)

(−φj−1 +

φjθj

)= ν

(1− 1

2(1− ν)

(φjθj− φj−1

))and 0 ≤ ν ≤ 1 so

0 ≤ Cj−1 ≤ 1⇔ |φ(θj)

θj− φ(θj−1)| ≤ 2

so we can pick a function φ, which need not be unique, and have a method which is TVD.Some possibilities for φ are

Van Lear φ(θ) = |θ|+θ1+|θ|

superbee φ = max0,min1, 2θ),minθ, 2

36

3.9.2 Slope Limiter Methods

A more geometrical approach uses a scheme that looks like

uj+1/2 = uj +∆x

2∂xuj

where a piecewise linear approximation to ∂xu is used:

∂xuj ≈

σ−j =uj−uj−1

∆x

σ+j =

uj+1−uj∆x

When the solution is smooth, it doesn’t matter which you choose. However, at a discontinuitya choice must be made. Choose

∂xuj =

0 if σ+j , σ

−j ≤ 0

sgn(σ+j ) min|σ−j |, |σ+

j | otherwise

or, using Van Lear’s criteria, choose

∂xuj =

0 if σ+

j , σ−j ≤ 0

σ0j if |σ0

j | < 2 min|σ−j |, |σ+j |

min|σ−j |, |σ+j | otherwise

This can be written

∂xuj ∼= φ(θj)uj+1 − uj

∆x

where φ(θ) is the Van Lear choice:

φ(θ) =|θ|+ θ

1 + θ

3.10 Central Schemes

Recalling the Godunov scheme, we took the average of each space grid at time tn. Now, wewill take

un+1j+1/2 =

1

∆x

∫ xj+1

xj

u(x, tn+1)dx

so that

1∆t∆x

∫ xj+1

xj

∫ tn+1

tnun + f(un)xdtdx = 0

1∆x

∫ xj+1xj

un(x,tn+1)dx− 1∆x

∫ xj+1xj

un(x,tn+1)dx

∆t+

1∆x

∫ tn+1

tn f(un(xj+1,t))dt− 1∆x

∫ tn+1

tn f(un(xj ,t))dt

∆t= 0

37

So, if we assume that each time step is small enough so that uj + 1n is constant,

un+1j+1/2 −

unj +unj+1

2

∆t+f(unj+1)− f(unj )

∆x= 0

This method is much like the Lax-Friedrich method, but on a staggered grid. This also putsa different constraint on the time step size: it now must be half the CFL number:

|f ′(u)|∆t∆x≤ 1

2

This is a first order method. However, if we construct the method with piecewise linearspacial approximations (rather than piecewise constant), we can come up with a secondorder scheme. We need to approximate

uj+1/2 −[

12(unj + unj+1) + ∆x

8(u′j − u′j+1)

]∆t

+1

∆t

∫ tn+1

tnf(un(xj+1, t))dt− 1

∆t

∫ tn+1

tnf(un(xj, t))dt

∆x

We do that with the midpoint rule (to get a second order scheme)∫ tn+1

tnf(un(xj, t))dt ≈ ∆tf(un(xj, t))

u(xj, tn+1/2 ≈ u(xj, t

n) + ∆t2∂tu(xj, t

n) = unj − ∆t2∂xf(unj )

so the 2nd order scheme is

un+1/2j = unj − ∆t

2∂xf(unj )

un+1j+1/2 = 1

2(unj + unj+1) + h

8(u′j − u′j+1) + ∆t

∆x

[f(u

n+1/2j+1 )− f(u

n+1/2j )

]where u′j and ∂xf(unj ) can be defined using slope limiters.

3.11 Relaxation Scheme

For the problemut + f(u)x = 0

we can approximate the equation with a system

ut + vx = 0

vt + aux = 1ε(f(u)− v)

which can be written u

v

t

+

0 1

a 0

u

v

x

=

0

1ε(f(u)− v)

38

we can show that the matrix

A =

0 1

a 0

has eigenvalues λ±(A) = ±

√a, so this is a hyperbolic system. Note:

v = f(u)− ε(vt + aux)

so as epsilon→ 0, v → f(u). This gives

v = f(u)− ε(f(u)t + aux) +O(ε2)

andf(u)t = f ′(u)ut = −f ′(u)vx = −f ′(u)(f ′(u)ux +O(ε))

sov = f(u)− ε(−(f ′(u))2 + aux) +O(ε2)

thenut + f(u)x = ε((a− (f ′(u))2)ux)x

so we need (f ′(u))2 ≤ a, or rather |f ′(u)| ≤√a. Note that

√a is the characteristic speed of

the system we are using. We next diagonalize the system to get(u± 1√

av

)t

±√a

(u± 1√

av

)x

= RHS

We then do an operator splitting. We will use an upwind scheme for the convection term√a(u± 1√

av)x: (

u+ 1√av)j+1/2

=(u+ 1√

av)j(

u− 1√av)j+1/2

=(u− 1√

av)j

which can be solved to find

vj+1/2 =vj+vj+1

2−√a

2(uj+1 − uj)

∂tuj +vj+1/2−vj−1/2

∆x= 0

We then relax using a local Lax Friedrich scheme:

ut = 0

vt = 1ε(f(u)− v)

⇒vj+1/2 =

f(uj) + f(uj+1)

2−√a

2(uj+1 − uj)

We can make this second order by using, instead of simple upwind,(u+ 1√

av)j+1/2

=(u+ 1√

av)j

+ ∆x2∂x

(u+ 1√

av)j(

u− 1√av)j+1/2

=(u− 1√

av)j− ∆x

2∂x

(u− 1√

av)j

using slope limiters.

39

Chapter 4

Front Propagation

Often of interest is the propagation of “fronts”, i.e. the movement of curves across, forexample, space. For instance, we may want to simulate the surface of a fireball explodingout of a car after John McClane shoots it with a grenade launcher. To do that, we don’tneed to keep track of what might be happening behind the surface of the fireball (whichwould be invisible to our audience anyway), so we only need to simulate the front surface.The propagation speed F of the front Γ may depend on a variety of factors, including

• local geometry (normal curvature, etc)

• physical properties (temperature, pressure, etc)

We will assume that F is known. The Lagrange Method is to parametrize Γ:

Γ(t) : ~x(s, t) 0 ≤ s ≤ S

dxdt

= F~n

x′(s) = F~nx

y′(s) = F~ny

~n =

ys√x2s+y

2s

−xs√x2s+y

2s

Perhaps F = F (κ) where κ is mean curvature:

κ =yssxs − xssys(x2

s + y2s)

3/2

Simply keeping track of a section ∆s of the curve corresponding to si ≤ s ≤ si+1 presentsproblems. One is that, if the front expands, ∆s may stretch. Another is the possibility oftopological changes such as the merging of fronts.

The Eulerian Method, in contrast, does not rely on a parametrization of Γ, but instead

40

on the use of level sets. This requires that we define a multivariable function Φ(~t, t) suchthat

Γ(~x, t) = ~x|Φ(~x, t) = 0

and so

Φ(~x, t) = 0

Φt + x′(t) · ∇~xΦ = 0

Φt + F~n · ∇~xΦ = 0

and note

~n =∇~xΦ|∇~xΦ|

so the level set equation isΦt + F |∇Φ| = 0

We can then just solve this on a fixed grid. This method can handle topological changes- these are simply reflected in Φ(·, t) - and this method does not lose accuracy as a frontexpands. To choose an initial condition Φ(~x, 0, we can use linear sloping sides:

Φ(~x, t) =

dist(x,Γ) if x 6∈ Γinterior

−dist(x,Γ) if x ∈ Γinterior

The level set equationΦt + F |∇Φ| = 0

is an example of a Hamilton - Jacobi equation:

st +H(∇s) = 0

and H is called the “Hamiltonian”. In one dimension this is

st +H(sx) = 0

and if we let u = sx, this becomes

ut + ∂xH(u) = 0

and so this is a non linear advection type equation with flux H. We can solve it:

∂tuj +Hj+1/2 −Hj−1/2

∆x= 0

where we can use any method that is appropriate for the approximation of the flux Hj+1/2.We also need to approximate u = sx:

uj =sj+1/2 − sj−1/2

∆x

41

so we are dealing with

∂t

(sj+1/2−sj−1/2

∆x

)+

Hj+1/2−Hj−1/2

∆x= 0

∂tsj+1/2 +Hj+1/2 = 0

∂tsj +Hj = 0

with Hj = H(uj−1/2, uj+1/2), or rather Hj = H( sj−sj−1

∆x,sj+1−sj

∆x

), so we have

∂tsj +H

(sj − sj−1

∆x,sj+1 − sj

∆x

)= 0

42

Chapter 5

Finite Volume Methods For PDEs

Consider the equation in two dimensions:

∂tu+∇ · f = 0

and consider a non uniform grid so

∆xi = xi+1/2 − xi−1/2

∆yj = yj+1/2 − yi−1/2

and denote

¯uij =1

∆xi∆yj

∫ xi+1/2

xi−1/2

∫ yj+1/2

yj−1/2

u(x, y)dxdy =1

Aij

∫ ∫Ωij

u(x, y)dxdy

where Aij = ∆xi∆yj is the area of the rectangle Ωij. Then our problem is:

∂t ¯uij + 1Aij

∫ ∫Ωij∇ · fdxdy = 0

∂t ¯uij + 1Aij

∫∂Ωij

f · ~ndl = 0

and because we are using rectangular regions Ωij,

∂t ¯uij +1

Aij

(∫∂Ωi+1/2,j

fx(xi+1/2, y)dy −∫∂Ωi−1/2,j

fx(xi−1/2, y)dy

+

∫∂Ωi,j+1/2

f y(x, yj+1/2)dx−∫∂Ωi,j−1/2

f y(x, yj−1/2)dx

)= 0

and we can denote

fxi+1/2 =1

∆yj

∫ yj+1/2

yj−1/2

fx(xi+1/2, y)dy

and approximate this with the trapezoidal rule:

fxi+1/2 ≈1

2

[fx(xi+1/2, yj+1/2) + fx(xi+1/2, yj−1/2)

]43

So we begin with ¯u and approximate in x to get:

ui+1/2,j =∆xi+1 ¯uij + ∆xi ¯ui+1,j

∆xi + ∆xi+1

and then approximate in y to get

ui+1/2,j+1/2 =∆yj+1ui+1/2,j + ∆yjui+1/2,j+1

∆yj + ∆yj+1

So at each point, we are using a weighted average of the 4 immediate neighbouring points.We can approximate the other integrals and ultimately arrive upon

∂t ¯uij +∆yj

(fxi+1/2 − fxi−1/2

)+ ∆xi

(f yj+1/2 − f yj−1/2

)∆xi∆yi

44

Chapter 6

Spectral Methods For PDEs

Spectral methods use trigonometric interpolation to get exponential accuracy at a highercomputational cost. These methods are global methods, which leads to this high cost.The advantage is that, for a smooth function, the accuracy is O(∆xm) ∀m ≥ 1. Finitedifference and volume methods are ultimately based on polynomial interpolation. Spectralmethods take advantage of the improved accuracy of trigonometric interpolation, especiallyfor periodic functions.

6.1 Trigonometric Interpolation

Suppose f is periodic with period τ > 0, that is to say τ is the smallest real number suchthat f(x+ τ) = f(x) ∀x. Let τ = 2π.If

Pn(t) = a0 +n∑j=1

aj cos(jt) + bj sin(jt)

and if |an| + |bn| = 0, we say that Pn(t) is a trigonometric polynomial of degree n. Theinterpolating points are

0 ≤ t0 ≤ t1 ≤ ... ≤ tN < 2π

and so f(tj) = Pn(tj) for 0 ≤ j ≤ N = 2n + 1. We can represent this in a number ofalternative ways:

Pn(t) = a0 +∑n

j=1 ajeijt+e−ijt

2+ bj

eijt−e−ijt2i

= a0 +∑n

j=112(aj − ibj)eijt +

∑nj=1

12(aj + ibj)e

−ijt

=∑n

j=−n cjeijt

where c0 = a0, cj = 12(aj + ibj) and c−j = 1

2(aj − ibj) and i =

√−1 is the imaginary unit

(and not an index!). So if we let z = eit, then

Pn =n∑

j=−n

cjzj

45

and so

znPn(t) =n∑

j=−n

cjzj+n

or

znPn(t) =2n∑j=0

cj−nzj

and we need Pn to satisfy f(tj) = Pn(tj) 0 ≤ j ≤ N where tj = j 2π2n+1

0 ≤ j ≤ 2n. This isa question, then, of solving for the coefficients of the series cj. This is recognizable by thispoint as a discrete Fourier series, and the process for finding the coefficients is the familiarone - i.e. make use of the orthogonality of the exponentials:

f(tk) =∑N−1

j=0 cjeijtk 0 ≤ k ≤ N − 1∑N−1

k=0 e−iltkf(tk) =

∑N−1k=0

∑N−1j=0 cje

i(j−l)tk 0 ≤ l ≤ N − 1

=∑N−1

k=0

∑N−1j=0 cje

i(j−l)k 2πN

=∑N−1

k=0 cj∑N−1

j=0 ei(j−l)k2πN

and we make use of the orthogonality of the exponentials. First, let

z = ei2πN

(j−l)

soN−1∑k=0

e−iltkf(tk) =N−1∑k=0

cj

N−1∑j=0

ei(j−l)k2πN =

N−1∑k=0

cj

N−1∑j=0

zk

andN−1∑j=0

zk =

0 j 6= l

1 j = l

so we haveN−1∑k=0

e−iltkf(tk) = Ncl

and finally we have found the lth coefficient:

cl =1

N

N−1∑k=0

f(tk)e−iltk 0 ≤ l ≤ N − 1

For the sake of completeness (and because it’s pretty easy) let’s prove the orthogonalityrelation we used. If j = l

z = ei2πN

(j−l) = e0 = 1

46

and so clearlyN−1∑j=0

zk =N−1∑j=0

1 = N

If j 6= l, we need to first remember that, for any integer m,

ei2πm = 1

and, if x is NOT an integer:ei2πx 6= 1

and the series relation found in most calc 2 textbooks:

N−1∑k=0

ak =1− aN

1− a

and so, because zN = ei2π(j−l),N−1∑k=0

zk =1− zN

1− z= 0

As was mentioned, this is the same basically a discrete version of a Fourier series, and so, un-surprisingly, the coefficients cj are called Discrete Fourier Coefficients and the trigonometricpolynomial

Pn(t) =N−1∑j=0

cjeijt

is called the Discrete Fourier Transform. It is not very easy to compute the coefficients cj,as it is necessary to compute a large sum and many function evaluations. Luckily, the FastFourier Transform exists to compute these quickly.It is interesting to look at the continuous Fourier transform and discrete counterpart for afunction f(t) that is periodic with period 2π:

Continuous f(ξ) = 12π

∫ 2π

0f(t)e−iξtdt

Discrete cl = 1N

∑N−1k=0 f(tk)e

−iltk

and notice that the discrete transform is the trapezoidal rule approximation to the continuoustransform. The same is true of the inverse transforms:

Continuous∫ 2π

0ˆf(ξ)eiξtdξ

Discrete∑N−1

k=0 cjeiltk

For periodic functions, the trapezoidal rule has exponential accuracy, so spectral methods(which are built after all on this transform) have exponential accuracy for periodic functions.

47

6.1.1 The Fast Fourier Transform

As I mentioned, the computational cost of a discrete Fourier transform is large, in fact theoperation count is O(N2). That is very slow, so the Fast Fourier Transform (FFT) wasdeveloped to improve this. I will discuss the theory of the FFT, not implementation.For the FFT, we need N = 2m for some integer m. For each cl, if l is even, l = 2l′, and wecan compute

cl = c2l′ =1

N

N/2−1∑k=0

f(tk)e−i2l′k 2π

N +1

N

N−1∑k=N/2

f(tk)e−i2l′k 2π

N

now let f(tk) = ak and write the second term in the transform as

1

N

N−1∑k=N/2

f(tk)e−i2l′k 2π

N =1

N

N/2−1∑k=0

ak+N/2e−i2l′(k+N/2) 2π

N

so that

cl = c2l′ =1

N

N/2−1∑k=0

(ak + ak+N/2e−il′k 2π

N/2

Now we have a discrete Fourier transform on a0 + aN/2, ..., aN/2−1 + aN−1.If l is odd, then l = 2l′ + 1.

cl = c2l′+1 =1

N

N−1∑k=0

f(tk)e−i(2l′+1)k 2π

N =1

N

N−1∑k=0

(ake−ik 2π

N

)e−2l′k 2π

N

so if we say bk = ake−ik 2π

N , we have the same problem as before, and can do the same thingto get the discrete Fourier transform on b0 + bN/2, ..., bN/2−1 + bN−1:

c2l′+1 =1

N

N/2−1∑k=0

(bk + bk+N/2)e−il′k 2π

N/2

and

bk = ake−ik 2π

N

bk+N/2 = ak+N/2e−i(k+N

2) 2πN = −ak+N/2e

−ik 2πN

sobk + bk+N/2 = (ak − ak+N/2)e

−ik 2πN

We now have 2 Fourier transforms on N/2 data, rather than 1 on N data. We can continuethis until we have N transforms on 1 piece of data, which is very trivial. The operationcount on the FFT is O(N log2N).

48

6.2 Basics about the Fourier Transform

Consider u(x) for x ∈ R. The Fourier Transform is:

u(k) =

∫ ∞−∞

e−ikxu(x)dx

and the inverse Fourier transform is

u(x) =1

∫ ∞−∞

eikxu(k)dk

where k ∈ R is called the “wave number”.Now consider x = hZ = xj = jh| −∞ < j <∞. Now, k is bounded in a domain of length2πn

because of aliasing of k:

eik1x 6≡ eik2x, k1 6= k2, x ∈ RIf x is only defined on hZ then eik1x ≡ eik2x if k1 − k2 is an integer multiple of 2π/h, so wewant

k ∈ [−πh,π

hLet v(x) = vj and x = xj. The relevant transforms are

FT (v)(k) = h∑∞

j=−∞ e−ikxjvj

IFT vj = 12π

∫ π/h−π/h e

ikxj v(k)dk

Now, if we want an interpolant P (x) such that P (xj) = vj, we can use

P (x) =1

∫ π/h

−π/heikxj v(k)dk

then

P (k) =

v(k) k ∈ [−πh, πh]

0 otherwise

because

P (k) =

∫ ∞−∞

e−ikx1

∫ π/h

−π/hv(l)eilxdldx

=1

∫ π/h

−π/hdlv(l)

∫ ∞−∞

e−i(k−l)xdx =

∫ π/h

−π/hv(l)δ(k − l)dl

One of the most useful facts about Fourier transforms is that for a function w(x),

w′(k) = −ikw(k)

and in fact,ˆw(n)(k) = (−ik)nw(k)

This is not too hard to show using integration by parts. More on it can be found in anyanalysis text.

49

6.3 Spectral Differentiation

Having found an interpolant for our data, we need to use this to approximate the derivativeof our data. The obvious thing to do is to set v′j = P ′(xj). Then write

vj =∞∑

m=−∞

vmδj−m

where δj is the Kronecker delta at 0 (note δj(k) = h).

Pδ =h

∫ π/h

−π/he−ikxdx =

sin(πx/h)πx/h

x 6= 0

1 x = 0

let

sh =

sin(πx/h)πx/h

x 6= 0

1 x = 0

This is called the “sinc” function. So,

Pv =∞∑

m=−∞

vmsh(x− xm)

so

v′j , P ′v(xj) =∞∑

m=−∞

vms′h(x− xm)

s′h(xj) =

0 j = 0

(−1)j

jhj 6= 0

and for a second derivative

s′′h(xj) =

−π3h2 j = 0

2(−1)j+1

j2h2 j 6= 0

This is not a very easy thing to deal with, and so we will find a better way to come up withan approximation to the derivative.

We want the derivative of P (x) = 12π

∫ π/h−π/h v(k)eikxdk. However, we know something about

P (k), in fact we know P (k) = v(k). For continuous data, the method we would use wouldthen be, given vj, −∞ < j <∞:

1. Take FT to find v(k)

2. let w(k) = (−ik)nv(k)

3. take IFT on w to find w(k)

50

4. set v(n)j , w(k)

But we have discrete data. We can use the discrete transforms

Discrete FT vk = h

N∑j=1

e−ikxjvj k = N/2 + 1, ...,N/2

Discrete IFT vj = 12π

N/2∑k=−N/2+1

eikxj vk j = 1, ..., N

Our interpolating polynomial is

P (x) =1

N/2∑k=−N/2+1

eikxvk

We notice that

eiN2x = cos

(N

2x

)+ i sin

(N

2x

)and because xj = jh = j 2π

n,

eiN2xj = cos(jπ)

and the derivative is 0, but (eiN2x)′

=iN

2eiN2x

is always complex. To fix this, we can define v−n/2 = vN/2. Then,

vk = h

N/2∑k=−N/2

eikxjvj

and

P (x) =1

N/2∑k=−N/2

eikxvk

and v′j = P ′(xj). So, with the discrete data, the process is

1. Take DFT to find vk, k = −N/2 + 1, ...,N/2

2. let wk = (−ik)nvk, wN/2 = 0

3. take DIFT on wk to find wk

4. set v(n)j , wk

using the FFT for the transforms.

51

6.4 Smoothness & Spectral Accuracy

Smooth functions have fast decay in their Fourier coefficients, which makes the discreteFourier transform more accurate. Furthermore, better accuracy will be found in functionsthat decay at infinity. Assume

u(x) ∈ Cm

u(k) =

∫ ∞−∞

eikxu(x)dxu(n)(x)→ 0 as x→ ±∞

for n = 0, 1, ...,m. Integrating by parts, we see

u(k) = − 1

ik

∫ ∞−∞

u′(x)eikxdx

Repeating m times:

u(k) = − 1

(ik)m

∫ ∞−∞

u(m)(x)eikxdx ∼(

1

k

)mFor large wave number, u decays. For large m, u decays very fast in Fourier space. Recallthat error from discrete Fourier transforms is related to the fact that these give bounded k,so this decay is a good thing. In the discrete transform, k ∈ [−π/h, π/h] and the error is dueto k 6∈ [−π/h, π/h]. These terms, for smooth functions, are tiny and so the error is not verybig.

Theorem 15. Let u(x) ∈ L2(R), with Fourier transform u(k)

1. If u has p − 1 continuous derivatives in L2(R) for p ≥ 0 and the pth derivative is of

bounded variation then u(k) = O(

1|k|p+1

)as k →∞

2. If u has infinitely many derivatives in L2(R), then u(k) = O(

1|k|n

)∀n ≥ 0 as k →∞.

The converse also holds.For example, consider

s(x) =

12|x| < 1

0 |x| ≥ 1

The Fourier transform is

s(k) =1

2

∫ ∞−∞

eikxs(x)dx =1

2

∫ 1

−1

eikxdx =1

2ikeikx|1−1 =

sin(k)

k

and so this decays as 1k.

52

6.4.1 Convolutions

The operation

u ∗ v =

∫ ∞−∞

u(y)v(x− y)dy =

∫ ∞−∞

u(x− y)v(y)dy

is called a convolution. These have the property that

u ∗ v = uv

and so convolving a function can increase smoothness.

6.4.2 Spectral Approximation

Let u(x) ∈ L2(R) and suppose u′(x) is of bounded variation. Let v be the grid functiondefined on hZ by vj = u(xj) for j = 1, ..., N (so v is the set of discretized data from u).Then

v(k) =∞∑

j=−∞

u

(k +

2πj

h

)= u(k) +

∑j 6=0

u

(k +

2πj

h

)The error ∑

j 6=0

u

(k +

2πj

h

)is called the Aliasing Error. This error is small if u decays fast, which we have seen is thecase if u(x) is smooth. Furthermore,

Theorem 16. The following estimates hold uniformly for all k ∈ [−πh, πh]

1. if u has p − 1 continuous derivatives in L2(R) for some p ≥ 1 and a pth derivative ofbounded variation, then |v(k)− u(k)| = O(hp+1) as h→ 0

2. if u has infinitely many continuous derivatives in L2(R), then |v(k) − u(k)| = O(hm)as h→ 0 ∀m ≥ 0 (exponential or spectral accuracy).

We also have Parseval’s Identity :√

2π‖u‖L2 = ‖u‖L2 , so if |v(k) − u(k)| is small, then|v(k)− u(k)| is also small.Let the IFT of v be P (x), our interpolating polynomial.

Theorem 17. Accuracy of Fourier Spectral DifferentiationLet u ∈ L2(R) have a νth derivative (ν ≥ 1) of bounded variation and let ω be the νth spectralderivative of u on hZ. Then ∀x ∈ hZ,

1. If u has p− 1 continuous derivatives in L2(R) for some p ≥ ν + 1 and a pth derivativeof bounded variation, then |ωj − u(ν)(xj)| = O(hp+1−ν) as ν → 0.

2. If u has infinitely many continuous derivatives in L2(R), then |ωj − u(ν)(xj)| = O(hm)as h→ 0 ∀m.

It is important to notice that this accuracy is only achieved for periodic problems. For nonperiodic functions on bounded domains (which are generally necessary for a computer to dealwith the problem), Gibb’s Phenomenon occurs. This means that trigonometric interpolationintroduces oscillations at discontinuities (which exist at the boundaries of the problem) ofO(1) amplitude. These oscillations do not decay, even as h→ 0, N →∞.

53

6.5 Non Periodic Problems - Chebyshev Points

In polynomial interpolation oscillations occur, especially near the boundary of the interpo-lated region. On a uniform grid, error near the boundaries is O(N), which is, of course,extremely bad. To improve the interpolation, we need to choose a grid with more pointscentered around the boundary.Assume our domain is [−1, 1] (a simple transformation can get us to any domain we need,of course). Choose

xj = cos

(jπ

N

)j = 0, ..., N

This puts more points near the boundary of domain because the curve on which the pointslie is flatter near x = −1 and x = 1. This is also suitable for problems with boundary layerissues because more points are near the boundaries. Polynomial interpolation using this gridand differentiations will give accuracy for f (ν)(x) of O(hN+1−ν). However the differentiationmatrix:

v′0

v′1...

v′N

=

......

...

Differentiation

Matrix...

......

v0

v1

...

vN

is a full matrix, so it becomes O(N2) computations, with is slow. There is a faster way,taking advantage of the structure of this matrix.

Theorem 18. The Chebyshev Differentiation matrix takes the form

D = (Dij)(N+1)×(N+1)

D00 = 2N2+16

DNN = −2N2+16

Djj = − xj2(1−x2

j )j = 1, ..., N − 1

Dij = CiCj

(−1)i+j

xi−xj i 6= j

where

Ci =

2 i = 0 or N

1 otherwise

There is an FFT for Chebyshev spectral methods, but first we need to decide how toimplement a boundary condition. Consider the boundary value problem

uxx = e4x −1 < x < 1

u(±1) = 0

54

Using Chebyshev interpolation, uxx → D(2)v. Let P (x) be the unique polynomial of degree≤ N such that P (±1) = 0, P (xj) = vj 1 ≤ j ≤ N − 1. Set wj = D(2)u = P ′′(xj)1 ≤ j ≤ N − 1. Then we have

w0

w1

...

wN−1

wN

=

D(2)n

v0

v1

...

vN−1

vN

We will remove the first and last row and column to deal with the boundary layer.

6.5.1 FFT for Chebyshev Spectral Methods

Define z ∈ C, z = eiθ. Then x = cos(θ) = <(z) = 12(z + z−1) ∈ [−1, 1] The Chebyshev

polynomial of degree n is

Tn = <(zn) =1

2(zn + z−n) = cos(nθ)

And soTn(x) = cos(n arccos(x))

The roots of this polynomial (doesn’t look like one but it is!) are at x = cos( (j+1/2)πn

, whichwork out to be the Chebyshev points. For those who don’t believe that crazy looking functionis a polynomial of degree n, consider:

T0 = 1

T1 = x

T2 = 12(z2 + z−2) = 1

2(z + z−1)− 1 = 2x2 − 1

and in generalTn(x) = 2xTn−1(x)− Tn−2(x)

The proof of this is by induction, and is now left as a (fairly easy, I promise) exercise. Thisrelation makes it clear that Tn(x) is a polynomial of degree n, and also shows that the leadingcoefficient is 2n−1. Furthermore, the set of Chebyshev polynomials of degree ≤ n form a basisfor the set of polynomials of degree n. This means that for any polynomial p(x) of degree n,

p(x) =n∑j=0

ajTj(x)

The Chebyshev points are xj = cos( jπN

) = cos(θj), where θj = jπN

make a uniform grid! Wecan then write p(x) as a function of θ on a uniform grid, and we have ourselves a periodicfunction. This means we can use spectral differentiation.The procedure for the Chebyshev spectral method is

55

1. Given v0, ..., vN at the Chebyshev points x0 = 1, ..., xN = −1, extend the data to avector v of length 2n by assigning v2N − j = vj, j = 1, ..., N − 1. Now you have aneven set of data, and this can be extended to ±∞ in a periodic way.

2. Use FFT to get the discrete Fourier transform of the data

vk =π

N

2N∑j=1

e−ikθjvj k = −N + 1, ..., N

3. set wk = −ikvk except wN = 0.

4. Use IFFT

Wj =1

N∑k=N+1

eikθj wj

5. The derivative with respect to x corresponding to xj is

wj =Wj√1− x2

j

j = 1, ..., N − 1

from the chain rule, because Wj is a discretization of a function of θ.

Of course, if the νth derivative is required, the third step becomes wk = (−ik)ν vk, and weneed to set wN = 0 if ν is odd. The ending relation from the chain rule is also different, thechain rule can be carried out to figure out what it should be.

56

Chapter 7

Monte Carlo Methods

Monte Carlo methods are probabilistic methods, named for the casino in Monaco.Consider ∫ 1

0

f(x)dx 0 ≤ f(x) ≤ 1

We can randomly select a point (x1, y1) in [0, 1]× [0, 1], and count it as 1 if f(x1) ≥ y1 or 0otherwise and then repeat. So for random (xi, yi) ∈ [0, 1]× [0, 1],

zi =

0 if yi > f(xi)

1 if yi ≤ f(xi)

and thenz1 + z2 + · · ·+ zn

n→∫ 1

0

f(x)dx

as n → ∞. This works by the Law of Large Numbers, which states that if x1, x2, ..., xnare independent identically distributed random variables, and E(xi) = µ,(where E(y) is theexpected value of a random variable y then

x =x1 + x2 + · · ·+ xn

n→ µ

as n→∞,

E(f(xi)) =

∫ 1

0

f(x)dx

andz1 + z2 + · · ·+ zn

n= E(f(xi))

because it is the chance the point is below the curve.

An important theorem in probabilistic methods is The Central Limit Theorem

Theorem 19. If var(xi) = σ2 <∞,

x− µσ/√n

→ N(0, 1)

where N(0, 1) is the normal distribution.

57

This means that x − µ ∼ O(

σ√n

), and so the convergence rate to our integral above is

O(

1√n

). This does not increase with increased dimension, nor does the computational cost.

In Rd the approximation is identical:∫[0,1]d

f(x)dx =1

n

n∑i=1

f(xi) +O

(1√n

)In contrast, quadrature rules in Rd become much more computationally costly:∫

[0,1]df(~x)dx = c

N∑i1=1

N∑i2=1

...

N∑id=1

f(xi1 , xi2 , ..., xid) +O

(1

N2

)So the computational cost is Nd. If n = Nd, so the computational cost was the same foreach, the accuracy of quadrature is 1

N2 , while for the Monte Carlo method it is 1

Nd/2. For

d >> 1, the Monte Carlo method is more accurate at the same computational cost. For veryhigh dimensions, it is not even feasible to use quadrature, and the Monte Carlo method isthe only choice.

7.1 Probability Basics

Let X be a continuous random variable. Then F (x) = P (X ≤ x) is a distribution function.F (x) is monotone, F (−∞) = 0, F (∞) = 1 and 0 ≤ F (x) ≤ 1. Furthermore, f(x) = F ′(x) is

called the probability density function. As a result, P (a ≤ X ≤ b) = F (b)−F (a) =∫ baf(x)dx,

and

E(X) =

∫ ∞−∞

xf(x)dx

so

E(h(X)) =

∫ ∞−∞

h(x)f(x)dx

The variance of x is

var(X) = E((x− µ)2) = E(x2)− (E(x))2 ≥ 0

and the standard deviation is σ =√

var(x).For a discrete random variable, F (x) is dis-continuous, and has jumps of pi at xi. The expected value for a discrete random variableis

E(h(x)) =N∑i=1

h(xi)pi

7.1.1 Distributions

There are some important probability distributions of continuous random variables. TheNormal (Gaussian) Distribution is N(µ, σ2) where

f(x) =1√2πσ

e−(x−µ)/2σ2

58

The Exponential Distribution Exp(λ) where

f(x) =

λe−λx x > 0

0 x < 0

and µ = σ = 1λ.

The Uniform Distribution Unif [a, b], where

f(x) =

1b−a x ∈ [a, b]

0 otherwise

Every computer has a (pseudo-)random number generator that will produce X ∈ Unif [0, 1],and when using a higher level language like python or matlab, random number generatorsare available with many other distributions.

Important distributions of discrete random variables include the Binomial Distribution Binom(n, θ),which is the ”coin flip” distribution (or cheated coin if θ 6= 1/2).

P (heads) = θ P (tails) = 1− θ

For n flips, if X is the number of “heads”

P (X = x) =(nx

)θx(1− θ)n−x

E(X) = nθ

σ2 = nθ(1− θ)

The Geometrical Distribution gives a distribution for X, the number of tosses before thefirst “heads”:

P (X = x) = (1− θ)x−1θ

and

E(X) =1

θσ2(X) =

1− θθ2

The Multinormal Distribution is outcome of tossing a k sided dice n times. In this distribu-tion, the variable is X = (x1, ..., xk) where x1 + · · ·+ xk = n and

P (X = (x1, ..., xk)) =n!

x1! · · · xk!θx1

1 · · · θxkk

where θj is the probability of seeing side j in a single toss.The Poisson Distribution is

P (X = x) =λx

x!e−x x = 0, 1, ...

andE(X) = var(x) = λ

This is the limit of the Binom(n, p) where np = λ as n→∞.

59

7.2 Convergence of Random Variables

If we have the random variables x1, x2, ..., xn and the distribution functions F1(x), F2(x), ..., Fn(x),we can define a few different kinds of convergence:

• If limn→∞

Fn(x) = F (x) we say that xn → x in distribution

• If ∀ε > 0, limn→∞

P (|xn − x| > ε) = 0, then we say that xn → x in probability.

• If P ( limn→∞

|xn − x| = 0) = 1 then we say xn → x almost surely.

So, convergence almost surely ⇒ convergence in probability ⇒ convergence in distribution.The weak law of large numbers is

Theorem 20. If x1, ..., xn are independent identically distributed (iid) with finite mean µ,then

x1 + x2 + · · ·+ xnn

→ µ

in probability.

The strong law of large numbers is

Theorem 21. If the variance of x1, ..., xn is also finite, then

x1 + x2 + · · ·+ xnn

→ µ

almost surely.

It is also worth restating the central limit theorem

Theorem 22. if x1, ..., xn are mutually independent with mean µ and variance σ2, then√n(x− µ)

σ→ N(0, 1)

in distribution as n→∞.

7.3 Random Sampling

That a computer can generate a uniform distribution on [0, 1] is taken as a given, but weprobably want more distributions than just that. Suppose ξ ∈ Unif [0, 1]. Then f(x)dx = dξ,so F (x) = ξ, and, if F is invertible, x = F−1(ξ). For example,

f(x) = e−x

F (x) =∫ x

0f(t)dt = 1− e−x

1− e−x = ξ

e−x = 1− ξx = − log(1− ξ)

60

and because we have a uniform distribution on [0, 1], the same is true for 1−ξ, so x = − log(ξ).We can use this technique to generate any distribution as long as F is invertible.If F is not invertible, we need some other ideas. One idea, due to Von-Neumann, is theacceptance-rejection method. In this method, we find w(x) such that Mw(x) > f(x)∀x,M >1, such that W (x) = 1

Aintx0w(t)dt is easily invertible. The algorithm for this method is

1. Choose ξ1 ∈ Unif [0, 1], find x = W−1(ξ1)

2. Choose ξ2 ∈ Unif [0, 1]. If f(x) ≥ Mw(x)ξ2, accept x. If not, reject x and return tostep 1.

Efficiency of this method depends on how close w(x) is to f(x).

P (accept x) = P

(f(x)

Mw(x)> ξ

)= E

(f(x)

Mw(x)

)=

∫ ∞−∞

f(x)

Mw(x)

w(x)

Adx =

1

MA

7.3.1 Discrete Sampling

Let k ∈ 1, ..,M be a random integer so that k = i with probability wi. The samplingalgorithm is

1. Wk =k∑i=1

wi1cmW0 = 0

2. Take ξ with Unif [0, 1]

3. Find k such that Wk−1 ≤ ξ ≤ Wk

If the wi are not explicitly known, find w ≥ wi, and use the rejection-acceptance algorithm,like so:

1. select integer with Unif [0,M ], by selecting k = [Mξ] + 1, where ξ ∈ Unif [0, 1]

2. if wk > wξ2 accept, else reject

7.4 Multivariate Random Variables

We can have multivariate random variables ~X = (X1, ..., Xd). For these,

F ( ~X) = P

(d⋂

n=1

Xi ≤ xi

)

f( ~X) =∂F ( ~X)

∂X1∂X2 · · · ∂Xd

and

E(h(x)) =

∫Rdh(~x)f(~x)d~x

61

The multivariate normal distribution f = f(~x, ~µ,Σ) is

f =1

(2π)d√

det(Σ)exp−1

2(~x− ~µ)TΣ(~x− ~µ)

where Σ is a positive definite matrix.If x1, ..., xd are independent with the joint density function f(x1, ..., xd), and xi has thedensity function fi(xi), then

f(x1, ..., xd) = f1(x1)f2(x2) · · · fd(xd)

The concept of Covariance is defined as

cov(X, Y ) = E(XY )− E(X)E(Y )

and if X and Y are independent, cov(XY ) = 0. Covariance is needed because

var(x1 + · · ·+ xn) =d∑i=1

var(xi) + 2∑i<j

cov(xi, xj)

And, if x1, ..., xn are independent identically distributed random variables, then

var

(x1 + · · ·+ xn

n

)=

var(xi)

n

In general, x1, x2, ..., xd are not mutually independent. We need a transformation T : x →η = T (x) such that

P (x1, x2, ..., xd) = P1(η1)P2(η2) · · ·Pd(ηd)dη1dη2 · · · dηd

One such transformation is

T1(x1) =∫ x1

−∞ dη∫Rd−1 dx2 · · · dxdP (η, x2, ..., xd)

T2 =∫ x2−∞ dη

∫Rd−2 dx3···dxdP (x1,η,x3,...,xd)∫∞

−∞ dη∫Rd−2 dx3···dxdP (x1,η,x3,...,xd)

Td(x1, ..., xd) =∫ xd−∞ dηP (x1,...,xd−1,η)∫∞−∞ dηP (x1,...,xd−1,η)

we can find x1 from T1(x), x2 from T2(x1, x2), and so on. This isn’t very practical, but it isperfectly general.

7.5 Variance Reduction Methods

High variance leads to high order in a method, so we want to sample in ways that lowervariance. Let

Ii =

∫Df(x)dx

62

and assume |D| = 1. If xi are uniformly distributed in D, then we can approximate I by

Im =1

m[f(x1) + · · ·+ f(xm)]

The variance of xi = σ2, so the variance of Im = σ2

M, and this the error in the approximation.

If f ∼ constant in D, we have low variance. So, we can partition our domain so that in eachsubinterval f is nearly constant, then split the integral up.

I =N∑i=1

∫Dif(x)dx

so

ˆIm =

N∑i=1

vol(Di)mi

[f(x1i), ..., f(xmi)]

so we apply the Monte Carlo method to each Di. Then the variance ofˆIm is

var(ˆIm) =

N∑i=1

σ2

mi

<σ2

m

There are a few more variance reduction methods which I will omit (for now) because I seethem as unlikely to come up on the qual.

63

Chapter 8

Numerical Linear Algebra

8.1 Norms

In order to do any kind of analysis of a numerical method, it is important to have a wayto judge “size” in some sense in whatever vector space is relevant. To do this, we need anobject called a norm. A vector norm ‖ · ‖ is map with the properties:

• x ∈ Rn, ‖x‖ : Rn → R

• ‖x‖ ≥ 0

• ‖x‖ = 0⇔ x = ~0

• ‖αx‖ = |α|‖x‖ for scalar α

• ‖x+ y‖ ≤ ‖x‖+ ‖y‖

The common examples in Rn are the l1 norm:

‖x‖1 =n∑i=1

|xi|

the l2, norm:

‖x‖2 =

(n∑i=1

|xi|2)1/2

the l∞ norm:‖x‖∞ = max

i|xi|

and the general lp norm:

‖x‖p =

(n∑i=1

|xi|p)1/p

64

8.1.1 Induced Matrix Norms

We also want to measure operators (in our case that means matrices because we are onlydealing with operators of finite rank). For this, we can use norms induced by vector normsor general matrix norms. An induced norm for an m× n matrix, given vector norms ‖ · ‖Rn ,‖ · ‖Rm on Rn and Rm, respectively, is

‖A‖ = supx∈Rn,x 6=0

‖Ax‖Rm‖x‖Rn

= supx∈Rn,‖x‖Rn=1

‖Ax‖Rm

(I should note that I am not using standard notation for these vector norms, which would be,for an lp norm, ‖ · ‖lp(Rn) or simply ‖x‖p). The induced matrix norm from the vector lp normis usually denoted ‖·‖p. Intuitively, an induced norm measures the maximum stretching thata matrix may do. If D is a diagonal matrix with diagonal entries di, then ‖D‖p = maxi(di).If we write a matrix out labelling its column vectors

A =

| | |

a1 a2 · · · an

| | |

Then we see

‖Ax‖1 =∥∥∥∑xjaj

∥∥∥ ≤∑ |xj|‖aj‖1 ≤ maxj‖aj‖1

If x = ejmax then this also implies ‖A‖1 = maxj ‖aj‖1. Furthermore, ‖A‖∞ = max ‖a∗i ‖where a∗i are the row vectors of A.

8.1.2 General Matrix Norms

We can also treat the collection of matrices as an mn dimensional vector space, and use thenormal rules for a norm. This gives us a Banach space. For example, the Frobenius Norm is

‖A‖F =

(∑i

∑j

a2ij

)1/2

=

(∑i

‖ai‖22

)1/2

= trace(ATA)1/2 = trace(AAT )

1/2

This is similar to the l2 norm. This norm creates a Banach space with the Cauchy-Schwarzinequality :

‖AB‖F = ‖A‖F‖B‖F

8.2 Singular Value Decomposition

The image of a unit sphere under any finite rank transformation Am×n is a hyper ellipse.Thus, knowledge of the directions and lengths of the axis of the hyper ellipse will tell us alot about the action of A. Let v1, ..., vn with ‖vi‖2 = 1, vi ⊥ vj be the pre images of the

65

semi-axis of the hyper ellipse, σ1 ≥ σ2 ≥ ... ≥ σn ≥ 0 the lengths of the semi axis, andu1, ..., un the directions so ui ⊥ uj and the principle semi axis of the hyper ellipse are σiui.The lengths σ1, ..., σn are called the singular values of A. If rank(A) = r, then there areexactly r non zero singular values. If m ≤ n and r ≤ n then at most n singular values arenon zero. The vectors ui are called the left singular vectors of A, and the vectors vi are theright singular values. Clearly, Avj = σjuj, so, if A is of full rank, we get

AV = UΣ

where V is the m × n matrix whose columns are vi, U is the m × n matrix whose columnsare ui, and ˆSigma is the n × n diagonal matrix whose entries are the singular values. Byconstruction, U and V are orthonormal, so

A = UΣV T

This decomposition is called the reduced singular value decomposition.We get the full singular value decomposition (SVD) by completing the orthonormal basis ofRm, adding additional basis vectors to U . This produces an m×m matrix U whose first n (orif r = rank(A) < n, r) columns are U and the remaining columns complete the orthonormalbasis. Adding rows of zeros to Σ gives the m×n matrix Σ, and ensure that we have the fullSVD:

A = UΣV T

The SVD is very useful because it reveals the properties of Ax and A−1y, makes use oforthonormal bases, and exists for any matrix. Orthonormal bases are good for stability ofmany methods.

Theorem 23. The SVD exists for any n×m matrix A.

Proof. First note that ‖A‖2 = ‖UΣV T‖2 = ‖Σ‖2 because U and V T are both orthonormal.Furthermore, because Σ is diagonal, we know ‖Σ‖2 = σ1. Take u1, v1 with ‖u1‖2 = 1,‖v1‖2 = 1 such that Av1 = σ1u1. Take complete orthonormal bases vj ⊂ Rn and uj ⊂Rm and from these form matrices U1 and V T

1 , then

U1AVT

1 = S =

σ1 wT

0 B

where w ∈ Rn−1, and B(m−1×n−1). Then∥∥∥∥∥∥S

σ1

w

∥∥∥∥∥∥2

=

∥∥∥∥∥∥σ2 + wTw

Bw

∥∥∥∥∥∥2

≥ σ1 + wTw =(σ1 + wTw

)1/2

∥∥∥∥∥∥σ1

w

∥∥∥∥∥∥2

However, ‖S‖2 = ‖A‖2 = σ1, so w = 0. Thus,

S =

σ1 0

0 B

B is the action of A on the subspace of ⊥ to v1. We can repeat this, and by induction theSVD exists.

66

The properties of the SVD include

• r = rank(A) = number of non zero σi

• the range of A = R(A) = u1, u2, ..., ur

• the null space of A = N(A) = vr+1, vr+2, ..., vn

• ‖A‖2 = ‖Σ‖2 = σ1, and ‖A‖F = ‖Σ‖F =

(r∑i=1

σ2i

)1/2

• non zero singular values are σi =√λi(ATA) (?)

• If AT = A, σi = |λi(A)|

A way to find the SVD (or at least the singular values) is suggested by (?), and it is worthinspecting why (?) is true:

ATA =(UΣV T

)T (UΣV T

)= V ΣTΣV T

so ATA ∼ ΣTΣ, so they share eigenvalues. So, because ΣTΣ is diagonal with entries σ2i , (?)

is true.One use for the SVD is low dimensional approximation. Let Σ = Σ1 + Σ2 + ..., where

Σi =

. . .

0

σi0

. . .

so A can be written as the sum of rank one matrices:

A =r∑i=1

σiuivTi

Then let

Ak =k∑i=1

σiuivTi 0 ≤ k ≤ r

then Ak is the best possible approximation of rank k in ‖ · ‖2 or ‖ · ‖F . This fact is used infields such as image compression.

Proof. Suppose ∃B of rank ≤ k such that ‖A− B‖2 < ‖A− Ak‖2 = σk+1. rank(B) ≤ k ⇒∃(n − k) dimensional subspace W ⊂ Rn ⇒ Bw = 0∀w ∈ W . For such a W , ‖Aw‖2 =‖(A − B)w‖2 ≤ ‖A − B‖2‖w‖2 ≤ σk+1‖w‖2. But, ∃ a k + 1 dimensional subspace where‖Aw‖2 ≥ σk+1‖w‖2 (the span of the first k + 1 right singular vectors). This means the sumof the spaces must have dimension > n, so there must be a non empty intersection, which isa contradiction.

67

8.3 Projection Operators

A projector is an operator that takes a vector in a vector space (we will be dealing just withRn here) to a vector in a subspace (Rm, m ≤ n). This action is called projection, and so,unsurprisingly, a projection operator is an operator that projects a vector onto a subspace. Ifa projection operator P projects onto a subspace W , and then for a vector v ∈ W , Pv = v.Furthermore, for any vector u in the whole space, Pu ∈ W (because that is precisely thepoint of P ). So, P 2(u) = P (Pu) = Pu, and so we actually define projectors by the propertythat

P 2 = P

Because P 2 exists, if P is a matrix (which it is if we are dealing with Rn) P must be square.If P is a projector, then the operator I − P is also a projector:

(I − P )2 = I2 − 2P + P 2 = I − 2P + P = I − P

(I − P ) projects onto the null space N(P ).An orthogonal projector is a projector that projects a vector v onto a subspace S1 along aspace S2, with S1 ⊥ S2. This is the projector that takes v to the vector that is closest to itin S1.

Theorem 24. An operator P is an orthogonal projector if and only if P 2 = P and P = P ∗

(or P = P T if P is real).

Proof. “If”:Consider x, y such that Px ∈ S1, (I − P )y ∈ S2. Then

(Px)T ((I − P )y) = xTP T (I − P )y = xT (P Ty − P TPy)

assume that P T = P and P is a projector, then

xT (P Ty − P TPy) = xT (Py − P 2y) = xT (Py − Py) = 0

then S1 ⊥ S2, so P is an orthogonal projector.“Only If”:We will need the SVD. Let S1 have dimension n, and let q1, q2, ..., qn span S1, with Pqi = qiand qTi qj = 0, i 6= j and qn+1, ..., qm span S2 with Pqk = 0. Then

PQ = P

| | | | |

q1 q2 · · · qn qn+1 · · · qm

| | | | |

=

| | | | |

q1 q2 · · · qn 0 · · · 0

| | | | |

this means that

QTPQ = Σ =

1

. . .

1

0. . .

so P = QΣQT = P T

68

An example of a rank 1 projector onto a the direction of a vector ~q is familiar from calc2:

P =qqT

qT q

In calc 2, this usually taught asqTu

qT qq = Pu

If P1 and P2 are both orthogonal projectors onto S1, then P1 = P2.

Proof. Take z ∈ Rn. Then

|(P1 − P2)z‖2 = zT (P1 − P2)T (P1 − P2)z = zT (P1 − P2)2z

= zTP1(P1 − P2)z − zTP2(P1 − P2)z = zTP1(I − P2)z − zTP2(P1 − I)z = 0

because (I − P2)z ∈ S2 and (P1 − I)z ∈ S2.

So, given an orthonormal basis q1, q2, ..., qn of S1 then P = QQT is a unique orthonormalprojector that projects onto S1, where the columns of Q are the qi. Also, (I − QQT ) is anorthogonal projector.

8.4 Gram-Schmidt Orthogonalization

Consider a matrix A ∈ Rm×n with columns aj, and consider the successive subspaces spannedby 〈a1〉 ⊂ 〈a1, a2〉 ⊂ 〈a1, a2, a3〉 ⊂ .... The idea of the Gram-Schmidt algorithm is to find anorthogonal basis for each subspace successively. We want to do this because if an orthogonalbasis is known for one of these subspaces, it is easy to find an orthogonal basis for the next.In fact, all that is needed is to add one vector to the basis, which is orthogonal to all theothers and adds the dimension added by the next aj. We will denote the vectors in thesebases qj, so that 〉q1〈⊂〉q1, q2〉 ⊂ 〈q1, q2, q3〉..., and the qis are orthogonal to each other.The process is:

q1 = a1

r11

q2 =a2−(q1qT2 )a2

r22

q3 = a3−r13q1−r23q2r22

and so on, with

rij = qTi aj fori 6= j

rjj = ‖aj −j−1∑i=1

rijqi‖2

For each successive basis we add one vector qk. We get this vector by subtracting from akthe projection of ak onto each of the basis vectors from the previous subspace. This leaves uswith a our vector qk that is orthogonal to all the previous basis vectors, and spans 〈a1, ..., ak〉.We can continue the process until k = n, giving us an orthonormal basis for the column space

69

of A.This process can give us a decomposition for A. If m ≥ n,

a1 a2 · · · an

=

q1 q2 · · ·

r11 r12 r13 · · ·

r22 r23 · · ·r33 · · ·

. . .

This is A = QR, the reduced QR factorization. The columns of Q span the range R(A). Wecan add columns by adding an orthogonal basis of the null space N(A), and add rows ofzeros to R to get the QR factorization, A = QR, where the columns of Q are an orthonormalbasis for the whole space. By construction, we see that every matrix has a QR factorization.The Classical Gram-Schmidt Algorithm is

for j = 1 : n dovj = aj;for i = 1 : (j − 1) do

rij = qTi aj;vj = vj − rijaj;

endrjj = ‖vjj‖2;qj =

vjrjj

;

end

Unfortunately, this happens to be an unstable algorithm. There is a modified Gram-Schmidt algorithm that is stable. The difference between them is that the modified algorithmsubtracts the projections onto the successive subspaces from each remaining vector all atonce, so that every remaining vector is modified at each step.The Modified Gram-Schmidt Algorithm is

for i = 1 : n dovi = ai;

endfor i = 1 : n do

rii = ‖vi‖2;qi = vi

rii;

for j = i+ 1 : n dorij = qTi vj;vj = vj − rijqj;

end

end

This algorithm is stable.We would also like to know how fast the algorithm is, not only in the sense of convergence(this is a finite step algorithm, so convergence isn’t really a question) but also in the sense

70

of how many operations a computer must carry out during the algorithm. We inspect thenumber of floating point operations, or “flops”, that is the number of multiplications, addi-tions, subtractions, divisions, and square roots. For our modified Gram-Schmidt algorithm,if m,n are large, then the algorithm is dominated by the inner loop.

rij = qTi vj m multiplications and m− 1 additions

vj = vj − rijqj m multiplications and m subtractions

So, in each iteration of the inner loop, there are 4m flops. So the total number of flops is

n∑i=1

n∑i=1

4m = 4mn∑i=1

n− i = 4mn(n− 1)

2≈ 2mn2

8.5 Householder Triangularization

Gram Schmidt orthogonalization can be accomplished by applying triangular matrices to Auntil the orthogonal matrix Q is left, and so can be consider “triangular orthogonalization”.An alternative way to arrive at the QR factorization of A is called Householder Triangulariza-tion, which may be considered “orthogonal triangularization” because it works by applyingorthogonal operators to A until what is left is the diagonal matrix R. That is, the idea ofHouseholder triangularization is

Qn · · ·Q2Q1A = R

where Qn · · ·Q2Q1 = QT . The standard approach to doing this is to use

Qk =

I 0

0 F

Where I is the (k− 1)× (k− 1) identity and F is a (m− k+ 1)× (m− k+ 1) operator thatputs 0s in the kth column beneath the diagonal. We choose F to be a Householder reflector,which is an operator such that

Fx =

‖x‖2

0

0...

0

= ‖x‖2e1

F reflects the vector x across the hyperplane H which is orthogonal to v = ‖x‖2e1−x. Thisis a linear operator that looks similar to a projector:

Fy =

(I − 2

vvT

vTv

)y

71

For better stability, we want v to be longer. Thus, in practice we choose which ever of ‖x‖2e1

or −‖x‖2e1 is farthest from x, so

v = sign(x1)‖x‖2e1 + x

8.5.1 Least Squares

Having a QR factorization for A is very useful for a number of applications. One is to leastsquares problems, which involve the equation Ax = b when b 6∈ R(A). This means, of course,that Ax = b has no solution. However, we may ask for the “best possible” vector x. Thisgenerally means the vector x such that r = b−Ax is as small as possible. The least squaresapproach is to minimize ‖r‖2.The classical approach to finding the least squares solution to Ax = b is to solve the “normalequations”, noting that ‖r‖2 is minimized if r ⊥ R(A). Thus, we want AT r = 0. This istrue if AT (b− Ax) = 0⇒ AT b = ATAx, so the solution is the pseudoinverse

x =(ATA

)−1AT b

Now, with the reduced QR factorization, we have A = QR, and we can project onto R(A)using the orthogonal projector QQT . Thus we have QRx = QQb, and can find the leastsquares solution by solving

Rx = QT b

We can similarly use the SVD: A = UΣV , and U UT is an orthogonal projector onto R(A).This gives us

UΣV x = U UT b

Σw = V T b

x = V w

8.6 Conditioning and Stability

To say more about the usefulness of our algorithms, we need to determine a few importantcharacteristics that we want in a “good” algorithm. Ultimately, we care about the “cor-rectness” of answers, as well as the speed at which our algorithm produces them. ConsiderF : X → Y , where X and Y are normed vector spaces (Banach spaces). We call F ill-conditioned if F (x) is highly sensitive to changes in x. This is an important considerationbecause, with rounding error inherent in computing, x must be an approximation with someerror due to rounding.Let δ ~F = ~F (~x+ δ~x)− ~F (~x). Then the absolute condition number of F is

κ = supδ~x

‖δ ~F‖‖δ~x‖

If δ ~Fi = Jij(~x)δ~xj, where Jij = δFiδ~xj

, then

κ = ‖J‖

72

A note on notation: The norm ‖ · ‖ is not specified because, in a practical sense, it doesnot matter. This is due to the equivalence of norms in finite dimensions, which is a simpleresult from analysis. That is, ∀p, q, ∃c, C such that c‖x‖p ≤ ‖x‖q ≤ C‖x‖p. Also, in the

definition of δ ~F for differentiable F (the second definition), Einstein’s summation notationis used. This means that repeated indices are summed, so above is a sum over j. I will tryto indicate when this notation is being used.

We can also define the relative condition number κ. This gives a better idea of conditioningwithin the scale of the problem.

κ = supδ~x

(‖δF‖‖F‖/‖δx‖‖x‖

)and if F is differentiable,

κ =‖J‖‖x‖‖F‖

An ill conditioned problem has κ ∼ 106 or greater. A classical example of an ill conditionedproblem is the problem of determining the roots of a polynomial p(a0, a1, ..., an, x) = a0 +a1x+ ...+ anx

n = 0. The problem returns the roots of the polynomial. Dependence on ~x isvery high. To see this, consider the jth root of p, xj:

p(a0, a1, ..., ai + δai, ..., an, xj + δxj) = 0

a0 + a1(xj + δxj) + a2(xj + δxj)2 + ...+ (ai + δai)(xj + δxj)

i + ...+ an(xj + δxj)n = 0

we see then, using Einstein’s summation notation,

δxj =−δai∂p∂x|xjxij

and so, the relative condition number is

κ =|δxj ||xj |/

|δai||ai|

=|aixi−1

j || ∂p∂x|xj

So if n = 20, κ ≈ 1013.One important operation is the operation of a matrix A on a vector x. We do not want asmall change in x to lead to a large change in Ax.

κ = supδx

(‖A(x+δx)−Ax‖

‖Ax‖ /‖δx‖‖x‖

)= sup

δx

(‖δAx‖‖δx‖ /‖Ax‖‖x‖

)= sup

δx

(‖δAx‖‖δx‖

)(‖x‖‖Ax‖

)so

κ = ‖A‖ ‖x‖‖Ax‖

We would like to judge the operator A without any reliance on x, so we want some kind ofcondition number that is independent of x. Since ‖x‖ = ‖A−1Ax‖ ≤ ‖A−1‖‖Ax‖, ‖x‖

‖Ax‖ ≤‖A−1‖. This means κ ≤ ‖A‖‖A−1‖. We define

κ(A) = ‖A‖‖A−1‖

73

as the condition number of A.Another problem of general interest is the solution x to Ax = b, given b. Assuming A isinvertible, x = A−1b, so

κ = ‖A−1‖ ‖b‖‖x‖≤ ‖A−1‖‖A‖

so κ ≤ κ(A). We see the same if A is perturbed in this problem.κ(A) depends on the choice of norm. If A is singular, κ(A) → ∞, because perturbationsof x may move out of the range of A. If ‖ · ‖ = ‖ · ‖2, then ‖A‖ = σ1, ‖A−1‖ = 1/σm, soκ(A) = σ1/σm. This is the eccentricity of the relevant hyper ellipsoid (thinking as alwaysabout the transformation of the unit sphere under A).

8.6.1 Floating Point Arithmetic

I have already mentioned the inherent rounding error in any method that is implemented ona computer. This error is due to the way that computers store numbers in their memory.The modern standard is the double precision floating point representation. This means thata real number is stored in 64 bits (8 bytes) of the computer’s memory. The first of these bitsstores the sign of the number, the next 11 bits correspond to the exponent of the number(ei) and the last 52 bits the fractional part. A number is represented as

(−1)(sign bit)2e−1023

(1 +

52∑i=1

bi2−i

)where

e =11∑i=1

ei2i−1

The largest possible number handleable by this representation should have the final 63 bitsset to 1, but this representation is reserved for the special value “infinity”. Likewise, therepresentation with every bit set to 0 is reserved for 0. The largest finite number allowableis a 0 in the first bit, a 0 in the 12th, and the rest of the bits set to 1. This number isN = 21023(1 + 2−52) ≈ 1.79× 10308.Numbers too large are usually not a problem, but gaps between numbers is. Between 23 and24 there are the same amount of floating point numbers as their are between 210 and 211.The distance from the number 1 to the nearest floating point number (1 + 2−52) is called“machine epsilon”, εmachine = 2−52 = 2.2 × 10−16. This is ultimately the precision possibleusing double precision, and for any real x, there is an ε with |ε| ≤ εmachine such that thefloating point representation of x, Fl(x) = x(1+ ε. To emphasize the fact that floating pointoperations have rounding error built in, we can denote them with a circle around the normalsymbol: ⊕, , ⊗, and . Every flop has the same precision, namely there is some ε with|ε| ≤ εmachine such that, for example, x⊗ y = (x× y)(1 + ε).

8.6.2 Accuracy and Stability

If we have a problem F : X → Y , we can denote the algorithm we use to approximate asolution F : X → Y . In our algorithm (which we are assuming is implemented on a real

74

computer, not a magic one), F (x) = F (Fl(x)) = y. An accurate algorithm is one in which

‖F (x)− F (x)‖‖F (x)‖

= O(εmachine)

If the original problem F is ill conditioned, we aren’t going to get an accurate algorithm.Instead, we might make stability our goal, so that F is only as good as F . A stable algorithmis one in which, ∀x ∈ X,

‖F (x)− F (x)‖‖F (x)‖

= O(εmachine)

for some x with ‖x−x‖‖x‖ = O(εmachine). We might say F is stable if it gives nearly the right

answer to nearly the right question.A stronger and simpler requirement is backwards stability. An algorithm is backwards stableif F (x) = F (x) for some x with ‖x−x‖‖x‖ = O(εmachine). A backwards stable algorithm could besaid to give exactly the right answer to nearly the right question.For example, we might ask if floating point subtraction is backwards stable. Then, we haveΘ : F (x, y) = x− y, and so

F (x, y) = Fl(x) Fl(y) = x(1 + ε1) y(1 + ε2) = (Fl(x)− Fl(y))(1 + ε3) = x− y

where‖x− x‖‖x‖

= O(εmachine)‖y − y‖‖y‖

= O(εmachine)

so this is backwards stable.

8.6.3 Backwards Stability of QR algorithms

Some experimentation with the algorithm (usually not a bad idea before diving into therigorous analysis) reveals that the Householder algorithm (of the slight variation of same builtinto matlab or python), when applied to a randomly generated matrix A returns matricesQ and R such that

‖A−QR‖‖A‖ = O(εmachine)

‖Q−Q‖‖Q‖ ≈ 0.01

‖R−R‖‖R‖ ≈ 0.001

so, while the algorithm is good at generating a factorization of A, it is very bad at beingclose to the true R and Q (which are known because A was constructed by coming upwith random orthogonal Q and upper triangular R and then multiplying those together).This at leat shows that the QR algorithm is backward stable, so QR = A + δA, with‖δA‖‖A‖ = O(εmachine).The question is then if this is good enough to be useful. For the problem of finding a solutionto Ax = b using the QR algorithm, we do the following

1. QR = A

75

2. y = QT b

3. x = R−1y

taking each of these steps individually,

1. QR = A+ δA with ‖δA‖‖A‖ = O(εmachine)

2. (Q+ δQ)y = b with ‖δQ‖ = O(εmachine)

3. (D + δR)x = y with ‖δR‖‖R‖ = O(εmachine)

so each step is backwards stable, so (A+ ∆A)x = b for some ‖∆A‖‖A‖ .

Proof. Given that each step is individually backwards stable,

b = (Q+ δQ)(R + δR)x

b = (QR + δQR + QδR + δQδR)x

b = (A+ ∆A)x

So we must show that for ∆A = δA+ δQR+ QδR+ δQδR, ‖∆A‖‖A‖ = O(εmachine) Can take itin parts.

1. we know that ‖δA‖‖A‖ = O(εmachine)

2. ‖QR‖‖A‖ = ‖A+δA‖‖A‖ = ‖R‖

‖A‖ = O(1), so ‖δQR‖‖A‖ ≤ ‖δQ‖‖R‖‖A‖ = O(εmachine)

3. ‖Q‖ = 1, so ‖QδR‖‖A‖ ≤‖q‖‖δR‖‖A‖

(‖A‖‖R‖

)= O(εmachine)

4. ‖δQδR‖‖A‖ ≤‖δQ‖‖δR‖‖A‖

(‖A‖‖R‖

)= O(ε2machine)

Thus, we have‖∆A‖‖A‖

=‖δA+ δQR + QδR + δQδR‖

‖A‖= O(εmachine)

So, this algorithm for solving Ax = b is backward stable with ‖x−x‖‖x‖ = O(κ(A) ∗ εmachine)

76

8.6.4 Stability of Back Substitution

Back substitution is the process for solving an upper triangular system. It is pretty straightforward, but used all the time. This is because, in general, we will apply some methodto get a triangular system and then use back substitution to solve. The algorithm solvesRx = b where R is m ×m upper triangular. We want to show that it is backwards stable,i.e. (R + δR)x = b with ‖δR‖

‖R‖ = O(εmachine). The problem is:

r1,1 r1,2 r1,3 · · · · · ·

r2,2 r2,3 · · · · · ·

r3,3 · · · · · ·. . . · · ·

rm,m

x1

x2

x3

...

xm

=

b1

b2

b3

...

bm

The algorithm is

xm = bmrm,m

;

xm−1 =bm−1−

bmrm−1,mrm,m

rm−1,m−1;

...xj =

bj−∑mk=j+1 xkrj,k

rj,j;

We actually need to show backwards stability individually for each value of m, but hereI will just do m = 1 and m = 2. For m > 2, the process is the same.For m = 1, we simply compute

x1 = b1 r1,1 =b1

r1,1

(1 + ε1) |ε1| ≤ εmachine

and we know that 11−ε =

∑εn = 1 + ε+O(ε2), so

x1 =b1

r1,1(1 + ε′1)|ε′1| ≤ εmachine +O(ε2machine)

so (r1,1 + ε′1r1,1)x1 = b, and the algorithm is backwards stable for m = 1.For m = 2, we have (using the same trick) x2 = b2

r2,2(1+ε1, with |ε1| ≤ εmachine+O(ε2machine),and

then

x1 = (b1 (x2 ⊗ r1,2)) r1,1

= (b1 (x2r1,2(1 + ε2))) r1,1

= (b1 − x2r1,2(1 + ε2))(1 + ε3) r1,1

=(b1 − x2r1,2(1 + ε2))(1 + ε3)(1 + ε4)

r1,1

=b1 − x2r1,2(1 + ε2)

r1,1(1 + ε′3)(1 + ε′4)

77

where |ε2|, |ε3|, |ε4| ≤ εmachine, and |ε′3|, |ε′4| ≤ εmachine + O(ε2machine), and (1 + ε′3)(1 + ε′4) =1 + 2ε′5, |ε′5| ≤ εmachine +O(ε2machine). So, if

r2,2 → (1 + ε1)r2,2

r2,2 → (1 + ε2)r1,2

r2,2 → (1 + 2ε′5)r1,1

then x is the exact solution to (R + δR)x = b, and

δR =

|δr1,1||r1,1||δr1,2||r1,2|

0 |δr2,2||r2,2|

=

1 2

0 1

εmachine +O(ε2machine)

so ‖δR‖‖R‖ = O(εmachine)

8.7 Gaussian Elimination

Gaussian elimination is the classic algorithm solve a linear system of equations, and isthe same basic algorithm as that which is taught in high schools before matrices are evenintroduced to students. It involves row multiplications and subtractions which reduce thesystem to one that is upper triangular. In fact, this algorithm can also give us a factorizationof the matrix A = LU where U is upper triangular and L is lower triangular. The processmay be termed “triangular triangularization”. While this is not how it is generally carriedout, it may be seen as the application of series of lower triangular matrices Li to A:

Lm−1 · · ·L3L2L1A = U

where Lm−1 · · ·L3L2L1 = L−1. This looks like

x x x x · · ·

x x x x · · ·

x x x x · · ·

x x x x · · ·...

......

......

L1A−−→

x x x x · · ·

0 x x x · · ·

0 x x x · · ·

0 x x x · · ·...

......

......

L2L1A−−−−→

x x x x · · ·

0 x x x · · ·

0 0 x x · · ·

0 0 x x · · ·...

......

......

· · ·

78

Let ~xk he the kth column of Lk−1 · · ·L1A. Then for the kth step in the process, we need Lkso that

Lk

x1,k

x2,k

...

...

...

...

=

x1,k

...

xk,k

0

0...

To achieve this, we subtract lj,k × row k from row j, where lj,k =

xj,kxk,k

, k < j ≤ m. So,

Lk =

1

1. . .

1

−lk+1,k 1

−lk+2,k 1...

. . .

−lm,k 1

= I − lkeTk

where

lkeTk =

0...

lk+1,k

...

lm,k

(

0 · · · 0 1 0 · · · 0)

So, we also get (I − lkeTk )(I + lKeTk ) = I − (lke

Tk )(lke

Tk ) = I, so L−1

k = I + lkeTk . Furthermore,

since eTk lj = 0 for j 6= k, we have

L−1k L−1

k+1 = (I + lkeTk )(I + lk+1e

Tk+1) = I + lke

Tk + lk+1e

Tk+1

and so

L = I +m−1∑k=1

lkeTk

79

and so L looks like

L =

1

l2,1 1

l3,1 l3,2 1...

... l4,3 1...

......

. . . . . ....

......

. . . . . . 1

The algorithm based on row multiplications and subtractions, and saves the multipliers usedso that the factorization can be created. The algorithm is:

U = A;L = I;for k = 1 : m− 1 do

for j = k + 1 : m do

Lj,k =Uj,kUk,k

;

uj,k:m = uj,k:m − Lj,kuk,k:m;

end

end

The flop count is

m−1∑k=1

m∑j=k+1

(1 + 2(m− k + 1)) =m−1∑k=1

(m− k)(1 + 2(m− k + 1)) ≈ 2

3m3

This is pretty high, but the A = LU decomposition is very useful. To solve Ax = b usingthis decomposition, the procedure is

1. A = LU

2. Ly = b, a forward substitution

3. Ux = y, a backwards substitution

This is about half the work required of the QR algorithm method. Unfortunately, Gaussianelimination is not backwards stable. In fact, the algorithm clearly fails for a matrix such as(

0 1

1 1

)

because it will attempt to divide by 0. If that 0 had been 10−20, the algorithm would work buta very large error would come from division by such a small number. For general A ∈ Rm×m,the A = LU decomposition is not stable.However, all is not lost! The algorithm can be improved. At the kth step of the algorithm,xk,k is called the “pivot”. However, we can apply a permutation matrix in order to changethe pivot to a different entry. Complete pivoting is when the largest entry in the remainingsubmatrix is used. More easily done, and still very effective, is partial pivoting, in which the

80

largest entry in the column below xk,k (the largest in absolute value of xk,k, xk+1,k, ..., xm,k)isused.A permutation is applied at each step, so we end up with

Lm−1Pm−1 · · ·L3P3L2P2L1P1A = U

with Pk a permutation matrix, which swaps 2 rows of a matrix. A permutation matrix hasthe form:

P =

1 0 0 0

0 0 0 1

0 0 1 0

0 1 0 0

If the desired effect of the permutation matrix P is to swap rows j and k, then P is theidentity with rows j and k swapped. So the the matrix given here will swap rows 2 and 4.In general, any matrix with exactly one 1 in each row and column is a permutation matrix.Luckily, we can rearrange this process so to get a nice factorization of the original matrix A.

Lm−1Pm−1 · · ·L3P3L2P2L1P1 = L′m−1 · · ·L′2L′1Pm−1 · · ·P ′2P ′1

where

L′m−1 = Lm−1

L′m−2 = Pm−1Lm−2P−1m−1

L′m−3 = Pm−1Pm−2Lm−3P−1m−2P

−1m−1

...

L′1 = Pm−1Pm−2 · · ·P3P2L1P−12 P−1

3 · · ·P−1m−2P

−1m−1

are easy to construct because the actions of permutation matrices is simple. Then, letL−1 = L′m−1 · · ·L′2L′1, and we have the factorization PA = LU . If A is strictly diagonallydominant (i.e. |ak,k| >

∑j 6=k aj,k), no pivoting is required. In practice, the change in the

algorithm is that at each step, the largest (in magnitude) possible pivot in a column mustbe found, and rows must be swapped. Modifying the algorithm to include partial pivotingis a good exercise for the reader.With partial pivoting,

PA+ δA = LU

for‖δA‖‖A‖

= O(ρεmachine)

where ρ is the “growth factor”, ρ =maxi,j |Ui,j |maxi,j |Ai,j | . So, PA = LU is backward stable for ρ = O(1),

81

which is often true. It is not always true! For example, an m×m matrix with the form

A =

1 0 0 0 1

−1 1 0 0 1

−1 −1 1 0 1

−1 −1 −1 1 1

−1 −1 −1 −1 1

has ρ = 2m−1, and so PA = LU is explosively unstable.

8.8 Cholesky Decomposition

A hermitian, positive definite matrix A can be decomposed as the product of an uppertriangular matrix and its conjugate transpose:

A = R∗R

Called the Cholesky Decomposition.As a reminder, a hermitian matrix is a matrix which has the property A = A∗; if A is real,then this is a symmetric matrix. Some properties of a hermitian matrix A are

1. x∗Ax is real

2. All the eigenvalues of A are real

3. Eigenvectors with different corresponding eigenvalues are orthogonal

Proof. Of these properties

1. (x∗Ax)∗ = x∗A∗x = x∗Ax

2. Avi = λivi, v∗iAvi = λi‖vi‖2, so λi ∈ R because v∗iAvi ∈ R by 1

3. Av1 = λ1v1, Av2 = λ2v2, λ1 6= λ2

(λ1v1)∗v2 = (Av1)∗v2 = v∗1Av2 = v∗1λ2v2 = λ2v∗1v2

⇒ λ1v∗1v2 = λ2v

∗1v2, but λ1 6= λ2, so v∗1v2 = 0

A positive definite matrix A is one where x∗Ax > 0 ∀x 6= 0.

Theorem 25. If A ∈ Cm×m is hermitian positive definite, and X ∈ Cm×n, m ≤ n, is of fullrank, then X∗AX is also hermitian positive definite.

82

Proof. First,(X∗AX)∗ = X∗A∗X = X∗AX

so it is hermitian. If is positive definite because

v∗(X∗AX)v = (Xv)∗A(Xv) > 0

for v 6= 0, because Xv 6= 0 because X is of full rank.

A couple of corollaries to this are

• All principle submatrices of A are hermitian positive definite

• The diagonal entries of A are all positive

We can arrive at the Cholesky decomposition by “symmetric Gaussian elimination”. It iseasiest to explain with an example. Let

A =

1 w∗

w K

=

1 0

w I

1 w∗

0 K − ww∗

and consider that this means

A =

1 0

w I

1 0

0 K − ww∗

1 w∗

0 I

Generally, for A = A∗,

A =

a1,1 w

w∗ K

if α =

√a1,1 ∈ R, we have

A =

α w/α

0 I

1 0

0 K − ww∗/α2

α w∗/α

0 I

= R∗1A1R1

Now we can just apply the same process to the submatrix K − ww∗/α2. The algorithm is

R = A ;for k = 1 : m do

for j = k + 1 : m do

Rj,j:m = Rj,j:m −Rk,j:m

¯Rk,jRk,k

;

end

Rk,k:m =Rk,k:m√Rk,k

;

end

The flop count for this is

≈m∑k=1

m∑j=k+1

2(m− j) = 2m∑k=1

k∑j=1

j ≈ 2m∑k=1

k2 ≈ m3

3

This is also a backwards stable algorithm, and is the standard way to solve Ax = b forhermitian positive definite A.

83

8.9 Eigenvalues and Eigenvectors

For a matrix A ∈ Cm×m, there are some scalar λi and vector xi such that Axi = λixi.the scalars λi are called the eigenvalues of A, and the vectors xi are the correspondingeigenvectors (I bet you knew that). The eigenvalues and eigenvectors are useful to know fora number of reasons. The are great, for example, for repeated operations such as computingmatrix exponentials etA. They also lead to a decomposition of A. If we multiply A by amatrix of whose columns are the eigenvectors xi of A, we see

A

...

...

x1 x2 · · ·...

...

=

...

...

x1 x2 · · ·...

...

λ1

λ2. . .

. . .

So we have AX = XΛ where Λ is the diagonal matrix of the eigenvalues of A, and X is thematrix whose columns are the eigenvectors of A. If X is invertible (think about when thisis true), this gives us the eigenvalue decomposition of A:

A = XΛX−1

This is, in essence, a change of basis to the eigenbasis of A, and in this basis, A stretches avector by Λ. If Ax = b, then we have Λ(X−1x) = X−1b.We need to call on some operator theory (which I won’t go into any detail on here) andconsider (A− λI). Clearly by the definition of eigenvectors and eigenvalues, (A− λI)x = 0for eigenvector x and eigenvalue λ. The matrix (A − λI) is singular, so we know thatdet(A− λI) = 0. This suggests an (analytical) way to find the eigenvalues of a matrix. Let

PA(z) = det(A− zI)⇒ PA(λ) = 0

PA(z) is a polynomial of degree m. The roots of PA(z) are the eigenvalues of A. Then thefundamental theorem of algebra tells us that we can factor PA(z) in C

PA(z) = (z − λ1)(z − λ2) · · · (z − λm)

The multiplicity of the an eigenvalue λ is the dimension of the corresponding eigenspace Eλ.If X is non singular, then we call A and B similar if A = X−1BX. This is basically a changeof basis, so A and B are similar if they are really the same operator written in different bases.If A and B are similar, they share eigenvalues, so λAi = λBi .If the eigenvalues of a matrix A are orthogonal (for example, if A is hermitian), then A =QΛQ∗. If this is the case, we say A is “unitarily diagonalizable”. A matrix A is unitarilydiagonalizable if and only if A is normal (A∗A = AA∗).The eigenvalue decomposition is not always possible. One example of a defective matrix is0 1

0 0

84

which has only one unique eigenvalue (λ1,2 = 0). There is, however, only one linearlyindependent eigenvector corresponding to the eigenvalue, so the dimension of the eigenspaceis less than the dimension of the matrix.Not all matrices have an eigenvalue decomposition, but all square matrices have a Schurfactorization. That is, all square matrices can be factored

A = QTQ∗

where T is upper triangular and Q is unitary. This means that A and T are similar, andso share eigenvalues. Because the eigenvalues of a triangular matrix are simply the diagonalentries of the matrix, this may be a way to find the eigenvalues of A.

Proof. (That every square matrix has a Schur factorization)Assume m ≥ 2, and let Ax = λx, because for every A ∈ Cm×m, at least one eigenvectorexists. Scale x so that ‖x‖2 = 1. Let U be unitary with its first column x, then

U∗AU =

λ B

0 C

so if C has a Schur factorization, V ∗TV = C, then

Q = U

1 0

0 V

so

Q∗AQ =

λ BV

0 T

Clearly, if m = 2, A has a Schur factorization (because in that case V is a 1× 1 matrix), soby induction, A has a Schur factorization for m ≥ 2 (and I don’t think we need to botherwith m = 1).

8.9.1 Finding Eigenvalues

Numerically finding eigenvalues is an important thing to be able to do. Analytically, wecould find eigenvalues by factoring the characteristic polynomial to find the roots. However,that is not a very easy thing to do, and root finding algorithms are not especially fast oraccurate in general. In fact, finding the roots of PA(z) is an ill conditioned problem. The nextclassical idea is power iteration. Power iteration takes advantage of the fact that repeatedapplications of A reveals the largest eigenvector of A (more on how that works later). Finally,we could make use of a factorization of A which reveals the eigenvalues.No matter how we go about doing it, finding the eigenvalues of a matrix A ∈ Cm×m MUSTbe iterative for m ≥ 5. This is because the problem is equivalent to factoring a polynomial,and there is no finite number of processes that can be applied to factor a general polynomialof deg ≥ 5. Fortunately, algorithms exist with fast convergence to a given tolerance.

85

We would like to apply unitary matrices to A until we have an triangular matrix T whichis similar to A. The way we will do this is first to bring our matrix as close as we can withdirect methods to a triangular matrix, and then use an iterative method. We start by usinga direct method to convert A to a similar upper Hessenberg matrix. An upper Hessenbergmatrix is one with the form

H =

x x x x x · · ·x x x x x · · ·0 x x x x · · ·0 0 x x x · · ·0 0 0 x x · · ·...

......

. . . . . . · · ·

This will make any iteration to upper triangular form much faster because a lot of theentries are already 0. We reduce to upper Hessenberg form using a variation of Householderreflectors. If we were to just use Householder reflectors Q, we would see

x x x x

x x x x

x x x x

x x x x

QA−−→

x x x x

0 x x x

0 x x x

0 x x x

QAQ∗−−−→

x x x x

x x x x

x x x x

x x x x

The application of Q∗ undoes the work we did by applying Q. However, we can apply amatrix that acts as a Householder reflector on a the submatrix that does not include thefirst row, and leaves the first row alone. This would then mean that Q∗ would leave the firstcolumn (where we just put some nice 0s) alone. We use

Q =

1 ~0

~0 Q

where Q is the Householder reflector for the submatrix below the first line. Then we get

x x x x

x x x x

x x x x

x x x x

QA−−→

x x x x

x x x x

0 x x x

0 x x x

QAQ∗−−−→

x x x x

x x x x

0 x x x

0 x x x

This algorithm is backwards stable.

Assuming we have reduced our matrix to upper Hessenberg form, we now need an iterativemethod to get a diagonal matrix. For now, let’s assume we are in the real and symmetriccase, so A ∈ Rm×m and A = AT . This is convenient to develop methods because if meansall the eigenvalues λ1, λ2, ..., λm are real, and the eigenvectors q1, q2, ..., qm are orthogonalfor distinct eigenvalues. We will need to know the Rayleigh quotient :

r(x) =xTAx

xTx

86

which is a scalar such that if Ax = λx, r(x) = λ. This is the also the least squaresapproximation to the eigenvalue corresponding to x.

Proof. If

F = minα‖Ax− αx‖2

2

= minα

(Ax− αx)T (Ax− αx)

= minα‖Ax‖2

2 + α2‖x‖22 − αxTAx− α(Ax)Tx

= minα‖Ax‖2

2 + α2‖x‖22 − 2αxTAx

so∂F

∂α= 0 = 2α‖x‖2

2 − 2xTAx⇒ α = r(x)

Near each eigenvector x, we notice that

∂r

∂xj=

2

xTx(Ax− r(x)x)j

so,

∇r(x) =2

xTx(Ax− r(x)x) = 0

if x is an eigenvector of A. This is great, because it means that if v is an eigenvector of A,and x is not, r(x)− r(v) = O(‖x− v‖2

2) as x→ vWe have a number of options for our iteration (although some are decidedly better thanothers).

Power Iteration

Power iteration takes advantage of the fact that repeated application of A to a vector willreveal the largest eigenvalue. The process is

Choose v(0) such that ‖v(0)‖ = 1;while within some tolerance do

w = Av(k−1);

v(k) = w‖w‖ ;

λ(k) =(v(k))TAv(k);

end

This is not a very good algorithm for a number of reasons (the first that it only finds thelargest eigenvalue!). It is worth discussing how it works and why it is bad, however, becauseother, better, methods are similar. Expand v(0) in the basis of eigenvectors of A,

v(0) = a1q1 + a2q2 + · · ·+ amqm

87

then, assuming the eigenvalues are ordered so that λ1 ≥ λ2 ≥ · · · ≥ λm,

Av(0) = a1λ1q1 + a2λ2q2 + · · ·+ amλmqm

⇒ v(k) = λk1

(a1q1 + a2

(λ2

λ1

)kq2 + · · ·+ am

(λmλ1

)kqm

)cm

⇒ ‖v(k) − (±q1)‖2 = O

(∣∣∣λ2

λ1

∣∣∣k) ; |λk − λ1| = O

(∣∣∣λ2

λ1

∣∣∣2k)This algorithm is then linear in λ1

λ2, so if this ratio is near 1, this method is awfully slow.

Inverse Iteration

Power iteration is bad, but it leads to better ideas. The first is inverse iteration. Considerµ ∈ R which is not an eigenvalue of A. Then if vj is an eigenvector of A, (A − Iµ)vj =(λj − µ)vj, and

(A− Iµ)−1vj =1

λj − µvj

so A and (A− Iµ)−1 have the same eigenvectors. Furthermore, the eigenvalues of (A− Iµ)are

q

λj − µjSo, if µ ≈ λj,

1λj−µ >> 1. Then we can proceed with power iteration on (A − Iµ)−1 and it

will proceed rapidly. The process is

Choose v(0) such that ‖v(0)‖ = 1;while within some tolerance do

Solve (A− µI)w = vk−1;

v(k) = w‖w‖ ;

λ(k) =(v(k))TAv(k);

end

If we choose µ ≈ λj, we can pick out λj, qj. This still has linear convergence, but a goodchoice of µ means fast convergence, because

‖v(k) − qj‖ = O

(∣∣∣∣µ− λjµ− λi

∣∣∣∣)and we can choose µ so that

∣∣∣µ−λjµ−λi

∣∣∣ is small.

Rayleigh Quotient Iteration

The very obvious way to improve on inverse iteration is to choose the best possible µ at eachstep of the iteration. As has been shown, the least squares solution is the Rayleigh quotient,which we are of course already computing anyway. Thus, at each step of the iteration, chooseµ = λ(k−1), the

‖v(k+1) − qi‖2 = O(‖v(k) − qi‖32)

|λ(k+1) − λj| = O(|λ(k) − λj|3)

88

The QR Algorithm & Simultaneous Iteration

The QR algorithm is an amazingly simple algorithm that provides a Schur factorization fora symmetric, real matrix. The algorithm is

A(0) = A while within some tolerance doQ(k)R(k) = A(k−1);

A(k) = R(k)Q(k);

end

Surprisingly, if A = QTQ∗ is a Schur factorization of A, then in this algorithm, A→ T !To understand why this is true, it is easiest to consider a seemingly (there’s some fore-shadowing) different algorithm, called simultaneous iteration. This algorithm applies poweriteration to n vectors simultaneously.We take v

(0)1 , v

(0)2 , · · · , v(0)

n linearly independent vectors. Repeated application of A to thesewill draw out qi, the eigenvectors corresponding to the largest eigenvectors λi.

〈Akv(0)1 , Akv

(0)2 , ..., Akv(0)

n 〉 → 〈q1, q2, ..., qn〉

Let be the m× n matrix

V (0) =

v(0)1 v

(0)2 · · · v

(0)n

Then

V (k) = AkV (0) =

v(k)1 v

(k)2 · · · v

(k)n

Then we extract an orthogonal basis using a reduced QR factorization, Q(k)R(k) = V (k).Generally, the column space of Q(k), col(Q(k)) = 〈±q1,±q2, ...,±qn〉 because

v(0)j = a1,jq1 + a2,jq2 + · · ·+ am,jqm

v(k)j = a1,jλ

k1q1 + · · ·+ am,jλ

kmqm

We now need to assume that the eigenvalues are order so that |λ1| > |λ2| > · · · > |λn| ≥|λn+1 ≥ · · · ≥ |λm| and that all the principle submatrices of QTV (0) are non singular. Thenusing the factorization Q(k)R(k) = V (k), we have

‖q(k)j −±qj‖ = O(ck)

where

c = max1≤k≤n

∣∣∣∣λk+1

λk

∣∣∣∣So applying A and then extracting the orthogonal basis from the column space of the resultwill lead to the eigenvectors.

89

Proof. Take Qm×m = [q1, q2, ..., qm] and write A = QΛQT . Then

V (k) = AkV (0) = QΛkQTV (0) = QΛkQTV (0) +O(|λn+1|k)

Using our assumptions about the submatrices of QTV (0), we have

V (k) =(QΛk +O(|λn+1|k)

)QTV (0)

and socol(V (k)) = col

(QΛk +O(|λn+1|k)

)→ col(Q)

as k →∞.

So, the process of simultaneous iteration is to extract an orthonormal basis after eachstep:

Pick Q(0) ∈ Rm×n with orthonormal columns;while within some tolerance do

Z = AQ(k−1);

Q(k)R(k) = Z;

end

A(k) =(Q(k)

)TAQ(k);

At each step, col(Z) = col(Q) = col(AkQ(0)).The QR algorithm is actually equivalent to simultaneous iteration applied to a full set ofvectors with Q(0) = Im×m, the identity matrix.To keep everything nice and organized, it’s worth repeating both algorithms:

(I) Simultaneous Iteration :

Q(0) = Im×m;

while within some tolerance do

Z = AQ(k−1);

Q(k)R(k) = Z;

end

(II) QR Algorithm (“pure”) :

while within some tolerance doQ(k)R(k) = A(k−1);

A(k) = R(k)Q(k);

end

and we define

• A(k) =(Q(k)

)TAQ(k)

• Q(k) = IQ(1)Q(2) · · ·Q(k)

90

• R(k) = R(k)R(k−1) · · ·R(1)I

Then (I) and (II) generate identical R(k), Q(k), and A(k). Namely,

• Ak = Q(k)R(k)

• A(k) =(Q(k)

)TAQ(k)

Proof. We prove by induction in k:k = 0 is trivial, here Q(0) = I, A(0) = A, and R(0) = I. Then

(I) A0 = I = Q(0)R(0)

(II) A(0) = A, Q(0) = I and R(0) = I

For k > 1, we assume the hypothesis for k − 1 is true. Then

(I) Ak = AAk−1, so

Ak = AQ(k−1)R(k−1)

= Q(k)R(k)R(k)

= Q(k)R(k)

(II) Ak = AAk = AQ(k−1)R(k−1), and note that

Q(k−1)A(k−1) = AQ(k−1)

by the assumption that the hypothesis is true for k − 1. Then

Ak = Q(k−1)A(k−1)R(k−1)

= Q(k−1)QkRkR(k−1)

= Q(k)R(k)

and A(k) =(Q(k)

)TA(k−1)Q(k) =

(Q(k)

)T (Q(k−1)

)TAQ(k−1)Q(k) =

(Q(k)

)TAQ(k)

The Practical QR Algorithm

The QR algorithm for finding eigenvectors and eigenvalues is based on power iteration, whichwe already established is not a great way to do things. So, unsurprisingly, we want to developsomething like a “simultaneous Rayleigh iteration”. The process for a matrix A is

1. Use modified Householder reflectors to take A→ H, an upper Hessenberg matrix

2. Choose µ(k), a shift, for example µ(k) = A(k−1)m,m = r(q

(k−1)m

3. A(k) = R(k)Q(k) − µ(k)I

4. if an off diagonal component Aj,j+1 ≈ 0, set Aj,j+1 = Aj+1,j = 0 and break the probleminto two submatrices

This process has cubic convergence and is backwards stable.

91

8.9.2 Two Great Eigenvalue Theorems

There are a couple of really good eigenvalue theorems that are at times useful. The first is

Theorem 26. Bauer-Fike TheoremLet A ∈ Cm×m be diagonalizable, so A = V ΛV −1, and let δA be arbitrary. Then evereigenvalue of A+ δA lies in one of the m circular disks in the complex plane centred at theeigenvalues of A with radius κ(V )‖δA‖2 (recall that κ(V ) = ‖V ‖2‖V −1‖2).

Proof. Let Ax = λx, and let (A + δA)ξ = µξ. If µ 6∈ eig(A), then det(Λ − µI) 6= 0 butdet((A+ δA)− µI) = 0. Thus we have

det(V −1) det((A+ δA)− µI) det(V ) = 0

det(Λ + V −1δAV − µI) = 0

det(Λ− µI) det((Λ− µI)−1(Λ + V −1δAV − µI)) = 0

det((Λ− µI)−1V −1δAV + I) = 0

This means that −1 ∈ eig ((Λ− µI)−1V −1δAV ). Generally, we have |λ| ≤ ‖A‖, so we have

1 ≤ ‖(Λ− µI)−1V −1δAV ‖ ≤ ‖(Λ− µI)−1‖‖δA‖κ(V )

(Λ− µI) is diagonal, so

‖(Λ− µI)−1‖ = maxλ∈eig(A)

1

|λ− µ|so

1 ≤ maxλ∈eig(A)

1

|λ− µ|‖δA‖2κ(V )

somin

λ∈eig(A)|λ− µ| ≤ κ(V )‖]deltaA‖2

Furthermore, if A is normal, so A∗A = AA∗, then A = V ΛV ∗, so κ(V ) = ‖V ‖‖V T‖ = 1,and so |λ− µ| ≤ ‖δA‖2.Next we have

Theorem 27. The Gershgorin Circle Theorem Let A ∈ Cm×m, and let Di be the closeddisk in the complex plane centred at Ai,i with radius

Ri =∑j 6=i

Ai,j

Then every eigenvalue of A lies within at least one of the disks Di. Furthermore, if there isa union of k of the disks Di which is disjoint from the other m−k disks, this union containsexactly k eigenvalues of A.

I will omit the proof (it’s not very hard, I’ve had it as homework problem both as anundergrad and as a grad student). The Gershgorin theorem gives a very easy way to estimateeigenvalues, and gives an upper bound on the magnitude of the eigenvalues.

92

8.10 Numerically Finding the SVD

In an earlier section, we discussed the singular value decomposition and its usefulness. Wedid not, however, discuss how it could be found. To do it, we note that if A ∈ Cm×m, we canwrite A = UΣV ∗, the SVD, and so A∗A = V Σ∗ΣV ∗, and this last looks like an eigenvaluedecomposition. One process we could use to find the SVD would then be

1. Take A∗A

2. Compute an eigenvalue decomposition of the Hermitian matrix A∗A = V ΛV ∗

3. Take Σ to be the m× n non negative square root of Λ.

4. Solve UΣ = AV

This is backwards stable, but it is unstable. This is because, by the Bauer-Fike theorem,

|λk(A∗A+ δB)− λk(A∗A)| ≤ ‖δB‖2

because A∗A is normal. There is a similar result for σk, by the Weilandt-Hoffman theorem:

|σk(A+ δA)− σk(A)| ≤ ‖δA‖2

Then σk = σk(A + δA) where ‖δA‖‖A‖ = O(εmachine), so |σk − σk| = O(εmachine‖A‖). So if we

carry out the above procedure and seek λk(A∗A), and that is backwards stable, then we have

|λk − λk| = O(εmachine‖A∗A‖) = O(εmachine‖A‖2)

then with σk =√λk,

|σk − σk| = |√λk −

√λk| =

∣∣∣∣∣ λk − λk√λk +

√λk

∣∣∣∣∣ ≈ |λk − λk√λk

= O

(εmachine

‖A‖2

σk

)so if there is a small singular value, the error will be large.A different approach is called for. Consider

H =

0 A∗

A 0

2m×2m

Since A = UΣV ∗, AV = UΣ, and A∗ = V Σ∗U∗ = V ΣU∗, so A∗U = V Σ. Then we notice0 A∗

A 0

V V

U −U

=

V V

U −U

Σ 0

0 −Σ

And we have an eigenvalue decomposition H = XΛX−1, where

σk = |λk(H)|

and this approach is stable.

93

8.11 Iterative Methods

Finding eigenvalues requires iterative techniques, the only way to find the solution is byperforming a repeated operation which converges to the solution as the number of iterationsgoes to infinity. There are also iterative techniques for solving systems of equations whichmay be a better choice than a direct method. A direct method (such as Gaussian elimination)will find a solution in some number of steps m. However, stopping at step m − 1 will notgive anything close to a solution. In contrast, iterative methods may never reach an exactanswer in finite steps, but stopping after a smaller number of steps n < m will give a resultthat is close to the solution to the problem.For this section, we assume A ∈ Rm×m and we want to solve Ax = b unless otherwise stated.

8.11.1 Classical Iterative Methods

There are a few classical iterative methods based on breaking apart A. They are the Jacobimethod and Gauss-Seidel method. These are both fixed point methods. A fixed point methodis one that puts the problem into a form in which can be written x = f(y) and has theproperty that if x∗ is the solution to the problem, f(x∗) = x∗. These methods work byiterating x(k+1) = f(x(k). Clearly, we must check these to ensure we have convergence. Toborrow a term from differential equations, we must make sure the fixed point is stable.

Jacobi Method :We can break up A into lower and upper triangular matrices and a diagonal matrix or adiagonal matrix and an off diagonal matrix through addition: A = L+D+U = D+R.Here D is diagonal and R = U + L. Then we see

Ax = b→ (D +R)x = b

Then x = D−1(b−Rx), and so the method is

x(k+1) = D−1 = (b−Rx(k))

or, element wise,

x(k+1)i =

1

Ai,i

(bi −

∑j 6=i

Ai,jx(k)j

)

Gauss-Seidel Method :Similar to the Jacobi method, we break A = U + D + L. The method is almost thesame as the Jacobi method, in fact, but here we use the x

(k+1)i as soon as we have

them. That is,x(k+1) = (L+D)−1(b− Ux(k))

We don’t actually want to invert L+D, so we generally implement this element wise:

x(k+1)i =

1

Ai,i

(bi −

∑j<i

Ai,jx(k+1)j −

∑j>i

Ai,jx− j(k)

)

94

Generally, the idea of these methods is to write Mx(k+1) = b+Nx(k) where A = M −N andM is easy to invert.We still have to show that these do actually converge. We have

x(k+1) = M−1Nx(k) + g

such that the exact solution satisfies

x∗ = M−1Nx∗ + g

so the error is e(k+1) = x(k+1)−x∗, and we notice e(k+1) = M−1Ne(k), so e(k) = (M−1N)ke(0).

We might suppose then that for convergence we need ‖M−1N‖ < 1. However, that is morestrict than necessary. In fact, we have

Theorem 28. (M−1N)k → 0 as k →∞ if and only if ρ (M−1N) < 1.

(Remember that ρ(A) = maxλi∈eig(A) |λi| is the spectral radius of A)

Proof. We will need to use the Jordan form a matrix. That is, B = XJX−1 where

J =

J1

J2 . . .Jr

and each Ji is the l × l matrix

Ji =

λi 1

λi 1. . . . . .

. . .

= λiIl×l +

0 1

0 1. . . . . .

. . .

So B = XJkX−1, and

J2i = λ2

i I + 2λi

0 1

0 1. . . . . .

. . .

+

0 0 1

0 0 1. . . . . .

. . .

and the last term disappears after l steps as the 1s shift up. Thus, Jki → 0 if λi < 1, soBk → 0 if ρ(B) < 1.

Using this theorem, we see that there are a number of sufficient conditions on A forconvergence of Jacobi or Gauss-Seidel iteration. For instance, if A is diagonally dominant,then both converge. Furthermore, for Jacobi iteration,

ρ(D−1R) ≤ ‖D−1R‖∞ = max1<i 6=j<m

Ai,jAi,i

95

For Gauss-Seidel iteration, it is enough that A is symmetric and positive definite.

Another classical iterative method is Successive Over Relaxation (SOR). This is an acceler-ated version of Gauss-Seidel iteration. Again, write A = L+D + U , and then

x(k+1) =

(L+

1

ωD

)−1(b−

(U +

ω − 1

ωD

)x(k)

)so Mωx

(k+1) = Nωx(k) + ωb where Mω = D + ωL and Nω = (ω − 1)D − ωU . The idea is to

tune ω to minimize ρ(M−1ω Nω).

8.11.2 Arnoldi Iteration

There are some more modern ways to solve Ax = b using iterative techniques. The two bigideas that come into play are

• Projection onto Krylov subspaces

• Computing Av without forming A

A Krylov subspace, Kn is

K0 = spanb

K1 = spanb, Ab...

Kn = spanb, Ab,A2b, ..., Anb

We can construct an orthonormal basis for successive Krylov subspaces through a processcalled Arnoldi iteration. We previously used Householder reflectors on submatrices to reduceA to an upper Hessenberg matrix, A = QHQ∗, AQ = QH:

A

q1 q2 qm

=

q1 q2 qm

H

The full reduction to upper Hessenberg form is expensive. However, if consider

Qn =

q1 q2 qn

a matrix made up of the first n columns of Q, we have AQN = QN+1HN where Hk is thefirst k columns of H truncated so it is k× k+ 1. This is clear by considering that AQN willyield the first N columns of QH. We see then that

Aqn = h1,nq1 + h2,nq2 + · · ·+ hn,nqn + hn+1,nqn+1

96

so

qn+1 =1

hn+1,n

(Aqn −

n∑i=1

hi,nqi

)We can construct the set qn then by iteration, starting with ‖q1‖ = 1 and then:

q2 =Aq1 − h1,1q1

h2,1

where q2 · q1 = 0 demands that h1,1 = qT1 Aq1, and ‖h2,1‖ = 1 gives h2,1. Then, if we letq1 = b

‖b‖ , then we have

spanb, Ab,A2b, ..., An−1b = spanq1, q2, ..., qnThis is clear when we think about the process we are going through: each time we find anew qk, we multiply qk−1 by A and then subtract all the components in the directions of theprevious qi. So, we have constructed an orthonormal basis for the Krylov subspace Kn.We can also interpret Arnoldi iteration as projections onto successive Kn. We have

Q∗nQn+1 = In+1,n =

1

11

. . .1

0 0 0 · · · 0

so we have

Q∗nQn+1Hn =

h1,1 h1,2 · · · h1,n

h2,1 h2,2. . .

.... . . . . .

...

hn,n−1 hn,n

= Hn

Then A = QnHnQ∗n, because AQn = Qn+1Hn. This is a similarity transform, so Hn may

be considered a projection of A onto Kn. Furthermore, the diagonal of Hn is qTkAqk, theRayleigh Quotient r(qk). This is thus another way to reveal eigenvalue estimates, and theseestimates are called the “Ritz values”.

GMRes

The process of using Arnoldi iteration to solve Ax = b for large A is called the GeneralizedMinimal Residual (GMRes) method. If we assume that A is non singular, then x∗ = A−1b.At step n of our iteration, we approximate x∗ by xn ∈ Kn such that we minimize the normof the residual r = b− Axn.The straight forward but unstable idea is consider

AKn =

Ab A2b · · · Anb

97

and try to minimize‖AKnc− b‖2

and say xn = Knc where

Kn =

b Ab · · · An−1b

Instead of this, we use Arnoldi iteration to build Qn with orthonormal columns spanningKn. The best approximation is then xn = Qny, and we minimize

‖AQny − b‖2 = ‖Qn+1Hny − b‖2

so the norm is the same for

Q∗n+1(Qn+1Hny − b) = Hny −Q∗n+1b

and Q∗n+1b = ‖b‖e1, so we minimize

‖Hny − ‖b‖e1‖2

and then xn = Qny. Then algorithm is:

q1 = b‖b‖ ;

while above some tolerance dodo step n of Arnoldi iteration;

minimize ‖Hny − ‖b‖e1‖2;xn = Qny;

end

This is solving the least squares problem for ‖Hny − ‖b‖e1‖2 with QR factorization, andis O(n2) flops.

To show that this converges to the answer, we want ‖rn‖‖b‖ << 1. We notice that ‖rn+1‖ ≤ ‖rn‖because Kn ∈ Kn+1, and in Cm, ‖rm‖ = 0 if A is invertible because Km = Cm (and assumingexact arithmetic). We note that

xn = c0b+ c1Ab+ · · ·+ cn−1An−1b = hn(A)b

where hn ∈ polynomials of degree n-1. Then

rn = b− Axn = (I − Ahn(A))b = pn(A)b

with pn(z) = 1 − zhn(z) ∈ Pn = polynomials of degree n, p(0) = 1. GMRes is thenchoosing ck to minimize ‖pn(A)b‖ for n = 1, 2, .... We see that

‖rn‖ = ‖pn(A)b‖ ≤ ‖pn(A)‖‖b‖

and so‖rn‖‖b‖

≤ infpn∈Pn

‖pn(A)‖

98

and so if ‖pn(A)‖ can be controlled, we are done. To estimate ‖pn(A)‖, assume A is diago-nalizable and let A = V Λv−1. Then

‖pn(A)‖ = ‖V pn(Λ)V −1‖ ≤ ‖V ‖‖pn(Λ)‖‖V −1‖ = κ(V )‖pn(Λ)‖ = κ(V ) supλ∈Λ(A)

|pn(λ)|

so‖rn‖‖b‖

≤ infpn∈Pn

(κ(V ) sup

λ∈Λ(A)

|pn(λ)|

)

8.11.3 The Conjugate Gradient Method

The last iterative to cover is the Conjugate Gradient Method, which is used to solve Ax = bwhen A is symmetric and positive definite. First, remember that “positive definite” meansthat xTAx > 0 ∀x 6= 0, and that all the eigenvalues of A are positive. Assume A ∈ Rm×m,and b ∈ Rm.For u, v ∈ Rm, u and v are called A conjugate if uTAv = 0.Any quadratic polynomial can be written as

F (x) = xTMx+ xTa+ c

It turns out that M is symmetric and positive definite if and only if F (x) is convex (i.e.F (αx + (1− α)y) ≤ αF (x) + (1− α)F (y) for 0 ≤ α ≤ 1, ∀x, y). So if M is symmetric andpositive definite, F (x) has a unique minimum. So, for our symmetric, positive definite A, if

F (x) = xTAx− xT b

then F (x) has a unique minimum at x = A−1b. This clearly suggests a connecting betweensolving Ax = b and minimizing F (x). We will choose xn ∈ Kn to minimize ‖en‖A, whereen = x− xn and

‖y‖A =√ytAy

so‖en‖2

A = eTnAen = (x− xn)TA(x− xn) = xTnAxn − xTnAx− xTAxn + xTAx

and symmetry of A along with the fact that Ax = b gives

‖en‖2A = xTnAxn − 2xTnb+ xT b = 2F (xn) + c

so clearly our intuition was correct, minimizing ‖en‖A is equivalent to minimizing the quadraticF (xn). The first idea that is natural to have to minimize a quadratic is to make use of thegradient, which points in the direction of steepest descent. The method of steepest descentis just that, to pick x0, search for a minimum in the direction of the (negative) gradient, andthen repeat. It is important to to note that taking the gradient of F (x) gives

[∇F ]m =∂

∂xm

(1

2

∑i,j

xiAi,jxj −∑i

xibi

)

99

and this works out to be∇F = Ax− b

and more important to us, this means the residual is the negative gradient

rn = b− Axn = −∇F (xn) = A(x− xn) = Aen

In general, this sort of method follows the iteration

xn+1 = xn + αndn

where αn is the search distance, and dn the direction. The steepest descent method is thensimply dn = ∇F (xn) = rn. The conjugate gradient method will pick a more effective searchdirection, but it is worth understanding the process of picking αn in the context of steepestdecent, because steepest descent is easier to visualize and involves simpler calculations.Given our search direction (for now rn), we need to do a line search to minimize F (xn+1) inthat direction. This means

0 =∂F (xn+1)

∂αn= (∇F (xn+1))T rn = −rTn+1rn

and note that rn+1 = b− Axn+1 = b− A(xn + αndn) = rn − αnAdn. So,

0 = − (rn − αnAdn)T rn

so

αn =rTn rndTnArn

=rTn rnrTnArn

with the last of that equality for steepest descent only, because dn = rn for this method.So we know how to pick the search distance, and for the method of steepest descent, thedirection is simply the residual.However, the method of steepest descent is not very good. We want to choose a searchdirection that will get us to our minimum more quickly, and that is what we get from theconjugate gradient method. We want to decompose our search so that we don’t have to“backtrack” at all, so dTndk = 0 whenever n 6= k. This corresponds to requesting

dTnen+1 = 0

We notice en+1 = en − αndn, so what we want is

dTn (en − αndn) = 0⇒ αn =dTnendTndn

and we know that Aen = rn, but we don’t know en.The first big idea is to choose our search directions to be A conjugate, so that for n 6= k,dTnAdk = 0. Then we have dTnAen = 0, so the line search becomes

0 = −dTn (rn − αnAdn)⇒ αn =dTnrndTnAdn

100

For any set of search directions such that dTnAdk = 0, n 6= k, we should get convergence tox in at most m steps, because we will then have searched the whole column space of A. Tosee this, we write:

x− x0 = e0 =m−1∑j=0

cjdj

Then we see

dTkAe0 =m−1∑j=0

cjdTkAdj = ckd

TkAdk

and so

ck =dTkAe0

dTkAdk

and because

ek = x− xk = x−

(x0 +

k−1∑j=1

αjdj

)= e0 −

k−1∑j=0

αjdj

we have

ck =1

dTkAdkdTkA

(ek +

k−1∑j=0

αjdj

)=dTkAekdTkAdk

= αk

This is useful because it means that

ek = e0 −k−1∑j=0

αjdj =m−1∑j=0

αjdj −k−1∑j=0

αjdj =m−1∑j=k

αjdj

so ‖ek+1‖ ≤ ‖ek‖ and ek → 0 as k → m− 1.The second big idea of the conjugate gradient method is to take

ek ∈ e0 ⊕Dk = e0 ⊕ 〈d0, d1, ..., dk−1〉

to minimize ‖ek‖A. With dTnAdk = 0, ∀n 6= k, and αn = dTn rndTnAdn

we get

‖ek‖2A =

(m−1∑j=k

αjdj

)A

(m−1∑i=k

αidi

)=

m−1∑j=k

α2jdTj Adj

and we have minimized ‖ek‖ for ek ∈ e0 ⊕Dk by taking A conjugate directions (that choiceeliminated all the cross terms in the above multiplication of sums). So taking A conjugatedirections will give us an exact answer in m steps, give convergence toward that answer inless steps than that, and minimize error at each step. We now only need to figure how totake conjugate directions.One possibility is to first construct an A conjugate basis from a set of m linearly independentvectors using something like a Graham Schmidt process. Take u0, u1, ..., um−1 with ‖u0‖2 =1 and follow Graham Schmidt:

• d0 = u0

101

• di = ui +i−1∑k=0

βi,kdk

• βi,j =−uTi AdjdTj Adj

so that dTi Adj = 0 for j < i

This strategy isn’t great however, because it is very costly. Costs go up with rapidly with i.We would rather take ui so that

βi,j =

c 6= 0 j = i− 1

0 j 6= i− 1

For i < k, we note

dTi rk = dTi Aek = dTi A

(m−1∑j=k

αjdj

)= 0

so all of our previous search directions should be orthogonal to our current residual if we arechoosing A conjugate directions. This leads us to the third big idea of the method: constructour directions dk from the residuals rk. The consequence of this is that

〈r0, r1, ..., ri−1〉 = 〈d0, d1, ..., di−1〉 = Di

and since rk ⊥ Dk−1, rTk r1 = 0 for i 6= k. Furthermore, ri = ri−1− αiAdi−1, with ri−1, di−1 ∈Di, so Di = Di−1 ⊕ Adi−1. This means that

Di = 〈d0, Ad0, A2d0, ..., A

i−1d0〉 = 〈r0, Ar0, A2r0, ..., A

i−1r0〉

then if x0 = 0 so that r0 = b, we are using the Krylov subspaces Ki! The Graham Schmidtprocedure then uses

βi,j =rTi AdjdTj Adj

but rTj rj+1 = rTi (rj − αjAdj), so αjrTi Adj = rTi rj+1 − rTi rj. This means that

βi,j =

1

αi−1

rTi rjdTi−1Adi−1

= 1αi−1

rTi rjrTi−1ri−1

= βi i = j + 1

0 i 6= j + 1

In summary, the algorithm for the conjugate gradient method is:

x0 = 0;r0 = b;d0 = r0;while either above a tolerance or i < m− 1 do

αi =rtiridTi Adi

;

xi+1 = xi + αidi;ri+1 = ri − αiAdi;βi+1 =

rTi+1ri+1

rTi ri;

di+1 = ri+1 + βi+1di;

end

102

Speed of Convergence

Recall that xn ∈ K)n such that ‖en‖A is minimized, where en = x − xn. Also that〈r0, r1, ..., rn−1〉 = Dn = 〈d0, d1, ..., dn−1〉 = Kn. Then

en = x− xn = x−n−1∑k=0

ckAkr0 = x−

n−1∑k=0

akAk+1e0

Conjugate gradients chooses coefficients to minimize ‖en‖, so we are minimizing ‖pn(A)e0‖Awhere pn ∈ polynomials of degree at most n, p(0) = 1 = Pn. We already know that‖ek‖A ≤ ‖ek−1‖A. Expand e0 in vi, the eigenvectors of A:

e0 =m∑j=1

cmvm

then‖ek‖A‖e0‖A

= infp∈Pn

‖p(A)e0‖A‖e0‖A

≤ infp∈Pn

(maxλ∈Λ(A)

|p(λ)|)

and so the convergence properties of the method are determined by the spectrum of A (thisshouldn’t be too surprising given the connection between Krylov subspaces and the spectrumof A). If the eigenvalues are clustered about a point z0 ∈ C, then

p(z) ≈(

1− z

z0

)nand so ‖ek‖A‖e0‖ diminishes rapidly.

In general, we have κ(A) = ‖A‖2‖A−1‖2 = |λmax||λmin| . Using Chebychev polynomials, we can see

‖en‖A‖e0‖A

≤ 2

(√κ(A)− 1√κ(A) + 1

)n

Thus, for poorly conditioned matrices, κ(A) >> 1, we have

‖en‖A‖e0‖A

. 2

(1− 2√

κ(A)

)nε2.

(1− 2√

κ(A)

)nlog(εmachine

2

). n log

(1− 2√

κ(A)

)log(εmachine

2

). − 2n√

κ(A)

⇒ n ≈ 20√κ(A) for large κ(A).

103

Chapter 9

Finite Elements

We return to partial differential equations with the Finite Element Method. This methodis build on ideas of analysis, and so familiarity with Hilbert spaces, Sobolev spaces, andassociated regularity theory is very helpful. Finite elements are an exciting branch of com-putational math, built on interesting analysis. They take advantage of elegant mathematicalstructure.

9.1 Introduction and Review of PDEs

Consider a general linear PDE in n variables, letting x = (x1, x2, ..., xn). A second orderPDE can be written

−n∑

i,k=1

ai,k(x)uxixk +n∑k=1

bk(x)uxk + c(x)u = f(x)

Consider the n×n matrix A(x) with entries ai,k(x) from the PDE. As long as we have someregularity (we assume here that we have enough), A is symmetric because uxixk = uxkxi ,meaning we can assume ai,k(x) = ak,i(x).The PDE is of elliptic type if A(x) is symmetric positive definite. Then we can write L(u) =F . If the PDE is translation and rotation invariant (i.e. invariant under isometric mappings),then

L(u) = −a0∆u+ c0u

The canonical example of an elliptic PDE is the Poisson Equation:

−∆u = f

in some domain Ω ∈ Rn (note that ∆u = ∇2u).For this chapter, we will use the following

V = v(x); v ∈ C[0, 1], v′ piecewise constant on [0, 1], v(0) = v(1) = 0

We will also use a linear functional F : V → R

F(v) =1

2(v′, v′)− (f, v)

104

This is the total potential energy, and here the inner product is the L2[0, 1] inner product:

(v, w) =

∫ 1

0

v(x)w(x)dx

There are actually three equivalent problems we can consider. Using the one dimensionalcase as an example (so ∆u = uxx), there are

D The differential form: −uxx = f

u(0) = u(1) = 0

M The minimization form:

Find u ∈ V such that F(u) ≤ F(v) ∀v ∈ V

this is called the “principle of minimum energy”

V The variational form:

Find u ∈ V such that (u′, v′) = (f, v) ∀v ∈ V

this is called the “principle of virtual work”

And we claim that D ⇔ M ⇔ V.

Proof. First, D ⇒ V :Take some v ∈ V (this is called a “test function”). We need some assumptions about theregularity of f and u, let’s assume we have those (specifically, we are invoking the ellipticregularity theorem, but let’s not worry about that). Then we have

−uxx = f

−vuxx = vf∫ 1

0−vuxxdx =

∫ 1

0vfdx

and integrating by parts, ∫ 1

0

uxvxdx =

∫ 1

0

vfdx

which we can write (u′, v′) = (f, v). Because our choice of test function was arbitrary, thisis true ∀v ∈ V .Next, V ⇒ D:Simply take the integration by parts backwards, so (u′, v′) = −(u′′, v)

105

Next, M ⇒ V :Let g(ε) = F(u+ εv) ≥ F(u) ∀v ∈ V , so g has a minimum at ε = 0. Then g′(0) = 0, and

g′(0) =∂

∂ε

[1

2(u′ + εv′, u′ + εv′)− (f, u+ εv)

]|ε→0

=∂

∂ε

[1

2(u′, u′)− (f, u) + ε(u′, v′)− ε(f, v) +

ε2

2(v′, v′)

]|ε→0

= (u′, v′)− (f, v)

∀v ∈ V , so (u′, v′) = (f, v).Finally, V ⇒ M :Let w = v − u, so w ∈ V , then

F(v) = F(u+ w)

=1

2(u′ + w′, u′ + w′)− (f, u+ w)

=1

2(u′, u′)− (f, u) +

1

2(w′, w′)− (f, w) + (u′, w′)

= F(u) +1

8(w′, w′) ≥ F(u)

Now if we have ux ∈ C1[0, 1], we have some more information. There we have, if f ∈ C2

then u ∈ C2. We also have u =∫

Φ?fdv, with −∆Φ = δ(x) (the Dirac delta). Furthermore,if f ∈ Ck+α, 0 < α ≤ k, k integer (this is the same as saying f is Holder continuous:

|f(x)− f(y)||x− y|α

<∞

) then u ∈ Ck+α+2. This can be shown by taking uε =∫

Φε ? fdv, −∆Φε = δε(x) and thensort out the regularity of u by showing uε ∈ Ck+α+2 and take uε → u.

Just for kicks, let’s compare this to the elliptic regularity theorem, which is

Theorem. In a domain U , if u ∈ D′(U) and p(k) elliptic of degree m, andp(∂)u = f ∈ H loc

s (U), then u ∈ H locs+m.

We need to give a little bit of background in order to make sense of that, butfor the most part those unfamiliar with functional analysis can safely ignore thispart (this is one of those irritating parts of the book where I am not going toexplain that much). Let’s start with the definition

H locs = f ∈ D′(U), such that for any compact set V ⊂ U, f = φ on V, φ ∈ Hs(Rd)

I won’t include the definition of the Sobolev space Hs(Rd), because that is prettystandard. I am using D′(U) to denote the set of distributions (NOT in the

106

sense of probability) on U . Specifically, this is the dual space to the set ofdistributions, meaning functions with support on U which are infinitely smooth(if you are unfamiliar with any of that, skip this part). I have skipped quite a lotof surrounding theory to the elliptic regularity theorem, but all I wanted to dowas point out that the theory exists, and what we are using is a slightly differentversion. The upshot is that what we have is a case of the more encompassingtheory, and there is a lot of theory developed around the regularity of solutionsto elliptic PDEs.

On the other hand, in higher than 1 dimension, with f ∈ C, it is not enough to have u ∈ C2.A counter example here is

f(x) =y2 − x2

2|x|2

(n+ 2

(− log |x|)1/2+

1

2(− log |x|)3/2

)x ∈ Br ⊂ Rn

with f(0) = 0. Then we have, if −∆u = f ,

u = (y2 − x2)(− log |x|)nhalf

but ∂2u∂2x→∞ as x→ 0, so we don’t have the regularity.

9.2 A First Look at Finite Elements

So why talk about analysis type stuff like regularity and function spaces (besides the eleganceof the theory itself, of course)? It turns out that a whole class of methods, which includesthe finite elements methods, for solving PDEs numerically is built on ideas from functionalanalysis. Here we reach the intersection between “pure” math and “applied” math (let’s notget started on the connotations involved with calling one of those “pure”) and get to whysomeone like an engineer needs to bother with something like a Sobolev space!Consider a finite dimensional subspace Vh of our function space V ,

Vh = v(x) ∈ C[0, 1], v piecewise linear , v(0) = v(1) = 0

where we generally want the linearity to be on grid intervals. In one dimension, we aretalking about functions that look like

107

with hj = xj − xj−1 and h = maxj hj. There are many ways to represent v ∈ Vh. At the endof the day, the class of methods we are talking about are all about representing v ∈ Vh withsome finite set of basis functions for Vh.Consider vi = v(xi) and define basis functions

φj(xi) =

1 i = j

0 otherwise∈ Vh

In one dimension these basis functions look like

then any v ∈ Vh may be written as

v(x) =m∑j=1

vjφj(x)

so for some unknown u, we can write u as

u =m∑j=1

ξjφj(x)

and to find u we need only find the coefficients ξj. This gives us m unknowns, but thisinsight reduces (we will see) finding a numerical solution to a PDE into a computationallinear algebra problem!In general, we can use any basis for Vh we want, but some have advantages over others inaccuracy and speed. The numerical variational (Vn) problem can be stated:

Find v ∈ Vh such that (u′, v′) = (f, v) ∀v ∈ Vh.

a method which makes use of this idea is called a Galerkin method. Alternatively, we coulduse Vh to attack the minimization (Mn) problem:

Find u ∈ Vh such that F(u) ≤ F(v) ∀v ∈ Vh.

this is called a Ritz method.So, if

u =∑

ξiφi(x) v =∑

viψi(x)

108

where φi and ψi are basis functions (usually, φi = ψi) for Vh, we are using a PetrovGalerin method. Inspecting Vn, we have

u ∈ Vh (u′, v′) = (f, v) ∀v ∈ Vhand because any v ∈ Vh can be written as a linear combination of basis functions φi, it issufficient to show

u ∈ Vh (u′, φ′j) = (f, φj) ∀φj

Then we write u =∑

ξiφi(x) as an expansion in the basis, so(m∑i=1

ξiφ′i, φ′j

)= (f, φj)

and by linearity of the inner product,

m∑i=1

ξi(φ′i, φ′j) = (f, φj)

And we have this for each φj, so this gives an m×m linear system (where m is the dimensionof Vh). We want to know the coefficients ξi, then we can construct u. We define the stiffnessmatrix A by

Ai,j = (φ′i, φ′j)

and the load vector b bybi = (f, φi)

and then we need only solve Ax = b. It is a little amazing to think that we have reducedour PDE to a simple linear algebra problem! (Of course, the sacrifice is that our answer willbe an approximation which lives in a finite dimensional space, rather than the true answer,which lives in an infinite dimensional function space). We can use other finite dimensionalspaces, and other basis functions than the ones described, but these choices will determinespeed and accuracy.With the space Vh and the basis functions φi that we have described, we can note somegood things about A. First, the inner product is commutative, so A is symmetric (because(φ′i, φ

′j) = (φ′j, φ

′i)). The choice of basis functions has a big advantage, as well. The support

of each φi is small, and more to the point, supp(φi)⋂

supp(φi) = ∅ (and so (φ′i, φ′j) = 0)

unless i = j or i = j ± 1. This means we have a tridiagonal matrix. In higher dimensions,this will be a banded matrix of some kind (depending on what the grid looks like).Furthermore, here A is positive definite.

Proof. ∀v ∈ Vh, v =m∑i=1

viφi. Let

η =

v1

v2

...

vm

109

then

ηTAη =∑i,j

vi(φ′i, φ′j)vj =

∑i,j

(viφ′i, vjφ

′j) =

(∑i

viφ′i,∑j

vjφ′j

)= (v′, v′) > 0

for v 6= 0.

This is great because we have already shown that when A is symmetric and positivedefinite, Aξ = b has a unique solution. For our choice of φi(x), we have

(φ′j, φ′j) =

1

hj+

1

hj+1

and for j = 2, ...,m,

(φ′j, φ′j−1) = (φ′j−1, φ

′j) =

−1

hj

and all other entries of A are 0. In the special case of a uniform grid, we have

hj = h =1

m+ 1

and so

A =1

h

2 −1

−1 2 −1

−1 2 −1. . . . . . . . .

−1 2 −1

−1 2 −1

−1 2

which is similar to what we have for a centred difference approximation.

In 2 dimensions, we start to get a more complicated problem. The problem is

−∆u = f in Ω, u = 0 on ∂Ω

and we use the space

V = v(x, y) ∈ C(Ω), vx&vy piecewise continuous , v|∂Ω = 0

then ∀v ∈ V ,

−∫

Ω

v∆udS =

∫Ω

fvdS

−∫

Ω

(∇ · v∇u−∇u · ∇v)dS =

∫Ω

fvdS

−∫∂Ω

v∇u · ndΓ +

∫Ω

∇u · ∇vdS =

∫Ω

fvdS∫Ω

∇u · ∇vdS = (f, v)

that last being the variational form of the problem, that is:

110

Find u ∈ V such that a(u, v) = (f, v), ∀v ∈ V , where

a(u, v) =

∫Ω

∇u · ∇vdS

We also have the minimization problem, which isn’t too hard to show, which is

Find u ∈ v such that F(v) = 12a(u, v)− (f, v) ≥ F(u), ∀v ∈ V .

The idea of the method doesn’t change, however. It is popular to use a triangular mesh ofa discretization of Ω. If we cover Ω with triangles kj (assuming ∂Ω is polygonal) which donot overlap and do not have any vertices meeting edges, we have

Ω =⋃j

kj

and define h = maxj(diam(kj)) where diam(kj) = |longest side of kj| We use the finitedimensional space

Vh = v ∈ C(Ω), v|kj linear, v = 0 on ∂ΩThen if the nodes of the triangulation are Nj, we have the basis functions

φi(x) =

1 at Ni

0 at any other Nj

then if vi = v(Ni), we have

v =∑i

viφi(x)

and we can get a linear system just like before. Here the basis functions look like tents abovethe triangulation.Development of the linear system proceeds exactly as before:

a(φj, u) = (φj, f)

a

(φj,

m∑i=1

ξiφi

)= (φj, f)

and a is bilinear, som∑i=1

ξia(φj, φi) = (φi, f)

and our stiffness matrix A has entries

Ai,j = a(φi, φj)

and again is symmetric and positive definite. There is thus a unique solution to Aξ = b,with the load vector b

bj = (f, φj)

We also get a banded matrix (depending on the triangulation) because a(φi, φj) = 0 if Nj ismore than one edge away from Ni.

111

9.3 Some Analysis - Hilbert & Sobolev Spaces

It is now time to suffer through some more rigorous analysis, the machinery that allows allof our work to make any sense. We need to think about linear vector spaces, which may befinite dimensional (like Rn) or infinite dimensional (like C[0, 1]), and we need to think aboutthe operations that can be carried out on vectors in these spaces (functions and functionals).If V is a linear space, then L : V → R is a linear form if L(au+ bw) = aL(u) + bL(w) wherea, b ∈ R, and u,w ∈ V .Going further, a(u,w) is a bilinear form on V × V if a(·, ·) is linear in each argument.Finally, a(u, v) is symmetric if a(u, v) = a(v, u), ∀u, v ∈ V . A symmetric, bilinear form onV × V is a “scalar product” or “inner product” on V if a(v, v) > 0 v∀v ∈ V, v 6= 0, andwe often write, for inner products, a(u, v) = 〈u, v〉 or simply (u, v). We can also associate anorm to an inner product:

‖v‖a =√a(v, v)

Because this is a norm, we have the Cauchy-Schwartz inequality:

|〈u, v〉| ≤ ‖v‖a‖u‖aV is called a Hilbert space if

• V is a linear space

• V is complete

• There is a scalar product (and thus a norm) on V

This is more than a complete, linear normed space (which is called a Banach space) becausea norm does not imply a scalar product (although a scalar product, as we see, implies anorm). For example, the space Lp([a, b]) is a Banach space, but only a Hilbert space if p = 2.We also have the Sobolev spaces

Wα,β = v|v ∈ Lβ, v′ ∈ Lβ, v′′ ∈ Lβ, ..., v(α) ∈ LβThese are also Hilbert spaces if β = 2, and then they are generally denoted Hα. In words,for some function v to be in Hα, v and its first α derivatives must be in L2.An equivalent definition makes use of the functional Λs:

Λs(f) =(

(1 + |ξ|2)s/2 + f

)∨and then

Hs = f |Λsf ∈ L2it isn’t so important to understand this definition (and if fact I’ve left out the fact that itrequires the f is something called a “tempered distribution”).We use the inner product for Hα:

〈u, v〉 =

∫ α∑i=0

u(i)v(i)dx

We can also restrict ourselves to some set of space Ω, so we have Hs(Ω), and also to functionswith compact support Hs

0 , and even functions with compact support contained within Ω,Hs

0(Ω).

112