numericalmethods_uofv

MATH 337, by T. Lakoba, University of Vermont 1

0 Preliminaries

0.1 Motivation

The numerical methods of solving differential equations that we will study in this course arebased on the following concept: Given a differential equation, e.g.,

y′(x) = f(x, y), (0.1)

replace the derivative by an appropriate finite difference, e.g.:

y′(x) ≈ y(x + h)− y(x)

h, when h is small (h ¿ 1). (0.2)

Then Eq. (0.1) becomes (in the approximate sense)

y(x + h)− y(x) = h f(x, y(x)), (0.3)

from which the ‘new’ value y(x + h) of the unknown function y can be found given the ‘old’value y(x).

In this course, we will consider both equations that are more complicated than (0.1) as wellas the discretization schemes that are more sophisticated than (0.3).

0.2 Taylor series expansions

Taylor series expansion of functions will play a central role when we study the accuracy ofdiscretization schemes. Below is a reminder from Calculus II, and its generalization.

If a function f(x) has infinitely many derivatives, then

f(x) = f(x0) +(x− x0)

1!f ′(x0) +

(x− x0)2

2!f ′′(x0) + . . .

=∞∑

n=0

(x− x0)n

n!f (n)(x0) . (0.4)

If f(x) has derivatives up to the (N + 1)st (i.e. f (N+1) exists), then

f(x) =N∑

n=0

(x− x0)n

n!f (n)(x0) +

(x− x0)N+1

(N + 1)!f (N+1)(x∗), x∗ ∈ (x0, x) . (0.5)

For functions of two variables, Eq. (0.4) generalizes as follows (we denote ∆x = (x − x0)and ∆y = (y − y0)):

f(x, y) =∞∑

n=0

(∆x)n

n!

∂nf(x0, y)

∂xn

=∞∑

n=0

(∆x)n

n!

( ∞∑m=0

(∆y)m

m!

∂n+mf(x0, y0)

∂xn∂ym

)

explained below=

∞∑

k=0

1

k!

(∆x

∂

∂x+ ∆y

∂

∂y

)k

f(x, y)|x=x0,y=y0

= f(x0, y0) + (∆x fx(x0, y0) + ∆y fy(x0, y0)) +

+1

2!

((∆x)2fxx(x0, y0) + 2∆x∆y fxy(x0, y0) + (∆y)2fyy(x0, y0)

)+ . . . (0.6)


The step of going from the second to the third line in the above calculations is based on thebinomial expansion formula

(x + y)k =k∑

n=0

k!

n!(n− k)!xn yk−n

and takes some effort to verify. (For example, one would write out all terms in line two withn + m = 2 and verify that they equal to the term in line three with k = 2. Then one wouldrepeat this for n + m = k = 3 and so on, until one sees the pattern.) For our purposes, it willbe sufficient to just accept the end result, i.e. the last line of (0.6).

0.3 Existence and uniqueness theorem for ODEs

In the first two parts of this course, we will deal exclusively with ordinary differential equations(ODEs), i.e. equations that involve the derivative(s) with respect to only one independentvariable (usually denoted as x).

To solve an ODE numerically, we first have to be sure that its solution exists and is unique;otherwise, we may be looking for something that simply is not there! The following theoremestablishes this fundamental fact for ODEs.

Theorem Let y(x) satisfy the initial-value problem (IVP), i.e. an ODE plus the initialcondition:

y′(x) = f(x, y), y(x0) = y0 . (0.7)

Let f(x, y) be defined and continuous in a closed region R that contains point (x0, y0). Let, inaddition, f(x, y) satisfy the Lipschitz condition with respect to y:

For any x, y1, y2 ∈ R, |f(x, y1)− f(x, y2)| ≤ L|y1 − y2| , (Lipschitz)

where the constant L depends on the region R and the function f , but not on y1 and y2. Thena solution of IVP (0.7) exists and is unique on some interval containing the point x0.

Remarks to the Theorem:

1. Any f(x, y) that is differentiable with respect to y and such that |fy| ≤ L in R, satisfiesthe Lipschitz condition. In this case, the Lipschitz constant L = maxR |fy(x, y)|.

2. In addition, f(y) = |y| also satisfies the Lipchitz condition, even though this functiondoes not have a derivative with respect to y. In general, L = max |fy(x, y)|, where themaximum is taken over the part of R where fy exists. For example, for f(y) = |y|, onehas L = 1.

3. f(y) =√

y does not satisfy the Lipschitz condition on [0, 1]. Indeed, one cannot find aconstant L that would be independent of y and such that

√y −

√0 < L|y − 0|

for sufficiently small y.

Question: What happens to the solution of the ODE when the Lipschitz condition is violated?


Consider the IVPy′(x) =

√y, y(0) = 0 . (0.8)

As we have just said in Remark 3, the function f(y) =√

y does not satisfy the Lipschitzcondition. One can verify (by substitution) that IVP (0.8) has the following solutions:

1st solution: y =x2

4.

2nd solution: y = 0.

infinitely many solutions:

y =

0, 0 ≤ x ≤ a (∀a > 0)

(x− a)2

4, x > a .

0 1 2 3 4 5 60

0.5

1

1.5

2

2.5

1st . . .other

2nd

Thus, if f(x, y) does not satisfy the Lipschitz condition, the solution of IVP (0.7) may not beunique.

0.4 Solution of a linear inhomogeneous IVP

We remind here the procedure of solving the IVP

y′(x) = a y + g(x), a = const , y(x0) = y0. (0.9)

Step 1: Solve the homogenous ODE y′ = a y:

y′hom = a yhom ⇒ yhom(x) = ea(x−x0). (0.10)

Step 2: Look for the solution of the inhomogeneous problem in the form y(x) = yhom(x) · c(x),where c(x) is determined by substituting the latter expression into Eq. (0.9):

6 6 6c y′hom + c′ yhom = 6 6 6a c yhom + g(x), ⇒

c′ =g(x)

yhom

, ⇒

c(x) =

∫g(x) e−a(x−x0)dx, ⇒

y(x) =

[y0 +

∫ x

x0

g(x) e−a(x−x0)dx

]ea(x−x0) . (0.11)

In the first line of (0.11), the symbol ‘ 6 6 6 ’ denotes cancellation of the respective terms on thetwo sides of the equation, which occurs due to (0.10).


0.5 A very useful limit from Calculus

In Calculus I, you learned thatlimh→0

(1 + h)1/h = e, (0.12)

where e is the base of the natural logarithm.The following useful corollary is derived from (0.12):

limh→0

(1 + ah)b/h = eab, (0.13)

where a, b are any finite numbers. Indeed, if we denote ah = g, then g → 0 as h → 0, and thenthe l.h.s. (left-hand side) of (0.13) becomes:

limg→0

(1 + g)b/(g/a) = limg→0

(1 + g)ab/g =

(limg→0

(1 + g)1/g

)ab

= eab .

Note also thatlimh→0

( 1 + ah2 )b/h = e0 = 1 (0.14)

for any finite numbers a and b.


1 Simple Euler method and its modifications

1.1 Simple Euler method for the 1st-order IVP

Consider the IVPy′(x) = f(x, y), y(x0) = y0 . (1.1)

Let: xi = x0 + i h, i = 0, 1, . . . nyi = y(xi) — true solution evaluated at points xi

Yi — the solution to be calculated numerically.

Replace

y′(x) −→ Yi+1 − Yi

h.

Then Eq. (1.1) gets replaced with

Yi+1 = Yi + h f(xi, Yi) Y0 = y0 . (1.2)

1.2 Local error of the simple Euler method

The calculated solution satisfies Eq. (1.2). Next, assuming that the true solution of IVP (1.1)has (at least) a second derivative y′′(x), one can use the Taylor expansion to write:

yi+1 = y(xi + h) = yi + y′i h + y′′i (x∗i )

h2

2= yi + h f(xi, yi) + O(h2) . (1.3)

Here x∗i is some point between xi and xi+1 = xi + h, and we have used Eq. (0.5).Notation O(hk) for any k means the following:

q = O(hk) whenever limh→0

q

hk= const < ∞ , const 6= 0 .

For example,

5h2 + 1000h3 = O(h2); orh

1 + h cos(3 + 2h)= O(h) .

We now introduce a new notation. The local truncation error shows how well the solutionYi+1 of the finite-difference scheme approximates the exact solution yi+1 of the ODE at pointxi+1, assuming that at xi the two solutions were the same, i.e. Yi = yi. Comparing the last lineof Eq. (1.3) with Eq. (1.2), we see that the local truncation error of the simple Euler methodis O(h2). It tends to zero when h → 0.

Another useful notation is that of discretization error. It shows how well the finite-differencescheme approximates the ODE. Let us now estimate this error. First, we note from (1.2) and(1.3) that the computed and exact solutions satisfy:

Yi+1 − Yi

h= f(xi, Yi) and

yi+1 − yi

h= f(xi, yi) + O(h),

whence the discretization error of the simple Euler method is seen to be O(h).


1.3 Global error of the Euler method; Propagation of errors

As we have said above, the local truncation error shows how well the computed solution ap-proximates the exact solution at one given point, assuming that these two solutions have beenthe same up to that point. However, as we compute the solution of the finite-difference scheme,the local truncation errors at each step accumulate. This results in that the difference betweenthe computed solution Yi and exact solution yi at some point xi down the line becomes muchgreater than the local truncation error.

Let εi = yi − Yi denote the error (the difference between the true and computed solutions)at x = xi. This error (or, sometimes, its absolute value) is called the global error of thefinite-difference method.

Our goal in this subsection will be to find an upper bound for this error. Let us emphasizethat finding an upper bound for the error rather than the error itself is the best one can do.(Indeed, if one could have found the actual error εi, one would have then simply added it tothe numerical solution Yi and obtained the exact solution yi.) The main purpose of finding theupper bound for the error is to determine how it depends on the step size h. We will do thisnow for the simple Euler method (1.2).

To this end, we begin by considering Eq. (1.2) and the 1st line of Eq. (1.3):

Yi+1 = Yi + h f(xi, Yi)

yi+1 = yi + h f(xi, yi) +h2

2y′′(x∗i )

Subtract the 1st equation above from the 2nd to obtain the error at xi+1:

εi+1 = εi + h (f(xi, yi)− f(xi, Yi)) +h2

2y′′(x∗i ) . (1.4)

Now apply the “triangle inequality”, valid for any three numbers a, b, c:

a = b + c ⇒ |a| ≤ |b|+ |c|, (1.5)

to Eq. (1.4) and obtain:

|εi+1| ≤ |εi|+ hL|εi|+ h2

2|y′′(x∗i )|

= (1 + hL)|εi|+ h2

2|y′′(x∗i )| . (1.6)

In writing the second term in the above formula, we used the fact that f(x, y) satisfies theLipschitz condition with respect to y (see Lecture 0).

To complete finding the upper bound for the error |εi+1|, we need to estimate y′′(x∗i ). Weuse the Chain rule for a function of two variables (recall Calculus III) to obtain:

y′′(x) =d2y(x)

dx2|use theODE =

df(x, y)

dx= fx

dx

dx+ fy

dy

dx= fx + fyf . (1.7)

Considering the first term on the r.h.s. of (1.7), let us assume that

|fx| ≤ M1 for some M1 < ∞. (1.8)

In cases when this asumption does not hold (as, for example, for f(x, y) = x1/3 sin 1x), the

estimate obtained below (see (1.16)) is not valid, but a modified estimate can usually be foundon a case-by-case basis. So here we proceed with assumption (1.8).


Considering the second term on the r.h.s. of (1.7), we first recall that f satisfies the Lipschitzcondition with respect to y, which means that

|fy| ≤ M2 for some M2 < ∞, (1.9)

except possibly at a finite number of y-values where fy does not exist (like at y = 0 forf(y) = |y|). Finally, the other factor of the second term on the r.h.s. of (1.7) is also bounded,because f is assumed to be continuous and on the closed region R (see the Existence andUniqueness Theorem in Lecture 0). Thus,

|f | ≤ M3 for some M3 < ∞. (1.10)

Combining Eqs. (1.7–1.10), we see that

|y′′(x∗i )| ≤ M1 + M2M3 ≡ M < ∞ . (1.11)

Now combining Eqs. (1.6) and (1.11), we obtain:

|εi+1| ≤ (1 + hL)|εi|+ h2

2M . (1.12)

This last equation implies that |εi+1| ≤ zi+1 , where zi+1 satisfies the following recurrenceequation:

zi+1 = (1 + hL)zi +h2

2M , z0 = 0 . (1.13)

(Condition z0 = 0 follows from the fact that ε0 = 0; see the initial conditions in Eqs. (1.1) and(1.2).)

Thus, the error |εi| is bounded by zi, and we need to solve Eq. (1.13) to find that bound.The way to do so is analogous to solving a linear inhomogeneous equation (see Section 0.4).However, before we obtain the solution, let us develop an intuitive understanding of what kindof answer we should expect. To this end, let us assume for the moment that L = 0 in Eq.(1.13). Then we have:

zi+1 = zi +h2

2M = (zi−1 +

h2

2M) +

h2

2M = . . .

= z0 +h2

2M · i = 0 +

h2

2M · xi − x0

h= h · M(xi − x0)

2= O(h) .

That is, the global error |εi| should have the size O(h). In other words,

Global error = Number of steps × Local erroror

O(h) = O(

1h

) × O(h2)

Now let us show that a similar estimate also holds for L 6= 0. First, solve the homogeneousversion of (1.13):

zi+1 = (1 + hL) zi ⇒ zi,hom = (1 + hL)i . (1.14)

Note that this is an analogue of ea(xi−x0) in Section 0.4, because

(1 + hL)i = (1 + hL)(xi−x0)/h|h→0 ≈ eL(xi−x0),


where we have used the definition of xi, found after (1.1), and also the results of Section 0.5.In close analogy to the method used in Section 0.4, we seek the solution of (1.13) in the

form zi = cizi,hom (with c0 = 0). Substituting this form into (1.13) and using Eq. (1.14), weobtain:

ci+1(1 + hL)i+1 = (1 + hL) · ci(1 + hL)i +h2

2M ⇒

ci+1 = ci +h2M

2 (1 + hL)i+1= ci−1 +

h2M

2 (1 + hL)i+1−1+

h2M

2 (1 + hL)i+1

= . . . = c0 +i+1∑

k=1

h2M

2

1

(1 + hL)k

= |geometric seriesh2M

2(1 + hL)

1− 1(1+hL)i+1

1− 1(1+hL)

=hM

2L

(1− 1

(1 + hL)i+1

). (1.15)

Combining (1.14) and (1.15), and using (0.13), we finally obtain:

zi+1 =hM

2L

((1 + hL)i+1 − 1

)=

hM

2L

((1 + hL)(xi+1−x0)/h − 1

) ≈ hM

2L

(eL(xi+1−x0) − 1

)= O(h),

⇒

|εi+1| ≤ hM

2L

(eL(x−x0) − 1

)= O(h) . (1.16)

This is the upper bound for the global error of the simple Euler method (1.2).

Thus, in the last two subsections, we have shown that for the simple Euler method:

• Local truncation error = O(h2);

• Discretization error = O(h);

• Global error = O(h).

The exponent of h in the global error is often referred to as the order of the finite-differencemethod. Thus, the simpler Euler method is the 1st-order method.

Question: How does the above bound for the error change when we include the machineround-off error (which occurs because numbers are computed with finite accuracy, usually10−16)?

Answer: In the above formulae, replace h2M/2 by h2M/2 + r, where r is the maximumvalue of the round-off error. Then Eq. (1.16) gets replaced with

|εi+1| ≤(

h2M

2+ r

)1

hL

(eL(x−x0) − 1

)=

(hM

2L+

r

hL

) (eL(x−x0) − 1

)(1.17)

The r.h.s. of the above bound is schematically plotted in the figure below. We see that for verysmall h, the term r/h can be dominant.


Moral:

Decreasing the step sizeof the difference equationdoes not always resultin the increased accuracyof the obtained solution.

total error

discretizationerror

round−off error

step size h

1.4 Modifications of the Euler method

In this subsection, our goal is to find finite-difference schemes which are more accurate than thesimple Euler method (i.e., the global error of the sought methods should be O(h2) or better).

Again, we first want to develop an intuitive understanding of how this can be done, andthen actually do it. So, to begin, we notice an obvious fact that the ODE y′ = f(x, y) is just amore general case of y′ = f(x). The solution of the latter equation is y =

∫f(x)dx. Whenever

we cannot evaluate the integral analytically in closed form, we resort to approximating theintegral by the Riemann sums.

A very crude approximationto

∫ b

af(x)dx

is provided by theleft Riemann sums:

Yi+1 = Yi + h f(xi) .

This is the analogue of thesimple Euler method (1.2):

Yi+1 = Yi + h f(xi, Yi) .

left Riemann sums

x0 x

1 x

2 x

3

Approximations of the integral∫ b

af(x)dx that are known to be more accurate than the left

Riemann sums are the Trapezoidal Rule and the Midpoint Rule:


Trapezoidal Rule:

Yi+1 = Yi + hf(xi) + f(xi+1)

2.

Its analogue for the ODEis to look like this:

Yi+1 = Yi +h

2(f(xi, Yi) + f(xi+1, Yi + Ah)) ,

(1.18)where the coefficient A is to be determined.Method (1.18) is calledthe Modified Euler method.

Trapezoidal Rule

x0 x

1 x

2 x

3

Midpoint Rule:

Yi+1 = Yi + h f

(xi +

h

2

).

Its analogue for the ODEis to look like this:

Yi+1 = Yi + h f

(xi +

h

2, Yi + Bh

), (1.19)

where the coefficient B is to be determined.We will refer to method (1.19) asthe Midpoint method.

Midpoint Rule

x0 x

1 x

2 x

3

The coefficients A in (1.18) and B in (1.19) are determined from the requirement that thecorresponding finite-difference scheme have the global error O(h2) (as opposed to the simpleEuler’s O(h)), or equivalently, the local truncation error O(h3). Below we will determine thevalue of A. You will be asked to compute B along similar lines in one of the homework problems.

To determine the coeficient A in the Modified Euler method (1.18), let us rewrite thatequation while Taylor-expanding its r.h.s. using Eq. (0.6) with ∆x = h and ∆y = Ah:

Yi+1 = Yi +h

2f(xi, Yi) +

h

2

(f(xi, Yi) + [hfx(xi, Yi) + (Ah)fy(xi, Yi)] + O(h2)

)

= Yi + hf(xi, Yi) +h2

2(fx(xi, Yi) + Afy(xi, Yi)) + O(h3) . (1.20)

Equation (1.20) yields the Taylor expansion of the computed solution Yi+1. Let us compareit with the Taylor expansion of the exact solution y(xi+1). To simplify the notations, we willdenote y′i = y′(xi), etc. Then, using Eq. (1.7):

yi+1 = yi + hy′i +h2

2y′′i + O(h3)

= yi + hf(xi, yi) +h2

2(fx(xi, yi) + f(xi, yi)fy(xi, yi)) + O(h3) . (1.21)

Upon comparing the last lines of Eqs. (1.20) and (1.21), we see that in order for method (1.18)to have the local truncation error of O(h3), one should take A = f(xi, Yi).


Thus, the Modified Euler method can be programmed into a computer code as follows:

Y0 = y0,Yi+1 = Yi + hf(xi, Yi),

Yi+1 = Yi +h

2

(f(xi, Yi) + f(xi+1, Yi+1)

).

(1.22)

Remark: An alternative way to code in the last line of the above equation is

Yi+1 =1

2

(Yi + Yi+1 + hf(xi+1, Yi+1)

). (1.23)

This way is more efficient, because it requires only one evaluation of function f , which isusually the most time-consuming operation, while the last line of (1.22) requires two functionevaluations.

In a homework problem, you will show that in Eq. (1.19), B = 12f(xi, Yi). Then the

Midpoint method can be programmed as follows:

Y0 = y0,

Yi+ 12

= Yi +h

2f(xi, Yi),

Yi+1 = Yi + hf

(xi +

h

2, Yi+ 1

2

).

(1.24)

Both the Modified Euler and the Midpoint methods have the local truncation error of O(h3)and the discretization and global errors of O(h2). Thus, these are the 2nd-order methods. Thederivation of the local truncation error for the Modified Euler method is given in the Appendixto this section. This derivation will be needed for solving some of the homework problems.

Remark about notations: Different books use different names for the methods which we havecalled the Modified Euler and Midpoint methods.

1.5 An alternative way to improve the accuracy of a finite-differencemethod: Richardson method / Romberg extrapolation

We have shown that the global error of the simple Euler method is O(h), which means that

Y hi = yi + O(h) = yi + (a h + b h2 + . . .) = yi + a h + O(h2) (1.25)

where a, b, etc. are some constant coefficients that depend on the function f and its derivatives(as well as on the values of x), but not on h. The superscript h in Y h

i means that this particularnumerical solution has been computed with the step size h. We can now halve the step sizeand re-compute Y

h/2i , which will satisfy

Yh/2i = yi +

(ah

2+ b

(h

2

)2

+ . . .

)= yi +

(ah

2+ O(h2)

). (1.26)

Let us clarify that Yh/2i is not the numerical solution at xi + (h/2) but rather the numerical

solution computed from x0 up to xi with the step size (h/2).


Equations (1.25) and (1.26) form a system of linear equations for the unknowns a and yi.Solving this system, we find

yi = 2Yh/2i − Y h

i + O(h2) . (1.27)

Thus, a better approximation to the exact solution than either Y hi or Y

h/2i is Y improved

i =

2Yh/2i − Y h

i .

The above method of improving accuracy of the computed solution is called either theRomberg extrapolation or Richardson method. It works for any finite-difference scheme, notjust for the simple Euler. However, it is not computationally efficient. For example, to computeY improved as per Eq. (1.27), one requires one function evaluation to compute Y h

i+1 from Y hi and

two function evaluations to compute Yh/2i+1 from Y

h/2i (since we need to use two steps of size h/2

each). Thus, the total number of function evaluations to move from point xi to point xi+1 isthree, compared with two required for either the Modified Euler or Midpoint methods.

1.6 Appendix: Derivation of the local truncation error of the Mod-ified Euler method

The idea of this derivation is the same as in Section 1.2, where we derived an estimate forthe local truncation error of the simple Euler method. The details of the present derivation,however, are more involved. In particular, we will use the following formula, obtained similarlyto (1.7):

y′′′(x) =d3y(x)

dx3|use the ODE =

d2f(x, y)

dx2|use (1.7)

= (fx + fyf)xdx

dx+ (fx + fyf)y

dy

dx|use theProduct rule

= fxx + fxfy + 2ffxy + f(fy)2 + f 2fyy . (1.28)

Let us recall that in deriving the local truncation error at point xi+1, one always assumesthat the exact solution yi and the computed solution Yi at the previous step (i.e. at point xi)are equal: yi = Yi. Also, for brevity of notations, we will write f without arguments to meaneither f(xi, yi) or f(xi, Yi):

f ≡ f(xi, yi) = f(xi, Yi).

By the definition, given in Section 1.2, the local truncation error of the Modified Eulermethod is computed as follows:

εMEi+1 = yi+1 − Y ME

i+1 , (1.29)

where yi+1 and Yi+1 are the exact and computed solutions at point xi+1, respectively (assumingthat yi = Yi). We first find yi+1 using ODE (1.1):

yi+1 = y(xi + h)

= yi + hy′i +h2

2y′′i +

h3

6y′′′i + O(h4) |use (1.7) and (1.28)

= yi + hf +h2

2(fx + ffy) +

h3

6

(fxx + fxfy + 2ffxy + f(fy)

2 + f 2fyy

)+ O(h4) .(1.30)


We now find Y MEi+1 from Eq. (1.22):

Y MEi+1 = Yi +

h

2( f + f(xi + h, Yi + hf) ) |for last term, use (0.6)with∆x=h and ∆y=hf

= Yi +h

2

(f +

{f + [hfx + hffy] +

1

2![h2fxx + 2 · h · hf · fxy + (hf)2fyy] + O(h3)

})

= Yi + hf +h2

2(fx + ffy) +

h3

4(fxx + 2ffxy + f 2fyy) + O(h4) . (1.31)

Finally, subtracting (1.31) from (1.30), one obtains:

εMEi+1 = h3

[(1

6− 1

4

)(fxx + 2ffxy + f 2fyy) +

1

6(fx + ffy)fy

]+ O(h4)

= h3

[− 1

12(fxx + 2ffxy + f 2fyy) +

1

6(fx + ffy)fy

]+ O(h4) . (1.32)

For example, let f(x, y) = ay, where a = const. Then

fx = fxx = fxy = 0, fy = a, and fyy = 0 ,

so that from (1.32) the local truncation error of the Modified Euler method, applied to theODE y′ = ay, is found to be

εMEi+1 =

h3

6a3y + O(h4) .

1.7 Questions for self-assessment

1. What does the notation O(hk) mean?

2. What are the meanings of the local truncation error, discretization error, and global error?

3. Give an example when the triangle inequality (1.5) holds with the ‘<’ sign.

4. Be able to explain all steps made in the derivations in Eqs. (1.15) and (1.16).

5. Why are the Modified Euler and Midpoint methods called 2nd-order methods?

6. Obtain (1.27) from (1.25) and (1.26).

7. Explain why the properly programmed Modified Euler method requires exactly two eval-uations of f per step.

8. Why may one prefer the Modified Euler method over the Romberg extrapolation basedon the simple Euler method?


2 Runge-Kutta methods

2.1 The family of Runge-Kutta methods

In this section, we will introduce a family of increasingly accurate, and time-efficient, methodscalled Runge-Kutta methods after two German scientists: a mathematician and physicist CarleRunge (1856–1927) and a mathematician Martin Kutta (1867–1944).

The Modified Euler and Midpoint methods of the previous section can be written in a formcommon to both of these methods:

Yi+1 = Yi + (a k1 + b k2);k1 = hf(xi, Yi),k2 = hf(xi + αh, Yi + βk1);a, b, α, β are some constants.

(2.1)

Specifically, for the Modified Euler,

a = b =1

2, α = β = 1; (2.2)

and for the Midpoint method,

a = 0, b = 1, α = β =1

2. (2.3)

In general, if we require that method (2.1) have the global error O(h2), we can repeatthe calculations we carried out in Section 1.4 for the Modified Euler method and obtain thefollowing 3 equations for 4 unknown coefficients a, b, α, β:

a + b = 1, αb =1

2, βb =

1

2. (2.4)

Observations:

• Since there are fewer equations than unknowns in (2.4), then there are infinitely manyfinite-difference methods whose global error is O(h2).

• One can generalize form (2.1) and seek methods of higher order (i.e. with the global errorof O(hk) with k ≥ 3) as follows:

Yi+1 = Yi + (ak1 + bk2 + ck3 + . . . );k1 = hf(xi, Yi),k2 = hf(xi + α2h, Yi + β21k1),k3 = hf(xi + α3h, Yi + β31k1 + β32k2),etc.

(2.5)

This family of methods is called the Runge-Kutta (RK) methods.

For example, if one looks for 4th-order methods, one obtains 11 equations for 13 coefficients.Again, this says that there are infinitely many 4th-order methods. Historically, the most popular


such method has been

Yi+1 = Yi +1

6(k1 + 2k2 + 2k3 + k4);

k1 = hf(xi, Yi),

k2 = hf

(xi +

1

2h, Yi +

1

2k1

),

k3 = hf

(xi +

1

2h, Yi +

1

2k2

),

k4 = hf (xi + h, Yi + k3) .

(2.6)

We will refer to this as the classical Runge-Kutta (cRK) method.The table below compares the time-efficiency of the cRK and Modified Euler methods and

shows that the former method is much more efficient.

Method Global error # of function evaluationsper step

cRK O(h4) 4

Modified Euler O(h2) 2

One of the reasons why the cRK method is so popular is that the number of functionevaluations per step in it equals the order of the method. It is known that RK methods oforder n ≥ 5 require more than n function evaluations; i.e. they are less efficient than the cRKand other lower-order RK methods. For example, a 5th-order RK method would require aminimum of 6 function evaluations per step.

2.2 Adaptive methods: Controlling step size for given accuracy

In this subsection, we discuss an important question of how the error of the numerical solutioncan be controlled and/or kept within a prescribed bound. A more complete and thoroughdiscussion of this issue can be found in a paper by L.F. Shampine, “Error estimation andcontrol for ODEs,” SIAM J. of Scientific Computing, 25, 3–16 (2005). A preprint of this paperis available on the course website.

To begin, we emphasize two important points about error control algorithms.

1. These algorithms control the local truncation error, and not the global error, of the solu-tion. Indeed, the only way to control the global error is to run the simulations more thanonce. For example, one can run a simulation with the step h and then repeat it with thestep h/2 to verify that the difference between the two solutions is within a prescribedaccuracy. Although this can be done occasionally (for example, when confirming a keyresult of one’s paper), it is too time-expensive to do so routinely. Therefore, the errorcontrol algorithms make sure that the local error at each step is less than a given tolerance(which is in some way related to the prescribed global accuracy), and then just let theuser hope that the global accuracy is met. Fortunately, this hope comes true in mostcases; but see the aforementioned paper for possible problematic cases.


2. The goal of the error control is not only to control the error but also to optimize thestep size used to obtain different portions of the solution. For example, if it is foundthat the solution changes very smoothly on a subinterval Ismooth of the computationalinterval, then the step size on Ismooth can be taken sufficiently large. On the contrary,if one detects that the solution changes rapidly on another interval, Irapid, then the stepsize there should be decreased.

Methods where both the solution and its error are evaluated at each step of the calculationare called adaptive methods. They are most useful in problems with abruptly (or rapidly)changing coefficients. One simple example of such a problem is the motion of a skydiver: theair resistance changes abruptly at the moment the parachute opens. This will be discussed inmore detail in the homework.

To present the idea of the algorithm used by adaptive methods, assume for the moment thatwe know the exact solution yi. Let εglob be the maximum desired global error and n be the orderof the method. Then the actual local truncation error must be O(hn+1), or chn+1 + O(hn+2)with some constant c. Since the maximum allowed local truncation error, εloc , is not prescribed,it has to be postulated in some plausible manner. The common choice is to take εloc = hεglob.

Then, the steps of the algorithm of an adaptive method are as follows.

1. At each xi, compute the actual local truncation error εi = |yi − Yi| and compare it withεloc. (The practical implementation of this step is described later.)

2a. If εi < εloc, then accept the solution, multiply the step size by κ(εloc/εi)1/(n+1), (where

κ is some numerical coefficient less than 1), and proceed to the next step.2b. If εi > εloc, then multiply the step size by κ(εloc/εi)

1/(n+1), re-calculate the solution,and check the error. If the actual error is acceptable, proceed to the next step. If not, repeatthis step again.

Note that with the above step size adjustment, the error at the next step is expected to beapproximately

c

(h · κ

(εloc

εi

)1/(n+1))n+1

= εlocκn+1 chn+1

εi

≈ εlocκn+1 .

The coefficient κ < 1 (say, κ = 0.9) is included to avoid the situation where the computed errorjust slightly exceeds the allowed bound, which would be acceptable to a human, but the com-puter will have to recalculate the entire step, thereby wasting expensive function evaluations.

Now, in reality, the exact solution of the ODE is not known. Then one can use the followingtrick. Suppose the numerical method we use is of sufficiently high order (e.g., the order 4 of thecRK method is sufficiently high for all practical purposes). Then we can compute the solution

Y hi with the step size h and at each step compare it with the solution Y

h/2i , obtained with the

step size being halved. For example, for the cRK method is of fourth order, and hence Yh/2i

should be closer to the exact solution than Y hi is by about 24 = 16 times. Then one can declare

Yh/2i to be the exact solution, compute εi = |Y h/2

i − Y hi |, and use εi in place of the εi above.

However, this way is very inefficient. For example, for the cRK method, it would require 7

additional function evaluations per step (needed to advance Y h/2 from xi to xi+1). Therefore,people have designed alternative approaches to control the error size. Below we briefly describethe ideas behind two such approaches.


Runge-Kutta-Fehlberg method1

Idea: Design a 5th-order method that would share some of the function evaluations witha 4th-order method. The solution Y

[5]i , obtained using the 5th-order method, is expected to

be much more accurate than the solution Y[4]i , obtained using the 4th-order method. Then we

declare εi = |Y [5]i − Y

[4]i | to be the numerical error and adjust the step size based on that error

relative to the allowed tolerance.

Implementation:

Y[4]i+1 = Yi +

(25

216k1 +

1408

2565k3 +

2197

4104k4 − 1

5k5

),

Y[5]i+1 = Yi +

(16

135k1 +

6656

12825k3 +

28561

56430k4 − 9

50k5 +

2

55k6

);

k1 = hf(xi, Yi),

k2 = hf

(xi +

1

4h, Yi +

1

4k1

),

k3 = hf

(xi +

3

8h, Yi +

3

32k1 +

9

32k2

),

k4 = hf

(xi +

12

13h, Yi +

1932

2197k1 − 7200

2197k2 +

7296

2197k3

),

k5 = hf

(xi + h, Yi +

439

216k1 − 8k2 +

3680

513k3 − 845

4104k4

),

k6 = hf

(xi +

1

2h, Yi − 8

27k1 + 2k2 − 3544

2565k3 +

1859

4104k4 − 11

40k5

),

where Yi = Y[5]i .

(2.7)

Altogether, there are only 6 function evaluations per step, because the 4th- and 5th-ordermethods share 4 function evaluations.

Runge-Kutta-Merson method

Idea: For certain choices of the auxiliary functions k1, k2, etc., the local truncation errorof, say, a 4th order RK method can be made equal to C5h

5y(5)(xi) + O(h6) with some knowncoefficient C5. (Note that this local truncation error is proportional to the (n + 1)-st derivativeof the solution, where n is the order of the method. We observed a similar situation earlierfor the simple Euler method; see Eq. (1.3).) On the other hand, a certain linear combinationof the k’s can also be chosen to equal C5h

5y(5)(xi) + O(h6) for a certain class of functions(namely, for linear functions: f(x, y) = a(x) + b · y, where b = const). Thus, we can obtainboth an approximate solution and an estimate for its error. We can then use that estimate toadjust the step size so as to always make the (estimate for the) local truncation error below aprescribed maximum value.

For example, if one computes the solution Yi using the cRK method and then, in addition,evaluates

k5 = hf

(xi +

3

4h, Yi +

1

32[5k1 + 7k2 + 13k3 − k4]

), (2.8)

1It is interesting to note that while the cRK method was developed in early 1900’s, its extension by Fehlbergwas proposed only in 1970.


then it can be shown (with a great deal of algebra) that

Local truncation error ∼ 2

3h (−k1 + 3k2 + 3k3 + 3k4 − 8k5) + O(h6) . (2.9)

Here the sign ‘∼’ is used instead of the ‘=’ because the equality holds only for f(x, y) =a(x)+b·y, where b = const. Thus, again, by evaluating function f just one extra time comparedto the cRK method, one obtains both the numerical solution and a crude estimate for its error.Then this error estimate can be used as the actual error εi in the algorithm of the correspondingadaptive method.

Implementation: More popular than the method described by (2.8) and (2.9), however, isanother method based on the same idea and called the Runge-Kutta-Merson method:

Yi+1 = Yi +1

6(k1 + 4k4 + k5);

k1 = hf(xi, Yi),

k2 = hf

(xi +

1

3h, Yi +

1

3k1

),

k3 = hf

(xi +

1

3h, Yi +

1

6(k1 + k2)

),

k4 = hf

(xi +

1

2h, Yi +

1

8(k1 + 3k3)

),

k5 = hf

(xi + h, Yi +

1

2(k1 − 3k3 + 4k4)

),

Local truncation error ∼ 130

(2k1 − 9k3 + 8k4 − k5) .

(2.10)

Once again, one should note that the last line above is only a crude estimate for the trun-cation error (valid only when f(x, y) is a linear function of y). Indeed, if it had been valid forany f(x, y), then we would have a contradiction with a statement found at the end of Sec. 2.1.(Which statement is that?)

To conclude this presentation of the adaptive RK methods, we must specify what solutionis taken at xi+1. For example, for the RK-Fehlberg method, we have the choice between settingYi+1 to either Y

[4]i+1 or Y

[5]i+1. The common sense suggests setting Yi+1 = Y

[5]i+1, because, after all,

it is Y[5]i+1 that we have declared to be our “etalon” solution. This choice does work in most

circumstances, although there are important exceptions (see the paper by L. Shampine). Thus,what the RK-Fehlberg method does is compute a 5th-order-accurate solution while controllingthe error of a less accurate 4-th-order solution related to it.


1. List the 13 coefficients mentioned in the paragraph after Eq. (2.5). Do not write the 11equations.

2. If the step size is reduced by a factor of 2, how much will the error of the cRK and theModified Euler methods be reduced? Which of these methods is more accurate?


3. Suppose f = f(x) (on the r.h.s. of the ODE); that is, f does not depend on y but onlyon x. What numerical integration method (studied in Calculus 2) does the cRK methodreduce to? [Hint: Rewrite Eq. (2.6) for f = f(x).]

4. List the 7 function evaluations mentioned in the paragraph before the title‘Runge-Kutta-Fehlberg method’.

5. Describe the idea behind the Runge-Kutta-Fehlberg method.

6. Describe the idea behind the Runge-Kutta-Merson method.

7. Which statement is meant in the paragraph following Eq. (2.10)?

8. One of the built-in ODE solvers in MATLAB is called ode45 . What do you think theorigin of this name is? Without reading the description of this solver under MATLAB’shelp browser, can you guess what order this method is?


3 Multistep, Predictor-Corrector, and Implicit methods

In this section, we will introduce methods that may be as accurate as high-order Runge-Kuttamethods but will require fewer function evaluations.

We will also introduce implicit methods, whose significance will become clearer in a latersection.

3.1 Idea behind multistep methods

The figure on the right illustrates the (famil-iar) fact that if you know y′(xi), i.e. the slopeof y(x), then you can compute a first-orderaccurate approximation Y 1st order

i+1 to the solu-tion yi+1.Likewise, if you know the slope and the cur-vature of your solution at a given point, youcan compute a second-order accurate approx-imation, Y 2nd order

i+1 , to the solution at the nextstep.

Yi

Y1st ordi+1

Y2nd ordi+1

yi+1

matches slope

matches slopeand curvature

Now, recall that curvature is proportional to y′′. This motivates the following.

Question: How can we find approximation to y′′i using already computed values Yi−k,k = 0, 1, 2, . . . ?

Answer: Note that

y′′i ≈y′i − y′i−1

h=

fi − fi−1

h. (3.1)

Here and below we will use the notation fi in two slightly different ways:

fi ≡ f(xi, yi) or fi ≡ f(xi, Yi) (3.2)

whenever this does not cause any confusion.Continuing with Eq. (3.1), we can state it more specifically by writing

y′′i =y′i − y′i−1

h+ O(h) =

fi − fi−1

h+ O(h) , (3.3)

where we will compute the O(h) term later. For now, we use (3.3) to approximate yi+1 asfollows:

yi+1 = y(xi + h) = yi + hy′i +h2

2y′′i + O(h3)

= yi + hfi +h2

2

(fi − fi−1

h+ O(h)

)+ O(h3)

= yi + h

(3

2fi − 1

2fi−1

)+ O(h3) . (3.4)

Remark 1: To start the corresponding finite-difference method, i.e.

Yi+1 = Yi + h

(3

2fi − 1

2fi−1

)(3.5)


(now we use fi as f(xi, Yi)), one needs two initial points of the solution, Y0 and Y1. These canbe computed, e.g., by the simple Euler method; this is discussed in more detail in Section 3.4.

Remark 2: Equation (3.4) becomes exact rather than approximate if y(x) = p2(x) ≡ ax2+bx+c

is a second-degree polynomial in x. Indeed, in such a case,

y′i = 2axi + b, and y′′i = 2a =y′i − y′i−1

h; (3.6)

(note the exact equality in the last formula). We will use this remark later on.

Method (3.5) is of the second order. If we want to obtain a third-order method along thesame lines, we need to use the third derivative of the solution:

y′′′i =y′i − 2y′i−1 + y′i−2

h2+ O(h) (3.7)

(you will be asked to verify this equation in one of the homework problems). Then we proceedas in Eq. (3.4), namely:

yi+1 = yi + hy′i +h2

2y′′i +

h3

6y′′′i + O(h4) . (3.8)

If you now try to substitute the expression on the r.h.s. of (3.3) for y′′i , you notice that youactually need an expression for the O(h)-term there that would have accuracy of O(h2). Hereis the corresponding calculation:

y′i − y′i−1

h=

y′(xi)− y′(xi−1)

h

=y′i −

[y′i − hy′′i + h2

2y′′′i + O(h3)

]

h

= y′′i −h

2y′′′i + O(h2),

(3.9)

whence

y′′i =y′i − y′i−1

h+

h

2y′′′i + O(h2) . (3.10)

To complete the derivation of the third-order finite-difference method, we substitute Eqs.(3.10), (3.7), and y′i = fi etc. into Eq. (3.8). The result is:

Yi+1 = Yi +h

12[23fi − 16fi−1 + 5fi−2] ; (3.11)

the local truncation error of this method is O(h4). Method (3.11) is called the 3rd-orderAdams–Bashforth method.

Similarly, one can derive higher-order Adams–Bashforth methods. For example, the 4th-order Adams–Bashforth method is

Yi+1 = Yi +h

24[55fi − 59fi−1 + 37fi−2 − 9fi−3] . (3.12)

Methods like (3.5), (3.11), and (3.12) are called multistep methods. To start a multistepmethod, one requires more than one initial point of the solution (in the examples consideredabove, the number of required initial points equals the order of the method).


Comparison of multistep and Runge-Kutta methodsThe advantage of multistep over single-step RK methods of the same accuracy is that the

multistep methods require only one function evaluation per step, while, e.g., the cRK methodrequires 4, and the RK-Fehlberg method, 6, function evaluations.

The disadvantage of the multistep methods is that changing the step size for them is rathercomplicated (it requires interpolation of the numerical solution), while for the single-step RKmethods this is a straighforward procedure.

3.2 An alternative way to derive formulae for multistep methods

Recall that the 2nd-order Adams–Bashforth method (3.5) was exact on solutions y(x) that are2nd-degree polynomials: y(x) = p2(x) (see Remark 2 after Eq. (3.4)). Similarly, one expectsthat the 3rd-order Adams–Bashforth method should be exact for y(x) = p3(x). We will nowuse this observation to derive the formula for this method, Eq. (3.11), in a different mannerthan in Sec. 3.1.

To begin, we take, according to the above observation, f(x, y) = y′(x) = (p3(x))′ = p2(x),i.e. a 2nd-degree polynomial in x. We now integrate the differential equation y′ = f(x, y) fromxi to xi+1 and obtain:

yi+1 = yi +

∫ xi+1

xi

f(x, y(x))dx . (3.13)

Let us approximate the integral by a quadrature formula, as follows:

∫ xi+1

xi

f(x, y(x))dx ≈ h(b0fi + b1fi−1 + b2fi−2) (3.14)

and require that the above equation hold exactly, rather than approximately, for any f(x, y(x)) =p2(x). This is equivalent to requiring that (3.14) hold exactly for f = 1, f = x, and f = x2.Without loss of generality2, one can set xi = 0 and then rewrite Eq. (3.14) for the above threeforms of f :

for f = 1:

∫ h

0

1 dx = h = h ( b0 · 1 + b1 · 1 + b2 · 1 )

for f = x:

∫ h

0

x dx = 12h2 = h ( b0 · 0 + b1 · (−h) + b2 · (−2h) )

for f = x2:

∫ h

0

x2 dx = 13h3 = h ( b0 · 0 + b1 · (−h)2 + b2 · (−2h)2 ) .

(3.15)

Equations (3.15) constitute a linear system of 3 equations for 3 unknowns b0, b1, and b2. Solvingit, we obtain

b0 =23

12, b1 = −16

12, b2 =

5

12,

which in combination with Eq. (3.14) yields the same method as (3.11). Methods of higherorder can be obtained similarly.

2In a homework problem, you will be asked to show this.


3.3 A more general form of multistep methods, with examples

The Adams–Bashforth methods above have the following common form:

Yi+1 − Yi = h

N∑

k=0

bkfi−k . (3.16)

As has been shown in Sec. 3.2, the sum on the r.h.s. approximates∫ xi+1

xi

f(x, y(x))dx .

Let us now consider multistep methods of a more general form:

Yi+1 −M∑

k=0

akYi−k = h

N∑

k=0

bkfi−k . (3.17)

Note that the sum on the r.h.s. of (3.17), unlike that in (3.16), does not have a straightforwardinterpretation. In the next Lecture, we will discover that many methods of the form (3.17)have a serious flaw in them, but for now let us consider two particular examples, focusing onlyon the accuracy of the following methods.

Simple center-difference (Leap-frog) method

Recall thatyi − yi−1

h= y′i + O(h) . (3.18)

However3,yi+1 − yi−1

2h= y′i + O(h2) . (3.19)

Thus, the l.h.s. of (3.19) provides a more accurate approximation to y′i than does the l.h.s. of(3.18). So we use Eq. (3.19) to produce a 2nd-order method:

Yi+1 = Yi−1 + 2hfi, (3.20)

which is of the form (3.17). We need both Y0 and Y1 to start this method.

A divergent third-order method(The term “divergent” will be explained in the next Lecture.)

Let us try to increase the order of method (3.20) from 2nd to 3rd by including extra termsinto the scheme:

Yi+1 − (a0Yi + a1Yi−1 + a2Yi−2) = b0h fi , (3.21)

where we now require that the local truncation error of (3.21) be O(h4). We can follow thederivation found either in Sec. 3.1 (Taylor-series expansion) or Sec. 3.2 (requiring that (3.21)hold true for y = p3(x)) to obtain the values of the coefficients a0 through a2, and b0. Theresult is:

Yi+1 +3

2Yi − 3Yi−1 +

1

2Yi−2 = 3h fi . (3.22)

Supposedly, method (3.22) is more accurate than the Leap-frog method (3.20). However, we willshow in the next Lecture that method (3.22) is completely useless for numerical computations.

3Again, you will be asked to verify this.


3.4 Starting a multistep method

To start any of the single-step methods, considered in Lectures 1 and 2, one only needs to knowthe initial condition, Y0 = y0, at x = x0. To start any multistep method, one needs to knowthe numerical solution at several points. For example, to start an Adams–Bashforth method oforder m, one would need the values Y0, . . . , Ym−1 (see Eqs. (3.5), (3.11), and (3.12)). That is,to start an mth-order method, one needs to know the solution at the first m points. We willnow address the following question:

Suppose that we want to start a multistep method of order m using the values Y1, . . . , Ym−1

that have been computed by a starting (single-step) method of order n. What should the ordern of the starting method be so that not to compromise the order m of the multistep method?

First, it is clear that if n ≥ m, then the local error made in the computation of Y1, . . . , Ym−1

and of the terms on the r.h.s. of (3.16) and (3.17) will be at least as small (in the order ofmagnitude sense) as the local error of the multistep method. So, using a starting method whoseorder is no less than the order of the multistep method will not degrade the accuracy of thelatter method. But is it possible to use a staring method with n < m with the same end result?

We will now show that for methods of the form (3.16) it is possible to take n = m−1 (i.e.,the starting method’s order may be one less than the multistep method’s order).4 With a littlemore work, one can show that the same answer holds also for the particular case of method(3.17) given by Eq. (3.24) below. (E.g., the Leap-frog method is a particular representative ofthe latter case.)

The local truncation errors of Y1 through Ym−1 are O(hn+1). Then the error contributed toYm from the second term (i.e., from Yi with i = m− 1) on the l.h.s. of (3.16), is O(hn+1):

error of l.h.s. of (3.16) = O(hn+1).

Next, if fi through fi−N on the r.h.s. were calculated using the exact solution y(x), then theerror of the r.h.s. would have been O(hm+1). Indeed, this error is just the local truncation error

of method (3.16) that arises due to the approximation of

∫ xi+1

xi

f(x, y(x))dx by h

N∑

k=0

bkfi−k.

However, the fi−k’s are calculated using values Y1 through Ym−1 which themselves have beenobtained with the error O(hn+1) of the starting method. Then the error of each fi−k is alsoO(hn+1).5 Therefore,

error of r.h.s. of (3.16) = O(hm+1) + h ·O(hn+1) = max{O(hn+2), O(hm+1)} .

Thus, the total local truncation error in Ym that comes from the l.h.s. and r.h.s of (3.16), isO(hn+1) (recall that we are only interested in the situation where n < m). In order not todecrease the accuracy of the multistep method, this error must satisfy two criteria:

(i) It must have the same order of magnitude as the global error at the end of the computa-tion, i.e., O(hm); and in addition,

(ii) It may propagate to the next computed solution, i.e., to Yi+2, but it must not accumulateat each step with other errors of the same magnitude.

4Unfortunately, I was unable to find any detailed published proof of this result, and so the derivation foundbelow is my own. As such, it is subject to mistakes ^. However, a set of Matlab codes accompanying thisLecture where the 3rd-order Adams–Bashforth method (3.11) can be started using the modified Euler, Midpoint,or simple Euler method, shows that if not this derivation itself, than at least its result is probably correct.

5If the last statement is not clear, do not worry and just read on. More details about it are presented in thederivation of Eq. (3.29) in the next Section, and also in the Appendix.


One can easily see that criterion (i) is indeed satisfied for n + 1 = m, i.e., when n = m − 1.As for criterion (ii), it is also satisfied. To see that this is the case, it suffices to repeat theabove derivation for the error at the next step, i.e. at Yi+2 ≡ Ym+1. Then one can see thatthe only contribution of order O(hn+1) to the error in Ym+1 will come from Ym, and it willnot combine with any other error of the same order. An analogous statement will hold for theerrors in Ym+2, Ym+3, etc. In this fashion, the original error made in the computation of Ym−1

will simply propagate to the end of the computational interval. Thus, the global error will be:

propagated error from Ym−1

= O(hn+1)+

accumulated local truncation error

= O(

1h

)·O(hm+1)

∣∣∣∣∣for n=m−1

= O(hm) .

Thus, the order n of the starting method should be no lower than (m− 1), where m is theorder of the multistep method (3.16).

For methods of the more general form (3.17) that do not reduce to (3.24), an answer tothe question stated at the beginning of this Section cannot be obtained as simply. Moreover,I have seen indirect indications in published literature that the above derivation may not bevalid for those more general multistep methods. Therefore, it is a good idea to use a methodof order m when starting a multistep method (3.17) of order m.

3.5 Predictor–corrector methods: General form

Let us recall the Modified Euler method introduced in Lecture 1 and write it here using slightlydifferent notations:

Y pi+1 = Yi + hfi

Y ci+1 = Yi + 1

2h

(fi + f(xi+1, Y

pi+1)

)

Yi+1 = Y ci+1 .

(3.23)

We can interpret the above as follows: We first predict the new value of the solution Yi+1 bythe first equation, and then correct it by the second equation. Methods of this kind are calledpredictor–corrector (P–C) methods.

Question: What is the optimal relation between the orders of the predictor and correctorequations?

Answer: The example of the Modified Euler method suggests that the order of the cor-rector should be one higher than that of the predictor. More precisely, the following theoremholds:

Theorem If the order of the corrector equation is n, then the order of the correspondingP–C method is also n, provided that the order of the predictor equation is no less than n− 1.

Proof We will assume that the global error of the corrector equation by itself is O(hn)and the global error of the predictor equation by itself is O(hn−1). Then we will prove that theglobal error of the combined P–C method is O(hn).

The general forms of the predictor and corector equations are, respectively:

Predictor: Y pi+1 = Yi−Q + h

N∑

k=0

pkfi−k, (3.24)

Corrector: Y ci+1 = Yi−D + h

M∑

k=0

ckfi−k + hc−1f(xi+1, Ypi+1) . (3.25)


In the above two equations, Q, D,N, M are some integer nonnegative numbers. (One of thequestions at the end of this Lecture asks you to represent Eq. (3.23) in the form (3.24), (3.25),i.e. to give values for Q,D,N, M and the coefficients pk’s and ck’s.)

As we have done in previous derivations, let us assume that all computed values Yi−k,k = 0, 1, 2, . . . coincide with the exact solution at the coresponding points: Yi−k = yi−k. Thenwe can use the identity

yi+1 = yi−Q + (yi+1 − yi−Q)see (3.13)

= yi−Q +

∫ xi+1

xi−Q

y′(x)dx = Yi−Q +

∫ xi+1

xi−Q

f(x, y(x))dx

and rewrite Eq. (3.24) as:

Y pi+1 = Yi−Q+

∫ xi+1

xi−Q

f(x, y(x))dx+

(h

N∑

k=0

pkfi−k −∫ xi+1

xi−Q

f(x, y(x))dx

)⇒ Y p

i+1 = yi+1+EP .

(3.26)Here EP is the error made by replacing the exact integral∫ xi+1

xi−Q

f(x, y(x))dx

by the linear combination of fi−k’s, found on the r.h.s. of (3.24). Since, by the condition of theTheorem, the global error of the predictor equation is O(hn−1), then the local truncation errorEP has the order of O(h(n−1)+1) = O(hn).

Similarly, Eq. (3.25) is rewritten as

Y ci+1 = yi+1 + EC + hc−1

(f(xi+1, Y

pi+1)− f(xi+1, yi+1)

). (3.27)

Here EC is the error obtained by replacing the exact integral∫ xi+1

xi−D

f(x, y(x))dx

by the quadrature formula

h

M∑

k=−1

ckfi−k

(note that the lower limit of the summation is different from that in (3.25)!). The last term onthe r.h.s. of (3.27) occurs because, unlike all previously computed Yi−k’s, the Y p

i+1 6= yi+1.To complete the proof,6 we need to show that Y c

i+1 − yi+1 = O(hn+1) in (3.27). By thecondition of the Theorem, the corrector equation has order n, and hence the local truncationerror EC = O(hn+1). Then all that remains to be estimated is the last term on the r.h.s. of(3.27). To that end, we recall that f satisfies the Lipschitz condition with respect to y, whence

|f(xi+1, Ypi+1)− f(xi+1, yi+1)| ≤ L|Y p

i+1 − yi+1| = L |EP |, (3.28)

where L is the Lipschitz constant. Combining Eqs. (3.27) and (3.28) and using the triangleinequality (1.5), we finally obtain

|Y ci+1 − yi+1| ≤ |EC |+ hL|EP | = O(hn+1) + h ·O(hn) = O(hn+1) , (3.29)

which proves that the P–C method has the local truncation error of order n + 1, and hence isthe nth-order method. q.e.d.

We now present two P–C pairs that in applications are sometimes preferred7 over the Mod-

6At this point, you have probably forgotten what we are proving. Pause, re-read the Theorem’s statement,and then come back to finish the reading.

7In the next Section we will explain why this is so.


ified Euler method. The first pair is:

Predictor : Y pi+1 = Yi + 1

2h(3fi − fi−1)

Corrector : Y ci+1 = Yi + 1

2h

(fi + fp

i+1

),

(3.30)

where f pi+1 = f(xi+1, Y

pi+1). The order of the P–C method (3.30) is two.

The other pair is:

Predictor: 4th-order Adams–Bashforth

Y pi+1 = Yi + 1

24h(55fi − 59fi−1 + 37fi−2 − 9fi−3 ) .

Corrector: 4th-order Adams–Moulton

Y ci+1 = Yi + 1

24h(9f p

i+1 + 19fi − 5fi−1 + fi−2 ) ,

(3.31)

This P–C method as a whole has the same name as its corrector equation: the 4th-orderAdams–Moulton.

3.6 Predictor–corrector methods: Error control

An observation one can make from Eqs. (3.30) is that both the predictor and corrector equationshave the order two (i.e. the local truncation errors of O(h3)). In view of the Theorem ofthe previous subsection, this may seem to be unnecessary. Indeed, the contribution of thepredictor’s local truncation error is h·O(h3) = O(h4) (see Eq. (3.29)), while the local truncationerror of the corrector equation (which determines that of the entire P–C method) is only O(h3).There is, however, an important consideration because of which method (3.30) may be preferredover the Modified Euler. Namely, one can monitor the error size in (3.30), whereas the ModifiedEuler does not give its user such a capability. Below we explain this statement in detail. Asimilar treatment can be applied to the Adams–Moulton method (3.31).

The key fact is that the local truncation errors of the predictor and correction equations(3.30) are proportional to each other in the leading order:

yi+1 − Y pi+1 = 5

12h3y′′′i + O(h4), (3.32)

yi+1 − Y ci+1 = − 1

12h3y′′′i + O(h4) . (3.33)

For the reader’s information, the analogues of the above estimates for the Adams–Moultonmethod (3.31) are:

4th-order Adams–Moulton method:

yi+1 − Y pi+1 =

251

720h5y

(5)i + O(h6),

yi+1 − Y ci+1 = − 19

720h5y

(5)i + O(h6) .

We derive (3.33) in the Appendix to this Lecture, while the derivation of (3.32) is left as anexercise. Here we only note that the derivation of (3.33) hinges upon the fact that yi − Y p

i =O(h3), which is guaranteed by (3.32). Otherwise, i.e. if yi−Y p

i = O(h2), as in the predictor forthe Modified Euler method, the term on the r.h.s. of (3.33) would not have had such a simpleform.


We will now explain how (3.32) and (3.33) can be used together to control the error of theP–C method (3.30). From (3.33) we obtain the error of the corrector equation:

|εci+1| ≈

1

12h3|y′′′i | . (3.34)

On the other hand, from Eqs. (3.32) and (3.33) together, we have

|Y pi+1 − Y c

i+1| ≈(

5

12+

1

12

)h3|y′′′i |. (3.35)

Thus, from (3.34) and (3.35) one can estimate the error via the difference of the predicted andcorrected values of the solution:

|εci+1| ≈

1

6|Y p

i+1 − Y ci+1| . (3.36)

Moreover, Eqs. (3.32) and (3.33) can also be used to obtain a higher-order method than(3.30), because they imply that

yi+1 =1

6

(Y p

i+1 + 5Y ci+1

)+ O(h4) .

Hence

Yi+1 =1

6

(Y p

i+1 + 5Y ci+1

)(3.37)

produces a more accurate approximation to the solution than either Y pi+1 or Y c

i+1 alone. (Notea similarity with the Romberg extrapolation described in Lecture 1.)

Thus, Eqs. (3.30), (3.36), and (3.37) can be used to program a P–C method with an adaptivestep size. Namely, suppose that we have a goal that the local truncation error of our numericalsolution does not exceed a specified number εloc. Then:1. Compute Y p

i and Y ci from (3.30).

2. Compare the error calculated from (3.36) with εloc and then adjust the step size as explainedin Lecture 2. (As we pointed out at the end of Sec. 3.1, the adjustment of the step size inmultistep methods is awkward; so it may be more practical to just keep a record of the errormagnitude without changing the step size.)3. Upon accepting the calculations for a particular step, calculate the solution at this stepfrom (3.37).The above procedure produces a 3rd-order-accurate solution (3.37) while controlling the errorsize of the associated 2nd-order method (3.30). This is exactly the same idea as was used bythe adaptive RK methods described in Lecture 2.

Remark 1 As we have just discussed, using the predictor and corrector equations of thesame order has the advantage of allowing one to monitor or control the error. However, itmay have a disadvantage of making such schemes less stable compared to schemes where thepredictor’s order is one less than that of the corrector. (We will study the concept of stabilityin the next Lecture.) Thus, choosing a particular P–C pair may depend on the application,and a significant body of research has been devoted to this issue.

Remark 2 Suppose we plan to use a P–C method with the predictor and corrector equationshaving the same order, say m, so as to monitor the error, as described above. If the predictorequation comes from a multistep method of the form (3.16) (as, e.g., in (3.30) or (3.31)), then weneed to re-examine the question addressed in Section 3.4. Namely, what order starting method


should we use for the predictor equation to be able to monitor the error? In Section 3.4 weshowed that a starting method of order (m − 1) would suffice to make the predictor methodhave order m. More precisely, the global error of the predictor method is O(hm) in this case.However, the local truncation error of such a predictor method is also O(hm) and not O(hm+1)!This is because the error made by the starting method would propagate (but not accumulate)to the last computed value of the numerical solution, as we showed in Section 3.4. For example,if we start the 2nd-order predictor method in (3.30) by the simple Euler method, then the localtruncation error of O(h2) that the simple Euler made in computing Y p

1 , will propagate up toYi for any i. Such an error will invalidate the derivation of the local truncation error of thecorrector equation, found in the Appendix (Section 3.8). Thus, we conclude: If you want to beable to monitor the error in a P–C method where both the predictor and corrector equations aremultistep methods of order m, you need to start the predictor equation by an mth-order startingmethod.

To conclude our consideration of the P–C methods, let us address the following importantissue. Note that we can apply the corrector formula more than once. For example, for themethod (3.30), we will then have:

Y pi+1 = Yi + 1

2h(3fi − fi−1)

Y c,1i+1 = Yi + 1

2h

(fi + f(xi+1, Y

pi+1)

),

Y c,2i+1 = Yi + 1

2h

(fi + f(xi+1, Y

c,1i+1)

),

etc.

Yi+1 = Y c,ki+1

(3.38)

Question: How many times should we apply the corrector equation?We need to strike a compromise here. If we apply the corrector too many times, then we

will waste computer time if each iteration of the corrector changes the solution by less than thetruncation error of the method. On the other hand, we may have to apply it more than once inorder to make the difference |Y c,k

i+1 − Y c,k−1i+1 | between the last two iterations much smaller than

the truncation error of the corrector equation (since the latter error is basically the error of themethod; see Eq. (3.33)).

Ideally, one would like to know the conditions under which it is sufficient to apply thecorrector equation only once, so that no benefits would be gained by its successive applications.Below we derive such a sufficient condition for the method (3.30)/(3.38). For another P–Cmethod, e.g., Adams–Moulton, an analogous condition can be derived along the same lines.

Suppose the maximum allowed global error of the solution is εglob. The allowed local trun-cation error is then about εloc = hεglob (see Sec. 2.2). We impose two requirements:(i) The local truncation error of our solution should not exceed εloc;(ii) The difference |Y c,2

i+1 − Y c,1i+1| should be much smaller than εloc.

Requirement (i) is necessary to satisfy in order to obtain the required accuracy of the numericalsolution. Requirement (ii) is necessary to satisfy in order to use the corrector equation onlyonce.

Requirement (i) along with Eq. (3.36) yields

1

6|Y p

i+1 − Y c,1i+1| < εloc, ⇒ |Y p

i+1 − Y c,1i+1| < 6εloc . (3.39)

If at some xi condition (3.39) does not hold, the step size needs to be reduced in accordancewith the error’s order |Y p

i+1 − Y c,1i+1| = O(h3).


Requirement (ii) implies:

|Y c,2i+1 − Y c,1

i+1| =∣∣[Yi + 1

2h

(fi + f(xi+1, Y

c,1i+1)

)] −[Yi + 1

2h

(fi + f(xi+1, Y

pi+1)

)]∣∣= 1

2h

∣∣f(xi+1, Yc,1i+1)− f(xi+1, Y

pi+1)

∣∣Lipschitz

≤ 12hL

∣∣Y c,1i+1 − Y p

i+1

∣∣(3.39)

≤ 12hL · 6εloc . (3.40)

Thus, a sufficient condition for |Y c,2i+1 − Y c,1

i+1| ¿ εloc to hold is

1

2hL · 6εloc ¿ εloc, or hL ¿ 1

3. (3.41)

If condition (3.41) is satisfied, then a single application of the corrector equation is adequate.If, however, the step size is not small enough, we may require two iterations of the corrector.Then a second application of (3.40) would produce the condition:

|Y c,3i+1 − Y c,2

i+1| ¿ εloc ⇒

(1

2hL

)2

· 6εloc ¿ εloc, or hL ¿√

2

3≈ 0.82 , (3.42)

which is less restrictive than (3.41).

To summarize on the P–C methods:

1. The P–C methods may provide both high accuracy and the capability of error control, allat a potentially lower computational cost than RK-Fehlberg or RK-Merson methods. Forexample, the Adams–Moulton method (3.31) has the error of the same (fourth) order asthe aforementioned RK methods, while requiring k+1 function evaluations, where k is thenumber of times one has to iterate the corrector equation. If k < 4, then Adam-Moultonrequires fewer function evaluations than either RK-Merson or RK-Fehlberg.

2. The adjustment of the step size in P–C methods is awkward (as it is in all multistepmethods); it requires interpolation of the solution between the nodes of the computationalgrid.

3. One may ask, why not just halve the step size of the Adams–Bashforth method (whichwould reduce the global error by a factor of 24 = 16, i.e. a lot) and then use it alonewithout the Adams–Moulton corrector formula? The answer is this. First, one willthen lose control over the error. Second, the Adams–Bashforth may sometimes producea numerical solution which has nothing to do with the exact solution, while the P–CAdams–Moulton’s solution will stay close to the exact one. This issue will be discussedin detail in the next Lecture.


3.7 Implicit methods

We noted in Lecture 1 that the simple Euler method is analogous to the left Riemann sumswhen integrating the differential equation y′ = f(x).

The method analogous to the right Riemannsums is:

Yi+1 = Yi + hf(xi+1, Yi+1) . (3.43)

It is called the implicit Euler, or backwardEuler, method. This is a first-order method:Its global error is O(h) and the local trunca-tion error is O(h2).

right Riemann sums

x0 x

1 x

2 x

3

We note that if f(x, y) = a(x)y+b(x), then the implicit equation (3.43) can be easily solved:

Yi+1 =Yi + hbi+1

1− hai+1

. (3.44)

However, for a general nonlinear f(x, y), equation (3.43) cannot be solved exactly, and itssolution then has to be found numerically, say, by the Newton-Raphson method.

Question: Why does one want to use the implicit Euler, which is so much harder to solvethan the simple Euler method?

Answer: Implicit methods have stability properties that are much better than those ofexplicit methods (like the simple Euler). We will discuss this in the next Lecture.

Note that the last remark about Adams–Bashforth vs. Adams–Moulton, found at the endof the previous Section, is also related to the stability issue. Indeed, the Adams–Bashforthmethod (the first equation in (3.31)) is explicit, and thus according to the above, it should benot as stable as the Adams–Moulton method (the second equation in (3.31)), which is implicitif one treats Y p

i+1 in it as being approximately equal to Y ci+1.

Finally, we present equations for the Modified implicit Euler method:

Yi+1 = Yi +h

2(f(xi, Yi) + f(xi+1, Yi+1)) . (3.45)

This is a second-order method.

3.8 Appendix: Derivation of (3.33)

Here we derive the local truncation error of the corrector equation in the method (3.30). As-suming, as usual, that Yi = yi, and using Y p

i+1 = yi+1 + O(h3) (since the order of the predictorequation is two), one obtains from the corrector equation of (3.30):

Y ci+1 = yi + 1

2h (y′i + f(xi+1, yi+1 + O(h3)) )

= yi + 12h (y′i + f(xi+1, yi+1) + O(h3) )

= yi + 12h

(y′i + y′i+1 + O(h3)

)

= yi + 12h

(y′i +

[y′i + hy′′i + 1

2h2y′′′i + O(h3)

]+ O(h3)

)

= yi + hy′i + 12h2y′′i + 1

4h3y′′′i + O(h4) . (3.46)


On the other hand, for the exact solution we have the usual Taylor series expansion:

yi+1 = yi + h′i +1

2h2y′′i +

1

6h3y′′′i + O(h4) . (3.47)

Subtracting (3.46) from (3.47), we obtain

yi+1 − Y ci+1 = − 1

12h3y′′′i + O(h4) ,

which is (3.33).


1. Make sure you can reproduce the derivation of Eq. (3.4).

2. What is the idea behind the derivation of Eq. (3.5)?

3. Derive Eqs. (3.9) and (3.10).

4. Derive Eq. (3.11) as indicated in the text.

5. Describe two alternative ways to derive formulae for multistep methods.

6. Verify Eq. (3.19).

7. For a multistep method of order m, what should the order of the starting method be?

8. Convince yourself that method (3.23) is of the form (3.24) and (3.25).

9. What is the origin of the error EP in Eq. (3.26)?

10. What is the origin of the error EC in Eq. (3.27)?

11. How should the orders of the predictor and corrector equations be related? Why?

12. Is there a reason to use a predictor as accurate as the corrector?

13. What is the significance of Requirements (i) and (ii) found before Eq. (3.39)?

14. Make sure you can explain the derivations of (3.40) and (3.41).

15. What are the advantages and disadvantages of the P–C methods compared to the RKmethods?

16. What is the reason one may want to use an implicit method?


4 Stability analysis of finite-difference methods for ODEs

4.1 Consistency, stability, and convergence of a numerical method;Main Theorem

Recall that we are solving the IVP

y′ = f(x, y), y(x0) = y0. (4.1)

Suppose we are using the simple Euler method:

Yn+1 − Yn

h− f(xn, Yn) = 0 . (4.2)

Note that in this and subsequent Lectures, we abandon the subscript i in favor of n, becausewe want to reserve i for the

√−1. If we denote the l.h.s. of (4.2) as F [Yn, h], then the aboveequation can be rewritten as

F [Yn, h] = 0 . (4.3)

In general, any finite-difference method can be written in the form (4.3).Recall that if we substitute into (4.3) the exact solution y(xn) of the ODE, then we obtain

F [yn, h] = τn . (4.4)

In Lecture 1, we called τn the discretization error and hτn, the local truncation error.Now we need a notation for the norm. For any sequence of numbers {an}, let

||a||∞ = maxn|an| . (4.5)

This norm is called the “L∞-norm” of the sequence {an}. There are many other kinds of norms,each of which is useful in its own range of circumstances. In this course, we will only deal withthe L∞-norm, and therefore we will simply denote it by ||a||, dropping the subscript “∞”. Thereason we are interested in this particular norm is that at the end of the day, we want to knowthat the maximum error of our solution is bounded by some tolerance εtol:

maxn|εn| ≤ εtol, or ||ε|| ≤ εtol . (4.6)

We will now give a series of definitions.

Definition 1: A numerical method F [Yn, h] = 0 is called consistent if

limh→0

||τ || = 0, (4.7)

where τn is defined by Eq. (4.4).Note that if the order of a method is l > 0, then the method is consistent, since τn = O(hl).

However, when we are solving an ODE numerically, our main concern is not so much thatthe discretization error be small but that the global error be small. Hence we have anotherdefinition.

Definition 2: A numerical method is called convergent if

limh→0

||ε|| ≡ limh→0

‖y − Y ‖ = limh→0

maxn|yn − Yn| = 0 . (4.8)


Question: What do we need to require of a consistent method in order for it to beconvergent?

Answer: That the accumulated local truncation error not grow “out of bound”.This motivates yet another definition.

Definition 3: Consider a finite-difference scheme of the form (4.3). Let an approximatesolution to that scheme be un. (Recall that its exact solution is Yn.) The approximation mayarise, for example, due to the finite precision of the computer or due to a slight error in theinitial condition. If we substitute un into the equation of the numerical method, it satisfies:

F [un, h] = ξn , (4.9)

where ξn is some small number. The method is called stable if

||u− Y || ≤ C||ξ|| , (4.10)

where the constant C is required to be independent of h. The latter requirement means that fora stable method, one wants to ensure that small errors (e.g., due to rounding off) made at eachstep do not accumulate. That is, one wants to preclude the possibility that C is proportionalto the number of steps (∼ O(1/h)) or, worse yet, that C = exp[O(1/h)].

Theorem 4.1 (P. Lax):If a method is both consistent and stable, then it converges. In short:

Consistency + Stability ⇒ Convergence

Remark: Note that all three properties of the method: consistency, stability, and convergence,must be defined with respect to the same norm (in this course, we are using only one kind ofnorm, so that is not an issue anyway).

The idea of the Proof: Consistency of the method means that the local truncation errorat each step, hτn, is sufficiently small so that the accumulated (i.e., global) error, which is onthe order of τn, tends to zero as h is decreased (see (4.7)). Thus:

Consistency ⇒ ‖y − Y ‖ is small, (4.11)

where, as above, Y is the ideal solution of the numerical scheme (4.3) obtained in the absenceof machine round-off errors and any errors in initial conditions.

Stability of the method means that if at any given step, the actual solution un slightlydeviates from the ideal solution Yn due to the round-off errors, then these small deviationsremain small and do not grow as n increases. Thus:

Stability ⇒ ‖Y − u‖ is small. (4.12)

But then the above two equations together imply that the maximum difference between theactual computed solution un and the exact solution yn also remains small, because there areno other sources of errors and no reasons for the errors to grow (other than merely add up).Thus:

‖y − u‖ = ‖(y − Y ) + (Y − u)‖ ≤ ‖(y − Y )‖+ ‖(Y − u)‖, (4.13)

which must be small because each term on the r.h.s. is small. The fact that the l.h.s. of theabove equation is small means, by Definition 2, that the method is convergent. q.e.d.


4.2 Some general comments about stability of solutions of ODEs

In the remainder of this Lecture, and also in some of the following Lectures, we will studystability of a given numerical method by applying it to a model problem

y′ = λy, y(0) = y0 (4.14)

with λ < 0 or, more generally, with Reλ < 0 (since, as we will see later, λ may need to beallowed to be a complex number).

We will first explain why we have excluded the case λ > 0 (or, more generally, Reλ > 0).The solution of (4.14) is y = y0 eλx. Suppose we make a small error in the initial condition:

y0 → y0 + δ ⇒ y → (y0 + δ)eλx ≡ ytrue + δeλx . (4.15)

Now, if λ > 0, then a small change δ in the initial condition produces, over a sufficiently largex, a large change in the solution. In other words, the problem is unstable in the absolute sense.However, if |δ| ¿ |y0|, then the error is still small compared to the exact solution, which meansthat the problem is stable in the relative sense.On the other hand, if λ < 0, the error is δe−|λ|x → 0 as x increases, and therefore problem(4.14) is stable in both absolute and relative senses.

Thus, the instability of the solution for λ > 0 is intrinsic to the ODE, i.e. it does not dependon how we solve the equation numerically. Moreover, we have seen that such an instability doesnot present any “danger” to the numerical solution, because the error still remains much smallerthan the solution (as long as |δ| ¿ |y0|). On the contrary, the exact solution of (4.14) withλ < 0 is intrinsically stable (and, in particular, non-growing). Therefore, we want a numericalmethod to also produce a stable solution. If, however, it produces a growing solution, then weimmediately know that the method is unstable.

Let us now explain why model problem (4.14) is relevant for studying stability of numericalmethods. A very brief answer is: Because it arises as the local linear approximation (also knownas a linearization) near every point of a generic trajectory y = y(x). Indeed, consider a generalODE y′ = f(x, y) along with its exact solution y(x) and numerical solution Yn. To illustrateour point, let us suppose that we have used the simple Euler method to obtain the numericalsolution. Thus, the exact solution, the numerical solution, and the error εn = yn − Yn satisfy:

yn+1 = yn + hf(xn, yn) + hτn

Yn+1 = Yn + hf(xn, Yn)

εn+1 = εn + h (f(xn, yn)− f(xn, Yn)) + hτn ,

(4.16)

where hτn is the local truncation error (see (4.4)). Now, since |εn| is supposed to be small,we can Taylor-expand the difference term inside the parentheses in the last equation of (4.16);then that equation becomes

εn+1 = εn + hfy(xn, yn) εn +{hτn + O(ε2

n)}

. (4.17)

If we disregard the O(ε2n)-term above and, in addition, replace fy(xn, yn) by a constant λ =

max |fy(x, y)|, then (4.17) becomes

εn+1 = εn + hλεn + hτn . (4.18a)


Equation (4.17) is nothing but the simple-Euler approximation of the linear inhomogeneousODE

ε′(x) = λε(x) + τ(x) . (4.18b)

The solution to this ODE was obtained in Lecture 0:

ε(x) = ε0eλ(x−x0) +

∫ x

0

τ(x) eλ(x−x)dx .

One can see that unless τ(x) itself grows with x (which we do not expect of a local truncationerror, since it is supposed to be always small), the presence of the τ(x)-term in Eq. (4.18b) doesnot cause the error ε(x) to grow. Rather, it is the λε term that may result in an exponentialgrowth of the initially small error! On these grounds, we neglect the τ(x)-term in (4.18b) andhence obtain the model problem (4.14). Thus, considering how the numerical solution behavesfor this model problem, we should be able to predict whether the numerical errors will grow orremain bounded in the original IVP (4.1).

The above consideration has missed one subtle possibility. Namely, what if λ < 0 whileτ(x) → const as x →∞? One can show that in this case, ε(x) does not decay but asymptoticallytends to a constant. On the other hand, from the model problem (4.14) with λ < 0, we wouldmistakenly conclude that the error between the exact and numerical solutions should decay,whereas according to the primordial and hence more correct equations (4.18), this error wouldlevel off at a nonzero value! This important conclusion is worth to be repeated: In somecases, when the numerical scheme is predicted to be stable by model equation (4.14), the errorbetween the exact and numerical solutions may tend to a nonzero constant. This means thatmodel equation (4.14) does not always correctly predict the behavior of the errorbetween the exact and numerical solutions. To be more specific, (4.14) wouldalways correctly predict only an unstable numerical solution.

A natural question then is: Is there some other kind of error whose behavior would alwaysbe correctly predicted by Eq. (4.14)? The answer is: Yes, and this error is the deviationbetween two numerical solutions, say Yn and Un, whose initial values are close to each other.Indeed, if both Y and U satisfy (4.2), then a simple calculation along the lines of (4.16) wouldyield for (Un − Yn) and equation similar to (4.17) where now there would be no term hτn.This would then yield (4.18b) with τ(x) ≡ 0, which is (4.14). Thus, we conclude that modelequation (4.14) always correctly predicts (at least in short term) the behavior ofthe deviation between two initially close solutions of the numerical scheme.

Finally, we note that if in (4.1), f(x, y) = a(x)y, then the model equation (4.14) coincideswith the original ODE, i.e. the aforementioned deviation (Un−Yn) and either of the numericalsolutions Yn and Un satisfy the same equation. This simple observation is also worth repeating:For linear ODEs, the numerical solution and a deviation (Un−Yn) between any twonumerical solutions Un and Yn, evolve in the same way.

4.3 Stability analyses of some familiar numerical methods

Below we present stability criteria for the numerical methods we have studied in the precedingLectures for the model problem (4.14).


We begin with the simple Euler method. As we have shown above, for it the error satisfies

εn+1 = εn + λhεn , ⇒ εn = ε0(1 + λh)n , (4.19)

where ε0 is the initial error.Let λ > 0. Then both the solution of the ODE (4.14), y = y0e

λx, and the error (4.19),increase as the calculation proceeds. As we said above, one can do nothing about that.

Now let λ < 0. Then the true solution y0eλx decreases, but the error will decrease only if

|1 + hλ| < 1 ⇒ −1 < 1 + hλ < 1 ⇒ h <2

|λ| . (4.20)

E.g., to solve y′ = −30y (with any initial condition), we must use h < 230

in order to guaranteethat the round-off and truncation errors will decay and not inundate the solution.

Thus, for the model problem (4.14) with λ < 0, the simple Euler method is stable only whenthe step size satisfies Eq. (4.20). This conditional stability is referred to as partial stability;thus, the simple Euler method is partially stable. For the general case where λ is a complexnumber, partial stability is defined as below.

Definition 4: A method is called partially stable if, when applied to the model problem(4.14) with Reλ < 0, the corresponding numerical solution is stable only for some values of λh.The region in the λh-plane where the method is stable is called the region of stability of themethod.

Let us find the region of stability of the simple Euler method. To this end, write λ as thesum of its real and imaginary parts: λ = λR + iλI (note that here and below i =

√−1). Thenthe first of the inequalities in (4.20) becomes

|1 + hλR + ihλI | < 1 ⇒√(1 + hλR)2 + (hλI)2 < 1 . (4.21)

Thus, the region of stability of the simple Eulermethod is the inside of the circle

(1 + hλR)2 + (hλI)2 = 1,

as shown in the figure on the right.

Stability region for simple Euler method

−2

hλI

hλR

We now present brief details about the region of stability for the Modified Euler method.In a homework problem, you will be asked to supply the missing details.

Substituting the ODE from (4.14) into Eqs. (1.22) (see Lecture 1), we find that

Yn+1 = (1 + hλ +1

2(hλ)2) Yn . (4.22)

Remark 1: The evolution of any error in this method will also satisfy the same equation (4.22).Remark 2: Note that the factor on the r.h.s. of (4.22) is quite expected: Since the ModifiedEuler is the 2nd-order method, its solution of the model problem (4.14) should be the 2nd-degreepolynomial that approximates the exponential in the exact solution y = y0e

λx.


The boundary of the stability region is obtainedby setting the modulus of the factor on the r.h.s.of (4.22) to 1:

∣∣∣∣1 + hλ +1

2(hλ)2)

∣∣∣∣ = 1.

Indeed, if the factor on the l.h.s. is less than 1,all errors will decay, and if it is greater than 1,they will grow, even though the exact solution maydecay.The above equation can be equivalently written as

Stability region for Modified Euler method

−2

hλI

hλR

(1 + hλR +

1

2

((hλR)2 − (hλI)

2) )2

+ (hλI + h2λIλR)2 = 1 . (4.23)

The corresponding region is shown above.

When the cRK method is applied to the model problem (4.14), the corresponding stabilitycriterion becomes ∣∣∣∣∣

4∑

k=0

(hλ)k

k!

∣∣∣∣∣ ≤ 1 . (4.24)

The expression on the l.h.s. is the fourth-degree polynomial approximating eλh; this is consistentwith Remark 2 made after Eq. (4.22).

For real λ, criterion (4.24) reduces to

−2.79 ≤ hλ ≤ 0 . (4.25)

Note that the cRK method is not only more accurate than the simple and Modified Eulermethods, but also has a greater stability region for negative real values of λ.

4.4 Stability analysis of multistep methods

We begin with the 2nd-order Adams-Bashforth method (3.5):

Yn+1 = Yn + h

(3

2fn − 1

2fn−1

). (3.5)

Substituting the model ODE (4.14) into that equation, one obtains

Yn+1 −(

1 +3

2λh

)Yn +

1

2λhYn−1 = 0 . (4.26)

To solve this difference equation, we use the same procedure as we would use to solve a linearODE. Namely, for the ODE

y′′ + a1y′ + a0y = 0

with constant coefficients a1, a0, we need to substitute the ansatz y = erx, which yields thefollowing polynomial equation for r:

r2 + a1r + a0 = 0 .


Similarly, for the difference equation (4.26), we substitute Yn = rn and, upon cancelling by thecommon factor rn−1, obtain:

r2 −(

1 +3

2λh

)r +

1

2λh = 0 . (4.27)

This quadratic equation has two roots:

r1 =1

2

(1 +

3

2λh

)+

√(1 +

3

2λh

)2

− 2λh

,

r2 =1

2

(1 +

3

2λh

)−

√(1 +

3

2λh

)2

− 2λh

.

(4.28)

In the limit of h → 0 (which is the limit where the difference method (4.26) reduces to theODE (4.14)), one can use the Taylor expansion (and, in particular, the formula

√1 + α =

1 + 12α + O(α2) ), to obtain the asymptotic forms of r1 and r2:

r1 ≈ 1 + λh, r2 ≈ 1

2λh . (4.29)

The solution Yn that corresponds to root r1 turns, in the limit h → 0, into the true solution ofthe ODE y′ = λy, because

limh→0

rn1 = lim

h→0(1 + λh)n = lim

h→0(1 + λh)x/h = eλx ; (4.30)

see Sec. 0.5. However, the solution of the difference method (4.26) corresponding to root r2

does not correspond to any actual solution of the ODE ! For that reason, root r2 and thecorresponding difference solution rn

2 are called parasitic.A good thing about the parasitic solution for the 2nd-order Adams-Bashforth method is that

it does not grow for sufficiently small λh. In fact, since for sufficiently small h, r2 ≈ 12λh < 1,

then that parasitic solution decays to zero rather rapidly and therefore does not contaminatethe numerical solution.

To require that the 2nd-order Adams-Bashforthmethod be stable is equivalent to requiring thatboth r1 and r2 satisfy

|r1| ≤ 1 and |r2| ≤ 1 . (4.31)

The stability region is inside the oval-shaped re-gion shown on the right (the little “horns” are theplotting artifice). This figure is produced by Math-ematica; in a homework problem, you will be askedto obtain this figure on your own.A curious point to note is that the two require-ments, |r1| ≤ 1 and |r2| ≤ 1, produce two non-overlapping parts of the stability region boundary(its right-hand and left-hand parts, respectively).

Stability region of 2nd-orderAdams-Bashforth

-1 -0.8 -0.6 -0.4 -0.2h Λr

-1

-0.5

0.5

1

h Λi


A similar analysis for the 3rd-order Adams-Bashforth method (3.11) shows that the corre-sponding difference equation has three roots, ofwhich one (say, r1) corresponds to the true solu-tion of the ODE and the other two (r2 and r3) arethe parasitic roots. Fortunately, these roots decayto zero as O(h) for h → 0, so they do not affectthe numerical solution for sufficiently small h. Forfinite h, the requirement

|r1| ≤ 1, |r2| ≤ 1, |r3| ≤ 1

results in the stability region whose shape is qual-itatively shown on the right.

Stability region of 3rd-orderAdams-Bashforth

h λr

h λi

−6/11

From the above consideration of the 2nd- and 3rd-order Adams-Bashforth methods therefollows an observation that is shared by some other families of methods: the more accuratemethod has a smaller stability region.

Let us now analyze the stability of the two methods considered in Sec. 3.3.

Leap-frog method

Substituting the model ODE into Eq. (3.20), one obtains

Yn+1 − Yn−1 = 2hλYn . (4.32)

For Yn = rn we find:

r2 − 2hλr − 1 = 0 ⇒ r1,2 = hλ±√

1 + (hλ)2 . (4.33)

Considering the limit h → 0, as before, we find:

r1 ≈ 1 + hλ, r2 ≈ −1 + hλ . (4.34)

Again, as before, the solution of the diference equation with r1 corresponds to the solution ofthe ODE: rn

1 ≈ (1 + hλ)x/h ≈ eλx. The solution corresponding to root r2 is parasitic. Itsbehavior needs to be analyzed separately for λ > 0 and λ < 0.

λ > 0 Then r2 = −(1 − hλ), so that |r2| < 1, and the parasitic solution decays, whereasthe true solution, (1 + hλ)n ≈ eλx, grows. Thus, the method is stable in this case.

λ < 0 Then r2 = −(1 + h|λ|), so that |r2| > 1. Thus, the parasitic solution growsexponentially and makes the method unstable. Specifically, the difference solution will be thelinear combination

Yn = c1rn1 + c2r

n2 ≈ c1e

λx + c2(−1)ne−λx . (4.35)

We see that the parasitic part (the 2nd term in (4.35) of this solution grows for λ < 0, whereasthe true solution (the first term in (4.35)) decays; thus, for a sufficiently large x, the numericalsolution will bear no resemblance to the true solution.


The stability region of the Leap-frog method,shown on the right, is disappointingly small: themethod is stable only for

λR = 0 and − 1 ≤ hλI ≤ 1 . (4.36)

However, since the numerical solution will still stayclose to the true solution for |λx| ¿ 1, the Leap-frog method is called weakly unstable.

Stability region of Leap−frog

hλi

hλr

i

−i

Divergent 3rd-order method (3.22)

For the model problem (4.14), that method becomes

Yn+1 +3

2Yn − 3Yn−1 +

1

2Yn−2 = 3hλYn . (4.37)

Proceeding as before, we obtain the characteristic equation for the roots:

r3 +

(3

2− 3hλ

)r2 − 3r +

1

2= 0 . (4.38)

To consider the limit h → 0, we can simply set h = 0 as the lowest-order approximation. Thecubic equation (4.38) reduces to

r3 +3

2r2 − 3r +

1

2= 0 , (4.39)

which has the rootsr1 = 1, r2 ≈ −2.69, r3 ≈ 0.19 . (4.40)

Then for small h, the numerical solution is

Yn = c1(1 + hλ)n + c2(−2.69 + O(h))n + c3(0.19 + O(h))n

(approximate true (parasitic solution (parasitic solutionsolution) that explodes) that decays)

(4.41)

The second term, corresponding to a parasitic solution, grows (in magnitude) much faster thanthe term approximating the true solution, and therefore the numerical solution very quicklybecomes complete garbage. This happens much faster than for the Leap-frog method withλ < 0. Therefore, method (3.22) is called strongly unstable; obviously, it is useless for anycomputations.

The above considerations of multistep methods can be summarized as follows. Consider amultistep method of the general form (3.17). For the model problem (4.14), it becomes

Yn+1 −M∑

k=0

akYn−k = h

N∑

k=0

bkλYn−k . (4.42)

The first step of its stability analysis is to set h = 0, which will result in the following charac-teristic polynomial:

rM+1 −M∑

k=0

akrM−k = 0 . (4.43)


This equation must always have a root r1 = 1 which corresponds to the true solution of theODE. If any of the other roots, i.e. {r2, r3, . . . , rM+1}, satisfies |rk| > 1, then the method isstrongly unstable. If any root with k ≥ 2 satisfies |rk| = 1, then the method may be weaklyunstable (like the Leap-frog method). Finally, if all |rk| < 1 for k = 2, . . . , M + 1, then themethod is stable for h → 0. It may be either partially stable, as the single-step methods andAdams-Bashforth methods, or absolutely stable, as the implicit methods that we will considernext.

4.5 Stability analysis of implicit methods

Consider the implicit Euler method (3.43). For the model problem (4.14) it yields

Yn+1 = Yn + hλYn+1, ⇒ Yn = Y0

(1

1− hλ

)n

. (4.44)

It can be verified (do it, following the lines of Sec. 0.5) that the r.h.s. of the last equationreduces for h → 0 to the exact solution y0e

λx, as it should. The stability condition is

∣∣∣∣1

1− hλ

∣∣∣∣ ≤ 1 ⇒ |1− hλ| ≥ 1 . (4.45)

The boundary of the stability region is the circle

(1− hλR)2 + (hλI)2 = 1 ; (4.46)

the stability region is the outside of that circle (seethe second of inequalities (4.45)).

Stability region of implicit Euler

hλi

hλr

Definition 5: If a numerical method, when applied to the model problem (4.14), is stablefor all λ with λR < 0, such a method is called absolutely stable, or A-stable for short.

Thus, we have shown that the implicit Euler method is A-stable.Similarly, one can show that the Modified implicit Euler method is also A-stable (you will

be asked to do so in a homework problem).

Theorem 4.2:1) No explicit finite-difference method is A-stable.2) No implicit method of order higher than 2 is A-stable.

Thus, according to Theorem 4.2, implicit methods of order 3 and higher are only partiallystable; however, their regions of stability are usually larger than those of explicit methods ofthe same order.

For more information about stability of numerical methods, one may consult the book byP. Henrici, “Discrete variable methods in ordinary differential equations,” (Wiley 1968).



1. Explain the meanings of the concepts of consistency, stability, and convergence of a nu-merical method.

2. State the Lax Theorem.

3. Give the idea behind the proof of the Lax Theorem.

4. Why is the model problem (4.14) relevant to analyze stability of numerical methods?

5. The behavior of what kind of error does the model problem (4.14) always predict correctly?

6. Verify the statement about (Un − Yn) in the next to the last paragraph of Sec. 4.2.

7. What is the general procedure of analyzing stability of a numerical method?

8. Obtain Eq. (4.27).

9. Obtain Eq. (4.29).

10. Why does the characteristic polynomial for the 3rd-order Adams-Bashforth method (3.11)has exactly 3 roots?

11. Would you use the Leap-frog method to solve the ODE y′ =√

y + x2 ?

12. Obtain (4.34) from (4.33).

13. Why is the Leap-frog method called weakly unstable?

14. Why is the method (3.22) called strongly unstable?

15. Are Adams-Bashforth and Runge-Kutta methods always partially stable, or can they beweakly unstable? (Hint: Look at Eqs. (4.42) and (4.43) and think of the roots r1, r2, . . .

that these methods can have when hλ → 0.)

16. Verify the statement made after Eq. (4.44).

17. Obtain (4.46) from (4.45).

18. Is the 3rd-order Adams-Moulton method that you obtained in Homework 3 (problem 3)A-stable?


5 Higher-order ODEs and systems of ODEs

5.1 General-purpose discretization schemes for systems of ODEs

The strategy of generalizing a discretization scheme from one to N > 1 ODEs is, for the mostpart, straightforward. Therefore, below we will consider only the case of a system of N = 2ODEs. This case will also allow us to investigate a certain issue that is specific to systems ofODEs and does not occur for a single ODE. We will denote the exact solutions of the ODEsystem in question as y(1)(x) and y(2)(x), while the corresponding numerical solutions of thissystem, as Y (1) and Y (2); the functions appearing on the r.h.s. of the system will be denotedas f (1) and f (2). Thus, the IVP for the two unknowns, y(1) and y(2), is:

y(1) ′ = f (1)(x, y(1), y(2)), y(1)(x0) = y(1)0 ,

y(2) ′ = f (2)(x, y(1), y(2)), y(2)(x0) = y(2)0 .

(5.1)

We now consider generalizations of some of the methods introduced in Lectures 1 and 2.

Simple Euler method

Probably the most intuitive form of this method for two ODEs is

Y(1)n+1 = Y

(1)n + hf (1)

(xn,

{Y

(1)n , Y

(2)n

}),

Y(2)n+1 = Y

(2)n + hf (2)

(xn,

{Y

(1)n , Y

(2)n

}).

(5.2)

Already for this most basic example, we can identify the issue, mentioned above, that is specificto systems of ODEs and does not occur for a single first-order ODE. Namely, notice that oncewe have found the new value Y

(1)n+1 for the first component of the solution, we can substitute it

into the second equation instead of substituting Y(1)n , as it is done in (5.2). The result is:

Y(1)n+1 = Y

(1)n + hf (1)

(xn,

{Y

(1)n , Y

(2)n

}),

Y(2)n+1 = Y

(2)n + hf (2)

(xn,

{Y

(1)n+1, Y

(2)n

}).

(5.3)

Since the components Y (1) and Y (2) enter Eqs. (5.2) on equal footing, we can interchange theirorder in (5.3) and obtain:

Y(2)n+1 = Y

(2)n + hf (2)

(xn,

{Y

(1)n , Y

(2)n

}),

Y(1)n+1 = Y

(1)n + hf (1)

(xn,

{Y

(1)n , Y

(2)n+1

}).

(5.4)

It is rather straightforward to see that all the three implementations of the simple Euler methodare first-order methods.

An obvious question that now comes to mind is this: Is there any aspect because of whichmethods (5.3) and (5.4) may be preferred over method (5.2)? The short answer is ‘yes, for acertain form of f (1) and f (2), there is’. We will present more detail in Sec. 5.3 below. For nowwe continue with presenting the discretization scheme for the Modified Euler equation for twofirst-order ODEs.


Modified Euler method

Y (k) = Y(k)n + hf (k)

(xn,

{Y

(1)n , Y

(2)n

}),

Y(k)n+1 = Y

(k)n + h

2

[f (k)

(xn,

{Y

(1)n , Y

(2)n

})+ f (k)

(xn+1,

{Y (1), Y (2)

})],

k = 1, 2 .

(5.5)

Let us verify that (5.5) is a second-order method, as it has been for a single ODE. Weproceed in exactly the same steps as in Lecture 1. We will also use the shorthand notations:

~Yn ={Y (1)

n , Y (2)n

}, ~yn =

{y(1)

n , y(2)n

}, ~fn =

{f (1)(xn, ~Yn), f (2)(xn, ~Yn)

}≡ {

f (1)n , f (2)

n

}.

Since in the derivation of the local truncation error we always assume that Y(k)n = y

(k)n , then

also {f (1)(xn, ~yn), f (2)(xn, ~yn)

}=

{f (1)

n , f (2)n

}.

Expanding the r.h.s. of the second of Eqs. (5.5) about the “point” (xn, ~Yn) in a Taylorseries, we obtain:

Y(k)n+1 = Y (k)

n +h

2

[f (k)

(xn, ~Yn

)+ f (k)

(xn+1, ~Yn + h~fn

)]

|Taylor expansion = Y (k)n +

h

2

[f (k)

n +

{f (k)

n + h∂f

(k)n

∂x+ hf (1)

n

∂f(k)n

∂y(1)+ hf (2)

n

∂f(k)n

∂y(2)

}]+ O(h3)

= Y (k)n + hf (k)

n +h2

2

[∂f

(k)n

∂x+ f (1)

n

∂f(k)n

∂y(1)+ f (2)

n

∂f(k)n

∂y(2)

]+ O(h3) . (5.6)

Now expanding the exact solution y(k)n+1 = y(k)(xn+1) in a Taylor series, we obtain:

y(k)n+1 |Taylor expansion = y(k)

n + hd

dxy(k)

n +h2

2

d2

dx2y(k)

n + O(h3)

|definitions of dy(k)/dx = y(k)n + hf (k)

n +h2

2

d

dxf (k)

n + O(h3)

|Chain rule = y(k)n + hf (k)

n +h2

2

[∂f

(k)n

∂x+ f (1)

n

∂f(k)n

∂y(1)+ f (2)

n

∂f(k)n

∂y(2)

]+ O(h3) . (5.7)

Here the coefficient of the h2-term has been computed by using the fact that f (k) = f (k)(x, y(1)(x), y(2)(x))and then using the Chain rule. Comparing the last lines in (5.6) and (5.7), we see that

y(k)n+1 = Y

(k)n+1+O(h3), which confirms that the order of the local truncation error in the Modified

Euler method (5.5) is 3, and hence the method is second-order accurate.

In the homework problems, you will be asked to write out the forms of discretization schemesfor the Midpoint and cRK methods for a system of two ODEs.

To conclude this subsection, we note that any higher-order IVP, say,

y′′′ + f(x, y, y′, y′′) = 0, y(x0) = y0, y′(x0) = z0, y′′(x0) = w0 , (5.8)


can be rewritten as a system of first-order ODEs with appropriate initial conditions:

y′ = z ,z′ = w ,w′ = −f(x, y, z, w) ,

(5.9)

y(x0) = y0, z(x0) = z0, w(x0) = w0 .

This is a system of three first-order ODEs, for which the forms of the discretization schemeshave been considered above. Obviously, any higher-order ODE that can be explicitly solved forthe highest derivative, can be dealt with along the same lines.

5.2 Special methods for the second-order ODE y′′ = f(y).I: Central-difference methods

A second-order ODE, along with the appropriate initial conditions:

y′′ = f(y), y(x0) = y0, y′(x0) = y′0 , (5.10)

occurs in applications quite frequently because it describes the motion of a Newtonian particle(i.e. a particle that obeys the laws of Newtonian mechanics) in the presence of a conservativeforce (i.e. a force that depends only on the position of the particle but not on its speed and/orthe time). In the remainder of this subsection, it will be convenient to think of y as the positionof the particle, of x — as the time, and of y′ — as the particle’s velocity.

The first special method that we introduce for Eq. (5.10) (and for systems of such equations)uses a second-order accurate approximation for y′′:

y′′n =yn+1 − 2yn + yn−1

h2+ O(h2) ; (5.11)

you encountered a similar formula in Lecture 3 (see Sec. 3.1). Combining Eqs. (5.10) and(5.11), we arrive at the central-difference method for Eq. (5.10):

Yn+1 − 2Yn + Yn−1 = h2fn . (5.12)

(Method (5.12) is sometimes referred to as the simple central-difference method, because ther.h.s. of the ODE (5.10) enters it in the simplest possible way.) Since this is a two-step method,one needs two initial points to start it. The first point, Y0, is simply the initial condition forthe particle’s position: Y0 = y0. The second point, Y1, has to be determined from the initialposition y0 and the initial velocity y′0. The natural question then is: To what accuracy shouldwe determine Y1 so as to be consistent with the accuracy of the method (5.12)?

To answer this question, we first show that the global error in the simple central-differencemethod is O(h2). Indeed, the local truncation error is O(h4), as it follows from (5.10)–(5.12).For numerical methods for a first-order ODE, considered earlier, this would imply that theglobal error must be O( 1

h) · O(h4) = O(h3), since the local error of O(h4) would accumulate

over O( 1h) steps. However, (5.10) is a second-order ODE, and for it, the error accumulates

differently than for a first-order one. To see this qualitatively, we consider the simplest casewhere the same error is made at every step, and all these errors simply add together. This canthen be modeled by the following second-order “ODE” in the discrete variable n:

d2(GlobalError)/dn2 = LocalError, where LocalError=const; (5.13)

GlobalError(0) = 0, GlobalError′(0) = StartupError . (5.14)


The “StartupError” is actually the error one makes in computing Y1. If we now treat thediscrete variable n as continuous (which is acceptable if we want to obtain an estimate for theanswer), then the solution of the above is, obviously,

GlobalError(n) = StartupError · n + LocalError · n2

2, (5.15)

which on the account of

n =(b− a)

h= O

(1

h

), ([a, b] being the interval of integration)

becomes

GlobalError(n) = StartupError ·O(

1

h

)+ LocalError ·O

(1

h2

). (5.16)

In Appendix, we derive an analog of (5.16) for the discrete equation (5.12) rather than for thecontinuous equation (5.13); that derivation confirms the validity of our replacing the discreteequation by its continuous equivalent for the purposes of the estimation of error accumulation.

Equation (5.16) along with the aforementioned fact that the local truncation error is O(h4)imply that the global error is indeed O(h2), provided that the “startup error” (i.e., the error inY1) is appropriately small. Using the same equation (5.16), it is now easy to see that Y1 needsto be determined with accuracy O(h3). Therefore, we supplement Eq. (5.12) with the following

Initial conditions for method (5.12):

Y0 = y0, Y1 = y0 + hy′0 +h2

2f(y0) , (5.17)

where in the last equation we have used the ODE y′′ = f(y).

Another method that uses the central-difference approximation (5.11) for y′′ is:

Yn+1 − 2Yn + Yn−1 =h2

12(fn+1 + 10fn + fn−1) . (5.18)

This is called Numerov’s method, or the Royal Road formula. The local truncation error ofthis method is O(h6). Therefore, the global error will be O(h4) (i.e., 2 orders better than theglobal error in the simple central-difference method), provided we calculate Y1 with accuracyO(h5). In principle, this can be done using, for example, the Taylor expansion:

Y1 = y0 + hy′0 +h2

2y′′0 +

h3

6y′′′0 +

h4

24y

(iv)0 , (5.19)

where y0 and y′0 are given as the initial conditions and the higher-order derivatives are computedsuccessively as follows:

y′′0 = f(y0),

y′′′0 = ddx

f(y)∣∣y=y0

= fy(y0)y′0 ,

y(iv)0 = d

dx[fy(y) y′(x)]

∣∣y=y0

= fyy(y0)y′0 + fy(y0)f(y0) .

(5.20)

However, Numerov’s method is implicit (why?), which makes it unpopular for numerical inte-gration of the IVPs (5.10). The only exception would be the case when f(y) = ay + b, a linearfunction of y, when the equation for Yn+1 can be easily solved. We will encounter Numerov’smethod later in this course when we study boundary value problems; there, this method is themethod of choice because of its high accuracy.


5.3 Special methods for the second-order ODE y′′ = f(y).II: Methods that approximately preserve energy

As we said at the beginning of the previous subsection, Eq. (5.10) describes the motion of aparticle in the field of a conservative force. For example, the gravitational or electrostatic forceis conservative, but any form of friction is not. We now rename the independent variable x

as t (the time) and denote v(t) = y′(t) (the velocity). As before, y(t) denotes the particle’sposition. Equation (5.10) can be rewritten as

y′ = v, (5.21)

v′ = f(y), (5.22)

y(t0) = y0, v(t0) = v0 .

In Eq. (5.22), the r.h.s. can be thought of as a force acting on the particle of unit mass. Notethat these equations admit a conserved quantity, called the Hamiltonian (which in Newtonianmechanics is just the total energy of the particle):

H(v, y) =1

2v2 + U(y) , U(y) = −

∫f(y)dy . (5.23)

The first and second terms on the r.h.s. of (5.23) are the kinetic and potential energies of theparticle. Using the equations of motion, (5.21) and (5.22), it is easy to see that the Hamiltonian(i.e., the total energy) is indeed conserved:

dH

dt=

∂H

∂v

dv

dt+

∂H

∂y

dy

dt= v · f(y) +

dU

dy· v = 0 ∀t . (5.24)

It is now natural to ask: Do any of the methods considered so far conserve the Hamiltonian?That is, if {Vn, Yn} is a numerical solution of (5.21) and (5.22), is H(Vn, Yn) independent of n?The answer is ‘no’. However, some of the methods do conserve the Hamiltonian approximatelyover very long time intervals. We now consider specific examples of such methods.

Consider the three implementations of the simple Euler method, given by Eqs. (5.2)–(5.4). We will refer to method (5.2) as the regular Euler method; the other two methods areconventionally referred to as symplectic8 Euler methods. Let us apply these methods withh = 0.02 to integration of the equations of a simple harmonic oscillator

y′′ = −y, y(0) = 0, y′(0) = 1 . (5.25)

The results are presented below. We plot the numerical solutions along with the exact one inthe phase plane for t ≤ 20, which corresponds to slightly more than 3 oscillation periods. Theorbits of the solutions obtained by the symplectic methods lie very close to the orbit of theexact solution, while the orbit corresponding to the regular Euler method winds off into infinity(provided one waits infinitely long, of course).

8“Symplectic” is a term from Hamiltonian mechanics that means “preserving areas in the phase space”. Ifthis explanation does not make the matter clearer to you, simply ignore it and treat the word “symplectic” justas a new adjective in your vocabulary.


−1 −0.5 0 0.5 1 1.5

−1

−0.5

0

0.5

1

1.5phase plane of simple harmonic oscillator

v

y

regular Euler

symplectic Eulers and exact solution

0 5 10 15 20

0

0.1

0.2

Error in the Hamiltonian of simple harmonic oscillator

time

Err

or,

Hco

mp

ute

d−

He

xact

regular Euler

symplectic Eulers

It is known that the symplectic Euler (andhigher-order symplectic) methods loose theirremarkable property of near-preservation ofthe energy if the step size is varied. Toillustrate this fact, we show the error inthe Hamiltonian obtained for the same Eq.(5.25) when the step size is sinusoidally var-ied with a frequency incommensurable withthat of the oscillator itself. Specifically, wetook h = 0.02+0.01 sin(1.95t). We see, how-ever, that the error in the symplectic meth-ods is still much smaller than that obtainedby the regular Euler method.

0 10 20 30 40 50 60 70 80 90−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Error in Hamiltonian: variable step size

time

Err

or,

H

com

pu

ted−

He

xact regular Euler

symplectic Eulers

At this point, we are ready to ask two more questions.

Question: What feature of the symplectic Euler methods allows them to maintain theHamiltonian of the numerical solution near that of the exact solution?

Answer: Perhaps surprisingly, systematic studies of symplectic methods began relativelyrecently, in the late 1980s. The theory behind these methods goes far beyond the scope of thiscourse (and the expertise of this instructor). A recent review of such methods is posted onthe website of this course. We will only briefly touch upon the main reason for the superiorperformance of the symplectic methods over non-symplectic ones, in Sec. 5.4.

Question: Among the methods we have considered in this Section, are there other methodsthat possess the property of near-conservation of the Hamiltonian?


Answer: The short answer is ‘yes’. To present a more detailed answer, let us look back atthe figure for the error in the Hamiltonian, obtained with the two symplectic Euler methods.We see that these errors are nearly opposite to each other and hence, being added, will nearlycancel one another. Therefore, if we somehow manage to combine the symplectic methods(5.3) and (5.4) so that the Hamiltonian error of the new method is the sum of those two “old”errors, then that new error will be dramatically reduced in comparison with either of the “old”errors. (This is similar to how the second-order trapezoidal rule for integration of f(x) isobtained as the average of the first-order accurate left and right Riemann sums, whose errorsare nearly opposite and thus, being added, nearly cancel each other.) Below we produce sucha combination of methods (5.3) and (5.4).

Let us split the step from xn to xn+1 into two substeps: from xn to xn+ 12

and then from

xn+ 12

to xn+1 (see the figure below). Let us now advance the solution in the first half-step

using method (5.4) and then advance it in the second half-step using method (5.3). Here is thisprocess in detail:

xn x

n+1 x

n+1/2

Symplectic Euler (5.4) Symplectic Euler (5.3)

xn → xn+ 12, use (5.4) : Vn+ 1

2= Vn + h

2f(Yn) ,

Yn+ 12

= Yn + h2Vn+ 1

2,

xn+ 12

→ xn+1, use (5.3) : Yn+1 = Yn+ 12

+ h2Vn+ 1

2,

Vn+1 = Vn+ 12

+ h2f(Yn+1) .

(5.26)

Combining the above equations (simply add the 2nd and 3rd equations, and then add the 1stand 4th ones), we obtain:

Yn+1 = Yn + hVn + h2

2f(Yn) ,

Vn+1 = Vn + h2(f(Yn) + f(Yn+1)) .

(5.27)

Method (5.27) is called the Verlet method, after Dr. Loup Verlet who “discovered” it in 1967.Later, however, Verlet himself found accounts of his method in works dated as far back as thelate 18th century. In particular, in 1907, G. Stormer used higher-order versions of this methodfor computation of the motion of the ionized particles in the Earth’s magnetic field. About 50years earlier, J.F. Encke had used method (5.27) for computation of planetary orbits. For thisreason, this method is also sometimes related with the names of Stormer and/or Encke.

The Verlet method is extensively used in applications dealing with long-time computations,such as molecular dynamics, planetary motion, and computer animation9. Its benefits are:

9For example, you may visit a game-developers’ website at http://www.gamedev.net, go to their Forums


(i) It nearly conserves the energy of the modeled system;(ii) It is second-order accurate; and(iii) It requires only one function evaluation per step.

To make the value of these benefits evident, in a homework problem you will be asked tocompare the performance of the Verlet method with that of the higher-order cRK method,which is not symplectic and does not have the property of near-conservation of the energy.

We now complete the answer to the question asked about a page ago and show that theVerlet method is equivalent to the simple central-difference method. To this end, let us writethe Verlet equations at two consecutive steps:

Yn+1 = Yn + hVn + h2

2f(Yn) ,

Vn+1 = Vn + h2(f(Yn) + f(Yn+1)) ,

Yn+2 = Yn+1 + hVn+1 + h2

2f(Yn+1) ,

Vn+2 = Vn+1 + h2(f(Yn+1) + f(Yn+2)) .

(5.28)

In fact, we will only need the first three of the above equations. Subtracting the 1st equationfrom the 3rd and slightly rearranging the terms, we obtain:

Yn+2 − 2Yn+1 + Yn =

{hVn+1 +

h2

2f(Yn+1)

}−

{hVn +

h2

2f(Yn)

}. (5.29)

We now use the 2nd equation of (5.28) to eliminate Vn+1. The straightforward calculation yields

Yn+2 − 2Yn+1 + Yn = h2f(Yn+1),

which is the simple central-difference method (5.12). Thus, we have shown that the simplecentral-difference method nearly conserves the energy of the system.

To conclude this section, we note that although the Verlet method nearly conserves theHamiltonian of the simulated system, it may not always conserve or nearly-conserve otherconstants of the motion, whenever such exist. As an example, consider the Kepler two-bodyproblem (two particles in each other’s gravitational field):

q′′ = − q

(q2 + r2)3/2, r′′ = − r

(q2 + r2)3/2, (5.30)

where q and r are the Cartesian coordinates of a certain radius vector relative to the centerof mass of the particles. Let us denote the velocities corresponding to q and r as Q and R,respectively. This problem has the following three constants of the motion:

Hamiltonian of (5.30):

H =1

2(Q2 + R2)− 1√

q2 + r2, (5.31)

Angular momentum of (5.30):

A = qR− rQ , (5.32)

Runge-Lenz vector of (5.30):

L =~i

(R(qR− rQ)− q√

q2 + r2

)+~j

(−Q(qR− rQ)− r√

q2 + r2

). (5.33)

and there do a search for ‘Verlet’.


It turns out that the Verlet method nearly conserves the Hamiltonian and exactly conservesthe angular momentum A, but does not conserve the Runge-Lenz vector L. In a homeworkproblem, you will be asked to examine what effect this nonconservation has on the numericalsolution.

5.4 Stability of numerical methods for systems of ODEs and higher-order ODEs

Following the lines of the previous three subsections, we will first comment on the stabilityof general-purpose methods, and then on that of special methods for y′′ = f(y) and similarequations.

In Lecture 4, we showed that in order to analyze stability of numerical methods for a singlefirst-order ODE y′ = f(x, y), we needed to consider that stability for the model problem (4.14),y′ = λy with λ = const. This is because any small error, coming either from the truncationerror or from elsewhere, satisfies the linearized equation

ε′(x) = fy(x, y) · ε(x) + driving terms ; (5.34)

see Eqs. (4.17) and (4.18). Thus, if we replace the variable coefficient fy(x, y), whose exactvalue at each x we do not know, with a constant λ (such that, hopefully, |λ| > |fy(x, y)|), wethen arrive at the model problem (4.14), which we can easily analyze.

Question: What is the counterpart of (5.34) for a system of ODEs?Answer, stated for two ODEs:

~ε ′(x) =∂(f (1), f (2))

∂(y(1), y(2))~ε(x) + driving terms , (5.35)

where

~ε =

(ε(1)

ε(1)

),

∂(f (1), f (2))

∂(y(1), y(2))=

∂f (1)

∂y(1)∂f (1)

∂y(2)

∂f (2)

∂y(1)∂f (2)

∂y(2)

. (5.36)

The last matrix is called the Jacobian of the r.h.s. of system (5.1). Equations (5.35) and (5.36)generalize straightforwardly for more than two equations.

We now give a brief derivation of Eqs. (5.35) and (5.36) which parallels that of Eqs. (4.18)for a single first-order ODE. For convenience of the reader, we re-state here the IVP system(5.1):

y(1) ′ = f (1)(x, y(1), y(2)), y(1)(x0) = y(1)0 ,

y(2) ′ = f (2)(x, y(1), y(2)), y(2)(x0) = y(2)0 ,

(5.1)

whose exact solutions are y(1) and y(2). Let Y (1) and Y (2) be the numerical solutions of system(5.1). They satisfy a related system:

Y (1) ‘ = f (1)(x, Y (1), Y (2)) + 1h· Local Truncation Error, Y (1)(x0) = y

(1)0 ,

Y (2) ‘ = f (2)(x, Y (1), Y (2)) + 1h· Local Truncation Error, Y (2)(x0) = y

(2)0 ,

(5.37)

(Here we used the “other” symbol of prime, ‘, to denote the finite-difference approximationof the derivative, while we reserve the regular symbol of prime, ′, for the exact analyticalderivative.) Next, we subtract each of Eqs. (5.37) from the corresponding equation in (5.1)


(assigning now the same meaning to the two kinds of primes) and obtain the following equationsfor the global errors ε(k) = y(k) − Y (k), k = 1, 2:

ε(k) ‘ = f (k)(x, y(1), y(2))− f (k)(x, Y (1), Y (2))− 1

h· Local Truncation Error

=∂f (k)

∂y(1)ε(1) +

∂f (k)

∂y(2)ε(2) + O(ε2)− 1

h· Local Truncation Error . (5.38)

It is now easy to see that Eq. (5.38) is the same as Eqs. (5.35) and (5.36).Since we do not know the values of the entries in the Jacobian matrix in Eq. (5.35), we

simply replace that matrix by a matrix with constant terms. Thus, the model problem thatone should use to analyze stability of numerical methods for system of ODEs is

~y ′ = A~y , A is a constant matrix. (5.39)

Now, for a single first-order ODE, we had only one parameter, λ, in the model problem.Question: How many parameters do we have in the model problem (5.39) for a system of N

ODEs?The answer depends on which of the two categories of methods one uses. Namely, in Sec.

5.1, we saw that some methods (e.g., the regular Euler (5.2) and the Modified Euler (5.5)) usethe solution ~Yn at x = xn to simultaneously advance to the next step, x = xn+1. Moreover,each component Y (k) is obtained using the same discretization rule. To be consistent withthe terminology of Sec. 5.1, we will call this first category of methods, the general purposemethods. Methods of the other category, which included the symplectic Euler and Verlet,obtain a component Y

(m)n+1 at xn+1 by using previously obtained components at xn+1, Y

(k)n+1 with

k < m, as well as the components Y(p)n with p ≥ m at xn. In other words, they apply different

discretization rules for different components. We will call methods from this category, specialmethods.

Returning to the above question, we now show that for the general-purpose methods (regularEuler, modified Euler, cRK, etc.), the answer is ‘N ’ (even though matrix A contains N2

entries!). We will explain this using the regular Euler method as an example. Details for othergeneral purpose methods are more involved, but follow the same logic. First, we note that“most” matrices are diagonalizable10, which means that there exists a matrix S and a diagonalmatrix D such that

A = S−1 D S ; D = diag(λ1, λ2, . . . , λN) . (5.40)

Moreover, the diagonal entries of D are the eigenvalues of A. Substitution of (5.40) into (5.39)leads to the following chain of transformations:

~y ′ = A~y ⇒~y ′ = S−1 D S~y ⇒

S~y ′ = D S~y ⇒(S~y)′ = D (S~y) ⇒

~z ′ = D~z, where ~z = S~y . (5.41)

10Some matrices, e.g.(

1 10 1

), are not diagonalizable. However, if we perturb it as, say,

(1.01 10 1

), this

latter matrix is diagonalizable.


Therefore, the important (for the stability analysis) information about a diagonalizable matrixA is concentrated in its eigenvalues. Namely, given the diagonal form (5.40) of matrix D, thelast equation in (5.41) can be written as

z(k) ′ = λ(k)z(k) , for k = 1, . . . N , (5.42)

which means that the matrix model problem (5.39) reduces to the model problem (4.14) for asingle first-order ODE.

Now, when we apply the regular Euler method to system (5.39), we get

~Yn+1 − ~Yn = hA~Yn . (5.43)

Repeating now the steps of (5.41), we rewrite this as

~Zn+1 − ~Zn = hD~Zn , (5.44)

where ~Zn is the numerical approximation to ~zn. Given the diagonal form of D, for the compo-nents of ~Zn we obtain:

Z(k)n+1 − Z(k)

n = hλ(k)Z(k)n , (5.45)

which is just the simple Euler method applied separately to individual model problems (5.42).Thus, we have confirmed our earlier statement that for a general purpose method, the stabilityanalysis for a system of ODEs reduces to the stability analysis for a single equation.

We will now show that the above statement does not apply to special methods like thesymplectic Euler etc.. Indeed, let us apply symplectic Euler (5.3) to a 2 × 2 model problem

(5.39), assuming that A =

(a11 a12

a21 a22

). We have (verify):

~Yn+1 − ~Yn = h

(a11 a12

0 0

)~Yn + h

(0 0

a21 a22

)~Yn+1 . (5.46)

This can no longer be written in the form (5.43), and hence the subsequent calculations that ledto (5.45) are no longer valid. Therefore, the only venue to proceed with the stability analysisfor special methods is to consider the original matrix model problem (5.39); obviously, thismodel problem has, in general, as many parameters as the matrix A, i.e. N2.

Below we give an example of doing stability analyses for the regular and symplectic Eulermethods. In a homework problem, you will be asked to do similar calculations for the modifiedEuler and Verlet methods. As a particular problem, we choose that of a simple harmonicoscillator (5.25). That problem can be written in the matrix form as follows:

(yv

)′=

(0 1−1 0

) (yv

), (5.47)

so that y(1) = y and y(2) = v. The matrix in (5.47) has the eigenvalues λ1,2 = ±i:

∣∣∣∣0− λ 1−1 0− λ

∣∣∣∣ = 0 ⇒

λ2 + 1 = 0 ⇒λ = ±i .


Thus, if we use the regular Euler, which is a general purpose method, it suffices to study thestability of the simple Euler method for a single ODE

y′ = λy with λ = i or λ = −i. (5.48)

Before we proceed, let us perform a sanity check and confirm that Eq. (5.48) does indeeddescribe a solution that we expect of a harmonic oscillator, i.e.

y = c1 sin x + c2 cos x (5.49)

for some constants c1, c2. Indeed, the solution of (5.48) with, say, λ = i, is

y = eix . (5.50)

Using the Euler formula for complex numbers,

eix = cos x + i sin x ,

we see that solution (5.50) is indeed of the form (5.49) with c1 = 1 and c2 = i. Now, if wewant both constants c1, c2 to be real, we need to also account for the solution with λ = −i

and perform more tedious calculations. As a result, we would not only confirm that the y-component of the solution has the form (5.49), but would also show that y and v are givenby

y = B sin(x + φ), v = C sin(x + ψ) , (5.51)

where B, C, φ, and ψ are some constants. We will not do these calculations here since thiswould have distracted us from our main goal, which is the stability analysis of the regular Eulermethod applied to (5.47). We will be content with summarizing the discussion on the aboveby stating that the exact solutions (5.51) of the harmonic oscillator model describe oscillationswith a constant amplitude (related to constants B and C).

Let us now return to the stability analysis for the regular Euler method applied to system(5.47). According to the discussion that led to Eq. (5.45), this reduces to the stability analysisof the single Eq. (5.48). The result is given by Eq. (4.21) of Lecture 4. Namely, we have that

|r| = |1 + h · i| =√

1 + h2 ≈ 1 +1

2h2 , (5.52)

so that

|r|n = |r|(x/h) ≈ (1 +1

2h2)(x/h) = (1 + h · 1

2h)(x/h) ≈ exh/2 . (5.53)

Since the absolute value |r|n determines the amplitude of the numerical solution, we see thatEq. (5.53) shows that this amplitude grows with x, whereas the amplitude of the exact solutionof (5.47) is constant (see (5.51))11. Therefore, the regular Euler method applied to (5.47) isunstable for any step size h!

This result is corroborated by the figure accompanying Eq. (4.21). Namely, for λ = i

or −i, the value hλ lies on the imaginary axis, which is outside the stability region for thesimple Euler method. Since the magnitude of any error will then grow exponentially, so willthe amplitude of the solution, because, as we discussed in Lecture 4, for linear equations theerror and the solution satisfy the same equation. In a homework problem, you will be asked

11To better visualize what is going on, you may imagine that the numerical solution at each xn is simply theexact solution multiplied by |r|n.


to show that the behavior of the Hamiltonian of the numerical solution shown in a figure inSec. 5.3 quantitatively agrees with Eq. (5.53).

Now we turn to the stability analysis of the symplectic Euler method (say, (5.3)). To thatend, we will apply these finite-difference equations to a model problem, which we take as aslight generalization of (5.25):

y′′ = −ω2y . (5.54)

The finite-difference equations are:

yn+1 = yn + hvn

vn+1 = vn − hω2yn+1

⇒(

1 0hω2 1

)(yv

)

n+1

=

(1 h0 1

)(yv

)

n

⇒(

yv

)

n+1

=

(1 0

hω2 1

)−1 (1 h0 1

)(yv

)

n

⇒(

yv

)

n+1

=

(1 0

−hω2 1

)(1 h0 1

)(yv

)

n

⇒(

yv

)

n+1

=

(1 h

−hω2 1− h2ω2

)(yv

)

n

. (5.55)

Once we have obtained this matrix relation between the solutions at the nth and (n + 1)ststeps, we need to obtain the eigenvalues of the matrix on the r.h.s. of (5.55). Indeed, it isknown from Linear Algebra that the solution of (5.55) is

(yv

)

n

= ~u1 rn1 + ~u2 rn

2 , (5.56)

where r1,2 are the eigenvalues of the matrix in question and ~u1,2 are the corresponding eigenvec-tors. (We have used the notation r1,2 instead of λ1,2 for the eigenvalues in order to emphasizethe connection with the characteristic root r that arises in the stability analysis of a singleODE.) If we find that the modulus of either of the eigenvalues r1 or r2 exceeds 1, this wouldmean that the symplectic method is unstable (well, we know already that it is not, but we needto demonstrate that). A simple calculation similar to that found after Eq. (5.47) yields

r1,2 = 1− 1

2

(h2ω2 ±

√h4ω4 − 4h2ω2

). (5.57)

With some help from Mathematica, one can show that

|r1| = |r2| = 1 for − 2 ≤ hω ≤ 2either |r1| > 1 or |r2| > 1, for any other complex hω .

(5.58)

Thus, the symplectic Euler method is stable for the simple harmonic oscillator equation (and,in general, other oscillatory models), provided that h is sufficiently small, so that |hω| < 2.Note that ω in Eq. (5.54) is a counterpart of iλ in the model equation (4.14): Indeed, simplydifferentiate (4.14) with λ being replaced by iλ one more time. Using this relation betweenλ and ω, we then observe that the stability region for the symplectic Euler method, given bythe first line of (5.58), is reminescent of the stability region of the Leap-frog method (see Eq.(4.36)). This may suggest that the Leap-frog method, applied to an oscillatory equation, will


also have the property of near-conservation of the total energy; however, we will not considerthis issue in detail.

To conclude this subsection, let us mention that for the simple central-difference method(5.12) and Numerov’s method (5.18), the stability analysis should also be applied to the simpleharmonic oscillator equation (5.54). For example, substituting Yn = rn into the simple central-difference equation, where f(y) = −ω2y, one finds

r2 − (2− h2ω2)r + 1 = 0 . (5.59)

Again, with the help from Mathematica, one can show that the two roots r1,2 of Eq. (5.59)satisfy Eq. (5.58). This is not at all surprising, given that the simple central-difference methodis equivalent to the Verlet method (see the text around (5.28)) and the latter, in its turn, issimply a composition of two symplectic Euler methods.

Similarly, one can show that the stability region for Numerov’s method is given by

−√

6 ≤ hω ≤√

6 , where |r1| = |r2| = 1 , (5.60)

whereas for any other complex hω, either |r1| > 1 or |r2| > 1.

5.5 Stiff equations

Here we will encounter, for the first time in this course, a class of equations that are verydifficult to solve numerically. These equations are called numerically stiff. It is important tobe able to recognize cases where one has to deal with such systems of equation; otherwise, thenumerical solution that one would obtain will have no connection to the exact one.

Let us consider an IVP(

uv

)′=

(998 1998

−999 −1999

)(uv

),

(uv

)∣∣∣∣x=0

=

(10

). (5.61)

Its exact solution is (uv

)=

(2

−1

)e−x +

( −11

)e−1000x . (5.62)

The IVP (5.61) is an example of a stiff equation. Although there is no rigorous definition of nu-merical stiffness, it is usually accepted that a stiff system should satisfy the following two criteria:(i) The system of ODEs must contain at least two groups of solutions, where solutions in onegroup vary rapidly relatively to the solutions of the other group. That is, among the eigenvaluesof the corresponding matrix A there must be two, λslow and λrapid, such that

|λrapid||λslow| À 1 . (5.63)

(ii) The rapidly changing solution(s) must be stable. That is, the large in magnitude eigen-values, λrapid, of the matrix A in Eq. (5.39) must have Re λrapid < 0. As for the slowlychanging solutions, they may be either stable or unstable.

Let us verify that system (5.61) is stiff. Indeed, criterion (i) above is satisfied for this systembecause of its two solutions, given by the two terms in (5.62), the first (with λslow = −1) variesslowly compared to the other term (with λrapid = −1000). Criterion (ii) is satisfied because therapidly changing solution has λrapid < 0.


Another example of a stiff system is

(uv

)′= −

(499 501501 499

)(uv

),

(uv

)∣∣∣∣x=0

=

(02

), (5.64)

whose solution is (uv

)=

( −11

)e2x +

(11

)e−1000x . (5.65)

Here, again, the first and second terms in (5.65) represent the slow and fast parts of the solution,with λslow = 2 and λrapid = −1000, so that |λrapid| À |λslow|. Thus, criterion (i) is satisfied.Criterion (ii) is satisfied because the rapid solution is stable: λrapid < 0.

The difficulty with stiff equations can be understood from the above examples (5.61), (5.62)and (5.64), (5.65). Namely, the rapid parts of those solutions are important only very close tox = 0 and are almost zero everywhere else. However, in order to integrate, e.g., (5.61) using,say, the simple Euler method, one would require to keep h · 1000 ≤ 2 (see Eq. (4.20)), i.e.h ≤ 0.002. That is, we are forced to use a very small step size in order to avoid the numericalinstability caused by the least important part of the solution!

Thus, in layman terms, a problem that involves processes evolving on two (or more) dis-parate scales, with the rapid process(es) being stable, is stiff. Moreover, as the above exampleshows, the meaning of stiffness is that one needs to work the hardest (i.e., use the smallesth) to resolve the least important part of the solution (i.e., the second terms on the r.h.s.’es of(5.62) and (5.65)).

An obvious way to deal with a stiff equation is to use an A-stable method (implicit ormodified implicit Euler). This would eliminate the issue of numerical instability; however, theproblem of (low) accuracy will still remain.

In practice, one strikes a compromise between the accuracy and stability of the method.Matlab, for example, uses a family of methods known as BDF (backward-difference formula)methods. Matlab’s built-in solvers for stiff problems are ode15s (this uses a method of orderbetween 1 and 5) and ode23s.

5.6 Appendix: Derivation of Eq. (5.12) with fn = const

Here we will derive the solution of Eq. (5.12) with its right-hand side being replaced by aconstant:

Yn+1 − 2Yn + Yn−1 = M, M = const. (5.66)

This will provide a rigorous justification for the solution (5.15) of the system (5.13)–(5.14).The method that we will use closely follows the lines of the method of variation of parameters

for the second-order ODEy′′ + By′ + Cy = F (x). (5.67)

In what follows we will refer to Eq. (5.67) as the continuous case. Namely, we first obtain thesolutions of the homogeneous version of (5.66):

Yn = c(1) + c(2)n , c(1) and c(2) are arbitrary constants. (5.68)

Solution (5.68) was obtained by the substitution into (5.66) with M = 0 of the ansatze Yn = rn

and Yn = nrn. This is analogous to how the solution y = c(1) + c(2)x of the ODE y′′ = 0 isobtained.


Next, to solve Eq. (5.66) with M 6= 0, we allow the constants c(1) and c(2) to depend on n.Substituting the result into (5.66), we obtain:

(c(1)n+1 − 2c(1)

n + c(1)n−1

)+

((n + 1)c

(2)n+1 − 2nc(2)

n + (n− 1)c(2)n−1

)= M. (5.69)

Now, similarly to how in the continuous case the counterparts of our c(1) and c(2) are set tosatisfy an equation (

c(1))′

y(1) +(c(2)

)′y(2) = 0,

where y(k), k = 1, 2 are the homogeneous solutions of (5.67), here we impose the followingcondition:

k = n :(c(1)k+1 − c

(1)k

)+ k

(c(2)k+1 − c

(2)k

)= 0. (5.70)

Subtracting from (5.70) its counterpart for k = n− 1, one obtains:

(c(1)n+1 − 2c(1)

n + c(1)n−1

)+

(n c

(2)n+1 − (2n− 1)c(2)

n + (n− 1)c(2)n−1

)= 0. (5.71)

Next, subtracting the last equation from (5.69), we obtain a recurrence equation for c(2) only,

which has a simple solution (assuming c(2)0 = 0):

c(2)n+1 − c(2)

n = M, ⇒ c(2)n = nM. (5.72)

From (5.72) and (5.70) one obtains the solution for c(1):

c(1)n+1 − c(1)

n = −nM, ⇒ c(1)n = −n(n− 1)

2M. (5.73)

(Again, we have assumed that c(1)0 = 0.) Finally, combining the results of (5.68), (5.72), and

(5.73), we obtain the solution of Eq. (5.66):

Yn = −n(n− 1)

2M + n2M =

n(n + 1)

2M = O(n2) M. (5.74)

The leading-order dependence on n of this solution is that claimed in formula (5.15).


1. Verify (5.6).

2. Verify (5.7).

3. What would be your first step to solve a 5th-order ODE using the methodology of Sec.5.1?

4. Use Eq. (5.11) to explain why the local truncation error of the simple central-differencemethod (5.12) is O(h4).

5. Explain why Y1 for that method needs to be calculated with accuracy O(h3).

6. What is the global error of the simple central-difference method?


7. How does the rate of the error accumulation for a second-order ODE differ from the rateof the error accumulation for a first-order ODE?

8. Explain the last term in the expression for Y1 in (5.17).

9. Why is Numerov’s method implicit?

10. What is the physical meaning of the Hamiltonian for a Newtonian particle?

11. Verify (5.24).

12. What is the advantage of the symplectic Euler methods over the regular Euler method?

13. State the observation that prompted us to combine the two symplectic Euler methodsinto the Verlet method.

14. Obtain (5.27) from (5.26).

15. Obtain (5.29) and the next (unnumbered) equation.

16. What is the model problem for the stability analysis for a system of ODEs?

17. Show that for a “general-purpose” method, the stability analysis for a system of ODEsreduces to the stability analysis for the model problem (4.14).

18. Why is this not so for the “special” methods, like the symplectic Euler?

19. Make sure you can follow (5.55).

20. Verify that (5.56) is the solution of (5.55). That is, substitute (5.56), with the correspond-ing subindices, into both sides of the last equation of (5.55). Then for the expression onthe r.h.s., use the stated fact that ~u1 and ~u2 are the eigenvectors of the matrix appearingon the r.h.s. of that equation. (You do not need to use the explicit form of that matrix.)

21. Obtain (5.57).

22. Would you apply the Verlet method to the following strongly damped oscillator:

y′′ = −(2 + i)2y ?

Please explain.

23. Same question for Numerov’s method.

24. What is the numerical stiffness, in layman terms?


6 Boundary-value problems (BVPs): Introduction

A typical BVP consists of an ODE and a set of conditions that its solution has to satisfy atboth ends of a certain interval [a, b]. For example:

y′′ = f(x, y, y′), y(a) = α, y(b) = β . (6.1)

Now, let us recall that for IVPs

y′ = f(x, y), y(x0) = y0,

we saw (in Lecture 0) that there are certain conditions on the function f(x, y) which wouldguarantee that the solution y(x) of the IVP exists and is unique.

The situation with BVPs is considerably more complicated. Namely, relatively few theoremsexist that can guarantee existence and uniqueness of a solution to a BVP. Below we will state,without proof, two of such theorems. In this course we will always assume that f(x, y, y′) in(6.1) and in similar BVPs is a continuous function of its arguments.

Theorem 6.1 Consider a BVP of a special form:

y′′ = f(x, y), y(a) = α, y(b) = β . (6.2)

(Note that unlike (6.1), the function f in (6.2) is assumed not to depend on y′.)If ∂f/∂y > 0 for all x ∈ [a, b] and all values of the solution y, then the solution y(x) to theBVP (6.2) exists and is unique.

This Theorem is not very useful for a general nonlinear function f(x, y), because we donot know the solution y(x) and hence cannot always determine whether ∂f/∂y > 0 or < 0for the specific solution that we are going to obtain. Sometimes, however, as, e.g., whenf(x, y) = f0(x) + y3, we are guaranteed that ∂f/∂y > 0 for any y and hence the solution ofthe corresponding BVP does exist and is unique. Another useful and common case is when theBVP is linear. In this case, we have the following result.

Theorem 6.2 Consider a linear BVP:

y′′ + P (x)y′ + Q(x)y = R(x), y(a) = α, y(b) = β . (6.3)

Let the coefficients P (x), Q(x), and R(x) be continuous on [a, b] and, in addition, let Q(x) ≤ 0on [a, b]. Then the BVP (6.3) has a unique solution.

Note that in addition to the Dirichlet boundary conditions considered above, i.e.

y(a) = α, y(b) = β, (Dirichlet b.c.)

there may be also boundary conditions for the derivative, called the Neumann boundary con-ditions:

y′(a) = α, y′(b) = β. (Neumann b.c.)

Also, the boundary consitions may be of the mixed type:

A1y(a) + A2y′(a) = α, B1y(b) + B2y

′(b) = β. (mixed b.c.)

Boundary conditions that involve values of the solution at both boundaries are also possible;e.g., periodic boundary conditions:

y(a) = y(b), y′(a) = y′(b); (periodic b.c.)


however, we will only consider the boundary conditions of the first three types (Dirichlet,Neumann, and mixed) in this course.

Note that Theorems 6.1 and 6.2 are stated specifically for the Dirichlet boundary conditions.For the Neumann boundary conditions, they are not valid. Specifically, in the case of Theorem6.2 applied to the linear ODE (6.3) with Neumann boundary conditions, the solution will existbut will only be unique up to an arbitrary constant. (That is, if y(x) is a solution, then so isy(x) + C, where C is any constant.)

A large collection of other theorems about existence and uniqueness of solutions of linearand nonlinear BVPs can be found in a very readable book by P.B. Bailey, L.F. Shampine, andP.E. Waltman, “Nonlinear two-point boundary value problems,” Ser.: Mathematics in scienceand engineering, vol. 44 (Academic Press 1968).

Unless the BVP satisfies the conditions of Theorems 6.1 or 6.2, it is not guaranteed tohave a unique solution. In fact, depending on the specific combination of the ODE and theboundary conditions, the BVP may have: (i) no solutions, (ii) one solution, (iii) a finite numberof solutions, or (iv) infinitely many solutions. Possibility (iii) can take place only for nonlinearBVPs, while the other three possibilities can take place for both linear and nonlinear BVPs.In the remander of this lecture, we will focus on linear BVPs.

Thus, a linear BVP can have 0 solutions, 1 solution, or the ∞ of solutions. This is similarto how a matrix equation

M~x = ~b (6.4)

can have 0 solutions, 1 solution, or the ∞ of solutions, depending on whether the matrix M

is singular or not. The reason behind this similarity will become apparent as we proceed, anearly indication of this reason appearing later in this lecture and more evidence appearing inthe subsequent lectures. Below we give three examples where the BVP does not satisfy theconditions of Theorem 6.2, and each of the above three possibilities is realized.

y′′ + π2y = 1, y(0) = 0, y(1) = 0 (Problem I)

has 0 solutions.

y′′ + π2y = 1, y(0) = 0, y′(1) = 1 (Problem II)

has exactly 1 solution.

y′′ + π2y = 1, y(0) = 0, y(1) =2

π2(Problem III)

has the ∞ of solutions.

Below we demonstrate that the above statements about the numbers of solutions in ProblemsI–III are indeed corect.

One can verify that the general solution of the ODE

y′′ + π2y = 1 (6.5)

is

y = A sin πx + B cos πx +1

π2. (6.6)


The constants A and B are determined by the boundary conditions. Namely, substituting thesolution (6.6) into the boundary conditions of Problem I above, we have:

y(0) = 0 ⇒ B +1

π2= 0; y(1) = 0 ⇒ −B +

1

π2= 0 ; (6.7)

hence no such B (and hence no pair A,B) exists. Note that the above equations can be writtenas a linear system for the coefficients A and B:

(0 10 −1

)(AB

)=

(− 1

π2

− 1π2

). (6.8)

The coefficient matrix in (6.8) is singular and, in addition, the vector on the r.h.s. does notbelong to the range (column space) of this coefficient matrix. Therefore, the linear system hasno solution, as we have stated above.

Substituting now the solution (6.6) of the ODE (6.5) into the boundary conditions of Prob-lem II, we arrive at the following linear system for A and B:

y(0) = 0 ⇒ B +1

π2= 0; y′(1) = 1 ⇒ −πA = 1 , (6.9)

or in the matrix form, (0 1−π 0

)(AB

)=

(− 1

π2

1

). (6.10)

This system obviously has the unique solution A = − 1π, B = − 1

π2 ; note that the matrix in thelinear system (6.10) is nonsingular.

Finally, substituting (6.6) into the boundary conditions of Problem III, one finds:

y(0) = 0 ⇒ B +1

π2= 0; y(1) =

2

π2⇒ −B +

1

π2=

2

π2, (6.11)

or in the matrix form, (0 10 −1

)(AB

)=

(− 1

π2

1π2

). (6.12)

The solution of (6.12) is: A = arbitrary, B = − 1π2 . Although the matrix in (6.12) is singular,

the vector on the r.h.s. of this equation belongs to the column space of this matrix (that is, itcan be written as a linear combination of the columns), and hence the linear system in (6.12)has infinitely many solutions.

The above simple examples illustrate a connection between linear BVPs and systems of lin-ear equations. We can use this connection to formulate an analogue of the well-known theoremfor linear systems, namely:

Theorem in Linear Algebra: The linear system (6.4) has a unique solution if and onlyif the matrix M is nonsingular.Equivalently, either the linear system has a unique solution, or the homogeneous linear system

M~x = ~0

has nontrivial solutions. (Recall that the second part of the previous sentence is one of thedefinitions of a singular matrix.)


Similarly, for BVPs we haveThe Alternative Principle for BVPs: Either the homogeneous linear BVP (i.e. the

one with both the r.h.s. R(x) = 0 and zero boundary conditions) has nontrivial solutions, orthe original BVP has a unique solution.

In a homework problem, you will be asked to verify this principle for Problems I–III.

To conclude this introduction to BVPs, let us again consider Problem I and exhibit a“danger” associated with that case. By itself, the fact that the BVP in Problem I has nosolutions, is neither good nor bad; it is simply the fact of life. However, suppose that thecoefficients in this problem are known only approximately (for example, because of the round-off error). Then the matrix in (6.8) is no longer singular (in general), but almost singular. This,in turn, makes the linear system (6.8) ill-conditioned: a tiny change of the vector on the r.h.s.will generically lead to a large change in the solution. Thus, any numerical results obtainedfor such a system will not be reliable. In a homework problem, you will be asked to consider aspecific example illustrating this case.

Questions for self-assessment

1. Can you say anything about the existence and uniqueness of solution of the BVP

y′′ = arctan y, y(−1) = −π, y′(1) = π ?

2. Same question about the BVP

y′′ + (arctan x) y′ + (sin x) y = 1, y(−π) = −1, y(0) = 1 .

3. Same question about the BVP

y′′ + (arctan x) y′ − (sin x) y = 1, y′(0) = 0, y′(1) = 1 .

4. How many solutions are possible for a BVP? What if the BVP is linear?

5. Verify that (6.6) is the solution of (6.5).

6. Verify each of the equations (6.7)–(6.12).

7. Verify that the vector on the r.h.s. of (6.12) belongs to the range (column space) of thematrix on the l.h.s..

8. State the Alternative Principle for the BVPs.

9. What is the danger associated with the case when the BVP has no solution?


7 The shooting method for solving BVPs

7.1 The idea of the shooting method

In this and the next lectures we will only consider BVPs that satisfy the conditions of Theorems6.1 or 6.2 and thus are guaranteed to have a unique solution.

Suppose we want to solve a BVP with Dirichlet boundary conditions:

y′′ = f(x, y, y′), y(a) = α, y(b) = β . (7.1)

We can rewrite this BVP in the form:

y′ = zz′ = f(x, y, z)y(a) = αy(b) = β .

(7.2)

The BVP (7.2) will turn into an IVP if we replace the boundary condition at x = b with thecondition

z(a) = θ , (7.3)

where θ is some number. Then we can solve the resulting IVP by any method that we havestudied in Lecture 5, and obtain the value of its solution y(b) at x = b. If y(b) = β, then wehave solved the BVP. Mostly likely, however, we will find that after the first try, y(b) 6= β.Then we should choose another value for θ and try again. There is actually a strategy of howthe values of θ need to be chosen. This strategy is simpler for linear BVPs, so this is the casewe consider next.

7.2 Shooting method for the Dirichlet problem of linear BVPs

Thus, our immediate goal is to solve the linear BVP

y′′ + P (x)y′ + Q(x)y = R(x) with Q(x) ≤ 0, y(a) = α, y(b) = β . (7.4)

To this end, consider two auxiliary IVPs:

u′′ + Pu′ + Qu = R,

u(a) = α, u′(a) = 0(7.5)

andv′′ + Pv′ + Qv = 0,

v(a) = 0, v′(a) = 1 ,(7.6)

where we omit the arguments of P (x) etc. as this should cause no confusion. Next, considerthe function

w = u + θv, θ = const . (7.7)

Using Eqs. (7.5) and (7.6), it is easy to see that

(u + θv)′′ + P (u + θv)′ + Q (u + θv) = R,

(u + θv)(a) = α, (u + θv)′(a) = θ ,(7.8)


i.e. w satisfies the IVPw′′ + Pw′ + Qw = R,

w(a) = α, w′(a) = θ .(7.9)

Note that the only difference between (7.9) and (7.4) is that in (7.9), we know the value of w′

at x = a but do not know whether w(b) = β. If we can choose θ in such a way that w(b) doesequal b, this will mean that we have solved the BVP (7.4).

To determine such a value of θ, we first solve the IVPs (7.5) and (7.6) by an appropriatemethod of Lecture 5 and find the coresponding values u(b) and v(b). We then choose the valueθ = θ0 by requiring that the corresponding w(b) = β, i.e.

w(b) = u(b) + θ0v(b) = β . (7.10)

This w(x) is the solution of the BVP (7.4), because it satisfies the same ODE and the sameboundary conditions at x = a and x = b; see Eqs. (7.9) and (7.10). Equation (7.10) yields thefollowing equation for the θ0:

θ0 =β − u(b)

v(b). (7.11)

Thus, solving only two IVPs (7.5) and (7.6) and constructing the new function w(x) ac-cording to (7.7) and (7.11), we obtain the solution to the linear BVP (7.4).

Consistency check: In (7.4), we have required that Q(x) ≤ 0, which guarantees that aunique solution of that BVP exists. What would have happened if we had overlooked to imposethat requirement? Then, according to Theorem 6.2, we could have run into a situation wherethe BVP would have had no solutions. This would occur if

v(b) = 0 . (7.12)

But then Eqs. (7.6) and (7.12) would mean that the homogeneous BVP

v′′ + Pv′ + Qv = 0,

v(a) = 0, v(b) = 0 ,(7.13)

must have a nontrivial12 solution. The above considerations agree, as they should, with theAlternative Principle for the BVPs, namely: the BVP (7.4) may have no solutions if thecorresponding homogeneous BVP (7.13) has nontrivial solutions.

7.3 Generalizations of the shooting method for linear BVPs

If the BVP has boundary conditions other than of the Dirichlet type, we will still proceedexactly as we did above. For example, suppose we need to solve the BVP

y′′ + Py′ + Qy = R,

y(a) = α, y′(b) = β,(7.14)

which has the Neumann boundary condition at the right end point. Denote y1 = y, y2 = y′,and rewrite this BVP as

y′1 = y2

y′2 = −Py2 −Qy1 + R

y1(a) = α, y2(b) = β .

(7.15)

12Indeed, since we also know that v′(a) = 1, then v(x) cannot identically equal zero on [a, b].


Using the vector/matrix notations with

~y =

(y1

y2

),

we can further rewrite this BVP as

~y ′ =(

0 1−Q −P

)~y +

(0R

),

y1(a) = α

y2(b) = β .(7.16)

Now, in analogy with Eqs. (7.5) and (7.6), consider two auxiliary IVPs:

~u ′ =(

0 1−Q −P

)~u +

(0R

), ~u(a) =

(α

0

); (7.17)

~v ′ =(

0 1−Q −P

)~v , ~v(a) =

(0

1

). (7.18)

Solve these IVPs by an appropriate method and obtain the values ~u(b) and ~v(b). Next, considerthe vector

~w = ~u + θ~v, θ = const. (7.19)

Using Eqs. (7.17)–(7.19), it is easy to see that this new vector satisfies the IVP

~w ′ =(

0 1−Q −P

)~w +

(0R

), ~w(a) =

(α

θ

). (7.20)

At x = b, its value is

~w(b) =

(u1(b) + θv1(b)

u2(b) + θv2(b)

), (7.21)

where u1 is the first component of ~u etc. From the last equation in (7.16), it follows that weare to require that

w2(b) = β . (7.22)

Equations (7.21) and (7.22) together yield

u2(b) + θv2(b) = β , ⇒ (7.23)

θ0 =β − u2(b)

v2(b). (7.24)

Thus, the vector ~w given by Eq. (7.19) where ~u, ~v, and θ satisfy Eqs. (7.17), (7.18), and(7.24), respectively, is the solution of the BVP (7.16).

Also, the shooting method can be used with IVPs of order higher than the second. Forexample, consider the BVP

x3y′′′ + xy′ − y = −3 + ln x ,

y(a) = α, y′(b) = β, y′′(b) = γ .(7.25)

As in the previous example, denote y1 = y, y2 = y′, y3 = y′′ and rewrite the BVP (7.25) in thematrix form:

~y ′ = A~y +~r ,

y1(a) = α

y2(b) = β

y3(b) = γ .

(7.26)


where

~y =

y1

y2

y3

, A =

0 1 0

0 0 11x3 − 1

x2 0

, ~r =

00

−3+ln xx3

.

Consider now three auxiliary IVPs:

~u ′ = A~u +~r ,

~u(a) =

α00

;

~v ′ = A~v ,

~v(a) =

010

;

~w ′ = A~w ,

~w(a) =

001

.

(7.27)

Solve them and obtain ~u(b), ~v(b), and ~w(b). Then, construct ~z = ~u + θ~v + φ~w, where θ and φ

are numbers which will be determined shortly. At x = b, one has

~z(b) =

. . .

u2(b) + θv2(b) + φw2(b)

u3(b) + θv3(b) + φw3(b)

. (7.28)

If we require that ~z satisfy the BVP (7.26), we must have

~z(b) =

. . .

β

γ

. (7.29)

From Eqs. (7.28) and (7.29) we form a system of two linear equations for the unknown coeffi-cients θ and φ:

θv2(b) + φw2(b) = β − u2(b) ,

θv3(b) + φw3(b) = γ − u3(b) .(7.30)

Solving this linear system, we obtain values θ0 and φ0 such that the corresponding ~z = ~u +θ0~v + φ0~w solves the BVP (7.26) and hence the original BVP (7.25).

7.4 Caveat with the shooting method, and its remedy, the multipleshooting method

Here we will encounter a situation where the shooting method in its form described above doesnot work. We will also provide a way to modify the method so that it would be usable again.

Let us consider the BVP

y′′ = 302 (y − 1 + 2x) ,

y(0) = 1, y(b) = 1− 2b ; b > 0.(7.31)

Its exact solution isy = 1− 2x ; (7.32)

by Theorem 6.2 this solution is unique. Note that the general solution of only the ODE (withoutboundary conditions) in (7.31) is

y = 1− 2x + Ae30x + B e−30x . (7.33)


Now let us try to use the shooting method to solve the BVP (7.31). Following the lines ofSec. 7.2, we set up auxiliary IVPs

u′′ = 302(u− 1 + 2x) ,

u(0) = 1, u′(0) = 0 ;

v′′ = 302v ,

v(0) = 0, v′(0) = 1 ;(7.34)

and solve them. The exact solutions of (7.34) are:

u = 1− 2x +1

30

(e30x − e−30x

), v =

1

60

(e30x − e−30x

). (7.35)

Then Eq. (7.11) provides the value of the auxiliary parameter θ0:

θ0 =(1− 2b)− (1− 2 · b + 1

30(e30·b − e−30·b) )

160

(e30·b − e−30·b)≈ − 1

30e30b

160

e30b= −2 . (7.36)

Above we have used the ‘≈’ sign to emphasize that we have kept only the largest terms in eachof the numerator and denominator; but, as a matter of fact, simple algebra shows that θ0 = −2exactly. Indeed,

u(7.35) + (−2) · v(7.35) = 1− 2x = exact solution. (7.37)

However, in any realistic situation, the value of θ0 will be determined with a round-off error,i.e. instead of (7.36) we should expect to get

θ0 = −2 + ε , (7.38)

where ε is the round-off error. Then the solution that one obtains from the auxiliary IVPs(7.34) and Eqs. (7.35) and (7.38) is:

y = u(7.35) + (−2 + ε) · v(7.35) = 1− 2x +ε

60

(e30x − e−30x

). (7.39)

In Matlab, ε ∼ 10−15. Suppose b = 1.4; then thelast term in (7.39) is

ε

60

(e30b − e−30b

) ≈ 10−15 · 1

60· e30·1.4 ≈ 29 ,

which is much larger than either of the first twoterms. Thus, the exact solution will be drowned inthe round-off error amplified by a large exponentialfactor. This is illustrated in the figure on the right,where both the exact and the numerical solutionsfor b = 1.4 are shown.

0 0.2 0.4 0.6 0.8 1 1.2 1.4

0

10

20

30

40

50

60

x

y

exact

numerical

The way in which the shooting method is to be modified in order to handle the aboveproblem is suggested by this figure. Namely, one can see that the numerical solution is quiteaccurate up to the vicinity of the right end point, where x ≈ 1.4 and the factor e30x overtakesthe terms of the exact solution. Therefore, if we split the interval [0, 1.4] into two adjointsubintervals, [0, 0.7] and [0.7, 1.4], and perform shooting on each of these subintervals, thenthe corresponding exponential factor, e30·0.7 ∼ 109, will be too small to distort the numericalsolution, because then ε · e30·0.7 ∼ 10−15 · 109 = 10−6 ¿ 1.


Below we show the implementation details of this approach, known as the multiple shootingmethod. (Obviously, the name comes from the fact that the shooting is performed in multiple(sub)intervals.) These details are worked out for the case of two subintervals, [0, b/2] and[b/2, b]; a generalization for the case of more subintervals is fairly straightforward.

Consider two sets of auxiliary IVPs that are similar to the IVPs (7.34):

On [0, b/2]:u(1) ′′ = 302(u(1) − 1 + 2x) ,

u(1)(0) = α, u(1) ′(0) = 0 ;

v(1) ′′ = 302v(1) ,

v(1)(0) = 0, v(1) ′(0) = 1 ;(7.40)

On [b/2, b]:u(2) ′′ = 302(u(2) − 1 + 2x) ,

u(2)(b/2) = 0 , u(2) ′(b/2) = 0 ;

v(2,1) ′′ = 302v(2,1) ,

v(2,1)(b/2) = 1, v(2,1) ′(b/2) = 0 ;(7.41)

v(2,2) ′′ = 302v(2,2) ,

v(2,2)(b/2) = 0, v(2,2) ′(b/2) = 1 .

Note 1: The initial condition for u(1) at x = 0 is denoted as α, even though in the exampleconsidered α = 1. This is done to emphasize that the given initial condition is always used foru(1) at the left end point of the original interval.Note 2: Note that in the 2nd subinterval (and, in general, in the kth subinterval with k ≥ 2),the initial conditions for the u(k) must be taken to always be zero. (This is stressed in theu-system in (7.41) by putting the initial condition for u(2) in a box.)

Continuing with solving the IVPs (7.40) and (7.41), we construct solutions

w(1) = u(1) + θ(1)v(1) , w(2) = u(2) + θ(2,1)v(2,1) + θ(2,2)v(2,2) , (7.42)

where the numbers θ(1), θ(2,1), and θ(2,2) are to be determined. Namely, these three numbersare determined from three requirements:

w(1)(b/2) = w(2)(b/2) ,

w(1) ′(b/2) = w(2) ′(b/2) ;

(The solution and its derivativemust be continuous at x = b/2;

)(7.43)

and

w(2)(b) = β (= 1− 2b) ,

(The solution must satisfy

the boundary condition at x = b.

)(7.44)

Equations (7.43) and (7.44) yield:

w(1)(b/2) = w(2)(b/2) ⇒ u(1)(b/2) + θ(1)v(1)(b/2) = 0 + θ(2,1) · 1 + θ(2,2) · 0 ;

w(1) ′(b/2) = w(2) ′(b/2) ⇒ u(1) ′(b/2) + θ(1)v(1) ′(b/2) = 0 + θ(2,1) · 0 + θ(2,2) · 1 ;

w(2)(b) = β ⇒ u(2)(b) + θ(2,1) · v(2,1)(b) + θ(2,2) · v(2,2)(b) = β .(7.45)

In writing out the r.h.s.’s of the first two equations above, we have used the boundary conditionsof the IVP (7.41).

The three equations (7.45) form a linear system for the three unknowns θ(1), θ(2,1), andθ(2,2). (Recall that u(1)(b/2) etc. are known from solving the IVPs (7.40) and (7.41).) Thus,finding the θ(1), θ(2,1), and θ(2,2) from (7.45) and substituting them back into (7.42), we obtainthe solution to the original BVP (7.31).


Important note: The multiple shooting method, at least in the above form, can onlybe used for linear BVPs, because it is only for them that the linear superposition principle,allowing us to write

w = u + θv ,

can be used.

Further reading on the multiple shooting method can be found in:• P. Deuflhard, “Recent advances in multiple shooting techniques,” in Computation techniquesfor ODEs, I. Gladwell and D.K. Sayers, eds. (Academic Press, 1980);• J. Stoer and R. Burlish, “Introduction to numerical analysis” (Springer Verlag, 1980);• G. Hall and J.M. Watt, “Modern numerical methods for ODEs” (Clarendon Press, 1976).

7.5 Shooting method for nonlinear BVPs

As has just been noted above, for nonlinear BVPs, the linear superposition of the auxiliarysolutions u and v cannot be used. However, one can still proceed with using the shootingmethod by following the general guidelines of Sec. 7.1.

As an example, let us consider the BVP

y′′ =y2

2 + x,

y(0) = 1, y(2) = 1 .

(7.46)

We again consider the auxiliary IVP

y′1 = y2 ,

y′2 =y2

1

2 + x;

y1(0) = 1, y2(0) = θ .

(7.47)

The idea is now to find the right value(s) of θ iter-atively. To motivate the iteration algorithm, let usactually solve the IVP (7.47) for an (equidistant)set of θ’s inside some large interval and look at theresult, which is shown in the figure on the right.(The reason this particular interval of θ valuesis chosen is simply because the instructor knowswhat the result should be.) This figure shows thatthe boundary condition at the right end point,

y(2)|as the function of θ = 1 , (7.48)

can be considered as a nonlinear algebraic equationwith respect to θ. Correspondingly, we can employwell-known methods of solving nonlinear algebraicequations for solving nonlinear BVPs.

−15 −10 −5 0 −5

−3

−1

1

3

θ

y(2

)

prescribed boundary value

θ θ


Probably the simplest such a method is the secant method. Below we will show how to useit to find the values θ and ¯θ, for which y(2) = 1 (see the figure). Suppose we have tried twovalues, θ1 and θ2, and found the corresponding values y(2)|θ1 and y(2)|θ2 . Denote

F (θk) = y(2)|θk− 1 , k = 1, 2 . (7.49)

Thus, our goal is to find the roots of the equation

F (θ) = 0 . (7.50)

Given the first two values of F (θ) at θ = θ1,2, the secant method proceeds as follows:

θk+1 = θk − F (θk)[F (θk)−F (θk−1)

θk−θk−1

] , and then compute F (θk+1) from (7.49). (7.51)

The iterations are stopped when |F (θk+1) − F (θk)| becomes less than a prescribed tolerance.In this manner, one will find the values θ and ¯θ and hence the corresponding two solutions ofthe nonlinear BVP (7.46). You will be asked to do so in a homework problem.

7.6 Broader applicability of the shooting method

We will conclude with two remarks. The first will outline another case where the shootingmethod can be used. The other will mention an important case where this method cannot beused.

7.6.1 Shooting method for finding discrete eigenvalues

Consider a BVP

y′′ + (2sech2x− λ2)y = 0, x ∈ (−∞,∞), y(|x| → ∞) → 0 . (7.52)

Here the term 2sech2x could be generalized to any“potential” V (x) that has one or several “humps”in its central region and decays to zero as |x| → ∞.Such a BVP is solvable (i.e., a y(x) can be foundsuch that y(|x| → ∞) → 0) only for some specialvalues of λ, called the eigenvalues of this BVP. Thecorresponding solution, called an eigenfunction, isa localized “blob”, which has some “structure” inthe region where the potential is significantly dif-ferent from zero and which vanishes at the ends ofthe infinite line. An example of such an eigenfunc-tion is shown on the right. Note that in general, aneigenfunction may have a more complicated struc-ture at the center than just a single “hump”.

−10 0 100

x

y


A variant of the shooting method which can find these eigenvalues is the following. First,since one cannot literally model the infinite line interval (−∞,∞), consider the above BVP onthe interval [−R, R] for some reasonably large R (say, R = 10). For a given λ in the BVP,choose the initial conditions for the shooting as

y(−R) = y0, y′(−R) = λ · y(−R), (7.53)

for some very small y0 which we will discuss later. The reason behind the above relation betweeny′(−R) and y(−R) is this. Since the potential 2sech2x (almost) vanishes at |x| = R, then(7.52) reduces to y′′−λ2y ≈ 0, and hence y′ ≈ λy at x = −R. Note that of the two possibilitiesy′ ≈ λy and y′ ≈ −λy which are implied by y′′ − λ2y ≈ 0, we have chosen the former, becauseit is its solution,

y = eλx, (7.54)

which agrees with the behavior of the eigenfunction at the left end of the real line (see thefigure above).

The constant y0 in (7.53) can be taken as

y0 = e−c R, (7.55)

where the constant c is of order one. Often one can simply take c = 1.Now, compute the solution of the IVP consisting of the ODE in (7.52) and the initial

condition (7.53) and record the value y(R). This value can be denoted as G(λ) since it hasbeen obtained for a particular value of λ: G(λ) ≡ y(R)|λ. Repeat this process for values ofλ = λmin + j∆λ, j = 0, 1, 2, . . . in some specified interval [λmin, λmax]; as a result, one obtainsa set of points representing a curve G(λ). Those values of λ where this curve crosses zerocorrespond to the eigenvalues13. Indeed, there y(R) = 0, which is the approximate relationsatisfied by eigenfunctions of (7.52) at x = R for R À 1.

7.6.2 Inapplicability of the shooting method in higher dimensions

Boundary value problems considered in this section are one-dimensional, in that they involvethe derivative with respect to only one variable x. Such BVPs often arise in description ofone-dimensional objects such as beams and strings. A natural generalization of these to twodimensions are plates and membranes. For example, a classic Helmholtz equation

∂2u

∂x2+

∂2u

∂y2+ k2u = 0, (7.56)

where u = 0 along the boundaries of a square with vertices (x, y) = (0, 0), (1, 0), (1, 1), (0, 1),arises in the mathematical description of oscillations of a square membrane.

One could attempt to obtain a solution of this BVP by shooting from, say, the left sideof this square to the right side. However, not only is this tedious to implement while ac-counting for all possible combinations of ∂u/∂x along the side (x = 0, 0 ≤ y ≤ 1), butalso results of such shooting will be dominated by numerical error and will have nothing incommon with the true solution. The reason is that any IVP for certain two-dimensional equa-tions, of which (7.56) is a particular case, are ill-posed. We will not go further into this

13In practice, one uses a more accurate method, but the description of this technical detail is outside thescope of this lecture.


issue14 since it would substantially rely on the material studied in a course on partial dif-ferential equations. What is important to remember out of this brief discussion is that theshooting method can be used only for one-dimensional BVPs.

In the next lecture we will introduce alternative methods that can be used to solve BVPsboth in one and many dimensions. However, we will only consider their applications in onedimension.


1. Explain the basic idea behind the shooting method (Sec. 7.1).

2. Why did we require that Q(x) ≤ 0 in (7.4)?

3. Verify (7.8).

4. Suppose that Q(x) > 0 and hence we can have v(b) = 0, as in (7.12). Does (7.12) meanonly that the BVP (7.4) will have no solutions, or are there other possibilities?

5. Verify (7.15) and (7.16).

6. Verify (7.20).

7. Verify (7.26).

8. Suppose you need to solve a 5th-order linear BVP. How many auxiliary systems do youneed to consider? How many parameters (analogues of θ) do you need to introduce andthen solve for?

9. What allows us to say that by Theorem 6.2 the solution (7.32) is unique?

10. Verify (7.37) and (7.39).

11. What causes the large deviation of the numerical solution from the exact one in the figurefound under Eq. (7.39)?

12. Describe the key idea behind the multiple shooting method.

13. Suppose we use 3 subintervals for multiple shooting. How many parameters analogous toθ(1), θ(2,1), and θ(2,2) will we need? What are the meanings of the conditions from whichthese parameters can be determined?

14. Verify that the r.h.s.’es in (7.45) are correct.

15. Describe the key idea behind the shooting method for nonlinear BVPs.

16. For a general nonlinear BVP, can one tell how many solutions one will find?

14We will, however, arrive at the same conclusion, but from another view point, in Lecture 11.


8 Finite-difference methods for BVPs

In this Lecture, we consider methods for solving BVPs whose (methods’) idea is to replace theBVPs with a system of approximating algebraic equations. In the first five sections, we will(mostly) deal with linear BVPs, so that the corresponding system of equations will be linear.The last section will show how nonlinear BVPs can be approached.

8.1 Matrix problem for the discretized solution

Let us begin by considering a linear BVP with Dirichlet boundary conditions:

y′′ + P (x)y′ + Q(x)y = R(x) ,

y(a) = α, y(b) = β .(8.1)

As before, we assume that P , Q, and R are twice continuously differentiable, so that y is a fourtimes continuously differentiable function of x. Also, we consider the case where Q(x) ≤ 0 on[a, b], so that the BVP (8.1) is guaranteed by Theorem 6.2 to have a unique solution.

Let us replace y′′ and y′ in (8.1) by their second-order accurate discretizations:

y′′ =1

h2(yn+1 − 2yn + yn−1) + O(h2) , (8.2)

y′ =1

2h(yn+1 − yn−1) + O(h2) . (8.3)

Upon omitting the O(h2)-terms and substituting (8.2) and (8.3) into (8.1), we obtain thefollowing system of linear equations:

Y0 = α ;

(1 + h2Pn)Yn+1 − (2− h2Qn)Yn + (1− h

2Pn)Yn−1 = h2Rn , 1 ≤ n ≤ N − 1 ;

YN = β .

(8.4)

In the matrix form, this isA~Y = ~r , (8.5)

where

A =

−(2− h2Q1) (1 + h2P1) 0 0 · · · 0

(1− h2P2) −(2− h2Q2) (1 + h

2P2) 0 · · · 0

0 (1− h2P3) −(2− h2Q3) (1 + h

2P3) 0 0

· · · · · · · · · · · · · · · · · ·0 · · · 0 (1− h

2PN−2) −(2− h2QN−2) (1 + h

2PN−2)

0 · · · 0 0 (1− h2PN−1) −(2− h2QN−1)

, (8.6)

~Y = [Y1, Y2, . . . , YN−1]T , and

~r =

[h2R1 −

(1− h

2P1

)α, h2R2, h2R3, · · · , h2RN−2, h2RN−1 −

(1 +

h

2PN−1

)β

]T

;

(8.7)the superscript ‘T ’ in (8.7) denotes the transpose.


From Linear Algebra, it is known that the linear system (8.5) has a unique solution if (andonly if) the matrix A is nonsingular. Therefore, below we list some results that will allow usto guarantee that under certain conditions, a particular A is nonsingular.

Gerschgorin Circles Theorem, 8.1 Let aij be entries of an M ×M matrix A and letλk, k = 1, . . . , M be the eigenvalues of A. Then:

(i) Each eigenvalue lies in the union of “row circles” Ri, where

Ri =

{z : |z − aii| ≤

M∑

j=1, j 6=i

|aij|}

. (8.8)

(In words,∑M

j=1, j 6=i |aij| is the sum of all off-diagonal entries in the ith row.)

(ii) Similarly, since A and AT have the same eigenvalues, then each eigenvalue also lies inthe union of “column circles” Cj, where

Cj =

{z : |z − ajj| ≤

M∑

i=1, i 6=j

|aij|}

. (8.9)

(In words,∑M

i=1, i 6=j |aij| is the sum of all off-diagonal entries in the jth column.)

(iii) Let⋃i=l

i=k Ri be a cluster of (l − k) row circles that is disjoint from all the other rowcircles. Then it contains exactly (l − k) eigenvalues.A similar statement holds for column circles.

Example Use the Gerschgorin Circles Theorem to estimate eigenvalues of a matrix

E =

1 2 −1

1 1 1

1 3 −1

. (8.10)

Solution The circles are listed and sketched below:

R1 = {z : |z − 1| ≤ |2|+ |1| = 3} C1 = {z : |z − 1| ≤ |1|+ |1| = 2}R2 = {z : |z − 1| ≤ |1|+ |1| = 2} C2 = {z : |z − 1| ≤ |2|+ |3| = 5}R3 = {z : |z + 1| ≤ |1|+ |3| = 4} C3 = {z : |z + 1| ≤ |1|+ | − 1| = 2}


−6 −4 −2 0 2 4 6

−4

−2

0

2

4 Im z

Re z

R3

R1

−4 −2 0 2 4 6 8−6

−4

−2

0

2

4

6 Im z

Re z

C2

C3

(Circles R2 and C1 are not sketched because they are concentric with, and lie entirely within,circles R1 and C2, respectively.)

Gerschgorin Circles Theorem says that the eigenvalues of E must lie within the intersectionof

⋃3i=1 Ri and

⋃3j=1 Cj. For example, this gives that |Re λ| ≤ 4 for each and any of the

eigenvalues.

Before we can apply the Gerschgorin Circles Theorem, we need to introduce some newterminology.

Definition Matrix A is called diagonally dominant if

either |aii| ≥M∑

j=1, j 6=i

|aij| or |aii| ≥M∑

j=1, j 6=i

|aji| , 1 ≤ i ≤ M , (8.11)

with the strict inequality holding for at least one i.A matrix is called strictly diagonally dominant (SDD) if

either |aii| >M∑

j=1, j 6=i

|aij| or |aii| >M∑

j=1, j 6=i

|aji| for all i = 1, . . . ,M . (8.12)

In other words, in a SDD matrix, the sums of the off-diagonal elements along either every rowor every colum are less than the corresponding diagonal entries.

Theorem 8.2 If a matrix A is SDD, then it is nonsingular.

Proof If A is SDD, then one of the inequalities (8.12) holds. Suppose the inequality for therows holds. Comparing that inequality with the r.h.s. of (8.8), we conclude, by the GerschgorinCircles Theorem, that point λ = 0 is outside of the union

⋃Mi=1 Ri of Gerschgorin circles. Hence

it is automatically outside of the intersection of the unions⋃M

i=1 Ri and⋃M

i=1 Ci. Therefore,λ = 0 is not an eigenvalue of A, hence A is nonsingular. q.e.d.

Theorem 8.3 Consider the BVP (8.1). If Q(x) ≤ 0, and if P (x) is bounded on [a, b] (i.e.|P (x)| ≤ P for some P), then the discrete version (8.4) of the BVP in question has a uniquesolution, provided that the step size satisfies hP ≤ 2.


Proof Case (a): Q(x) < 0. In this case, matrix A in (8.6) is SDD, provided that hP ≤ 2.Indeed, A is tridiagonal, with the diagonal elements being

aii = −(2− h2Qi), ⇒ |aii| > 2 . (8.13)

The sum of the absolute values of the off-diagonal elements is

|ai,i+1|+ |ai,i−1| =

∣∣∣∣1 +h

2Pi

∣∣∣∣ +

∣∣∣∣1−h

2Pi

∣∣∣∣ =

(1 +

h

2Pi

)+

(1− h

2Pi

)= 2 . (8.14)

In removing the absolute value signs in the above equation, we have used the fact that hP ≤ 2.Now, comparing (8.13) with (8.14), we see that

|ai,i+1|+ |ai,i−1| < |aii| ,

which means that A is SDD and hence by Theorem 8.2, (8.4) has a unique solution. q.e.d.Case (b): Q(x) ≤ 0 requires a more involved proof, which we omit.

Note: We emphasize that Theorem 8.3 gives a bound for the step size,

h · maxx∈[a, b]

|P (x)| ≤ 2 , (8.15)

which is a sufficient condition for the discretization (8.4) (with Q(x) ≤ 0) to be solvable.

8.2 Thomas algorithm

In the previous section, we have considered the issue of the possibility to obtain the uniquesolution of the linear system (8.4), which (the system) approximates the BVP (8.1). In thissection, we will consider the issue of solving (8.4) in an efficient manner. The key fact thatwill allow us to do so is the tridiagonal form of A; that is, A has nonzero entries only on themain diagonal and on the subdiagonals directly above and below it. Let us recall that in orderto solve15 a linear system of the form (8.5) with a full M ×M matrix A, one requires O(M3)operations. However, this would be an inefficient way to solve a linear system with a tridiagonalA. Below we will present an algorithm, which has been discovered independently by severalresearchers in the 50’s, that allows one to solve a linear system with a tridiagonal matrix usingonly O(M) operations.

A common practical way to numerically solve a linear system of the form

A~y = ~r (8.5)

(with any matrix A) is via LU decomposition. Namely, one seeks two matrices L (lowertriangular) and U (upper triangular) such that 16

LU = A . (8.16)

Then (8.5) is solved in two steps:

Step 1 : L~z = ~r and Step 2 : U~y = ~z . (8.17)

15by Gaussian elimination or by any other direct (i.e., non-iterative) method16Disclaimer: The theory of LU decomposition is considerably more involved than the simple excerpt from

it given here. We will not go into further details of that theory in this course.


The linear systems in (8.17) are then solved using forward (for Step 1) and backward (for Step2) substitutions; details of this will be provided later on.

When A is tridiagonal, both finding the matrices L and U and solving the systems in (8.17)requires only O(M) operations. You will be asked to demonstrate that in a homework problem.Here we will present details of the algorithm itself.

Let

A =

b1 c1 0 0 · · · 0

a2 b2 c2 0 · · · 0

0 a3 b3 c3 · · · 0

· · · · · · · ·0 · · · 0 aM−1 bM−1 cM−1

0 · · · 0 0 aM bM

; (8.18)

then we seek L and U in the form:

LU =

1 0 0 · 0

α2 1 0 · 0

0 α3 1 · 0

· · · · ·0 · 0 αM 1

β1 c1 0 · 0

0 β2 c2 · 0

· · · · ·0 · 0 βM−1 cM−1

0 · 0 0 βM

(8.19)

Multiplying the matrices in (8.19) and comparing the result with (8.18), we obtain:

Row 1: β1 = b1;

Row 2: α2β1 = a2, α2c1 + β2 = b2;

Row j: αjβj−1 = aj, αjcj−1 + βj = bj .

(8.20)

Equations (8.20) are easily solved for the unknown coefficients αj and βj:

β1 = b1,

αj = aj/βj−1, βj = bj − αjcj−1, j = 2, . . . , M .(8.21)

Finally, we show how the systems in (8.17) can be solved. Let

~z = [z1, z2, · · · , zM ]T , etc.

Then the forward substitution in L~z = ~r gives:

z1 = r1,

zj = rj − αjzj−1, j = 2, . . . , M .(8.22)

The backward substitution in U~y = ~z gives:

yM = zM/βM ,

yj = (zj − cjyj+1)/βj, j = M − 1, . . . , 1 .(8.23)

Thus, ~y is found in terms of ~r.


The entire procedure (8.21) to (8.23) is fast, as we said earlier, and also requires the storageof only 8 one-dimensional arrays of size O(M) for the coefficients {aj}, {bj}, {cj}, {αj}, {βj},{rj}, {zj}, and {yj}. Moreover, it is possible to show that when A is SDD, i.e. when

|bj| > |aj|+ |cj|, j = 1, . . . ,M, (8.24)

then small round-off (or any other) errors do not get amplified by this algorithm; specifically,the numerical error remains small and independent of the size M of the problem17.

To conclude this section, we note that similar algorithms exist for other banded (e.g., pen-tadiagonal) matrices. The details of those algorithms can be found, e.g., in Sec. 3-2 of W.Ames, “Numerical methods for partial differential equations,” 3rd ed. (Academic Press, 1992).

8.3 Error estimates, and higher-order discretization

In this section, we will state, without proof, two theorems about the accuracy of the solutionsof systems of discretized equations approximating a given BVP.

Theorem 8.4 Let {Yn}N−1n=1 be the solution of the discretized problem (8.4) (or, which

is the same, (8.5)–(8.7)) and let y(x) be the exact solution of the original BVP (8.1); thenεn = y(xn) − Yn is the error of the numerical solution. In addition, let P = maxx∈[a, b] |P (x)|and also let Q(x) ≤ Q < 0 (recall that Q(x) ≤ 0 is required for the BVP to have a uniquesolution). Then the error satisfies the following estimate:

max |εn| ≤ 1

h2 (|Q|+ 8(b−a)2

)

(1

12h4(M4 + PM3) + 2ρ

), (8.25)

where M3 = maxx∈[a, b] |y′′′|, M4 = maxx∈[a, b] |y′′′′|, and ρ is the round-off error.When the round-off error is neglected, estimate (8.25) yields that the discrete approximation

(8.4) to the BVP produces the error on the order of O(h2) (i.e., in other words, is second-orderaccurate). At first sight, this is similar to how the finite-difference approximations (8.2) and(8.3) led to second-order accurate methods for IVPs. In fact, for both the IVPs and BVPs, theexpression inside the largest parentheses on the r.h.s. of (8.25) is the local truncation error.However, the interpretation of the O(1/h2) factor in front of those parentheses differs fromthe interpretation of the similar factor for IVPs. In the latter case, that factor arose fromaccumulation of the error over O(1/h) steps. On the contrary, for the BVPs, that factor is theso-called condition number of the matrix A in (8.5), defined as

cond A = ||A|| · ||A−1|| , (8.26)

where ||A|| is the norm of matrix A, subordinate to a particular vector norm18. In courses onNumerical Analysis, it is shown that the condition number is the factor by which a small errorin the vector ~r is amplified in the corresponding error of the solution ~Y:

A (~Y + ~δY) = ~r + ~δr ⇒ || ~δY||||~Y||

= condA · ||~δr||||~r|| . (8.27)

17In line with the earlier disclaimer about the LU decomposition, we also note that condition (8.24) is justthe simplest of possible conditions that guarantee boundedness of the error. Other conditions exist and arestudied in advanced courses in Numerical Analysis.

18Definition (8.26) is given here only for completemeness of the presentation; we will not need more precisedefinitions of the norms involved.


One can show (but we will not do so here) that the condition number of the matrix A in (8.5)is O( 1

h2 ); this fact, along with Eq. (8.27) and the local truncation error ~δr being O(h4), impliesestimate (8.25) for the solution error.

The numerical error of the discrete approximation of the BVP (8.1) can be significantlyreduced if instead of the simple central-difference approximation to y′′ = f(x, y), as in (8.2),one uses the Numerov’s formula (5.18). Specifically, if y′ does not enter the BVP, i.e. when theBVP is

y′′ = f(x, y) , y(a) = α, y(b) = β, (8.28)

then Numerov’s formula leads to the following system of discrete equations:

Y0 = α;

Yn+1 − 2Yn + Yn−1 = h2

12(fn+1 + 10fn + fn−1) 1 ≤ n ≤ N − 1 ;

YN = β .

(8.29)

In particular, when f(x, y) = −Q(x)y+R(x), system (8.29) is linear. Since the local truncationerror of Numerov’s method is O(h6), the solution of system (8.29) must be a fourth-orderaccurate approximation to the exact solution of the BVP (8.28). More precisely, the followingtheorem holds true:

Theorem 8.5 Let {Yn}N−1n=1 be the solution of the discretized problem (8.29), where

f(x, y) = −Q(x)y + R(x) (so that this system is linear). Then the error εn = y(xn) − Yn

satisfies the following estimate:

max |εn| ≤ 1

h2 (|Q|+ 8(b−a)2

)

(1

240h6M6 + 2ρ

), (8.30)

where the notations are the same as in Theorem 8.4 and, in addition, M6 = maxx∈[a, b] |y(6)|.

8.4 Neumann and mixed boundary conditions

Suppose that at, say, x = a, the boundary condition is

A1y(a) + A2y′(a) = α , (8.31)

where A1 and A2 are some constants. (Recall that for A1 = 0, this is the Neumann boundarycondition, while for A1 · A2 6= 0, this condition is of the mixed type.) Below we present twomethods that allow one to construct generalizations of systems (8.4) (simple central-differencediscretization) and (8.29) (Numerov’s formula) for the case of the boundary condition (8.31)instead of

y(a) = α .

Note that the problem we need to handle for such a generalization is obtaining an approximationto y′(a) in (8.31) that would have the accuracy consistent with the accuracy of the discretizationscheme. That is, the accuracy needs to be second-order for a generalization of (8.4) and fourth-order for a generalization of (8.29).

Method 1This method is efficient only for a generalization of the second-order accurate approximation.Thus, we will look for a second-order accurate finite-difference approximation to y′(a).


Introduce a fictitious point x−1 = a − h and let the approximate solution at that point beY−1. Once one knows Y−1, the approximation sought is

y′(a) =Y1 − Y−1

2h+ O(h2) , (8.32)

so that the equation approximating the boundary condition (8.31) is

x = a : A1Y0 + A2Y1 − Y−1

2h= α . (8.33)

The discrete analog of the ODE itself at x = a is:

(1 +

h

2P0

)Y1 − (2− h2Q0)Y0 +

(1− h

2P0

)Y−1 = h2R0 . (8.34)

Equations (8.33) and (8.34) should replace the first equation,

Y0 = α, (8.35)

in (8.4), so that the dimension of the resulting system of equations in (8.4) increases from(N − 1) × (N − 1) to (N + 1) × (N + 1). Note that adding two new equations (8.33) and(8.34) to system (8.4) is consistent with introducing two new unknowns Y−1 and Y0.

Later on we will see that, rather than dealing with two equations (8.33) and (8.34), it ismore practical to solve (8.33) for Y−1 and substitute the result in (8.34). Then, instead of twoequations (8.33) and (8.34), we will have one equation that needs to replace (8.35):

2Y1 −[2− h2Q0 − 2h

A1

A2

(1− h

2P0

)]Y0 = h2R0 + 2h

α

A2

(1− h

2P0

). (8.36)

(We are justified to assume in (8.36) that A2 6= 0, since otherwise the boundary condition (8.31)becomes Dirichlet.) The resulting system then contains 1 + (N − 1) = N equations for the N

unknowns: Y0 through YN−1, and hence can be solved for a unique solution {Yn}N−1n=0 , unless

the coefficient matrix of this system is singular. We will remark on the latter possibility afterwe describe the other method.

Method 2This method does not use a fictitious point to approximate the boundary condition with therequired accuracy. We will first show how this method can be used for the second-order accurateapproximation (8.4) and then indicate how it (the method) can be modified for the fourth-orderaccurate approximation (8.29).

In analogy with Eq. (3.10) of Lecture 3, one can obtain:

yn+1 − yn

h= y′n +

h

2y′′n + O(h2) . (8.37)

Then, using the ODE in (8.1) to express y′′ as R− Py′ −Qy, one finds:

y′n =yn+1 − yn

h− h

2(Rn − Pny

′n −Qnyn) + O(h2)

=yn+1 − yn

h− h

2

(Rn − Pn

[yn+1 − yn

h+ O(h)

]−Qnyn

)+ O(h2) . (8.38)


Therefore, the boundary condition (8.31) can be approximated by

A1Y0 + A2

[Y1 − Y0

h− h

2

(R0 − P0

Y1 − Y0

h−Q0Y0

)]= α , (8.39)

with the discretization error being O(h2). To answer some of the Questions for self-assessment,you will need to collect the coefficients of Y0 and Y1 in (8.39) to put it in a form similar to thatof (8.36). Such a rearranged Eq. (8.39) should then be used along with the N − 1 equationsin (8.4) for {Yn}N−1

n=1 . Thus we have a system of N equations for N unknowns {Yn}N−1n=0 , which

can be solved (again, unless its coefficient matrix is singular).If one requires a fourth-order accurate approximation to y′(a), one needs to use two more

terms in expansion (8.37):

yn+1 − yn

h= y′n +

h

2y′′n +

h2

6y′′′n +

h3

24y′′′′n + O(h4) (8.40)

and replace the y′′ and higher-order derivatives using y′′n = f(xn, yn) ≡ fn, fn+1 = fn +h ddx

fn +. . . = y′′n + hy′′′n + . . ., etc. The result is:

y′n =yn+1 − yn

h− h

24(7fn + 6fn+1 − fn+2) + O(h4) . (8.41)

Equation (8.41) with n = 0 can then be used to obtain a fourth-order accurate counterpart of(8.39).

Remark 1: Recall that the coefficient matrix in the discretized BVP (8.4) with Dirichletboundary conditions is SDD and hence nonsingular, provided that h satisfies the condition ofTheorem 8.3. We will now state a requirement that the coefficients A1, A2 in the non-Dirichletboundary condition (8.31) must satisfy in order to guarantee that the corresponding BVP has aunique solution. To this end, we consider the situations arising in Methods 1 and 2 separately.

When Eq. (8.36) replaces the Dirichlet boundary condition (8.35) in Method 1, the resultingcoefficient matrix is SDD provided that

A1A2 ≤ 0, (8.42)

in addition to the earlier requirements of Q(x) < 0 and h·max |P (x)| ≤ 2. (You will be asked toverify (8.42) in one of the Questions for self-assessment.) On the other hand, if two individualequations (8.33) and (8.34) are used instead of their combined form (8.36), the correspondingcoefficient matrix is no longer SDD. This is one advantage of using (8.36) instead of (8.33) and(8.34).

Thus, condition (8.42) is sufficient, but not necessary, to guarantee that the correspondingdiscretized BVP has a unique solution. That is, even when (8.42) does not hold, one can stillattempt to solve the corresponding linear system, since strict diagonal dominance is only asufficient, but not necessary, condition for A to be nonsingular.

In the case of Method 2, by collecting the coefficients of Y0 and Y1 in Eq. (8.39), it isstraightforward (although, perhaps, a little tediuos) to show that the coefficient matrix is SDDif (8.42) and the two conditions stated one line below it, are satisfied. You will verify this ina homework problem. The fourth-order accurate analog of (8.39), based on Eq. (8.41), alsoyields the same conditions for strict diagonal dominance of the coefficient matrix.

Remark 2: In the case of the BVP with Dirichlet boundary conditions, the coefficientmatrix is tridiagonal (see Eq. (8.6)), and hence the corresponding linear system can be solved


efficiently by the Thomas algorithm. In the case of non-Dirichlet boundary conditions, one canshow (and you will be asked to do so) that Method 1 based on Eq. (8.36) yields a tridiagonalsystem, but the same method using Eqs. (8.33) and (8.34) does not. This is the other advantageof using the single Eq. (8.36) in this method.

Method 2 for a second-order accurate approximation to the BVP gives a tridiagonal matrix.This can be straightforwardly shown by the same rearrangement of terms in Eq. (8.39) that wasused above to show that the corresponding matrix is SDD. However, the fourth-order accuratemodification of (8.39), based on (8.41), produces a matrix A that is no longer tridiagonal. Onecan handle this situation in two ways. First, if one is willing to sacrifice one order of accuracy inexchange for the convenience of having a tridiagonal coefficient matrix, then instead of (8.41),one can use

y′n =yn+1 − yn

h− h

6(2fn + fn+1) + O(h3) . (8.43)

Then the modification of (8.39) based on (8.43) does result in a tridiagonal coefficient matrix.Alternatively, one can see that matrix A obtained with the fourth-order accurate formula (8.41)differs from a tridiagonal one only in having its (1, 3)th entry nonzero (verify). Thus, A is insome sense “very close” to a tridiagonal matrix, and we can hope that this fact could be used tofind A−1 with only O(M) operations. This can indeed be done by using the algorithm describedin the next section.

8.5 Periodic boundary condition; Sherman–Morrison algorithm

In this section, we will consider the case when BVP (8.1) has periodic boundary conditions.We will see that the coefficient matrix arising in this case is not tridiagonal, but is, in somesense, close to it. We will then present an algorithm that will allow us to find the inverse ofthat matrix using only O(M) operations, where M ×M is the dimension of the matrix. Thesame method can also be used in other situations, including the inversion of matrix A definedin the last paragraph of the preceding section.

We first obtain the analogues of Eqs. (8.5)–(8.7) in the case of the BVP having periodicboundary conditions. Consider the corresponding counterpart of the BVP (8.1):

y′′ + P (x)y′ + Q(x)y = R(x) ,

y(a) = y(b).(8.44)

The corresponding counterpart of system (8.4) is

Y0 = YN ;

(1 + h2Pn)Yn+1 − (2− h2Qn)Yn + (1− h

2Pn)Yn−1 = h2Rn , 0 ≤ n ≤ N .

(8.45)

Note 1: The index n in (8.45) runs from 0 to N , while in (8.4) in runs from 1 to N − 1.Note 2: Since our problem has periodic boundary conditions, it is logical to let

Y−n = YN−n and YN+n = Yn, 0 ≤ n ≤ N .

In particular,Y−1 = YN−1 and YN = Y0 (8.46)

(the last equation here is just the original periodic boundary condition).


In view of Eq. (8.46), the equations in system (8.45) with n = 0 and n = N − 1 can bewritten as follows:

(1 + h2P0)Y1 − (2− h2Q0)Y0 + (1− h

2P0)YN−1 = h2R0 ,

(1 + h2PN−1)Y0 − (2− h2QN−1)YN−1 + (1− h

2PN−1)YN−2 = h2RN−1 .

(8.47)

With this result, system (8.45) can be written in the matrix form (8.5), where now

~Y = [Y0, Y1, . . . , YN−1]T , (8.48)

A =

−(2− h2Q0) (1 + h2P0) 0 0 · · · (1− h

2P0)

(1− h2P2) −(2− h2Q2) (1 + h

2P2) 0 · · · 0

0 (1− h2P3) −(2− h2Q3) (1 + h

2P3) 0 0

· · · · · · · · · · · · · · · · · ·0 · · · 0 (1− h

2PN−2) −(2− h2QN−2) (1 + h

2PN−2)

(1 + h2PN−1) · · · 0 0 (1− h

2PN−1) −(2− h2QN−1)

(8.49)

and~r =

[h2R0, h2R1, h2R2, · · · , h2RN−2, h2RN−1

]T. (8.50)

Matrix A in Eq. (8.49) differes from its counterpart in Eq. (8.6) in two respects: (i) itsdimension is N × N rather than (N − 1) × (N − 1) and (ii) it is not tridiagonal due to theterms in its upper-right and lower-left corners. Such matrices are called circulant.

Thus, to obtain the solution of the BVP (8.44), we will need to solve the linear system (8.5with the non-tridiagonal matrix A. We will now show how this problem can be reduced tothe solution of a system with a tridiagonal matrix. To this end, we first make a preliminaryobservation. Let ~w be a vector whose only nonzero entry is its ith entry and equals wi. Let ~z bea vector whose only nonzero entry is its jth entry and equals zj. Then C = ~w~zT is an N ×N

matrix whose only nonzero entry is Cij = wizj (verify). Similarly, if ~w = [w1, 0, 0, . . . , wN ]T

and ~z = [z1, 0, 0, . . . , zN ]T , then

C = ~w~zT =

w1z1 0 · · · 0 w1zN

0 0 · · · 0 0

· · · · · · · · · · · · · · ·0 0 · · · 0 0

wNz1 0 · · · 0 wNzN

. (8.51)

Therefore, the circulant matrix A in Eq. (8.49) can be represented as:

A = Atridiag + ~w~zT , (8.52)

where Atridiag is some tridiagonal matrix and ~w and ~z are some properly chosen vectors. Notethat while the choice of ~w and ~z allows much freedom, the form of Atridiag is unique for a givencirculant matrix A, once ~w and ~z have been chosen. In a homework problem, you will be askedto make a choice of ~w and ~z and consequently come up with the expression for Atridiag, giventhat A is as in Eq. (8.49).

Linear systems (8.5) with the coefficient matrix A given by (8.52) can be time-efficiently— i.e., in O(M) operations — solved by the so-called Sherman–Morrison algorithm. Thisalgorithm can be found in most textbooks on Numerical Analysis, or online.


8.6 Nonlinear BVPs

The analysis presented in this section is carried out with three restrictions. First, we consideronly BVPs with Dirichlet boundary conditions. Generalizations for boundary conditions of theform (8.31) can be done straightforwardly along the lines of the previous section.

Second, we only consider the BVPs

y′′ = f(x, y), y(a) = α, y(b) = β , (8.28)

which does not involve y′. Although the methods described below can also be adopted to amore general BVP

y′′ = f(x, y, y′), y(a) = α, y(b) = β , (7.1)

the analysis of convergence of those methods to a solution of the BVP (7.1) is significantly morecomplex than the one presented here. Thus, the methods that we will develop in this sectioncan be applied to (7.1) without a guarantee that one will obtain a solution of that BVP.

Third, we will focus our attention on the second-order accurate discretization of the BVPs.The analysis for the fourth-order accurate discretization scheme is essentially the same, andproduces similar results.

Consider the BVP (8.28). The counterpart of system (8.4) for this BVP has the same formas (8.5), i.e.:

A~Y = ~r , (8.5)

where now

A =

−2 1 0 0 · · · 0

1 −2 1 0 · · · 0

0 1 −2 1 · · · 0

· · · · · · · ·0 · · · 0 1 −2 1

0 · · · 0 0 1 −2

; ~r =

h2f(x1, Y1)− α

h2f(x2, Y2)

· · ·h2f(xN−2, YN−2)

h2f(xN−1, YN−1)− β

. (8.53)

Equations (8.5) and (8.53) constitute a system of nonlinear algebraic equations. Below weconsider three methods of iterative solution of such a system. Of these methods, Method 1 isan analogue of the fixed-point iteration method for solving a single linear equation19

A · y = r(y), (8.54)

Method 2 contains its modifications, and Method 3 is the analog of the Newton–Raphsonmethod for (8.54).

Method 1 (Picard iterations)The fixed-point iteration scheme, also called the Picard iteration scheme, for the single nonlinearequation (8.54) is simply

y(k+1) =1

Ar(y(k)), (8.55)

where y(k) denotes the kth iteration of the solution of (8.54). To start the iteration scheme(8.55), one, of course, needs an initial guess y(0).

19The constant A in (8.54) could have been absorbed into the function r(y), but we kept it so as to mimicthe notations of (8.5).


Now, let ~Y(k) denote the kth iteration of the solution of the matrix nonlinear equation (8.5),(8.53). Then the corresponding Picard iteration scheme is

A~Y(k+1) = ~r(~x, ~Y(k)

), k = 0, 1, 2, . . . . (8.56)

Importantly, unlike the original nonlinear Eqs. (8.5) and (8.53), Eq. (8.56) is a linear system.Indeed, at the (k+1)th iteration, the unknown vector ~Y(k+1) enters linearly, while the nonlinearr.h.s. contains ~Y(k) that has been determined at the previous iteration.

Let us invetsigate the rate of convergence of the iteration scheme (8.56). For its one-variablecounterpart, Eq. (8.54), the convergence condition of the fixed-point iterations (8.55) is well-known to be ∣∣∣∣A−1 · dr(y)

dy

∣∣∣∣ < 1 (8.57)

for all y sufficiently close to the fixed point. The condition we will establish for (8.56) will bea direct analog of (8.57).

Before we proceed, let us remark that the analysis of convergence of Picard’s iterationscheme (8.56) can proceed in two slightly different ways. In one way, one can transform (8.56)into an iteration scheme for ~ε (k) = ~Y(k) − ~Y(k−1). Then the conditions that the iterationsconverge is expressed by

∥∥~ε (k)∥∥ <

∥∥~ε (k−1)∥∥ for all sufficiently large k, (8.58)

where, as in Lecture 4 (see (4.5)), || . . . || denotes the ∞-norm:

‖~ε‖ ≡ ‖~ε‖∞ = maxn|εn| .

Indeed, this condition implies that limk→∞∥∥~ε (k)

∥∥ = 0. This, in its turn, implies that the

sequence of iterations{

~Y(k)}

tends to a limit, which then, according to (8.56), must be a

solution of (8.5).We will, however, proceed in a slightly different way. Namely, we will assume that our

starting guess, ~Y(0), is sufficiently close to the exact solution, ~Y, of (8.5). Then, as long asthe iterations converge, ~Y(k) will stay close to ~Y for all k, and one can write

~Y(k) = ~Y + ~ε (k), where∥∥~ε (k)

∥∥ ¿ 1. (8.59)

The condition that iterations (8.56) converge has exactly the same form, (8.58), as before.However, its interpretation is slightly different (although equivalent): Now the fact thatlimk→∞

∥∥~ε (k)∥∥ = 0 implies that limk→∞ ~Y(k) = ~Y, i.e. that the iterative solutions converge to

the exact solution of (8.5).Both methods of convergence analysis described above can be shown to yield the same

conclusions. We chose to follow the second method, based on (8.59), because it can be morenaturally related to the linearization of the nonlinear equation at hand (see below). Lineariza-tion results in replacing the analysis of the original nonlinear equation (8.5) by the analysis ofa system of linear equations, which can be carried out using well-developed methods of LinearAlgebra.

Thus, we begin the convergence analysis of the iteration scheme (8.56) by substitutingthere expression (8.59) and linearizing the right-hand side using the first two terms of theTaylor expansion near ~Y:

½½½A~Y + A~ε (k+1) = »»»»

~r(~x, ~Y) +∂~r

∂ ~Y~ε (k) + O

( ∥∥~ε (k)∥∥2 )

. (8.60)


The first terms on both sides of the above equation cancel out by virtue of the exact equation(8.5). Then, upon discarding quadratically small terms and using the definition of ~r from (8.53),one obtains a linear system

A~ε (k+1) =∂~r

∂ ~Y~ε (k), where

∂~r

∂ ~Y= h2 diag

(∂f(x1, Y1)

∂Y1

, . . . ,∂f(xN−1, YN−1)

∂YN−1

). (8.61)

We will use this equation to relate the norms of ~ε (k+1) and ~ε (k) and thereby establish a sufficientcondition for convergence of iterations (8.56).

Let us assume that

max1≤n≤N−1

∣∣∣∣∂f(xn, Yn)

∂Yn

∣∣∣∣ = L . (8.62)

Multiplying both sides of (8.56) by A−1 and taking their norm, one obtains:

∥∥~ε (k+1)∥∥ ≤ ||A−1|| · h2L ·

∥∥~ε (k)∥∥ , (8.63)

where we have also used a known fact from Linear Algebra, stating that for any matrix A andvector ~z,

||A~z|| ≤ ||A|| · ||~z||(actually, the latter inequality follows directly from the definition of a matrix norm). Forcompleteness, let us mention that

||A||∞ = max1≤i≤M

M∑j=1

|aij| . (8.64)

Inequality (8.63) shows that

∥∥~ε (k)∥∥ ≤ (

h2L ||A−1||)k ∥∥~ε (0)∥∥ , (8.65)

which implies that Picard iterations converge when

h2L ||A−1|| < 1 . (8.66)

As promised earlier, this condition is analogous to (8.57).It now remains to find ||A−1||. Since matrix A shown in Eq. (8.53) arises in a great many

applications, the explicit form of its inverse has been calculated for any size N = (b − a)/h.The derivation of A−1 can be found on photocopied pages posted on the course website; thecorresponding result for ||A−1||, obtained with the use of (8.64), is:

||A−1|| = (b− a)2

8h2. (8.67)

Substituting (8.67) into (8.66), we finally obtain that for Picard iterations to converge, it issufficient (but not necessary) that

(b− a)2

8L < 1 . (8.68)

Thus, whether the Picard iterations converge to the discretized solution of the BVP (8.28)depends not only on the function f(x, y) but also on the length of the interval [a, b].

Method 2 (modified Picard iterations)The idea of the modified Picard iterations method can be explained using the example of the


single equation (8.54), where we will set A = 1 for convenience and without loss of generality.Suppose the simple fixed-point iteration scheme (8.55) does not converge because at the fixedpoint y, dr(y)/dy ≈ κ > 1,20 so that the convergence condition (8.57) is violated. Then,instead of iterating (8.54), let us iterate

y−κy = r(y)−κy ⇒ (1−κ)y(k+1) = r(y(k))−κy(k) ⇒ y(k+1) =r(y(k))− κy(k)

1− κ.

(8.69)Note that the exact equation which we start with in (8.69) is equivalent to Eq. (8.54) (withA = 1), but the iteration equation in (8.69) is different from (8.55). If our guess κ at thetrue value of the derivative dr(y)/dy is “sufficiently close”, then the derivative of the r.h.s. of(8.69) is less than 1, and hence, by (8.57), the iteration scheme (8.69) converges, in contrastto (8.54), which diverges. In other words, by subtracting from both sides of the equation alinear term whose slope closely matches the slope of the nonlinear term at the solution y, onedrastically reduces the magnitude of the slope of the right-hand side of the iterative scheme,thereby making it converge.

Let us return to the iterative solution of Eq. (8.28), where now, instead of iterating, as inPicard’s method, the equation (

y(k+1))′′

= f(x, y(k)

), (8.70)

we will iterate the equation

(y(k+1)

)′′ − c y(k+1) = f(x, y(k)

)− c y(k) (8.71)

with some constant c. The corresponding linearized equation in vector form is easily obtainedfrom (8.61):

(A− h2c I)~ε (k+1) = h2 diag

(∂f(x1, Y1)

∂Y1

− c, . . . ,∂f(xN−1, YN−1)

∂YN−1

− c

)~ε (k), (8.72)

where I is the identity matrix of the same size as A. We will now address the question ofhow one should choose the constant c so as to ensure convergence of (8.72) and hence of themodified scheme (8.71).

The main difference between the multi-component equation (8.72) and the single-componentequation (8.69) is that no single value of c could simultaneously match all of the values∂f(xn, Yn)/∂Yn, which, in general, are distributed in some interval

L− ≤ ∂f(xn, Yn)

∂Yn

≤ L+ , n = 1, . . . , N − 1 . (8.73)

It may be intuitive to suppose that the optimal choice for c may be at the midpoint of thatinterval, i.e.,

copt = Lav =1

2

(L− + L+

). (8.74)

Below we will show that this is indeed the case.Specifically, let us only consider the case where ∂f/∂y > 0, when a unique solution of BVP

(8.28) is guaranteed to exist by Theorem 6.1 of Lecture 6. Then in (8.73), both

L± > 0, (8.75)

20Here ‘≈’ is used instead of ‘=’ because dr(y)/dy is usually not known exactly.


and hence Lav > 0. Next, by following the steps of the derivation of (8.63) one obtains from(8.72):

∥∥~ε (k+1)∥∥ ≤

(||(A− h2c I)−1|| · h2 max

L−≤`≤L+|`− c|

)·∥∥~ε (k)

∥∥ . (8.76)

Here ` stands for any of ∂f(xn, Yn)/∂Yn, which satisfy (8.73). Note that the maximum aboveis taken with respect to ` while c is assumed to be fixed. Then:

L− − c ≤ `− c ≤ L+ − c ⇒ maxL−≤`≤L+

|`− c| = max{|c− L−|, |L+ − c|}. (8.77)

Our immediate goal is to determine for what c the coefficient multiplying∥∥~ε (k)

∥∥ in (8.76)is the smallest: this will yield the fastest convergence of the modified Picard iterations (8.71).The entire analysis of this question is a little teduious, since one will have to consider separatelythe cases where c ∈ [L−, L+] and c¡¡∈ [L−, L+]. Since we have announced that the answer,(8.74), corresponds to the former case, we will present the details only for it. Details for thecase c¡¡∈ [L−, L+] are similar, but will not yield an optimal value of c, and so we will omitthem.

Thus, we are looking to determine

K ≡ minL−≤ c≤L+

(h2 ||(A− h2c I)−1|| max{|c− L−|, |L+ − c|}) , (8.78)

for which we will first need to find the norm ||(A− h2c I)−1||. Since A is a symmetric matrix(see (8.53)), so are matrices (A− h2c I) and (A− h2c I)−1. In graduate-level courses on LinearAlgebra it is shown that the norm of a real symmetric matrix B equals the modulus of itslargest eigenvalue. Since (an eigenvalue of B−1) = 1/(an eigenvalue of B), then

||(A− h2c I)−1|| =1

min |eigenvalue of (A− h2c I) | . (8.79a)

To find the lower bound for the latter eigenvalue, we use the result of Problem 2 of HW 8(which is based on the Gerschgorin Circles Theorem of Section 8.1). Thus,

||(A− h2c I)−1|| =1

h2c(8.79b)

(recall that c > 0 because we assumed in (8.75) that L± > 0 and also since c ∈ [L−, L+]).Combining (8.78) and (8.79b), we will now determine

K = minL−≤ c≤L+

(max

{∣∣∣∣c− L−

c

∣∣∣∣ ,

∣∣∣∣L+ − c

c

∣∣∣∣})

= minL−≤ c≤L+

(max

{1− L−

c,

L+

c− 1

}). (8.80)


c = L− c = L+ c = copt

1 − (L−/c)

(L+/c)−1 The corresponding optimal c is found as shown inthe figure on the left:

1− L−

copt

=L+

copt

− 1

⇒ copt =1

2

(L− + L+

),

which is (8.74).

Substituting (8.74) into (8.80) and then using the result in (8.76), one finally obtains:

||~ε(k+1)|| ≤(

L+ − L−

L+ + L−

)||~ε(k)|| . (8.81)

Since according to our assumption (8.75) L± > 0, then the factor (L+ − L−)/(L+ + L−) < 1.Thus, the modified Picard scheme (8.71), (8.74) converges.

The issues of using the modified Picard iterations on BVPs where conditions (8.75) arenot met, or using scheme (8.71) with a non-optimal constant c, are considered in homeworkproblems. Let us only emphasize that the modified Picard iterations can sometimes convergeeven when (8.75) and/or (8.74) do not hold.

Method 3 (Newton–Raphson method)Although this method can be described for a general BVP of the form (7.1), we will only doso for a particular BVP, encountered in one of the homework problems for Lecture 7. Namely,consider the BVP

y′′ =y2

2 + x, y(0) = 1, y(2) = 1. (8.82)

Let us begin by writing down the system of second-order accurate discrete equations for thisBVP:

Yn+1 − 2Yn + Yn−1 =h2 Y 2

n

2 + xn

, Y0 = 1, YN = 1 . (8.83)

Let Y(0)n be the initial guess for the solution of (8.83) at xn. Similarly to (8.59), we relate it to

the exact solution {Yn} of (8.83):

Y (0)n = Yn + ε(0)

n , |εn| ¿ 1 . (8.84)

Substituting (8.84) into (8.83) and using the fact that the exact solution {Yn} satisfies (8.83),we obtain an equivalent system:

[ε(0)n+1 − 2ε(0)

n + ε(0)n−1 −

h2 · 2Y (0)n ε

(0)n

2 + xn

]+

h2(ε(0)n

)2

2 + xn

=(Y

(0)n+1 − 2Y (0)

n + Y(0)n−1

)−

h2(Y

(0)n

)2

2 + xn

.

(8.85)

If we now assume that ε(0)n are sufficiently small for all n, then we neglect the last term on the

l.h.s. of (8.85), and obtain a linear system for ε(0)n :

ε(0)n+1 −

(2 +

2h2Y(0)n

2 + xn

)ε(0)n + ε

(0)n−1 =

(Y

(0)n+1 − 2Y (0)

n + Y(0)n−1

)−

h2(Y

(0)n

)2

2 + xn

. (8.86)


If our initial guess satisfies the boundary conditions of the BVP (8.82), then we also have

ε(0)0 = 0, ε

(0)N = 0. (8.87)

System (8.86), (8.87) can be solved time-efficiently by the Thomas algorithm. Thus we

obtain ε(0)n . According to (8.84), this gives the next iteration for our solution {Yn}:

Yn ≈ Y (1)n = Y (0)

n − ε(0)n . (8.88)

We then substituteY (1)

n = Yn + ε(1)n (8.89)

into (8.83), obtain a system analogous to (8.85), and then solve it for ε(1)n in exactly the same

way we have solved (8.85) for ε(0)n . Repeating these steps, we stop when the process converges,

i.e.∥∥~Y(k+1) − ~Y(k)

∥∥ becomes less than a given tolerance.

Using iterative methods in more complicated situations

First, let us note that iterative methods, which we have described above for ODEs, areequally applicable to partial differential equations (PDEs). This distinguishes iterative methodsfrom the shoting method considered in Lecture 7: as we stressed at the end of that Lecture,the shooting method cannot be extended to solving BVPs for PDEs. Thus, iterative methodsremain the only group of method for solving nonlinear BVPs for PDEs. The ideas of thesemethods are the same as we have described above for ODEs.

Second, the fixed-point iteration methods (i.e., counterparts of the Picard and modifiedPicard described above) are not used (or are rarely used) in commercial software. The mainreason is that they are considerably slower than the Newton–Raphson method and its variationsand also than so-called Krylov subspace methods, studied in advanced courses on NumericalLinear Algebra. We will only mention that the most famous of those Krylov subspace methodsis the Conjugate Gradient method (CGM) for solving symmetric positive (or negative) definitelinear systems. An accessible introduction to this method can be found in many textbooksand also in an online paper by J.R. Shewchuk, “An Introduction to the Conjugate GradientMethod Without the Agonizing Pain”.21 An extension of the CGM to nonlinear BVPs whoselinearization yields symmetric matrices is known as the Newton–CGM or, more generally, as aclass of Newton–Krylov methods.

Another reason why Krylov subspace methods are used much more widely than fixed-pointiterative methods is that convergence conditions of the latter methods are typically restrictedby a condition analogous to (8.75). Some of Krylov subspace methods either do not haverestrictions, or are less sensitive to them. Also, the Newton–Raphson method does not haveany restrictions similar to (8.75). Yet, this method also has its own issues, and a number ofbooks are written about application of the Newton–Raphson method to systems of nonlinearequations.


1. Verify that (8.4) follows from (8.1)–(8.3).

21Some pain, however, is to be expected.


2. Explain why the form of the first and last terms of ~r in (8.7) is different from the form ofthe other terms.

3. What does the Gerschgorin Circles Theorem allow one to do?

4. What is the difference between a diagonally dominant and strictly diagonally dominantmatrices?

5. Make sure you can follow the derivations in the Example about the Gerschgorin CirclesTheorem.

6. Make sure you can follow the proof of Theorem 8.2.

7. What can you say about a solution of a linear system (8.5) where A is SDD?

8. Under what condition(s) on h is the discretized BVP (8.4) guaranteed to have a uniquesolution?

9. Verify (8.20).

10. Verify (8.21) through (8.23).

11. Explain why the local truncation error of the discretized BVP is multiplied by a factorO( 1

h2 ) to give the error of the solution.

12. Describe the idea of Method 1 for handling non-Dirichlet boundary conditions.

13. Derive (8.36).

14. Describe the idea of Method 2 for handling non-Dirichlet boundary conditions.

15. Convince yourself (and be prepared to convince the instructor) that if condition (8.42) andthe two conditions stated one line below it, hold, then the coefficient matrix in Method 1based on Eq. (8.36) is SDD.Hint: Write the condition that you want to prove and then assume that (8.42) and theother two conditions hold.

16. Suppose that you need to solve a BVP with a mixed-type boundary condition and suchthat (8.42) does not hold. Will you attempt to solve such a BVP? What should yourexpectations be?

17. Suppose that the BVP has a non-Dirichlet boundary condition of the form (8.31) atx = b (the right end point of the interval) rather than at x = a. What will the analog ofcondition (8.42) be in this case?Hint 1: Remember that this is a QSA and not an exercise requiring calculations.Hint 2: To better visualize the situation, suppose [a, b] = [−1, 1]. What mathematicaloperation will take the left end point of this interval into the right end point? How willthis operation transform the terms on the l.h.s. of (8.31)?

18. Convince yourself (and be prepared to convince the instructor) that the statements inRemark 2 in Sec. 8.4 about the coefficient matrices for the second-order methods beingtridiagonal, are correct.

19. Describe the idea of the Picard iterations.


20. Derive (8.63) as explained in the text.

21. What is the condition for the Picard iterations to converge? Is this a necessary or sufficientcondition? In other words, if that condition does not hold, should one still attempt tosolve the BVP?

22. Describe the idea behind the modified Picard iterations.

23. How is the finite-difference implementation of (8.71) different from that of (8.56)?

24. Make sure you can obtain (8.72).

25. Make sure you can obtain (8.76) and (8.77).

26. Explain how the result of Problem 2 of HW 8 leads to (8.79b).

27. Derive (8.74) from the explanation found after (8.80).

28. Obtain (8.81) as explained in the text.

29. Are (8.75) and (8.74) necessary or sufficient conditions for convergence of the modifiedPicard iterations (8.71)?

30. Make sure you can see where (8.85) comes from.

31. Write down a linear system satisfied by ε(1)n defined in (8.89).

32. Describe the idea behind the Newton–Raphson method.

33. Can one use iterative methods when solving nonlinear BVPs for PDEs?


9 Concepts behind finite-element method

The power of the finite-element method (FEM) becomes evident when handling problems intwo or three spatial dimentions. In this lecture (and this course), we will only consider appli-cations of the FEM to problems in one spatial dimention (i.e., BVPs on an interval), so as todemonstrate some of the basic concepts behind this method.

9.1 General idea of FEM

Suppose we are looking for a solution of the BVP

y′′ + Q(x)y = R(x), y(0) = y(1) = 0 . (9.1)

This BVP has zero boundary conditions at both end points of the interval; however, thisdoes not restrict the generality of the subsequent exposition. Namely, a way to treat nonzeroboundary conditions is described in a homework problem.

The idea of the FEM is as follows. Select a set of linearly independent function {φj(x)}Mj=1

such that each φj satisfies the boundary conditions of the BVP, i.e.

φj(0) = φj(1) = 0, j = 1, . . . , M . (9.2)

Then, look for an approximate solution of the BVP in the form

Y (x) =M∑

j=1

cjφj(x) , (9.3)

where coefficients cj are to be determined. Note that since the φj’s satisfy the boundaryconditions of the BVP, so does the solution Y (x). The problem has now become the following:(i) decide which basis functions φj(x) to use and (ii) determine the coefficients cj so as to makethe error between Y (x) and the exact solution y(x) as small as possible.

The term ‘basis’ describing the set {φj} is used here to indicate that functions φj mustbe linearly independent and, moreover, their linear superpositions (i.e., the r.h.s. of (9.3))should be able to approximate functions (i.e., solutions of the BVP) from a sufficiently largeclass sufficiently closely. A quantitaive chracterization of the two ‘sufficiently’s in the previoussentence is a serious mathematical task, which we will not attempt to undertake. Instead, wewill proceed at the intuitive level.

One possible set of basis functions which satisfy boundary conditions (9.2) is

φj(x) = sin(jπx), j = 1, . . . , M . (9.4)

Another, more convenient, set will be introduced later on as we proceed. In general, theremay be many choices for {φj(x)}; the decision as to which one to use is usually made on aproblem-by-problem basis.

The problem of determining the coefficients cj can be handled in three different ways, whichwe will now describe.


9.2 Collocation method

Let us substitute expansion (9.3) into BVP (9.1):

M∑j=1

cjφ′′j (x) + Q(x)

M∑j=1

cjφj(x) = R(x) . (9.5)

Recall that due to the choice (9.2), the boundary conditions are satisfied automatically.Now, ideally, we want Eq. (9.5) to hold identically, i.e. for all x ∈ [0, 1]. However, since we

only have M free parameters, {cj}Mj=1, at our disposal, we can only require that Eq. (9.5) be

satisfied at M points, called collocation points. That is, if {xk}Mk=1 is a set of such points on

[0, 1], then we require that the following system of (linear) equations hold:

M∑j=1

(φ′′j (xk) + Q(xk)φj(xk)

)cj = R(xk), k = 1, . . . , M . (9.6)

Upon solving this system of M equations for the M unknowns cj and substituting their valuesinto (9.3), one finds the approximate solution Y (x) of the BVP.

Linear system (9.6) can be written in the standard form

A~c = ~r , (9.7)

where(A)kj =

(φ′′j (xk) + Q(xk)φj(xk)

),

~r = (R(x1), . . . , R(xM))T .(9.8)

The coefficient matrix A is, in general, full (i.e. not tri- or pentadiagonal), whereas matrixA that arose in the finite-difference approach of Lecture 8 was tridiagonal. Thus it appearsthat the collocation methods leads to a system that is more difficult to solve than the systemproduced by the finite-difference approach. However, one can make matrix A of the collocationmethod to also be tridiagonal if one chooses the basis functions in a special form. Namely, forj = 2, . . . , M − 1, take φj’s to be the following cubic B-splines:

Bj(x) =

(∆xj−2)3

4h3, xj−2 ≤ x ≤ xj−1

1

4+

3∆xj−1

4h

[1 +

∆xj−1

h−

(∆xj−1

h

)2]

, xj−1 ≤ x ≤ xj

1

4− 3∆xj+1

4h

[1− ∆xj+1

h−

(∆xj+1

h

)2]

, xj ≤ x ≤ xj+1

−(∆xj+2)3

4h3, xj+1 ≤ x ≤ xj+2

0, otherwise

j = 2, . . . , M − 1,

(9.9)

where ∆xj = x−xj, and we have assumed for simplicity that all points are equidistant, so thath = xj+1 − xj. Recall that x0 = 0 and xM+1 = 1 for BVP (9.1). Functions Bj(x) (9.9) have


continuous first and second derivatives everywhere on [0, 1]; a typical such function is shownin the middle plot of the accompanying figure.

B1(x)

Bj(x)

j=2,...,M−1

xj

BM

(x)

When j = 1 or j = M , these functions have to be slightly modified, so that, say, B1(x)satisfies the boundary condition at x0: B1(x0) = 0, but no condition is placed on its derivativeat that point. The plots of B1 and BM are shown in the left and right plots of the figure; theanalytical expression for, say, B1 being:

B1(x) =

1

2(4− 3h)

[3(6− 5h)

∆x0

h− 9(1− h)

(∆x0

h

)2

−(

∆x0

h

)3]

, x0 ≤ x ≤ x1

1

4− 3∆x2

4h

[1− ∆x2

h−

(∆x2

h

)2]

, x1 ≤ x ≤ x2

−(∆x3)3

4h3, x2 ≤ x ≤ x3

0, otherwise.(9.10)

Now, with φj(x) = Bj(x), system (9.6) has a tridiagonal matrix A because for any xk, onlyBk(xk) and Bk±1(xk) are nonzero, i.e. Bj(xk) = 0 for |k − j| ≥ 1. Then all one needs in orderto write out the explicit form of (9.6) are the values of Bk(xk) and Bk±1(xk). These are shownin the table below.

xk−1 xk xk+1

Bk(x) 1/4 1 1/4

B′′k(x) 3/(2h2) −3/h2 3/(2h2)

To conclude this subsection, we point out the advantage of the collocation method over thefinite-difference method of Lecture 8: Points xk do not need to be equidistant. This will haveno effect on the form of system (9.6); only the coefficients in (9.9) and (9.10) will be slightlymodified. One can use this freedom in distributing the collocation points over the interval sothat to place more of them in the region(s) where the solution is expected to change rapidly.

9.3 Galerkin method

This method allows one to use basis functions that are simpler than the B-splines consideredin the previous subsection. The solution obtained, of course, will not be as smooth as thatobtained by the collocation method.


As we said above, the approximate solution Y (x) is sought as a linear combination of thebasis functions φj; see (9.3). We can now draw an analogy with Linear Algebra and call thebasis functions φj vectors which span (i.e., form) a linear, M -dimensional space SM . Then,according to (9.3), Y (x) is a vector in that space.

Let us again substitute (9.3) into BVP (9.1) and write the result as

M∑j=1

(φ′′j (x) + Q(x)φj(x)

)cj − R(x) = ρ(x) . (9.11)

(Recall that in the collocation method, we requiredthat the residual vector ρ(xk) = 0 at all the collo-cation points xk.) Now, the residual vector ρ(x)does not, in general, belong to the linear spaceSM (in other words, it is not a linear combinationof φj’s). Geometrically, we can represent this sit-uation (for M = 2) as in the figure on the right.Namely, vector ρ(x) has a component that belongsto SM and the other component that lies outsideof SM .

φ1

φ2 S

2

ρ

The idea of the Galerkin method is to select the coefficients cj so as to make the residualvector ρ(x) orthogonal to all of the basis functions φj, j = 1, . . . ,M . In that case, the projectionof ρ(x) on SM is zero, and hence the “length” of ρ(x) is minimized, since

“length” ρ(x) =

√(“length” ρ|| to SM

(x))2

+ (“length” ρ⊥ to SM(x))2 .

Thus, we need to specify what we mean by ‘orthogonal’ and ‘length’ for functions. Twofunctions f(x) and g(x) are called orthogonal if

∫ 1

0

f(x)g(x)dx = 0 . (9.12)

Two remarks are in order. First, note that the integral in (9.12) is over [0, 1]. This is becausethe BVP we are considering is defined over that interval. If a BVP is defined over [a, b], the

corresponding definition of orthogonality would contain∫ b

ainstead of

∫ 1

0. Second, (9.12) is not

the only definition of orthogonality of functions, but just one of those which are used frequently.For different applications, different definitions of function orthogonality may prove to be moreconvenient.

The definition of ‘length’ of a function is subordinate to that of orthogonality, namely:

||f(x)||2 =

√∫ 1

0

(f(x))2 dx. (9.13)

The subscript ‘2’ of ||...||2 is used because the l.h.s. of (9.13) is also known as the L2-norm ofa function, which is different from the ∞-norm that we have considered so far.

Thus, the Galerkin method requires that

∫ 1

0

ρ(x)φk(x) dx = 0 for k = 1, . . . , M. (9.14)


With the account of (9.11), this gives:

M∑j=1

cj

∫ 1

0

(φ′′j (x) + Q(x)φj(x)

)φk(x) dx =

∫ 1

0

R(x)φk(x) dx ; k = 1, . . . , M. (9.15)

If we now define a matrix A to have the coefficients

akj =

∫ 1

0

(φ′′j (x) + Q(x)φj(x)

)φk(x) dx (9.16)

and the vector ~r to be

~r =

(∫ 1

0

R(x)φ1(x) dx, . . . ,

∫ 1

0

R(x)φM(x) dx

)T

, (9.17)

then the system of linear equations (9.15) takes on the familiar form (9.7).

So far, there has been no real advantage of the Galerkin method over the collocation method.Such an advantage arises when we use integration by parts to rewrite (9.16) in the form

akj = −∫ 1

0

φ′j(x)φ′k(x) dx +

∫ 1

0

Q(x)φj(x)φk(x) dx . (9.18)

In deriving (9.18), we have used the boundary conditions (9.2). From (9.18), which is equivalentto (9.16), we immediately observe two things, which were not evident from (9.16).• akj = ajk, i.e. the coefficient matrix in the Galerkin method is symmetric.• To calculate akj, one only requires φ′j, but not φ′′j , to exist. Moreover, one does not requireφ′j to be continuous; it suffices that it be integrable.

Then, the following simple choice of φj(x) can be made:

φj(x) =

1− |∆xj|h

, xj−1 ≤ x ≤ xj+1

0 otherwise

(9.19)

These functions φj(x) are called hat-functions, or linearB-splines.

φj(x)

j=1,...,M

xj

With this choice for φj’s, matrix A is tridiagonal. In-deed, in this case φ′j is as shown on the right, whenceone can calculate that

∫ 1

0

φ′j(x)φ′k(x) dx =

−1

h, k = j ± 1

2

h, k = j

0, otherwise.

(9.20)

Quantities∫ 1

0Q(x)φj(x)φk(x) dx are nonzero also only

for k = j − 1, j, j + 1, as is evident from the figureabove.

φ’j(x)

j=1,...,M

xj

0

1/h

1/h


To conclude this subsection, let us remark on the issue of the calculation of the second inte-gral in (9.18) and the integrals in (9.17). For brevity, we will only speak here about the formertype of integrals; the same will apply to the latter ones. The integrals

∫ 1

0Q(x)φj(x)φk(x) dx

may be evaluated in one of the following ways. First, if their analytical expressions for allrequired pairs of k and j can be obtained (e.g., with Mathematica or another computer algebrapackage), the numeric values of these expressions should be used. If such expressions are notavailable, then the computation may differ depending on whether the φj’s are smooth and non-vanishing over the entire interval [0, 1], as, say, functions (9.4), or they are the hat-functions(or any other highly localized functions). In the former case, the integrals in question may becomputed by one of the standard methods (say, Simpson’s) using the existing subdivision of[0, 1], or by Matlab’s built-in inetrgators (quad or quadl). In the latter case, i.e. for highlylocalized φj’s, the integrals can be approximated as

∫ 1

0

Q(x)φj(x)φk(x) dx ≈ Q(xmid)

∫ 1

0

φj(x)φk(x) dx , (9.21)

where xmid is the middle of the interval over which the product φj(x)φk(x) is nonzero. Onecan show that the accuracy of approximation (9.21) is O(h2). In the case when φj’s are thehat-functions (9.19), the integral on the r.h.s. of (9.21) can be explicitly calculated to be:

∫ 1

0

φj(x)φk(x) dx =

h

6, k = j ± 1

2h

3, k = j

0, otherwise.

(9.22)

9.4 Rayleigh-Ritz method

This method replaces the problem of solving BVP (9.1) by a problem of finding the minimumof a certain functional. We will not consider this method in more detail, but only mentionthat: (i) Rayleigh-Ritz method can be shown to be equivalent to Galerkin method, and (ii)the functional mentioned in the previous sentence is the “length” of the residual vector ρ(x),defined according to (9.13).


1. Describe the idea behind the collocation method.

2. What condition (or conditions) should the basis functions in the collocation methodsatisfy?

3. In addition to the above condition(s), what other condition should the basis functionssatisfy in order to make the corresponding coefficient matrix tridiagonal?

4. What is the advantage of the collocation method over the finite-diference method?

5. Write down the explicit form of BM .


6. Verify the entries in the Table in Sec. 9.2.

7. Describe the idea behind the Galerkin method.

8. Try to explain the analogy between definition (9.12) of orthogonality of functions and thedefinition of orthogonality of vectors in Rn. Hint: Interpret the integral as (the limit of)a finite, say, Riemann, sum. (The fact that it is the limit is not really important here.)

9. Consequently, explain why (9.13) is analogous to the Euclidean length of a vector.

10. Continuing from the last two questions, try to explain the close analogy between theGalerkin method and the least-squares solution of inconsistent linear systems.

11. What is the advantage of the Galerkin method over the collocation method?


11 Classification of partial differentiation equations (PDEs)

In this lecture, we will begin studying differential equations involving more than one indepen-dent variable. Since they involve partial derivatives with respect to these variables, they arecalled partial differential equations (PDEs). Although this course is concerned with numericalmethods for solving such equations, we will first need to provide some analytical backgroundon where those equations arise and how their setup is different from that of ODEs. This willbe done in this lecture, while the subsequent lectures, except Lecture 16, will deal with thenumerical methods proper.

11.1 Classification of physical problems described by PDEs

The majority of problems in physics and engineering fall into one of the following categories:(i) equilibrium problems, (ii) eigenvalue problems, and (iii) evolution problems.

(i) Equilibrium problems are those where asteady-state spatial distribution of some quantityu inside a given domain D is to be determined bysolving a differential equation

L[u] = f(x, y), (x, y) ∈ D

subject to the boundary condition

B[u] = g(x, y), (x, y) ∈ ∂D,

where ∂D is the boundary of D. Here L is a differ-ential operator involving derivatives with respectto x and y (for the case of two spatial dimensions);B, in general, may also involve derivatives. TheseBVPs generalize, to two or more dimensions, theone-dimensional BVPs we studied in Lectures 6through 9.

PDE: L[u]=f

inside domain D Boundary conditions

B[u]=g

on boundary ∂D of D

Examples of equilibrium problems include: Steady flows of liquids and gases; steady tem-perature distributions; equilibrium stress distributions in elastic structures.

(ii) Eigenvalue problems are extensions of equilibrium problems with no external forceswhere nontrivial (i.e. not identically zero) steady-state distributions exist only for special valuesof certain parameters, called eigenvalues. These eigenvalues, denoted λ, are to be determinedalong with the steady-state distributions themselves. The simplest form of an eigenvalue prob-lem is

L[u] = λu for (x, y) ∈ D; B[u] = 0 for (x, y) ∈ ∂D.

In a more complex setup, the eigenvalue may enter into the PDE, and even into the boundarycondition, in a more complicated way.

Examples of eigenvalue problems include: Natural frequencies of vibrating strings andbeams; resonances in electric circuits, mechanics, and acoustics; energy levels in quantummechanics.


(iii) Evolution problems are extensions of initial value problems, where the distribution ofa quantity u is not steady but exhibits transient behavior. Generally, the problem is to predictthe evolution of the system at any time given its initial state. This is done by solving the PDE

L[u] = f(x, t), x ∈ D for t > t0,

given the initial stateI[u] = h(x), x ∈ D for t = t0,

and the boundary conditions

B[u] = g(x, t), x ∈ ∂D and t ≥ t0.

The differential operator L now involves derivatives with respect to x and t.Examples of evolution problems include: Propagation of waves of any nature; diffusion of a

substance in a room; cooling down or heating an object.For example, the mathematical problem of determining the evolution of a temperature

distribution u(x, t) inside a rod of length 1 is set up as follows:

L[u] = f(x, t), 0 < x < 1, t > 0,

where the form of the operator L will be specifiedlater;

u(x, t = 0) = h(x), 0 ≤ x ≤ 1,

where h(x) is the initial temperature distributioninside the rod;

u(x = 0, t) = g0(t), u(x = 1, t) = g1(t), t ≥ 0,

where g0,1(t) are the temperature values main-tained at the two ends of the rod. 0 1

0

1

2t

x

D

∂ D

In subsequent lectures, we will consider exclusively evolution problems. To that end, wewould like to obtain an unambiguous, mathematically rigorous criterion which allows one todistinguish problems of different categories. This is done in the next subsection.

11.2 Classification of PDEs into three types; characteristics

Here we will consider the question of how many initial or boundary conditions can or should bespecified for a PDE, and where (in the (x, y)-space or (x, t)-space) it can or should be specified.We will concentrate on the case of two-dimensional spaces; generalizations to three- and four-(i.e., the time plus three spatial dimensions) dimensional cases are possible and for the mostpart straightforward. For definiteness, let us speak about the (x, y)-space until otherwise isindicated. (That is, for now, y may denote either the second spatial variable or the timevariable t.)

As a reference, let us recall the situation with ODEs and, for concreteness, consider asecond-order ODE. There, we could either specify the initial values for the dependent functiony and its derivative y′ at one point x = x0, or the values of y (or more complicated expressions,e.g., (8.31)) at two points, x = a and x = b. In the former case, we have an IVP, and in the


latter case, a BVP. Since, as we said earlier, we will be concerned with the evolution problems,which are higher-dimensional counterparts of IVPs, we proceed to recall how we were able tosolve (conceptually, not technically) the latter, i.e.

y′′ = f(x, y, y′), y(x0) = y0, y′(x0) = v0 . (11.1)

Namely, for any point x0 + h sufficiently near x0, we could write the Taylor expansion

y(x0 + h) = y(x0) + hy′(x0) +1

2h2y′′(x0) +

1

6h3y′′′(x0) + . . . . (11.2)

The first two terms on the r.h.s of (11.2) are known from the initial condition; the third termis known from the ODE, which we assume to be satisfied at x = x0. The last term in (11.2)can then be found from

y′′′(x) =dy′′

dx=

df

dx=

∂f

∂x+ y′

∂f

∂y+ y′′

∂f

∂y′. (11.3)

All omitted higher-order terms in (11.2) can be found analogously to (11.3). Thus, y(x) can bedetermined for all points that are sufficiently close to x0.

When we move from one independent variable (as in ODEs) to two (as in PDEs), it isintuitive to suppose that now the initial and/or boundary conditions should be specified alongcertain curves in the (x, y)-space rather than at a point. (In that case, the dimensions of boththe differential equation and the initial/boundary condition are each increased by one.) Thus,let us assume that we know the dependent function u along some curve Γ in the (x, y)-planeand also the derivative ∂u/∂~n in the direction normal to Γ:

u(x, y) = g0(x, y),

∂u(x, y)

∂~n= G1(x, y),

(x, y) ∈ Γ. (11.4)

Note that if one knows u along Γ, one automatically knows also the derivative of u along Γ(simply take the directional derivative of u in the direction tangent to Γ at each of its points).Knowing the derivatives of u in both the normal and tangent directions to Γ is equivalent toknowing ux and uy separately at each point. (Here and below we use subscripts to denotepartial differentiation; i.e., ux ≡ ∂u/∂x and uy ≡ ∂u/∂y.) Thus, Eqs. (11.4) are equivalent to

{u(x, y) = g0(x, y),

ux(x, y) = g1(x, y) and uy(x, y) = g2(x, y),(x, y) ∈ Γ. (11.5)

In the remainder of this course, we will consider PDEs of the form

Auxx + 2Buxy + Cuyy + Dux + Euy + F = 0, (11.6)

where coefficients A, B, C may depend on any or all of x, y, u, ux, uy and coefficients D, E, F

may depend on x, y, u. Given the PDE (11.6) and the initial/boundary conditions (11.5) oncurve Γ, we would like to determine u at some point that is sufficiently near Γ. So, let (x0, y0) besome point on Γ and (x0+h, y0+k) with h, k ¿ 1 be a nearby point where we want to determineu. The fundamental question that we now ask is: What are the restrictions on curve Γ andon the coefficients A, B, C in (11.6), under which one can determine u(x0 + h, y0 + k)?


We begin answering this question by writing the Taylor expansion for u(x0 +h, y0 + k) nearpoint (x0, y0) on Γ, where we know both u and its first derivatives from (11.5):

u(x0 + h, y0 + k) = u(x0, y0) + hux(x0, y0) + kuy(x0, y0) +

1

2h2uxx(x0, y0) + hkuxy(x0, y0) +

1

2k2uyy(x0, y0) +

1

6h3uxxx(x0, y0) + . . . (11.7)

This expansion is the analog of (11.2). Now, as we have said, all terms on the r.h.s. of the firstline of (11.7) are known from (11.5). If each of the three terms in the second line of (11.7) canbe found separately (i.e., as opposed to in the combination in which they enter Eq. (11.6)),then all the higher-order terms in expansion (11.7) can be found similarly to (11.3). Indeed,suppose we have found expressions for uxx, uxy, and uyy in the form that generalizes (11.1) totwo variables:

uxx = f1(x, y, u, ux, uy), uxy = f2(x, y, u, ux, uy), uyy = f3(x, y, u, ux, uy). (11.8)

Then the third-order partial derivatives (see the last line in (11.7)) can be computed using theChain Rule for a function of several variables. For example,

uyyy =∂uyy

∂y

∣∣∣∣x=const

=df3(x, y, u(x, y), ux(x, y), uy(x, y))

dy

∣∣∣∣x=const

=∂f3

∂y+

∂f3

∂u

∂u

∂y+

∂f3

∂ux

∂ux

∂y+

∂f3

∂uy

∂uy

∂y≡ ∂f3

∂y+

∂f3

∂uuy +

∂f3

∂ux

uxy +∂f3

∂uy

uyy . (11.9)

The first two terms on the r.h.s. of (11.9) are known from (11.5). Therefore, if we also knowuxy and uyy on Γ, we then can compute the last two terms in (11.9) and hence the uyyy. Otherthird- and higher-order derivatives in the Taylor expansion (11.7) can be computed analogously.Thus, will be able to find u(x0 + h, y0 + k) if and only if we know uxx, uxy, and uyy on Γ.

Now, we need three equations to be able to uniquely determine the three quantities uxx,uxy, and uyy. The first equation is the PDE (11.6). The other two equations are found bydifferentiating the two equations on the second line of (11.5) along Γ 22:

uxxdΓx + uxydΓy = dΓg1(x, y) (≡ g1,xdΓx + g1,ydΓy ) , (11.10)

uyxdΓx + uyydΓy = dΓg2(x, y) (≡ g2,xdΓx + g2,ydΓy ) , (11.11)

where the symbol dΓ denotes a differential (i.e., an infinitesimally small step) along Γ. Notethat both the r.h.s.’es and the ratio dΓy/dΓx (i.e., the slope of Γ) are known once Γ and thefunctions in (11.5) are known.

Further, if we assume u to be continuously differentiable at least twice, then

uxy = uyx, (11.12)

and Eqs. (11.6) and (11.10), (11.11) can be written as a linear system for uxx, uxy, and uyy atany point on Γ:

A 2B C

dΓx dΓy 0

0 dΓx dΓy

uxx

uxy

uyy

=

−Dux − Euy − F

dΓg1

dΓg2

. (11.13)

22Differentiating, or taking the directional derivative, along a curve is equivalent to computing the infenites-imal increments ux(x + dΓx, y + dΓy) − ux(x, y) and uy(x + dΓx, y + dΓy) − uy(x, y), which yields the l.h.s.’esof (11.10) and (11.11), respectively.


This system yields a unique solution for uxx, uxy, and uyy provided that the coefficient matrixis nonsingular. The matrix would be singular if its determinant vanishes:

A (dΓy)2 − 2B dΓx dΓy + C (dΓx)2 = 0 ⇒

A

(dy

dx

)2

Γ

− 2B

(dy

dx

)

Γ

+ C = 0 . (11.14)

Thus, we have obtained the answer to the fundamental question posed above. Namely,if the initial/boundary conditions are prescribed on a curve Γ whose tangent at any pointsatisfies Eq. (11.14), then the corresponding initial-boundary value problem (IBVP) (11.4) and(11.6) cannot be solved for a twice-countinuously differentiable (see (11.12)) function u(x, y). Ifthe initial/boundary conditions (11.4) are prescribed along any other curve, the IBVP can besolved. Alternatively, the IBVP can still be solved if a smaller set of initial/boundary conditions(say, just the first line in (11.4)) is specified along Γ, or if uxy (or any lower-order derivative ofu) is allowed to be discontinuous across Γ.

Equation (11.14) gives one the mathematically rigorous criterion that separates all PDEs(11.6) into three types depending on the relation among A, B, and C.

B2 − AC < 0In this case, no real solution for the slope (dy/dx)Γ can be found from the quadratic equation(11.14). This means that one can specify the initial/boundary conditions (11.4) along anycurve in the plane, and be able to obtain the solution u sufficiently close to that curve. Suchequations are called elliptic. Physical problems leading to elliptic equations are the equilibriumand eigenvalue problems, described in Sec. 11.1. Typical examples of such problems are theLaplace and Helmholtz equations:

uxx + uyy = 0, (Laplace)

uxx + uyy = λu. (Helmholtz)

The boundary conditions for elliptic equations are usually imposed along the boundary of aclosed domain D, as in the first figure in Sec. 11.1. One can also show that to obtain thesolution inside the entire domain D rather than only “sufficiently close” to its boundary ∂D,one needs to impose only one of the conditions (11.4), but not both. On this remark, we leavethe elliptic equations and will not consider them again in this course.

B2 − AC > 0In this case, two real solutions for the slopes of curve Γ exist:

(dy

dx

)

Γ

=B ±√B2 − AC

A. (11.15)

These slopes specify two distinct directions in the (x, y)-plane, called characteristics. Thecorrespondinf PDEs are called hyperbolic. Physical problems that lead to hyperbolic equa-tions are the evolution problems dealing with propagation of waves (e.g., light or sound). Thecoordinates in this case are x, the spatial coordinate of propagation, and t, the time, ratherthan the second spatial coordinate y. The typical example is the Wave equation:

uxx − utt = 0 . (Wave)

The importance of characteristics in hyperbolic problems is two-fold: (i) the initial datafor a smooth solution cannot be prescribed on a characteristic, and (ii) initial disturbances


propagate along the characteristics. We will consider this latter issue in more detail when webegin to study numerical methods for hyperbolic PDEs.

B2 − AC = 0In this case, only one value of the slope of Γ exists:

(dy

dx

)

Γ

=B

A. (11.16)

This gives only one direction of characteristics. The corresponding PDEs are called parabolic.Physical problems that lead to parabolic equations are usually diffusion-type problems. Thetypical example is the Heat equation,

uxx − ut = 0 , or ut = uxx , (Heat)

which describes, e.g., evolution of temperature inside a rod.Since in the next four lectures we will consider methods of numerical solution of the Heat

equation, let us discuss how boundary conditions can or should be set up for it. In fact, thiswas considered in the example at the end of Sec. 11.1. Namely, the initial condition for theHeat equation on x ∈ [0, 1] is

u(x, t = 0) = u0(x), 0 ≤ x ≤ 1,

(Initial condition

for Heat equation

)

and the boundary conditions are

u(0, t) = g0(t), u(1, t) = g1(t), t ≥ 0 .

(Boundary conditions

for Heat equation

)

Note that the initial condition is prescribed along a characteristic! Indeed, for the Heatequation, A = 1, B = C = 0, and Eq. (11.16) gives the slope of characteristic as dt/dx = 0,which means that any line t = const is a characteristic. The above, however, does not contradictthe results of analysis of this subsection, because the initial condition corresponds only to thefirst equation in (11.4), while the second equation is absent. Thus, one cannot prescribe therate of change ut at the initial moment for the Heat equation.


1. Give examples from physics of equilibrium, eigenvalue, and evolution problems.

2. Explain how system (11.13) is set up (i.e., where its equations come from).

3. What is the significance of characteristics?

4. What types of physical problems lead to elliptic, hyperbolic, and parabolic equations?

5. How many characteristics does the Wave equation have?

6. Why does prescribing initial data on a characteristic for the Heat equation not preventone from finding the solution of that IBVP?


12 The Heat equation in one spatial dimension:

Simple explicit method and Stability analysis

12.1 Formulation of the IBVP and the minimax property of its so-lution

We begin by writing down the Heat equation (in its simplest form) on the interval x ∈ [0, 1]and the corresponding initial and boundary conditions. In fact, this is just a restatement fromthe end of Lecture 11.

ut = uxx 0 < x < 1, t > 0 ; (12.1)

u(x, t = 0) = u0(x) 0 ≤ x ≤ 1 ; (12.2)

u(0, t) = g0(t), u(1, t) = g1(t) t ≥ 0 . (12.3)

The IBVP (12.1)–(12.3) will be the subject of this and the next lectures. Boundary conditionsof a form more general than (12.3) will be considered in Lecture 14. Recall that in order toproduce a continuous solution, the boundary and initial conditions must match:

u0(0) = g0(0) and u0(1) = g1(0) . (12.4)

On physical grounds, in what follows we will always require that the matching conditions (12.4)be satisfied.

It is always useful to know what general properties one may expect of the analytical solutionof a given IBVP, so that one could verify that the corresponding numerical solution also hasthese properties (this is a basic sanity check for the numerical code). Such a property for IBVP(12.1)–(12.3), stated below, is proved in courses on PDEs.

Minimax principle Suppose ut (and hence uxx

and both ux and u) is continuous in the regionD = [0, 1] × [0, ∞) (see the figure on the right)a.Then the solution u of the IBVP (12.1)–(12.3)achieves its maximum and minimum values on ∂D(i.e. either for t = 0 or for x = 0 or x = 1).In other words, u cannot achieve its maximum orminimum values strictly inside D.

aNote that here domain D and its boundary ∂D aredefined slightly differently than in the figure at the end ofSec. 11.1.

0 1

0

1

2t

x

D

∂ D

Note that this, at least partially, agrees with our intuition in “real life”. Indeed, supposeone creates some distribution of nonnegative temperature in the rod at t = 0 while keeping theends of the rod at zero temperature at all times. Then we expect that the temperature insidethe rod at any t > 0 will be less than it was at t = 0 (because the rod will cool down); that is,the maximum temperature was observed somewhere along the rod at t = 0, i.e. at the bottompart of ∂D. On the other hand, we also expect that the temperature in this setup will not dropbelow zero; that is, the temperature will be minimum at the ends of the rod, i.e. at the sidesof ∂D.


12.2 The simplest explicit method for the Heat equation

Let us cover the region D with a mesh (or grid), asshown on the right. Denote

xm = mh, m = 0, 1, . . . , M, (h = 1M

);

tn = nκ, n = 0, 1, . . . , N,(κ = Tmax

N

);

(12.5)

here Tmax is the maximum time until we want to com-pute the solution. Also, let Un

m be the solution computedat node (xm, tn). For simplicity, in this lecture we willassume that the boundary conditions are homogeneous:

g0(t) = g1(t) = 0 for all t ; (12.6)

note that this implies that u0(0) = u0(1) = 0.

When restricted to the grid, the initial and boundary conditions become:

(12.2) ⇒ U0m = u0(mh), 0 ≤ m ≤ M ; (12.7)

(12.3) ⇒{

Un0 = 0,

UnM = 0,

n ≥ 0 . (12.8)

Let us now use the simplest finite-difference approximations to replace the derivatives inthe Heat equation:

ut → Un+1m − Un

m

κ+ O(κ) , (12.9)

uxx → Unm+1 − 2Un

m + Unm−1

h2+ O(h2) . (12.10)

Substituting these formulae into (12.1) yileds the simplest explicit method for solving the Heatequation:

Un+1m − Un

m

κ=

Unm+1 − 2Un

m + Unm−1

h2+ O(κ + h2) , (12.11)

or, equivalently,Un+1

m = rUnm+1 + (1− 2r)Un

m + rUnm−1 , (12.12)

wherer =

κ

h2. (12.13)

The numerical solution at node (xm, tn+1) can thus befound if one knows the solution at nodes (xm, tn) and(xm±1, tn). These four nodes form a stencil for scheme(12.12), as shown schematically on the right.Given the initial and boundary conditions (12.7) and(12.8), one can advance the solution Un

m from time levelnumber n to time level number (n + 1) using the rec-curent formula of scheme (12.12).

n+1

n

m−1 m m+1


12.3 Stability analysis

From Eq. (12.11) one can see that the simple explicit method is consistent with the PDE(12.1). Recall from Lecture 4 that consistency means that the solution of the finite-differencescheme approaches the solution of the differential equation as the step size(s), κ and h in thiscase, tend to zero. In other words, the local truncation error τ satisfies limκ,h→0 τ = 0.

From the study of ODEs, we know that consistency alone is not sufficient for the numericalsolution to converge to the analytical solution of the PDE. To assure the convergence, wemust also require that the finite-difference scheme be stable. Recall that stability means thatsmall errors made during one step of the computation must not grow at subsequent steps. ForODEs, we stated a theorem that said that “stability + consistency” implied convergence of thenumerical solution to the analytical one. For PDEs, a similar result also holds:

Lax Equivalence Theorem, 12.1 For a properly posed (as discussed in Lecture 11)IBVP and for a finite-difference scheme that is consistent with the IBVP, stability is a necessaryand sufficient condition for convergence.

As for ODEs, this theorem can be understood from the following simple consideration.Let un

m = u(xm, tn) be the exact solution of the PDE, Unm be the exact solution of the finite-

difference scheme, and Unm be the actually computed solution of that scheme. (It may differ

from the exact one because, e.g., of round-off errors.) Then

|unm − Un

m| =∣∣(un

m − Unm

)+

(Un

m − Unm

)∣∣ ≤∣∣un

m − Unm

∣∣ +∣∣Un

m − Unm

∣∣ . (12.14)

If the difference scheme is consistent, then the first term on the r.h.s. is small. If the differencescheme is stable, then the second term on the r.h.s. is small for all n (i.e., it does not grow).Thus, if the scheme is both consistent and stable, then the l.h.s. of (12.14) is small for alln, which, in words, means that the numerical solution of the finite-difference scheme closelyapproximates the analytical solution of the PDE.

Now we will show how stability of a finite-difference scheme for a PDE can be studied. Wewill do this using two alternative methods. Method 1 will show a relation between the stabilityanalysis for PDEs with that for systems of ODEs. Method 2 will be new. It is specific toPDEs and, quite pleasantly, is easier to apply than Method 1. However, nothing is free: thissimplicity comes at the price that this method gives less complete information than Method 1.We will provide more details after we will have described both methods.

Method 1 (Matrix stability analysis)One can view scheme (12.11) (and hence (12.12)) as the simple explicit Euler method appliedto the following coupled system of ODEs:

d

dt

u1

u2

··

uM−1

=1

h2

−2 1 0 · · 01 −2 1 0 · 0· · · · · ·0 · 0 1 −2 10 · · 0 1 −2

u1

u2

··

uM−1

. (12.15)

(In writing out (12.15), we have also used the homogeneous boundary conditions (12.8).) In-deed, Eqs. (12.11) are obtained by discretizing the time derivative in (12.15) according to(12.9). Thus, studying the stability of scheme (12.11) is equivalent to studying the stabilityof the simple Euler method for system (12.15). You will be asked to do so, using techniques


of Lecture 5, in one of the homework problems. Below we will proceed in a slightly different,although, of course, equivalent, way.

We write Eqs. (12.12) in the matrix form:

Un+11

Un+12

··

Un+1M−1

=

1− 2r r 0 · · 0r 1− 2r r 0 · 0· · · · · ·0 · 0 r 1− 2r r0 · · 0 r 1− 2r

Un1

Un2

··

UnM−1

, (12.16)

or~U

n+1= A~Un, (12.17)

where r is defined by (12.13),

~Un =[Un

1 , Un2 , · · , Un

M−1

]T,

and A is the matrix on the r.h.s. of (12.16).Iteration scheme (12.16) will converge to a solution only if all the eigenvalues of A do

not exceed 1 in magnitude. Indeed, if any of these eigenvalues exceed 1 (say, λ1 > 1), then‖Un‖ = ‖AnU0‖ will grow as λn

1 . Therefore, to continue with the stability analysis, we needto know bounds for the eigenvalues of matrix A. In fact, for the matrix of the very specialform appearing in (12.16), exact eigenvalues are well known. We present the following resultwithout a proof (which can be found, e.g., in D. Kincaid and W. Cheney, Numerical Analysis:Mathematics of Scientific Computing, 3rd Ed. (Brooks/Cole, 2002); Sec. 9.1).

Lemma Let B be an N ×N tridiagonal matrix of the form

B =

b c 0 · · 0a b c 0 · 0· · · · · ·0 · 0 a b c0 · · 0 a b

. (12.18)

The eigenvalues and the corresponding eigenvectors of B are:

λj = b + 2√

ac cosπj

N + 1, ~vj =

(cb

)1/2sin 1·πj

N+1(cb

)2/2sin 2·πj

N+1

··(

cb

)N/2sin N ·πj

N+1

, j = 1, . . . , N . (12.19)

Using this Lemma, we immediately deduce that the eigenvalues of matrix A in (12.17) are

λj = 1− 2r + 2r cosπj

M, j = 1, . . . , M − 1 , (12.20)

whence

λmin = λM−1 = 1− 2r + 2r cosπ(M − 1)

M, (12.21)

λmax = λ1 = 1− 2r + 2r cosπ

M. (12.22)


If π/M ¿ 1 (i.e., if there are sufficiently many grid points on the interval [0, 1]), the preceedingexpressions reduce to

λmin ≈ 1− 4r + r( π

M

)2

, (12.23)

λmax ≈ 1− r( π

M

)2

, (12.24)

where we have used the expansion cos α ≈ 1−12α2 for α ¿ 1. Then the condition for convergence

of the iterations (12.16), which is, as we said before the Lemma,

−1 ≤ λj ≤ 1, j = 1, . . . , M − 1, (12.25)

yields

λmin ≈ 1− 4r + r( π

M

)2

≥ −1 ;

λmax ≈ 1− r( π

M

)2

≤ 1 .

The second of these equations is satisfied automatically because r = κ/h2 > 0. The firstequation yields:

r ≤ 2

4− (πM

)2 ≡2

4− (πh)2≈ 1

2. (12.26)

This condition, in a simplified form

r ≤ 1

2, or κ ≤ 1

2h2 , (12.27)

is usually taken as the stability condition of the finite-difference scheme (12.12). This meansthat if κ ≤ 1

2h2, then all round-off errors will eventually decay, and the scheme is stable. The

corresponding numerical solution will converge to the solution of IBVP (12.1)–(12.3). If, onthe other hand, κ > 1

2h2, then the errors will grow, thereby making the scheme unstable. The

corresponding numerical solution, starting at some t > 0, will have nothing in common withthe exact solution of the IBVP.

Remark 1 Above we said that for stability of iterations (12.17), the eigenvalues of A must beless than 1 in magnitude. Let us stress that this is true only for diagonalizable (e.g., symmetric)matrices. For nondiagonalizable matrices, e.g., for

N =

1 −1 0 · · 00 1 −1 0 · 0· · · · · ·0 · 0 0 1 −10 · · 0 0 1

, (12.28)

an eigenvalue-based stability analysis will fail. Indeed, all of N ’s eigenvalues equal 1, yet onecan show (e.g., using Matlab’s command norm) that ‖N n‖ → ∞ as nM−1, where M is definedin (12.5). There is an entire field of matrix analysis that deals with such nondiagonalizablematrices (with the descriptive keyword being “pseudospectra”), but we will not go into itsdetails here.


Condition (12.27) highlights the main drawback of the simple explicit scheme (12.12).Namely, in order for this scheme to be stable (and hence converge to the analytical solutionof the IBVP), one must take very small steps in time, κ ≤ 1

2h2. This will make the code very

time-consuming. We will consider alternative approaches, which do not face that problem, inthe next lecture.

Now we turn to the second method for stability analysis, announced earlier in this section.

Method 2 (von Neumann stability analysis)It is rare that eigenvalues of a matrix, like those of matrix A in (12.17), are available. Therefore,we would like to be able to deduce stability of a scheme without finding those eigenvalues. Tothat end, observe that, since the Heat equation and its discrete version (12.12) are linear, thecomputational errors satisfy the same equations as the solution itself. Let us denote the errorat node (mh, nκ) as εn

m. According to the above, it satisfies Eq. (12.12):

εn+1m = rεn

m+1 + (1− 2r)εnm + rεn

m−1 . (12.29)

At each time level, the error can be expanded as a linear superposition of Fourier harmonics:

εnm =

∑

l

cl(n) exp(iβlxm) (here i ≡ √−1). (12.30)

The range of values for βl will be specified as we proceed.Since Eq. (12.29) is linear, we can substitute in it each individual term of the above

expansion. In doing so, we will also let

cl(n) = ρn,

where ρ is the number to be determined. Thus, substituting εnm = ρn exp(iβmh) into (12.29),

one obtainsρn+1eiβmh = rρneiβ(m+1)h + (1− 2r)ρneiβmh + rρneiβ(m−1)h . (12.31)

Let us make two remarks about the notations in (12.31). First, the superscript in εnm means

that the error ε is evaluated at the nth time level. On the other hand, the superscript in ρn

means that the factor ρ is raised to nth power. Second, we have dropped the subscript l of β

since we now deal with only one term in expansion (12.30).Continuing with our derivation, we divide all terms in (12.31) by ρn exp(iβmh) and obtain:

ρ = reiβh + (1− 2r) + re−iβh = 1− 2r + 2r cos(βh) . (12.32)

Condition |ρ| ≤ 1, which would guarantee that the errors do not grow, yields:

−1 ≤ 1− 2r + 2r cos(βh) ≤ 1 . (12.33)

To obtain a condition on r from this double inequality, we need to know what values theparameter β can take. Even though periodic boundary conditions, which are tacitly impliedby the use of the Fourier expansion (12.30) (as shown in graduate courses on Fourier analysis),yield certain discrete values for β, we will follow an alternative — and simplified — approach.Namely, we will assume that the cosine in (12.33) can take its full range of values:

−1 ≤ cos(βh) ≤ 1 ⇒ 0 ≤ βh ≤ π . (12.34)


Using now the half-angle formula, valid for any α:

1− cos α = 2 sin2(α

2

),

one rewrites (12.33) as

−1 ≤ 1− 4r sin2

(βh

2

)≤ 1. (12.35)

The right-hand inequality in (12.35) holds automatically, while the left-hand one implies:

r sin2

(βh

2

)≤ 1/2. (12.36)

To guarantee stability of the method, this inequality must hold for all values of βh from(12.34). In particular, it must hold for the “worst”-case value that yields the largest value ofsin2(βh/2). The latter value is 1, occurring for βh = π. Then, the stability condition is

r · 1 ≤ 1

2, (12.27)

which is the simplified form of the stability condition obtained in Method 1 above.A few remarks are now in order.

Remark 2 The reason why the condition obtained by the von Neumann analysis is slightlydifferent from the exact condition (12.26) is that the latter, based on the eigenvalues of matrixA in (12.16), takes into account the boundary conditions (QSA: how?), while the von Neumannanalysis, based on expansion (12.30), ignores those conditions.

Remark 3, related to Remark 2. A condition on r obtained via the von Neumann analysisis a necessary, but not sufficient, condition for stability of a finite-difference scheme. That is, ascheme may be found to be stable according to the von Neumann analysis, but taking into ac-count the information about the boundary conditions may reveal that there still is an instability.A simple example of this can be found in R.D. Richtmyer and K.W. Morton, Difference methodsfor initial-value problems, 2nd Ed. (Interscience/John Wiley, New York, 1967); pp. 154–156(that book also contains a thorough and rather clear presentation of sufficient conditions forstability). We will not, however, consider the generalization of the von Neumann analysis thattakes into account boundary conditions. A simple, yet practical approach that one may take isto apply the von Neumann stability analysis to a given scheme, find the neceassary condition(usually on r) that is required for the scheme to be stable, and then test the scheme on theproblem of interest while monitoring if any modes localized near the boundaries tend to becomeunstable.

Note that Method 1 provides a sufficient condition for stability of the numerical scheme23,because it takes into account the boundary conditions when setting up matrix A. However,that method is difficult to apply in practice since it requires the knowledge of the eigenvaluesof A.

Remark 4 Note, however, that in finite-difference discretization of hyperbolic equations, wherethe counterpart of matrix A may turn out to be nondiagonalizable, the von Neumann analysiswould provide more information about the stability of the numerical scheme than Method 1. Anextreme example is that of matrix N in (12.28), for which the information about its eigenvalues

23We refer to the case of the Heat equation, where matrix A is diagonalizable and hence has a basis ofeigenvectors over which any initial condition ~U0 can be expanded.


is useless for the stability analysis (see above). Yet, the von Neumann analysis in this case canbe shown to correctly predict stability or instability of the numerical scheme.

An important feature of the von Neumann analysis is that it tells the user which harmonics(or modes) of the numerical solution will first become unstable if the stability condition isslightly violated. For example, it follows from (12.33) and (12.36) that if r just slightly exceedsthe critical value of 1/2, then modes with β ≈ π/h will have the amplification factor ρ thatwill be slightly less than −1:

r >1

2⇒ ρ

(β ≈ π

h

)< −1 . (12.37)

Now recall that the modes are proportional toexp(iβmh), hence the unstable modes mentioned aboveare

exp(iβmh) = exp(iπ

h·mh

)= exp(iπm) . (12.38)

Therefore, with the account of eiπ = −1, the modechanges its sign from one node to the next, as shownon the right. In other words, it is modes with the high-est frequency that can cause numerical instability of thesimple explicit method for the Heat equation.

−1

0

1

m−2

m−1

m

m+1

m+2

12.4 Explicit methods of higher order

As it follows from (12.11), scheme (12.12) has the first order of consistency in t and the secondorder of consistency in x (i.e., the global error is O(κ + h2)). Note, however, that since thestability condition (12.27),

κ ≤ 1

2h2 , (12.27)

must hold, then one always has O(κ) = O(h2) for a stable scheme. In other words, it wouldnot make sense to derive a method with the global error of O(κ2 + h2) while keeping κ ≤ 1

2h2.

However, it will still be of value to derive a method with the truncation error O(κ2 +h4), whichwe will now do.

Remembering how we derived higher-order methods for ODEs, we start off by writing outthe Taylor expansions for the finite differences appearing in (12.9) and (12.10):

Un+1m − Un

m

κ=

∂

∂tUn

m +κ

2

∂2

∂t2Un

m + O(κ2) , (12.39)

Unm+1 − 2Un

m + Unm−1

h2=

∂2

∂x2Un

m +h2

12

∂4

∂x4Un

m + O(h4) . (12.40)

Equation (12.39) is the counterpart of Eq. (8.37), and Eq. (12.40) was obtained in Problem 3of HW 5. Substituting (12.39) and (12.40) into the Heat equation (12.1), we obtain:

Un+1m − Un

m

κ−Un

m+1 − 2Unm + Un

m−1

h2=

(∂

∂tUn

m −∂2

∂x2Un

m

)+

(κ

2

∂2

∂t2Un

m −h2

12

∂4

∂x4Un

m

)+O(κ2+h4) .

(12.41)


The first term on the r.h.s. of (12.41) vanishes, because Unm is assumed to satisfy the Heat

equation. By differentiating both sides of the Heat equation with respect to t and then usingthe Heat equation again, we obtain:

∂

∂t(ut − uxx) = utt − ∂2

∂x2ut = utt − uxx xx , ⇒ utt = uxxxx . (12.42)

Note that in the middle part of the first equation above, we have used that uxxt = utxx, whichimplies that the solution has to be differentiable sufficiently many times with respect x and t.We will state some results of the effect of smoothness of the solution on the order of the errorin the next section.

Continuing with the derivation of a higher-order scheme, we use (12.42) to write the secondterm on the r.h.s. of (12.41) as

(κ

2utt − h2

12uxxxx

)=

(κ

2− h2

12

)uxxxx . (12.43)

Thus, if one chooses

κ =1

6h2, or r =

1

6, (12.44)

then the term (12.43) vanishes identically. Then the r.h.s. of (12.41) becomes O(κ2 + h4) =O(h4) (or O(κ2)), since κ and h2 are related by (12.44). Thus, scheme (12.12) with r = 1/6has the error O(κ2) = O(h4); it is sometimes called the Douglas method.

12.5 Effect of smoothness of initial condition (12.2) on accuracy ofscheme (12.12)

As has been noted after Eq. (12.42), the order of the truncation error of the numerical schemedepends on the smoothness of the solution, which, in its turn, is determined by the smoothnessof the initial and boundary data. Below we give a corresponding result, whose proof may befound in Sec. 1.7 of the book by Richtmyer and Morton, mentioned a couple of pages back.

Consider the IBVP (12.1)–(12.3) with constant boundary conditions (g0(t) = const andg1(t) = const). Let the initial condition u0(x) have (p− 1) continuous derivatives, while its pthderivative is discontinuous but bounded. Then for scheme (12.12) with r ≤ 1/2 and r 6= 1/6,there hold the following conservative estimates for the error of the numerical solution:

||εn|| =

O(κp/4) = O(hp/2), for 1 ≤ p ≤ 3;

O(κ| ln κ|) = O(h2| ln h|), for p = 4;

O(κ) = O(h2), for p > 4 .

(12.45)

For the Douglas method (i.e. scheme (12.12) with r = 1/6), the analogous error estimates are:

||εn|| =

O(κp/3) = O(h2p/3), for 1 ≤ p ≤ 5;

O(κ2 ln κ) = O(h4| ln h|), for p = 6;

O(κ2) = O(h4), for p > 6 .

(12.46)

Let us emphasize that these estimates are very conservative and, according to Richtmyerand Morton, more precise estimates can be obtained, which would show that the error tends


to zero with κ and h faster than predicted by (12.45) and (12.46). These estimates, however,do show two important trends, namely:(i) If the initial condition is not sufficiently smooth, the numerical error will tend to zero slowerthan for a smooth initial condition. In other words, the “full potential” of a scheme in regardsto its accuracy can be utilized only for sufficiently smooth initial data; see the last lines in(12.45) and (12.46).(ii) The higher the (formally derived) order of the truncation error, the smoother the initialcondition needs to be for the numerical solution to actually achieve that order of accuracy.

It appears likely that similar statements also hold for boundary conditions; we will not,however, consider that issue.

Finally, let us mention that there is one more important trend in regards to the accuracy ofnumerical schemes, which estimates (12.45) and (12.46) do not illustrate. Namely, the accuracyof a scheme depends also on how close the parameter r is to the stability threshold (which is 1/2for scheme (12.12)). Intuitively, the reason for this dependency can be understood as follows.Note that when r is at the stability threshold, there is a mode that does not decay, because forit, the amplification factor satisfies: |ρ| = 1 (ρ was introduced before Eq. (12.31)). Accordingto the end of Sec. 12.3, such a mode for scheme (12.12) is the highest-frequency mode withβ = π/h = πM . It is intuitively clear that any jagged initial condition will contain such a modeand modes with similar values of β (i.e. β = π(M − 1), π(M − 2), etc.). For those modes,|ρ| will be just slightly less than 1, and hence they will decay very slowly, thereby lowering theaccuracy of the scheme. On the contrary, when r is, say, 0.4, i.e. less than the threshold bya finite amount, then all modes will decay at a finite rate, and the accuracy of the scheme isexpected to be higher than for r = 0.5. In a homework problem, you will be asked to use amodel initial datum to explore the effect of its smoothness, as well as the effect of the proximityof r to the stability threshold, on the accuracy of scheme (12.12).


1. State the minimax principle and provide its intuitive interpretation. When can thisprinciple be useful?

2. Obtain (12.12).

3. State the Lax Equivalence Theorem and provide a justification for it, based on (12.14).

4. Make sure you can obtain (12.15) as explained in the text below that equation. Whereare the boundary conditions (12.8) used in this derivation?

5. Make sure you can obtain (12.16) from (12.12).

6. Describe the idea behind Method 1 of stability analysis of the Heat equation.

7. What will happen to the solution of scheme (12.12) if condition (12.27) is not satisfied?

8. Describe the idea behind the von Neumann stability analysis.

9. Make sure you can obtain Eqs. (12.31) and (12.32).

10. Answer the QSA posed in Remark 2 after the description of the von Neumann stabilityanalysis.

11. Describe advantages and disadvantages of the von Neumann method relative to Method1.


12. What piece of information would be required to turn a von Neumann-like analysis froma necessary to a sufficient condition of stability?

13. Which harmonics are “most dangerous” from the point of view of making scheme (12.12)unstable? How would you proceed answering this question for an arbitrary numericalscheme?

14. Make sure you can follow the derivation of (12.42).

15. Can you recall a counterpart of the Douglas method for ODEs?

16. Which factors affect the accuracy of a numerical scheme?


13 Implicit methods for the Heat equation

13.1 Derivation of the Crank-Nicolson scheme

We continue studying numerical methods for the IBVP (12.1)–(12.3). In Sec. 12.3, we haveseen that the Heat equation (12.1) could be represented as a coupled system of ODEs (12.15).Moreover, in one of the problems in Homework #12, you were asked whether that systemwas stiff (and the answer was ‘yes, it is stiff’). As we remember from Lecture 5, the way todeal with stiff systems is by using implicit methods, which may be constructed so as to beunconditionally stable irrespective of the step size in the evolution variable. In Lecture 4, westated that the highest order that an unconditionally stable method can have is 2 (that is, inour current notations, the error can tend to zero no faster than O(κ2)). From Lecture 4 we alsorecall the particular example of an implicit, unconditionally stable method of order 2: this isthe modified implicit Euler method (3.45). In our current notations, it is:

~Un+1

= ~Un +κ

2

(~f(tn, ~Un) +~f(tn+1, ~Un+1)

). (13.1)

In the case of system (12.15), the form of function ~f is:

fm(~Un) =Un

m+1 − 2Unm + Un

m−1

h2. (13.2)

Since the operator on the r.h.s. of the above equation will appear very frequently in theremainder of this course, we introduce a special notation for it:

Um+1 − 2Um + Um−1

h2≡ 1

h2δ2xUm . (13.3)

Similarly, we denoteUn+1 − Un

κ≡ 1

κδtU

n . (13.4)

Then Eq. (13.1) with f given by (13.2) takes on the form:

Un+1m − Un

m

κ=

1

2

[Un

m+1 − 2Unm + Un

m−1

h2+

Un+1m+1 − 2Un+1

m + Un+1m−1

h2

], (13.5)

or, in the above shorthand notations,

1

κδtU

nm =

1

2h2

[δ2xU

nm + δ2

xUn+1m

]. (13.6)

The finite-difference equation (13.5) can be rewritten as

Un+1m − r

2

(Un+1

m+1 − 2Un+1m + Un+1

m−1

)= Un

m +r

2

(Un

m+1 − 2Unm + Un

m−1

); (13.7)

and correspondingly, Eq. (13.6), as

(1− r

2δ2x

)Un+1

m =(1 +

r

2δ2x

)Un

m, m = 1, . . . , M − 1 , (13.8)

wherer =

κ

h2. (12.13)


Scheme (13.7) (or, equivalently, (13.8)) is called theCrank-Nicolson (CN) method. Its stencil is shownon the right. Both from the stencil and from the definingequations one can see that Un+1

m cannot be determined

in isolation. Rather, one has to determine ~U on the en-tire (n + 1)th time level. Using our standard notationfor the solution vector,

~Un =[Un

1 , Un2 , · · , Un

M−1

]T,

we rewrite Eq. (13.8) in the vector form:

n+1

n

m−1 m m+1

(I − r

2A

)~Un+1 =

(I +

r

2A

)~Un + ~b , (13.9)

where I is the unit matrix and

A =

−2 1 0 · · 01 −2 1 0 · 0· · · · · ·0 · 0 1 −2 10 · · 0 1 −2

and ~b =r

2

Un0 + Un+1

0

0·0

UnM + Un+1

M

≡ r

2

g0(tn) + g0(tn+1)0·0

g1(tn) + g1(tn+1)

.

(13.10)Thus, to find ~Un+1, we need to solve a tridiagonal linear system, which we can do by theThomas algorithm of Lecture 8, using only O(M) operations.

Above, we have derived the CN scheme using the analogy with the modified implicit Eulermethod for ODEs. This analogy allows us to expect that the two key features of the lattermethod: the unconditional stability and the second-order accuracy, are inherited by the CNmethod. Below we show that this is indeed the case.

13.2 Truncation error of the Crank-Nicolson method

The easiest (and, probably, quite instructive) way to de-rive the truncation error of the CN method is to usethe following observation. Note that the stencil forthis method is symmetric relative to the “virtual node”(mh, (n + 1

2)κ), marked by a cross in the figure on the

right. This motivates one to expand quantities Unm etc.

about that virtual node. Let us denote the value of thesolution at that point by U :

U ≡ u

(mh, (n +

1

2)κ

).

Then, using the Taylor expansion of a function of twovariables (see Lecture 0), we obtain:

n+1

n

m−1 m m+1

n+1/2 x


for ε = −1, 0, or 1:

Un+1m+ε = Um +

(κ

2Um,t + εhUm,x

)

+1

2!

((κ

2

)2

Um,tt + 2κ

2εhUm,xt + (εh)2Um,xx

)

+1

3!

((κ

2

)3

Um,ttt + 3(κ

2

)2

εhUm,xtt + 3κ

2(εh)2Um,xxt + (εh)3Um,xxx

)

+O(κ4 + κ3h + κ2h2 + κh3 + h4) ,

(13.11)

where Um,t = ∂∂t

Um|t=(n+ 12)κ, etc. Similarly,

Unm+ε = Um +

(−κ

2Um,t + εhUm,x

)

+1

2!

((−κ

2

)2

Um,tt + 2(−κ

2

)εhUm,xt + (εh)2Um,xx

)

+1

3!

((−κ

2

)3

Um,ttt + 3(−κ

2

)2

εhUm,xtt + 3(−κ

2

)(εh)2Um,xxt + (εh)3Um,xxx

)

+O(κ4 + κ3h + κ2h2 + κh3 + h4) .(13.12)

In a homework problem you will be asked to provide details of the following derivation. Namely,substituting expressions (13.11) and (13.12) with ε = 0 into the l.h.s. of (13.6), one obtains:

1

κδtU

nm = Um,t + O(κ2) . (13.13)

Next, substituting expressions (13.11) and (13.12) into the r.h.s. of (13.5), one obtains:

1

2h2

(δ2xU

nm + δ2

xUn+1m

)= Um,xx + O(h2 + κ2) . (13.14)

Finally, combining the last two equations yields

1

κδtU

nm −

1

2h2

(δ2xU

nm + δ2

xUn+1m

)= Um,t − Um,xx + O(κ2 + h2) , (13.15)

which means that the CN scheme (13.6) is second-order accurate in time.

Remark Note that the notation O(κ2 + h2) for the truncation error of the CN method doesnot necessarily imply that the sizes of κ and h may be taken to be about the same in practice.One of the homework problems explores this issue in detail.

Let us now consider the following obvious generalization of the modified implicit Eulerscheme:

~Un+1 = ~Un + κ((1− θ)~f(tn, ~Un) + θ~f(tn+1, ~Un+1)

), (13.16)

where the constant θ ∈ [0, 1]. The corresponding method for the Heat equation is, instead of(13.9):

(I − rθA) ~Un+1 = (I + r(1− θ)A) ~Un + ~b . (13.17)


Obviously, when:

θ = 12, ⇒ (13.17) is the CN method;

θ = 0, ⇒ (13.17) is the simple explicit method (12.12);θ = 1, ⇒ (13.17) is an analogue of the simple implicit

Euler method for the Heat equation;its stencil is shown on the right.

We will refer to methods (13.17) with all possible valuesof θ as the θ-family of methods.

n+1

n

m−1 m m+1

Following the derivation of the Douglas method in Sec. 12.4 and of Eq. (13.15) above, onecan show that the truncation error of the θ-family of methods is

truncation error of (13.17) =

((1

2− θ

)κ− h2

12

)uxxxx + O(κ2 + h4) . (13.18)

Then it follows that in addition to the special value θ = 12, which gives the second-order accurate

in time CN method, there is another special values of θ:

θ =1

2− 1

12r. (13.19)

When θ is given by the above formula, the first term on the r.h.s. of (13.18) vanishes, and thetruncation error becomes O(κ2 +h4). The corresponding scheme is called Crandall’s method orthe optimal method. For other values of θ, the truncation error of (13.17) is only O(κ + h2).

13.3 Stability of the θ-family of methods

Here we use Method 1 of Lecture 12 to study stability of scheme (13.17). In a homeworkproblem, you will be asked to obtain the same results using the von Neumann stability analysis(Method 2 of Lecture 12).

Before we begin with the analysis, let us recall the idea of Method 1. First, since the PDEwe deal with in this Lecture is linear, both the solution and the error of the numerical schemesatisfy the difference equation with the same homogeneous part: namely, in this case, Eq.(13.17) with ~b = 0. Therefore, to establish the condition for stability of a numerical scheme,we consider only its homogeneous part, because inhomogeneous terms (i.e., ~b in (13.17)) donot alter the stability properties. Next, with a difference scheme written in the matrix form as

~Un+1 = M~Un,

one needs to determine whether the magnitude of any eigenvalue of matrix M exceeds 1. Thefollowing possibilities can ocur.• If at least one eigenvalue λj exists such that |λj| > 1, then the scheme is unstable, becausesmall errors “aligned” along the corresponding eigenvector(s) will grow.• At most one eigenvalue of M satisfies |λj| = 1, with all the other eigenvalues being strictlyless than 1 in magnitude. Then the scheme is stable.• There are several eigenvalues satisfying |λj1| = · · · = |λjJ

| = 1, with the other eigenvaluesbeing strictly less than 1 in magnitude. Moreover, the eigenvectors corresponding to λj1 , ...


λjJare all distinct24. Then the numerical scheme is also stable. If, however, some of the

aforementioned eigenvectors coincide (this may, although does not always have to, happenwhen some λjk

is a double eigenvalue), then the scheme is unstable. In the latter case, theerrors will grow, although very slowly.

Following the above outline, we rewrite Eq. (13.17) with ~b being set to zero as

~Un+1 = (I − rθA)−1 (I + r(1− θ)A) ~Un. (13.20)

Now recall from Linear Algebra that matrices A, aA + bI, and (cA + dI)−1 have the sameeigenvectors, with the corresponding eigenvalues being λ, aλ + b, and (cλ + d)−1. Then, sincethe eigenvectors of (I + r(1− θ)A) and (I − rθA)−1 are the same, the eigenvalues of the matrixappearing on the r.h.s. of (13.20) are easily found to be

1 + r(1− θ)λj

1− rθλj

, (13.21)

where λj are the eigenvalues of A. (You will be asked to confirm this in a homework problem.)According to the Lemma of Lecture 12,

λj = −2 + 2 cosπj

M= −4 sin2

(πj

2M

), j = 1, . . . , M. (13.22)

As pointed out above, for stability of the scheme, it is necessary that

∣∣∣∣1 + r(1− θ)λj

1− rθλj

∣∣∣∣ ≤ 1, j = 1, . . . ,M. (13.23)

Using (13.22) and denoting

φj ≡ πj

2M,

one rewrites (13.23) as

∣∣1− 4r(1− θ) sin2 φj

∣∣ ≤∣∣1 + 4rθ sin2 φj

∣∣ . (13.24)

Since θ ≥ 0 by assumption and r > 0 by definition, then the expression under the absolutevalue sign on the r.h.s. of (13.24) is positive, and therefore the above inequality can be writtenas

− (1 + 4rθ sin2 φj

) ≤ 1− 4r(1− θ) sin2 φj ≤ 1 + 4rθ sin2 φj. (13.25)

The right part of this double inequality is automatically satisfied for all φj. The left part issatisfied when

4r(1− 2θ) sin2 φj ≤ 2 . (13.26)

The strongest restriction on r (and hence on the step size in time) occurs when sin2 φj assumesits largest value, i.e. 1. In that case, (13.26) yields

(1− 2θ)r ≤ 1

2. (13.27)

24From Linear Algebra, we recall that a sufficient, but not necessary, condition for those eigenvectors to bedistinct is that all the λj1 , ... λjJ

are distinct.


This inequality should be considered separately in two cases:

1

2≤ θ ≤ 1 ⇒ r is arbitrary. (13.28)

That is, scheme (13.17) is unconditionally stable for any r.

0 ≤ θ <1

2⇒ Scheme (13.17) is stable provided that

r ≤ 1

2(1− 2θ). (13.29)

The above results show that the CN method, as well as the purely implicit method (13.17)with θ = 1, are unconditionally stable. That is, no relation between κ and h must hold in orderfor these schemes to converge to the exact solution of the Heat equation. This is the mainadvantage of the CN method over the explicit methods of Lecture 12.

Let us also note that Crandall’s method “(13.17) + (13.19)” belongs to the conditionallystable case, (13.29). However, since for Crandall’s method,

r =1

6(1− 2θ)<

1

2(1− 2θ), (13.30)

then, according to (13.29), this method is stable.

13.4 Ways to improve on the accuracy of the Crank-Nicolson method

To improve on the accuracy of the CN method in time, one may use higher-order multi-stepmethods or implicit Runge-Kutta methods. Recall, however, that no method of order higherthan 2 is absolutely stable, and therefore any scheme that one may expect to obtain alongthese lines will be (at best) only conditionally stable. Note also that the stencil for a multi-stepgeneralization of the CN scheme will contain nodes on more than two time levels.

To improve the accuracy of the CN method in space, one can use the analogy with Numerov’smethod. Namely, we rewrite the Heat equation uxx = ut as

δ2xU

nm + δ2

xUn+1m

2h2=

1

12

(δtU

nm+1

κ+ 10

δtUnm

κ+

δtUnm−1

κ

). (13.31)

However, one can verify that the resulting scheme is nothing but Crandall’s method!

Finally, we also mention a method attributed to DuFort and Frankel:

Un+1m − Un−1

m

2κ=

1

h2

(Un

m+1 −[Un+1

m + Un−1m

]+ Un

m−1

). (13.32)

This method has the truncation error of

O

(κ2 + h2 +

(κ

h

)2)

,

which means that it is consistent with the Heat equation only when (κ/h) → 0. When (κ/h) →const 6= 0, the DuFort-Frankel method approximates a different, hyperbolic, PDE. Althoughthe DuFort-Frankel method can be shown to be unconditionally stable, it is not used for solutionof parabolic PDEs because of the aforementioned need to have κ ¿ h in order to provide ascheme consistent with the PDE.



1. Why does one want to use an implicit method to solve the Heat equation?

2. Make sure you can obtain (13.7), and from it, (13.8) and (13.9).

3. How many (order of magnitude) operations is required to advance the solution of the Heatequation from one time level to the next using the CN method? Is the CN a time-efficientmethod?

4. Make sure you can obtain (13.11) and (13.12).

5. What is the order of truncation error of the CN method?


7. Why do you think Crandall’s method is called the “optimal” method?

8. Verify the first equality in (13.22).

9. What is the significance of the inequality (13.27)?

10. What is the main advantage of the CN method over the simple explicit method of Lecture12?

11. Do you think Crandall’s method has the same advantage?Also, explain the origin of each part of formula (13.30).

12. Is it possible to derive an unconditionally stable method with accuracy O(κ3) (or better)for the Heat equation? If ‘yes’, then how? If ‘no’, why?

13. Draw the stencil for the DuFort-Frankel method.

14. Do you think the DuFort-Frankel method is time-efficient?


14 Generalizations of the simple Heat equation

In this Lecture, we will consider the following generalizations of the IBVP (12.1)–(12.3), basedon the simple Heat equation:

• Derivative (Neumann and mixed-type) boundary conditions;

• The linear Heat equations with variable coefficients;

• Nonlinear parabolic equations.

14.1 Boundary conditions involving derivatives

Let us consider the modified IBVP (12.1)–(12.3) where the only modification concerns theboundary condition at x = 0:

ut = uxx 0 < x < 1, t > 0 ; (14.1)

u(x, t = 0) = u0(x) 0 ≤ x ≤ 1 ; (14.2)

ux(0, t) + p(t)u(0, t) = q(t), t ≥ 0 ; (14.3)

u(1, t) = g1(t), t ≥ 0 . (14.4)

The boundary condition involving the derivative can be handled by either of the two methodsdescribed in Section 8.4 for one-dimensional BVPs. Below we will describe in detail how thefirst of those methods can be applied to the Heat equation. We will proceed in two steps,whereby we will first consider a modification of the simple explicit scheme (12.12) and then, amodification for the Crank-Nicolson method (13.8), for the boundary condition (14.3).

Modification of the simple explicit scheme (12.12)For n = 0, i.e. for t = 0, U0

m, m = 0, 1, . . . ,M − 1, M are given by the initial condition(14.2). Then, discretizing (14.3) with the second order of accuracy in x as

U01 − U0

−1

2h+ p0U0

0 = q0, (14.5)

one immediately finds U0−1 (because p0 ≡ p(0) and q0 ≡ q(0) are given by the boundary

condition (14.3)). Thus, at the time level n = 0, one knows U0m, m = −1, 0, 1, . . . , M − 1,M .

For n = 1, we first determine U1m for m = 0, 1, . . . , M − 1 as prescribed by the scheme:

U1m = U0

m + r(U0

m−1 − 2U0m + U0

m+1

). (14.6)

(Note that the value U0−1 is used to determine the value of U1

0 .) Having thus found U10 and U1

1 ,we next find U1

−1 from the equation analogous to (14.5):

U11 − U1

−1

2h+ p1U1

0 = q1. (14.7)

Finally, U1M is given by the boundary condition (14.4).

For n ≥ 2, the above step is repeated.

Remark We used the second-order accurate approximation for ux in (14.5) and its counterpartsfor n > 0 because we wanted the order of the error at the boundary to be consistent with theorder of the error of the scheme, which is O(h2).


Modification of the Crank-Nicolson scheme (13.8)For n = 0, one finds U0

−1 from Eq. (14.5).For n = 1, one has,

from the boundary condition (14.3):

U11 − U1

−1

2h+ p1U1

0 = q1 ; (14.7)

from the scheme (13.7):

U1m −

r

2

(U1

m−1 − 2U1m + U1

m+1

)= U0

m +r

2

(U0

m−1 − 2U0m + U0

m+1

), m = 0, 1, . . . , M − 1.

(14.8)Equations (14.7) and (14.8) yield M + 1 equations for the M + 1 unknowns U1

−1, U10 , U1

1 , . . .,U1

M−1. This system of linear equations can, in principle, be solved. However, as we know fromSec. 8.4 (see Remark 2 there), the coefficient matrix in such a system will not be tridiagonal,which would preclude a straightforward application of the time-efficient Thomas algorithm.The way around that problem was also indicated in the aforementioned Remark. Namely, oneneeds to eliminate U1

−1 from (14.7) and the Eq. (14.8) with m = 0. For example, we can solve(14.7) for U1

−1 and substitute the result into Eq. (14.8) with m = 0. This yields:

U10 −

r

2

([U1

1 − 2h(q1 − p1U10 )

]− 2U10 + U1

1

)= U0

0 +r

2

([U0

1 − 2h(q0 − p0U00 )

]− 2U00 + U0

1

),

(14.9)where on the r.h.s. we have also used (14.5). Upon simplifying the above equation, one canwrite the linear system for the vector

~Un

=[Un

0 , Un1 , . . . , Un

M−1

]T, n = 0 or 1

in the form:A~U1 = B~U0 + ~b, (14.10)

where

A =

1 + r(1− hp1) −r 0 0 · 0

−r/2 1 + r −r/2 0 · 0

0 −r/2 1 + r −r/2 · 0

· · · · · ·0 · 0 −r/2 1 + r −r/2

0 · 0 0 −r/2 1 + r

(14.11)

and

B =

1− r(1− hp0) r 0 0 · 0

r/2 1− r r/2 0 · 0

0 r/2 1− r r/2 · 0

· · · · · ·0 · 0 r/2 1− r r/2

0 · 0 0 r/2 1− r

, ~b =

−rh(q0 + q1)00·0

r2(g0

1 + g11)

.

(14.12)System (14.10) with the tridiagonal matrix A given by (14.11) can now be efficiently solved bythe Thomas algorithm.

For n ≥ 2, the above step is repeated.


14.2 Linear parabolic PDEs with variable coefficients

Generalization of the explicit scheme (12.12) to such PDEs is straighforward. For example, ifinstead of the Heat equation (14.1) we have a PDE

ut = a(x, t)uxx, (14.13)

then we use the following obvious discretization:

a(x, t)uxx → anm

δ2xU

nm

h2. (14.14)

For the CN method, only slightly more effort is required. Note that the main concern hereis to maintain the O(κ2 + h2) accuracy of the method. Maintaining this accuracy is achievedby using the (well-known to you by now) fact that

f(X + H)− f(X −H)

2H= f ′(X) + O(H2),

or, equivalently,

f(X + H)− f(X)

H= f ′

(X +

H

2

)+ O(H2),

(14.15a)

where f(X) is any sufficiently smooth function, and X can stand for either x or t (then H

stands for either h or κ, respectively). Similarly, using the Taylor expansion, you will be askedin a QSA to show that

f(X + H) + f(X)

2= f

(X +

H

2

)+ O(H2). (14.15b)

In other words, we can use values f(X) and f(X+H) to approximate the values of the functionand its derivative at (X + H

2) — the midpoint between X and X + H — with accuracy O(H2).

Using the idea expressed by (14.15), the schemes that we will list below can be shown to havethe required accuracy of O(κ2 + h2).

For the PDEut = a(x, t)uxx + b(x, t)ux + c(x, t)u, (14.16)

we discretize the terms in a rather obvious way:

ut → 1

κδtU

nm ,

a(x, t)uxx → 1

2h2

(an

mδ2xU

nm + an+1

m δ2xU

n+1m

),

b(x, t)ux → 1

4h

(bnm(Un

m+1 − Unm−1) + bn+1

m (Un+1m+1 − Un+1

m−1))

,

c(x, t)u → 1

2

(cnmUn

m + cn+1m Un+1

m

).

(14.17)

Let us explain the origin of the expressions on the r.h.s.’es of the first and third lines above. Theterm on the first line approximates ut with accuracy O(κ2) at the virtual node (mh, (n + 1

2)κ);

this is just a straightforward corollary of the second line of (14.15a). The term on the third linehas two parts. The first part (with 1/(2h) factored into it) approximates bux with accuracyO(h2) at the node (mh, nκ); this is just a straightforward corollary of the first line of (14.15a).


Similarly, the second term approximates bux with accuracy O(h2) at the node (mh, (n + 1)κ).Hence the average of these two parts approximates bux with accuracy O(κ2 +h2) at the virtualnode (mh, (n + 1

2)κ); this is a straightforward corollary of (14.15b). (If you still have difficulty

following these explanations, draw the stencil for the CN method and then draw all the nodesmentioned above.)

Often, the PDE arises in a physical problem in the form

γ(x, t)ut = (α(x, t)ux)x + β(x, t)u . (14.18)

Instead of manipulating the terms so as to transform this to the form of (14.16) and then usethe discretization (14.17), one can discretize (14.18) directly:

γ(x, t)ut → 1

2(γn

m + γn+1m )

1

κδtU

nm, or γ

n+ 12

m1

κδtU

nm,

(α(x, t)ux)x → 1

2h

(αn

m+ 12

δxUnm

h− αn

m− 12

δxUnm−1

h

)+

1

2h

(αn+1

m+ 12

δxUn+1m

h− αn+1

m− 12

δxUn+1m−1

h

),

β(x, t)u → 1

2

(βn

mUnm + βn+1

m Un+1m

).

(14.19)Here we only explain the term on the r.h.s. of the second line, since the other two discretizationsare analogous to those presented in (14.17). The first term in the first parentheses approximatesaux with accuracy O(h2) at the virtual node ((m + 1

2)h, nκ); this is a corollary of the second

line of (14.15a). Similarly, the second term in the first parentheses approximates aux withaccuracy O(h2) at the virtual node ((m− 1

2)h, nκ). Consequently, the entire expression in the

first parentheses with 1/h factored in it approximates (αux)x with accuracy O(h2) at the node(mh, nκ); this is a corollary of the first line of (14.15a). Finally, the entire expression on ther.h.s. of the second line of (14.19) approximates (αux)x with accuracy O(κ2 +h2) at the virtualnode (mh, (n + 1

2)κ).

14.3 Von Neumann stability analysis for PDEs with variable coeffi-cients

Let us recall that the idea of the von Neumann analysis was to expand the error of the PDEwith constant coefficients into a set of exponentials ρn exp(iβx) = ρn exp(iβmh), each of whichexactly satisfies the discretized PDE for a certain ρ. Note also that for both the simple explicitscheme (12.12) and the modified Euler-like scheme considered in Problem 4 for Homework #12, the harmonics exp(iβmh) that would first become unstable should the stability conditionfor the scheme be violated, are those with the largest spatial frequency, i.e. with β = π/h (seethe figure at the end of Sec. 12.3). The same appears to be true for most other conditionallystable schemes.

Now let us consider the PDE (14.13) (or either of (14.16) and (14.18)) where the coefficient(s)does(do) not vary too rapidly. Then, such a coefficient can be considered to be almost constantin comparison to the highest-frequency harmonic that can potentially cause the instability.This simple consideration suggests that for PDEs with sufficiently smooth coefficients, the vonNeumann analysis can be carried out without any changes, while assuming that at each pointin space and time, the coefficients are constant.


For example, the stability criterion for the simple explicit method applied to (14.13) becomes

r ≤ 1

2a(x, t). (14.20)

This can be interpreted in the following two different ways.(i) If the programmer decides to use constant values for κ and h, and hence r, over the entiregrid, then he/she should ensure that

r ≤ 1

2 maxx,t a(x, t)(14.21)

for the scheme to be stable.(ii) If the programmer decides to vary the step sizes in x and/or t, then at every time level thestep size κ is to be chosen so as to satisfy the condition

r(t) ≤ 1

2 maxx a(x, t). (14.22)

Let us now point out another issue, unrelated to the above one, which, however, is also spe-cific to PDEs (14.16) and (14.18) and could not occur for the simple Heat equation. Namely,note that (14.16) and (14.18) may have exponentially growing solutions, which the Heat equa-tion (14.1) or (14.13) does not have. For example, Eq. (14.16) where each of a, b, and c isconstant, has a solution u = exp(ct). In such a case, when carrying out the von Neumannanalsysis, one should not require that |ρ| ≤ 1 for the stability of the scheme, because this wouldpreclude obtaining the above exponentially growing solution. Instead, one should stipulate thatthe largest25 value of |ρ| satisfy (for the above example)

max |ρ| = 1 + cκ + “smaller terms”, (14.23)

while all the other ρ’s must be strictly less than 1 in absolute value. Equation (14.23) allowsthe (largest) amplification factor ρ corresponding to very low-frequency harmonics (i.e. thosewith β ≈ 0) to be greater than 1 because of the true nature of the solution. If one does notinclude the term into cκ into the modified definition of stability, Eq. (14.23), then it would notbe possible to find a range for r where the scheme (12.12) could be stable.

For the above example of Eq. (14.16) with constant coefficients a, b, and c, the condition onr based on this modified stability criterion can be shown, by a straightforward but somewhatlengthy calculation, to be

r ≤ 2 + cκ− 12r2b2π2h4

4a≈ 1

2a, (14.24)

i.e. the same as (14.20).

14.4 Nonlinear parabolic PDEs: I. Explicit schemes, and the Newton–Raphson method for implicit schemes

Explicit schemes for nonlinear parabolic PDEs can be consructed straightforwardly. For exam-ple, for the PDE

ut =(u2ux

)x, (14.25)

25if more than one value of ρ for a given β exists, as for a multi-level scheme


the simple explicit scheme is

δtUnm

κ=

1

h2

[(Un

m+1 + Unm

2

)2

(Unm+1 − Un

m)−(

Unm + Un

m−1

2

)2

(Unm − Un

m−1)

). (14.26)

The von Neumann stability analysis can no longer be rigorously justified for (most) nonlinearPDEs, but it can be justified approximately, if one assumes that the solution u(x, t) (and henceits numerical counterpart Un

m) does not vary too rapidly. This is analogous to the conditionon the coefficients of linear PDEs, mentioned in Sec. 14.3. Below we provide an intuitiveexplanation for this claim using (14.25) as a model problem, and then write down the stabilitycriterion for that PDE.

Let us suppose that at a given moment t, we have obtained the solution u(x, t) to (14.25).We now want to use (14.25) to advance this solution in time by a small step κ. It is reasonableto expect that the solution at t + κ will be close to that at t:

u(x, t + κ) = u(x, t) + w(x, t; κ), where |w| ¿ |u| . (14.27)

Since w is small in the above sense, it must satisfy (14.25) linearized on the background of theinitial solution u(x, t):

wt =(u2wx

)x

+ (2uwux)x = u2wxx + 2(u2)xwx + (u2)xxw , (14.28)

where we have omitted terms that are quadratic and cubic in w. In other words, to advance thesolution of a nonlinear PDE in time by a small amount, we need to solve a linear PDE. Notethat the linear PDE (14.28) has the form (14.16), where the role of known coefficients a(x, t)etc. is now played by the terms u2(x, t) etc. These terms are also known from the solutionu(x, t) at time t. Then the stability condition is given by (14.24), which for (14.28) takes onthe form

r ≤ 1

2u2(x, t). (14.29)

Condition (14.29) means that the step size κ needs to be adjusted accordingly at each timelevel so as to maintain the stability of the scheme.

As far as implicit methods for nonlinear PDEs are concerned, there are quite a few possi-bilities in which such methods can be designed. Here we will discuss in detail an equivalentof the Newton–Raphson method considered in Lecture 8 and also mention a couple of othermethods: an adhoc method suitable for a certain class of nonlinear parabolic PDEs and anoperator-splitting method. In the Appendix, we will briefly describe a group of methods knownas Implicit–Explicit methods (IMEX).

The main difficulty that one faces with the Newton–Raphson method is, similarly to Lecture8, the need to solve systems of algebraic nonlinear equations to obtain the solution at the “new”time level. We will now discuss approaches to this problem using Eq. (14.25) as the modelPDE.

To begin, we can use the following slight modification of scheme (14.19) for the PDE (14.18),where now a = u2, b = 0, and c = 1:

ut → 1

κδtU

nm ;

(u2ux)x → 1

2h

((Un

m)2 + (Unm+1)

2

2

δxUnm

h− (Un

m−1)2 + (Un

m)2

2

δxUnm−1

h

)+

1

2h

((Un+1

m )2 + (Un+1m+1)

2

2

δxUn+1m

h− (Un+1

m−1)2 + (Un+1

m )2

2

δxUn+1m−1

h

).

(14.30)


Next, we substitute the discretized derivatives in (14.30) into Eq. (14.25). You will be asked towrite down the resulting scheme in a homework problem. This scheme, which is just a nonlinearalgebraic system of equations for Un+1

m with m = 1, . . . , M − 1, can be solved by any of theiterative methods of Sec. 8.6. We will show the details for the Newton–Raphson method. Infact, that method is essentially the linearization used in Eqs. (14.27) and (14.28). Namely, theNewton–Raphson method (as any other iterative method) requires one to use an initial guessfor Un+1

m , and an obvious candidate for such a guess is the known value of Unm. Then we let

~Un+1 = ~Un + ~ε(0), ‖~ε(0)‖ ¿ ‖~Un‖ ; (14.31)

compare this with (8.84). Upon substituting (14.30) and (14.31) into (14.25) and discardingterms O((ε(0))2), one obtains:

ε(0)m − κ

2h

((ε(0)

m Unm + ε

(0)m+1U

nm+1

) δxUnm

h−

(ε(0)m−1U

nm−1 + ε(0)

m Unm

) δxUnm−1

h+

(Unm)2 + (Un

m+1)2

2

δxε(0)m

h− (Un

m−1)2 + (Un

m)2

2

δxε(0)m−1

h

)

=κ

h

((Un

m)2 + (Unm+1)

2

2

δxUnm

h− (Un

m−1)2 + (Un

m)2

2

δxUnm−1

h

).

(14.32)

Let us outline how the above expression is obtained; you will be asked to fill in the missingdetails in a homework problem. Although (14.32) can be obtained by direct multiplication ofterms in (14.30), the easier, and “mathematically literate”, way is to use the following form ofthe familiar Product Rule from Calculus:

∆(fg) ≡ (f + ∆f)(g + ∆g)− fg ≈ f∆g + g∆f, where ∆f ¿ f, ∆g ¿ g .

(Product Rule)As a preliminary step of the calculation that you will need to complete on your own, considerthe term (Un+1

m )2. Let us use the substitution (14.31) and denote Unm ≡ f and Un+1

m =

Unm + ε

(0)m ≡ f + ∆f . Then, using the form of the Product Rule stated above with g = f , you

can write(Un+1

m )2 ≈ (Unm)2 + 2Un

mε(0)m . (14.33)

Next, consider the first term in the first large parentheses in (14.30) and denote ( (Unm)2 +

(Unm+1)

2 ) by f and δxUnm/h by g. (So, you denote f , g, ∆f , and ∆g anew each time you use

the Product Rule.) Then it is reasonable to use the following names for the correspondingquantities in the second large parentheses in (14.30):

(Un+1m )2 + (Un+1

m+1)2 ≡ f + ∆f, δxU

n+1m /h ≡ g + ∆g, (14.34)

where ∆f and ∆g are proportional to ε(0). At home you will obtain the form of ∆f in (14.34)using Eq. (14.31). Directly from Eq. (14.31) you will be able to obtain the ∆g. Then all thatremains is to use the Product Rule on these f , g, ∆f , and ∆g. The remaining terms in (14.30)should be handled similarly.

From (14.32), the vector ~ε(0) can be solved for in a time-efficient manner (since the coefficientmatrix is tridiagonal). In most circumstances, one iteration (14.31) is sufficient, but if need be,the iterations can be continued in complete analogy with the procedure described at the endof Sec. 8.6. Namely, we first compute

~U(1) ≡ ~Un + ~ε(0) (14.35)


and then seek a correction to that solution in the form

~Un+1 = ~U(1) + ~ε(1), ‖~ε(1)‖ ¿ ‖~U(1)‖ . (14.36)

Substituting (14.36) along with (14.30) into (14.25), we obtain an equation similar to (14.32):

ε(1)m − κ

2h

((ε(1)

m U (1)m + ε

(1)m+1U

(1)m+1

) δxU(1)m

h−

(ε(1)m−1U

(1)m−1 + ε(1)

m U (1)m

) δxU(1)m−1

h+

(U(1)m )2 + (U

(1)m+1)

2

2

δxε(1)m

h− (U

(1)m−1)

2 + (U(1)m )2

2

δxε(1)m−1

h

)

= −ε(0)m +

κ

2h

((Un

m)2 + (Unm+1)

2

2

δxUnm

h− (Un

m−1)2 + (Un

m)2

2

δxUnm−1

h

)+

κ

2h

((U

(1)m )2 + (U

(1)m+1)

2

2

δxU(1)m

h− (U

(1)m−1)

2 + (U(1)m )2

2

δxU(1)m−1

h

).

(14.37)

Recall that here, U (1), Un, and ε(0) are known, and one’s goal is to solve this linear equation forε(1). This can be done time-efficiently, because the coefficient matrix of the equation for ε(1) istridiagonal. Once ε(1) has been found, one can define, and solve for, ε(2), etc. These iterationscan be carried out in the above manner as many times as need be.

As we have seen above, the strength of the Newton–Raphson method is that it can beapplied to programming an implicit numerical scheme for any nonlinear equation or systemof equations. However, a drawback of this method is that it is quite cumbersome (see, e.g.,(14.32) and (14.37)). Therefore, a considerable amount of research has been done on findingother methods which, on one hand, would to large extent retain the good stability propertiesof implicit methods while, one the other hand, would be much easier to program. Two suchsystematic alternatives to the Newton–Raphson method, which can be applied to a very wideclass of equations and which do not require the solution of a system of nonlinear equations, aredescribed in the next Section.

To conclude this Section, we will point out one issue that is specific to discretization ofnonlinear differential equations.

Remark 1 Let us continue using (14.25) as the model problem. Note that it can be writtenin an equivalent form:

ut =1

3(u3)xx . (14.38)

We can use the following discretization that has the accuracy of O(κ2 + h2):

1

κδtU

nm =

1

3· 1

2h2

(δ2x(U

3)nm + δ2

x(U3)n+1

m

); (14.39)

recall the definition (13.3) of the operator δ2x. The point we want to make here is that the

nonlinear system (14.39) is different from the nonlinear system obtained upon substitution of(14.30) into (14.25)!

The issue we have encountered can be understood from the following simple example, per-taining to a single time level (hence we omit the superscript of the functions). Consider anonlinear function u3. Obviously,

(u3)x = 3u2ux . (14.40)


With the second-order accuracy, the l.h.s. can be discretized as, e.g.,

(u3)x → (Um+1)3 − (Um−1)

3

2h=

(Um+1 − Um−1)(U2m+1 + Um+1Um−1 + U2

m−1)

2h. (14.41)

Using the same — central-difference — formula to discretize the derivative on the r.h.s. of(14.40), one obtains

3u2 · ux → 3U2m ·

Um+1 − Um−1

2h, (14.42)

which, obviously, does not equal the r.h.s. of (14.41), although differs from it by an amountO(h2).

Thus, a nonlinear term can have several representations, which are equivalent in the contin-uous limit (like the l.h.s. and r.h.s. of (14.40)). However, these different representations, whendiscretized using the same rule, can still lead to distinct finite-difference equations.

14.5 Nonlinear parabolic PDEs: II. Semi-implicit, implicit-explicit(IMEX), and other methods

14.5.1 A semi-implicit method

Let us present a simple alternative to the Newton–Raphson method using (14.38) as the modelproblem. With the accuracy of O(κ2), the u3 term can be discretized as follows:

u3 →(Un+ 1

2

)2 Un + Un+1

2. (14.43)

The r.h.s. of (14.43) is now linear with respect to Un+1, but the problem is that we do notyet know Un+ 1

2 . The latter can be approximated by an explicit method that should have thelocal truncation error O(κ2), and hence the global accuracy of one order less, i.e. only O(κ).That is, one can first compute Un+ 1

2 and then use it as a known value in (14.43). A simple,O(κ2)-accurate way to compute Un+ 1

2 is by a multi-step method similar to (3.4):

Un+ 12 = Un +

1

2(Un − Un−1) =

3

2Un − 1

2Un−1 . (14.44)

Then the scheme

δtUnm =

r

6δ2x

((U

n+ 12

m

)2

(Unm + Un+1

m )

), (14.45)

becomes an implicit scheme for the linear equation.Method (14.45), (14.44) is a member of a large class of semi-implicit methods. It can be

straightforwardly generalized to the following class of equations:

ut = a(u, ux, x, t)uxx + b(u, ux, x, t)ux , (14.46)

where, as stated above, the coefficients a and b may depend on the solution u and its derivativeux. (Further generalizations of this form are possible, but for the purpose of our brief discussion,form (14.46) is sufficient.) An extension of scheme (14.45), (14.44) for (14.46) is:

δtUnm

κ= a

(U

n+ 12

m , (Un+ 1

2m )x, xm, tn+ 1

2

) (Unm)xx + (Un+1

m )xx

2+

b(U

n+ 12

m , (Un+ 1

2m )x, xm, tn+ 1

2

) (Unm)x + (Un+1

m )x

2, (14.47)


where Un+ 1

2m is given by (14.44) and (Un

m)x denotes the second-order accurate finite-differenceapproximation of ux(xm, tn), etc..

Since this scheme is not fully implicit, it cannot be unconditionally stable (see Theorem4.2 at the end of Lecture 4). However, one can show that it is unconditionally stable on thebackground of the constant solution, u = C where C is any constant, of (14.46). To show this,let us consider an ansatz

Unm = C + ε ρn eiβmh , (14.48)

where ρ is the amplification factor in the von Neumann analysis and ε ¿ 1 indicates smallnessof the perturbation to the exact solution u = C. Then, according to (14.44),

Un+ 1

2m = C + ε

(3

2ρ− 1

2

)ρn−1eiβmh . (14.49)

When, however, (14.49) is substituted into (14.47) and terms of order O(ε2) and higher areneglected, the O(ε)-term from (14.49) drops out, since the terms that it multiplies are alreadyO(ε) (the C-term is absent from (Un

m)xx and similar terms due to the x-derivative). Then, inthe equation that results from the stability analysis, a and b take on the forms a(C, 0, xm, tn+ 1

2)

and b(C, 0, xm, tn+ 12), which is the same as in the case of linear parabolic PDEs with variable

coefficients, considered in Section 14.2. Thus, method (14.47) is unconditionally stable on thebackground of the constant solution of Eqs. (14.46). While it may be unstable (for “too large”a time step) on the background of other, non-constant, solutions, it may still be a good firstmethod to try since it is much easier to implement than the Newton–Raphson method.

14.5.2 The idea behind Implicit–Explicit (IMEX) methods

IMEX methods present another attractive alternative to the Newton–Raphson method because,as the semi-implicit method above, they also do not require the solution of a system of nonlinearalgebraic equations. They do require the step size κ to be restricted since they are not fullyimplicit and hence cannot be unconditionally stable (see Lecture 4). However, such a restrictioncan be significantly weaker than that for a fully explicit method. Below we present only thebasic idea of IMEX methods. A more detailed, and quite readable, exposition, as well asreferences, can be found in Section IV.4 of the book by W. Hundsdorfer and J.G. Verwer,“Numerical Solution of Time-Dependent Advection-Diffusion-Reaction Equations,” (SpringerSeries in Comput. Math., vol. 33, Springer, 2003).

The idea behind IMEX methods can be explained without any explicit reference to spatialvariables. Let the evolution equation that we want to solve have the form

ut = F (u(t), t)) ≡ F0(u(t), t) + F1(u(t), t) , (14.50)

where F0 is a non-stiff term suitable for explicit time-integration and F1 is a stiff term thatrequires implicit treatment. Usually, F0 and F1 include, respectively, the advection and diffusionterms (i.e., the second and first terms in (14.16) or (14.46), respectively; recall from Lecture12 that the simple Heat equation ut = uxx is a stiff problem). The last term in (14.16), whichis usually referred to as the reaction term (because it often describes chemical reactions) canbelong to either F0 or F1. To make the splitting (14.50) useful for a numerical implementation,which means avoiding the solution of a system of nonlinear equations, it suffices to require thatF1 be linear in u. Below we will proceed with this assumption, but at the end of our discussionwill mention a generalization where F1 may contain nonlinear terms. Let us also note that


our consideration applies equally well both to a single Eq. (14.50) and to a system of coupledequations whose r.h.s. can be split as a sum of non-stiff and stiff terms.

A simple first-order accurate IMEX method for (14.50) is:

Un+1 − Un

κ= F0(U

n, tn) + (1− θ)F1(Un, tn) + θF1(U

n+1, tn+1) , (14.51)

where θ is a parameter, as in Lecture 13. Note that since, by design, F1 depends on Un+1

linearly, scheme (14.51) does not require its user to solve any nonlinear algebraic equations.The stability analysis for this scheme is done as follows. Instead of the model equation

ut = λu , (4.14)

which does not distinguish between the stiff and non-stiff parts, one considers a model equation

ut = λ0u + λ1u , (14.52)

where λ0 and λ1 correspond to F0 and F1. Substituting Un = ρn into scheme (14.51) appliedto Eq. (14.52), one finds that

ρ ≡ ρ(z0, z1) =1 + z0 + (1− θ)z1

1− θz1

, (14.53)

where z0 = λ0κ and z1 = λ1κ. As usual, one requires

|ρ(z0, z1)| < 1 (14.54)

for stability. We will now explain that this condition can be interepreted in two different ways.

First interpretation of (14.54)Suppose one has to design a method (14.51) that should be applicable to equations of the

form (14.50) where parameters of F0 cause the values of λ0 to be anywhere in the left-halfcomplex plane (i.e., not just on the negative real line). Then one should insist on using the fullstability region of the explicit method, i.e., to have |1 + z0| < 1, while being willing to giveup some flexibility in selecting z1. An example of such a situation is when F0 contains termsdescribing nonlinear, but non-stiff, reaction or advection, while F1 contains the simple diffusionterm uxx, for which all λ1’s lie on the negative real axis (see Problem 2 in HW 12). Let us nowexplain why the restriction on the values of z1 are expected to occur.

For the sake of argument, consider the value θ = 1/2 in (14.51), which would lead to theCrank–Nicolson scheme if F0 were absent. Since that scheme is nothing but the implementationof the modified implicit Euler method for the Heat equation (see Sec. 13.1), its stability regionis the entire left-half complex plane (recall the result of Problem 7 in HW 4). That is,

∣∣1 + 12z1

∣∣∣∣1− 1

2z1

∣∣ ≤ 1 (14.55)

whenever Re (z1) ≤ 0. Graphically, this is illustrated in Figure (a) below. There, the ex-pressions in the numerator and denominator on the l.h.s. of (14.55) are depicted by the thick


vectors in the left-half and right-half planes, respectively. It is clear that the ratio of the lengthsof those vectors is indeed always less than one.

1

1

(1/2)z1

−(1/2)z1

Im z1

Re z1

(a)

1+z0

1

(1/2)z1

−(1/2)z1

Im z1

Re z1

(b)

On the other hand, the stability condition (14.54) with θ = 1/2 is

∣∣(1 + z0) + 12z1

∣∣∣∣1− 1

2z1

∣∣ ≤ 1, (14.56)

which must hold for all z0 such that |1 + z0| ≤ 1. As illustrated in Figure (b) above, condition(14.56) can be violated for some of such z0 unless Im (z1) = 0. Thus, if one insists on havingthe full stability region for the explicit part of the IMEX method (14.51), the stability region ofthis method with respect to its implicit part can be less than the corresponding stability regionof (14.51) with F0 ≡ 0.

In general, in this case one can show that the stability condition (14.54) yields the inequality

1 + |(1− θ)z1| < |1− θz1| . (14.57)

With some effort, one can further show that the unconditional stability of the IMEX method(14.51) is attained only for θ = 1. For θ < 1/2, scheme (14.51) is unstable. For θ = 1/2, itsstability region D1, given by (14.57), collapses onto the negative real axis: z1 < 0. (Above,we have illustrated this graphically.) However, already for θ just slightly exceeding the criticalvalue of 1/2, the stability region D1 becomes a sector with a significantly nonzero angle α onboth sides of the negative real axis; for example, α ≈ 25o and α > 50o for θ = 0.51 and θ = 0.6,respectively (see, e.g., Fig. 4.1 in the book by Hundsdorfer and Verwer cited above).

Second interpretation of (14.54)Alternatively, suppose that (14.50) is a system of coupled equations for variables u(1), u(2), . . .,

and suppose that F1 contains both the diffusion term and the stiff part of the reaction term.Then, the eigenvalues λ

(1)1 , λ

(2)1 , . . . (and hence the corresponding values z

(1)1 , z

(2)1 , . . .) of the

Jacobian matrix ∂(F (1), F (2), . . .)/∂(u(1), u(2), . . .) (see Sec. 5.4 in Lecture 5) can be found

anywhere in the left half of the complex plane, i.e. Re z(j)1 ≤ 0 for all j. Thus, one may want

to know for which complex z0 one can fulfill condition (14.54) given that z1 can be allowedanywhere in the left-half complex plane.

Similarly to the previous case, the corresponding nonempty regionD0 exists only for θ ≥ 1/2.That is, if θ < 1/2, then the IMEX method (14.51) where z1 can be found anywhere in the


left-half plane, is unstable for any z0 6= 0 with Re (z0) ≤ 0! (This should be contrastedwith the situation when F0 ≡ 0, for which method (14.51) with θ < 1/2 is conditionallystable, as we showed in Sec. 13.3.) For θ < 1, the stability region D0 of the IMEX methodis smaller than the region |1 + z0| < 1, which would result in the absence of the F1 termin (14.50). For θ = 1/2, the region D0 collapses into the segment −2 < z0 < 0 along thenegative real axis, while for θ = 1, the stability region of the explicit Euler method, e.g.,D0(θ = 1) = {All z0 such that |1 + z0| < 1}, is recovered (see, again, Fig. 4.1 in the book byHundsdorfer and Verwer).

The book by Hundsdorfer and Verwer provides an overview of higher-order accurate mem-bers of the IMEX family, which are preferred in practice over the lowest-order method (14.51).Among them are, for example, IMEX Runge–Kutta and multistep IMEX methods. Below wewill list two second-order accurate IMEX methods and briefly comment on their properties.

Second-order IMEX-Adams methods have the form:

Un+1 − Un

κ=

3

2F0(U

n, tn)− 1

2F0(U

n−1, tn−1) + (14.58)

θF1(Un+1, tn+1) +

(3

2− 2θ

)F1(U

n, tn) +(θ − 1

2

)F1(U

n−1, tn−1) .

If we insist that it be stable for all z1 in the left-half plane, its stability region with respect toz0 depends on θ (similarly to what we discussed above in the second interpretation of (14.54)).For example, for θ = 1/2, this method is stable only when z0 belongs to a segment along thenegative real axis, z0 ∈ [−1, 0]. For θ = 1, the stability region of the second-order Adams–Bashforth method is recovered (see Problem 4 in HW 4). For θ = 3/4, the stability region isan oval part of whose boundary follows the imaginary axis most closely (out of all values of θ).Thus, the IMEX-Adams method with θ = 3/4 is preferred for equations that have z0 both on,and to the left of, the imaginary axis.

If z0 are known to lie only on the imaginary axis, then the so-called IMEX-CNLF (Crank–Nicolson Leap-frog) method can be used. Its scheme is:

Un+1 − Un−1

2κ= F0(U

n, tn) +1

2

(F1(U

n+1, tn+1) + F1(Un−1, tn−1)

). (14.59)

This scheme is stable for all z1 in the left-half plane and for z0 ∈ [−2, 2]. Examples of thenon-stiff term F0 for which λ0 lies on the imaginary axis is the advection term b(x, t, u)ux. (Itis beyond the scope of this course to explain why this is so, but if you are familiar with Fourieranalysis, you may figure it out on your own.) Thus, equations of the form (14.46) where a isindependent of u can be solved by this method.

Finally, we note that the same considerations can often be generalized when F1 is not alinear function of u. For example, consider Eq. (14.46) where now the coefficient a does dependon u. Then one can replace the implict integration in (14.51) with an analogue of the semi-implicit method (14.47). This would still result in the equation for Un+1 being linear, and henceeasily solvable. Stability properties of such a method are not, however, clear, and may need tobe verified by numerical experiments.

14.5.3 Comments on other methods

Let us mention a popular method called a split-step method, which we will illustrate with theexample of the celebrated Nonlinear Schrodinger equation:

iut + uxx + |u|2u = 0, (note the i =√−1 in front of ut) (14.60)


which appears in a great many applications involving propagation of wave packets. The split-step method is based on the observation that the linear and nonlinear parts of this equation canbe solved exactly (we do not need to consider here how this can be done). Then the split-stepalgorithm is:

Given Un(x) ≡ u(x, tn),

Solve iut + uxx = 0 from tn to tn+1; ⇒ get Uaux;

Using Uaux as the initial condition,

Solve iut + u|u|2 = 0 from tn to tn+1; ⇒ get Un+1.

(14.61)

The last class of methods that we will mention are valuable only for PDEs that possessconserved quantities, like energy. Usually, such equations are hyperbolic PDEs or parabolicPDEs with “imaginary time”, like the Nonlinear Schrodinger equation (14.60). Such equationsare multi-dimensional counterparts of the harmonic oscillator equation. There are classes ofnumerical schemes that preserve some (or, in rare cases, all!) of the conserved quantities of thoseequations. Such schemes are relatives of symplectic methods for ODEs, discussed in Lecture5. One can read about those conservation-laws-based schemes in, e.g., a textbook by J.W.Thomas, “Numerical partial differential equations: Conservation laws and elliptic equations”(Springer, 1999). “True” parabolic equations, like the Heat equation or, more generally, anyequation with diffusion in real-valued time, do not have conserved quantities like energy, andhence conservation-laws-based schemes are not applicable to them.


1. In (14.5) and (14.7), why did we not use the simpler discretization

Un1 − Un

0

h+ pnUn

0 = qn, n = 0, 1,

which would have eliminated the need to deal with the solution Un−1 at the virtual node?

2. Be able to explain the idea(s) behind handling the derivative boundary condition for boththe simple explicit and Crank-Nicolson schemes.

3. Make sure you can obtain (14.9) and hence (14.10)–(14.12).

4. Obtain (14.15b).

5. Explain, argumentatively but without calculations, that discretization (14.17) producesa scheme of the accuracy stated in the text. (Drawing the stencil should help.)

6. Same question about (14.19).

7. What condition on the variable coefficients of a linear PDE should hold in order for thevon Neumann stability analysis to proceed along the same lines as for the simple Heatequation? Why?

8. Describe two ways in which the person who is numerically solving PDE (14.13) may usethe stability condition (14.20).

9. When and why does one need to modify the stability criterion to be (14.23)?


10. What is the order of accuracy of scheme (14.26)?

11. Explain qualitatively (i.e., without calculations) that discretization (14.30) produces ascheme of the accuracy O(κ2 + h2). (Drawing the stencil should help.)

12. What is the main difficulty in solving nonlinear PDEs by implicit methods?

13. Using discretizations (14.30) as an example, explain the idea behing the Newton–Raphsonmethod when applied to nonlinear PDEs.

14. Describe the issue about discretization of nonlinear terms, pointed out in Remark 1.

15. Describe the idea behind the semi-implicit method presented in Sec. 14.5.

16. Explain why the r.h.s. of (14.44) approximates the l.h.s. of that equation.

17. Obtain (14.53).

18. What are two possible interpretations of (14.54)?

19. Make sure you can follow the argument made around condition (14.56).

20. Why does method (14.58) have the name ‘Adams’ in it?

21. When can method (14.59) be used?


15 The Heat equation in 2 and 3 spatial dimensions

In this Lecture, which concludes our treatment of parabolic equations, we will develop numericalmethods for the Heat equation in 2 and 3 dimensions in space. We will present the details ofthese developments for the 2-dimensional case, while for the 3-dimensional case, we will mentiononly those aspects which cannot be straightforwardly generalized from 2 to 3 spatial dimensions.

Since this Lecture is quite long, here we give a brief preview of its results. First, we willexplain how the solution vector can be set up on a 3-dimensional grid (two dimensions in spaceand one in time). We will discuss both the conceptual part of this setup and its implementationin Matlab. Then we will present the simple explicit scheme for the 2D Heat equation and willshow that it is even more time-inefficient than it was for the Heat equation in one dimension.In search of a time-efficient substitute, we will analyze the naive version of the Crank-Nicolsonscheme for the 2D Heat equation, and will discover that that scheme is not time-efficient either!We will then show how a number of time-efficient generalizations of the Crank-Nicolson schemeto 2 and 3 dimensions can be constructed. These generalizations are known under the commonname of Alternating Direction methods, and are a particular case of an even more general classof so-called operator-splitting methods. In Appendix 1 we will point out a relation betweenthese methods and the IMEX methods mentioned in Lecture 14, as well as with the predictor-corrector methods considered in Lecture 3. Finally, we will also describe that prescribingboundary conditions (even the Dirichlet ones) for those time-efficient schemes is not always atrivial matter, and demonstrate how they can be prescribed.

15.1 Setting up the solution vector on a three-dimensional grid

In this Lecture, we study the following IBVP:

ut = uxx + uyy 0 < x < 1, 0 < y < Y t > 0 ; (15.1)

u(x, y, t = 0) = u0(x, y) 0 ≤ x ≤ 1 0 ≤ y ≤ Y ; (15.2)

u(0, y, t) = g0(y, t), u(1, y, t) = g1(y, t), 0 ≤ y ≤ Y, t ≥ 0 ; (15.3)

u(x, 0, t) = g2(x, t), u(x, Y, t) = g3(x, t), 0 ≤ x ≤ 1, t ≥ 0 . (15.4)

We will always assume that the boundary conditions are consistent with the initial condition:

g0(y, 0) = u0(0, y), g1(y, 0) = u0(1, y), g2(x, 0) = u0(x, 0), g3(x, 0) = u0(x, Y ), (15.5)

and, at the corners of the domain, with each other:

g0(0, t) = g2(0, t), g0(Y, t) = g3(0, t), g3(1, t) = g1(Y, t), g1(0, t) = g2(1, t), for t > 0.(15.6)


The figure on the right shows the two-dimensionalspatial domain, where the Heat equation (15.1)holds, as well as the domain’s boundary, where theboundary conditions (15.3), (15.4) are specified.Note that we have allowed the lengths of the do-main in the x and y directions to be different (ifY 6= 1). Although, in principle, one can alwaysmake Y = 1 by a suitable scaling of the spatial co-ordinates, we prefer not to do so in order to allow,later on, the step sizes in x and y to be the same.The latter is simply the matter of convenience. 0 1

0

Y

y

x

D =[0,1]×[0,Y]

∂ D g

1 g

0

g2

g3

m=5m=4

m=0

m=3m=2

m=1l=5

l=3l=1

l=0

l=4l=2

t

yx

Two levels of 2D−spatial grid for M=5, L=5

16

43

21

915

65

78

1410

1112

13n

n+1

To discretize the Heat equation (15.1), we cover domain D with a two-dimensional grid. Aswe have just noted above, in what follows we will assume that the step sizes in the x and y

directions are the same and equal h. We also discretize the time variable with a step size κ. Thenthe three-dimensional grid for the 2D Heat equation consists of points (x = mh, y = lh, t = nκ),0 ≤ m ≤ M = 1/h, 0 ≤ l ≤ L = Y/h, and 0 ≤ n ≤ N = tmax/κ. Two time levels of such a gridfor the case M = 5 and L = 5 are shown in the figure above.

We will denote the solution on the above grid as

Unml = u(mh, lh, nκ) , 0 ≤ m ≤ M, 0 ≤ l ≤ L, 0 ≤ n ≤ N. (15.7)

We expect that any numerical scheme that we will design will give some recurrence relation


between Un+1ml and Un

ml (and, possibly, Un−1ml etc.). As long as our grid is rectangular, the array

of values Unml at each given n can be conveniently represented as an (M + 1)× (L + 1) matrix.

In this Lecture, we will consider only this case of a rectangular grid. Then, to step from leveln to level (n + 1), we just apply the recurrence formula to each element of the matrix Un

ml.For example:

for m = 2 : mmax-1

for ell = 2 : ellmax-1

Unew(m,ell) = a*U(m,ell) + b*U(m+1,ell-1);

end

end

U(2:mmax-1, 2:ellmax-1) = Unew(2:mmax-1, 2:ellmax-1);

If we want to record and keep the value of the solution at each time level, we can instead use:U(m,ell,n+1) = a*U(m,ell,n) + b*U(m+1,ell-1,n); .

On the other hand, if the spatial domain in not rectangular, as occurs in many practicalproblems, then defining Un

ml as a matrix is not possible, or at least not straightforward. In thiscase, one needs to reshape the two-dimensional array Un

ml into a one-dimensional vector. Eventhough it will not be needed for the purposes of this lecture or homework, we will still illustratethe idea and implementation behind this reshaping. For simplicity, we will consider the casewhere the grid is rectangular. This reshaping can be done in more than one way. Here we willconsider only the so-called lexicographic ordering. In this ordering, the first (M−1) componentsof the solution vector ~U will be the values Um,1 with m = 1, 2, . . . , M−1. (Here, for brevity, wehave omitted the superscript n, and also inserted a comma between the subscripts pertainingto the x and y axes for visual convenience.) The next M − 1 components will be Um,2 withm = 1, 2, . . . , M − 1, and so on. The resulting vector is:

~U =

U1,1

U2,1

··UM−1,1

U1,2

U2,2

··UM−1,2

···U1,L−1

U2,L−1

··UM−1,L−1

1st row of 2D level(along y = 1 · h (i.e. l = 1))

2nd row of 2D level(along y = 2 · h (i.e. l = 2))

(L− 1)th row of 2D level(along y = (L− 1) · h (i.e. l = L− 1))

(15.8)

An example of lexicographic ordering of the nodes of one time level is shown in the figure onthe previous page for M = 5 and L = 5 (see the numbers next to the filled circles on the lowerlevel).


Let us now show how one can set up one time level of the grid and construct a vector of theform (15.8), using built-in commands in Matlab. In order to avoid possible confusion, we willdefine the domain D slightly differently than was done above. Namely, we let x ∈ [0, 2] andy ∈ [3, 4]. Next, let us discretize the x coordinate as

>> x=[0 1 2]

x =

0 1 2

and the y coordinate as

>> y=[3 4]

y =

3 4

(Such a coarse discretization is quite sufficient for the demonstration of how Matlab commandscan be used.) Now, let us construct a two-dimensional grid as follows:

>> [X,Y]=meshgrid(x,y)

X =

0 1 2

0 1 2

Y =

3 3 3

4 4 4

Thus, entries of matrix X along the rows equal the values of x, and these entries do not changealong the columns. Similarly, entries of matrix Y along the columns equal the values of y, andthese entries do not change along the rows.

Here is a function Z of two variables x and y, constructed with the help of the above matricesX and Y:

>> Z=100*Y+X

Z =

300 301 302

400 401 402

Now, if we need to reshape matrix Z into a vector, we simply say:

>> Zr=reshape(Z,prod(size(Z)),1)

Zr =

300

400

301

401

302

402

(Note that

>> size(Z)

ans =

2 3


and command prod simply computes the product of all entries of its argument.) If we want togo back and forth between using Z and Zr, we can use the reversibility of command reshape:

>> Zrr=reshape(Zr,size(X,1),size(X,2))

Zrr =

300 301 302

400 401 402

which, of course, gives you back the Z. Finally, if you want to plot the two-dimensional functionZ(x, y), you can type either mesh(x,y,Z) or mesh(X,Y,Z). You may always look up thehelp for any of the above (or any other) commands if you have questions about them.

15.2 Simple explicit method for the 2D Heat equation

Construction of the simple explicit scheme for the 2D Heat equation is a fairly straightforwardmatter. Namely, we discretize the terms in (15.1) in the standard way:

ut → Un+1m,l − Un

m,l

κ≡ δtU

nm,l

κ,

uxx → Unm+1,l − 2Un

m,l + Unm−1,l

h2≡ δ2

xUnm,l

h2,

uyy → Unm,l+1 − 2Un

m,l + Unm,l−1

h2≡ δ2

yUnm,l

h2,

(15.9)

and substitute these expressions into (15.1) to obtain:

Un+1ml = Un

ml + r(δ2xU

nml + δ2

yUnml

)

≡ (1 + rδ2

x + rδ2y

)Un

ml .(15.10)

Three remarks about notations in (15.9) and(15.10) are in order. First,

r =κ

h2,

as before. Second, we will use the notations Um,l

and Uml (i.e. with and without a comma betweenm and l) interchangeably; i.e., they denote thesame thing. Third, the operators δ2

x and δ2y will

be used extensively in this Lecture.

The stencil for the simple explicit scheme (15.10)is shown on the right. Implementation of thisscheme is discussed in a homework problem.

o

o

o o o

o

m, l

m, l−1

m−1, l

m, l+1

m+1, l

LEVEL n

LEVEL n+1

Next, we perform the von Neumann stability analysis of scheme (15.10). To this end, weuse the fact that the solution of this constant-coefficient difference equation is satisfied by theFourier harmonics

Unml = ρn eiβmh eiγlh, (15.11)


which we substitute into (15.10) to find the amplification factor ρ. In this calculation, as well asin many other calculations in the remainder of this Lecture, we will use the following formulae:

δ2x

[eiβmh eiγlh

]= −4 sin2

(βh

2

) [eiβmh eiγlh

], (15.12)

δ2y

[eiβmh eiγlh

]= −4 sin2

(γh

2

) [eiβmh eiγlh

]. (15.13)

(You will be asked to confirm the validity of these formulae in a homework problem.) Substi-tuting (15.11) into (15.10) and using (15.12) and (15.13), one finds

ρ = 1− 4r

(sin2 βh

2+ sin2 γh

2

). (15.14)

The harmonics most prone to instability are, as for the one-dimensional Heat equation, thosewith the highest spatial frequency, and for which

sin2 βh

2= sin2 γh

2= 1 .

For these harmonics, the stability condition |ρ| ≤ 1 implies

r ≤ 1

4or, equivalently, κ ≤ h2

4. (15.15)

Thus, in order to ensure the stability of the simple explicit scheme (15.10), one has to impose arestriction on the time step κ that is twice as strong as the analogous restriction in the case ofthe one-dimensional Heat equation. Therefore, the simple explicit scheme is computationallyinefficient, and our next step is, of course, to look for a computationally efficient scheme. Asthe first candidate for that position, we will analyze the Crank-Nicolson scheme.

15.3 Naive generalization of Crank-Nicolson scheme for the 2D Heatequation

Our main finding in this subsection will be that a naive generalization of the CN method (13.6)is also computationally inefficient. The underlying analysis will allow us to formulate specificproperties that a computationally efficient scheme must possess.

The naive generalization to two dimensions of the CN scheme, (13.5) or (13.6), is:

Un+1ml = Un

ml +r

2

(δ2x + δ2

y

) (Un

ml + Un+1ml

), (15.16)

or, equivalently, (1− r

2δ2x −

r

2δ2y

)Un+1

ml =(1 +

r

2δ2x +

r

2δ2y

)Un

ml . (15.17)

Following the lines of Lecture 13, one can show that the accuracy of this scheme is O(κ2 + h2).Also, the von Neumann analysis yields the following expression for the error amplificationfactor:

ρ =1− 2r

(sin2 βh

2+ sin2 γh

2

)

1 + 2r(sin2 βh

2+ sin2 γh

2

) , (15.18)

so that |ρ| ≤ 1 for any r and hence the CN scheme (15.17) is unconditionally stable.


We will now demonstrate that scheme (15.16) / (15.17) is computationally inefficient. Tothat end, we need to exhibit the explicit matrix form of that scheme. We begin by rewriting(15.16) in the form26:

(1 + 2r)Un+1m,l −

r

2

(Un+1

m+1,l + Un+1m−1,l

)− r

2

(Un+1

m,l+1 + Un+1m,l−1

)

= (1− 2r)Unm,l +

r

2

(Un

m+1,l + Unm−1,l

)+

r

2

(Un

m,l+1 + Unm,l−1

).

(15.19)

To write down Eqs. (15.19) for all m and l in a compact form, we will need the followingnotations:

A =

2r −r/2 0 · · 0−r/2 2r −r/2 0 · 0· · · · · ·0 · 0 −r/2 2r −r/20 · · 0 −r/2 2r

, ~U; l =

U1,l

U2,l

·UM−2,l

UM−1,l

, (15.20)

and

~Bk =

(gk)1

(gk)2

··(gk)M−1

, for k = 2, 3; ~bnl =

(g0)nl + (g0)

n+1l

0·0

(g1)nl + (g1)

n+1l

. (15.21)

Using these notations, one can recast Eq. (15.19) in a matrix form. Namely, for l = 2, . . . , L−2(i.e. for layers with constant y and which are not adjacent to the boundaries), Eq. (15.19)becomes:

(I + A)~Un+1; l − r

2I ~Un+1

; l+1 −r

2I ~Un+1

; l−1 = (I − A)~Un; l +

r

2I ~Un

; l+1 +r

2I ~Un

; l−1 +r

2~bn

l , (15.22)

where I is the (M − 1)× (M − 1) identity matrix. Note that Eq. (15.22) is analogous to Eq.(13.9), although the meanings of notation A is different in these two equations. Continuing,for the layer with l = 1 one obtains:

(I + A)~Un+1; l − r

2I ~Un+1

; l+1 −r

2~Bn+1

2 = (I − A)~Un; l +

r

2I ~Un

; l+1 +r

2~Bn

2 +r

2~bn

l . (15.23)

The equation for l = L− 1 has a similar form. Combining now all these equations into one, weobtain:

(I +A)~Un+1 = (I − A)~Un + Bn, (15.24)

where ~U has been defined in (15.8), I is the [(M−1)(L−1)]× [(M−1)(L−1)] identity matrix,and

A =

A − r2I O · · O

− r2I A − r

2I O · O

· · · · · ·O · O − r

2I A − r

2I

O · · O − r2I A

, Bn =r

2

~Bn2 + ~Bn+1

2 + ~bn1

~bn2

·~bn

L−2

~Bn3 + ~Bn+1

3 + ~bnL−1

. (15.25)

26Recall our convention to use notations Uml and Um,l interchangeably.


In (15.25), O stands for the (M − 1) × (M − 1) zero matrix; hopefully, the use of the samecharacter here and in the O-symbol (e.g., O(h2)) will not cause any confusion.

Now, the [(M − 1)(L − 1)] × [(M − 1)(L − 1)] matrix A in (15.25) is block-tridiagonal,but not tridiagonal. Namely, it has only 5 nonzero diagonals or subdiagonals, but the outersubdiagonals are not located next to the inner subdiagonals but separated from them by a bandof zeros, with the band’s width being (M − 2). Thus, the total width of the central nonzeroband in matrix A is 2(M − 2) + 3. Inverting such a matrix is not a computationally efficientprocess in the sense that it will require not O(ML), but O(ML)2 or O(ML)3 operations. Inother words, the number of operations required to solve Eq. (15.25) is much greater than thenumber of unknowns.27

Let us summarize what we have established about the CN method (15.17) for the 2D Heatequation. The method: (i) has accuracy O(κ2 + h2), (ii) is unconditionally stable, but (iii)requires much more operations per time step than the number of unknown variables. We aresatisfied with features (i) and (ii), but not with (iii). In the remainder of this Lecture, we willbe concerned with constructing methods that do not have the deficiency stated in (iii). Forreference purposes, we will now repeat the properties that we want our “dream scheme” tohave.

In order to be considered computationally efficient, the scheme:

(i) must have accuracy O(κ2 + h2) (or better);

(ii) must be unconditionally stable;

(iii) must require the number of operations per time stepthat is proportional to the number of the unknowns.

(15.26)

In the next subsection, we will set the ground for obtaining such schemes.

15.4 Derivation of a computationally efficient scheme

In this section, we will derive a scheme which we will use later on to obtain methods thatsatisfy all the three conditions (15.26). Specifically, we pose the problem as follows: Find ascheme that (a) reduces to the Crank-Nicolson scheme (13.6) in the case of the one-dimensionalHeat equation and (b) has the same order of truncation error, i.e. O(κ2 + h2); or, in otherwords, satisfies property (i) of (15.26). Of course, there are many (probably, infinitely many)such schemes. A significant contribution by computational scientists in the 1950’s was finding,among those schemes, the ones which are unconditionally stable (property (ii)) and could beimplemented in a time-efficient manner (property (iii)). In the remainder of this section, we

27One might have reasoned that, since A in (15.25) is block-tridiagonal, then one could solve Eq. (15.24) bythe block-Thomas algorithm. This well-known generalization of the Thomas algorithm presented in Lecture 8assumes that the coefficients ak, bk, ck and αk, βk in (8.18) and (8.19) are (M − 1)× (M − 1) square matrices.Then formulae (8.21)–(8.23) of the Thomas algorithm are straightforwardly generalized by assigning the matrixsense to all the operations in those formulae.

However, this naive idea of being able to solve (15.24) by the block-Thomas algorithm does not work. Indeed,consider the defining equation for α2 in (8.21). It involves β−1

1 . While matrix β1 = b1 is tridiagonal, its inverseβ−1

1 is full. Hence α2 is also a full matrix. Then by the last equation in (8.21), all subsequent βk’s are alsofull matrices. But then finding the inverse of each βk in (8.21)–(8.23) would require O(M3) operations, andthis would have to be repeated O(L) times. Thus, the total operation count in this naive approach is O(M3L),which renders the approach computationally inefficient.


will concentrate on the derivation of a scheme, alternative to (15.17), that has property (i).We postpone the discussion of implementation of that scheme, as well as demonstration of theunconditional stability of such implementations, until the next section.

Since we want to obtain a scheme that reduces to the Crank-Nicolson method for theone-dimensional Heat equation, it is natural to start with its naive 2D generalization, scheme(15.17). Now, note the following: When applied to the solution of the discretized equation,operators 1

κδt,

1h2 δ

2x, and 1

h2 δ2y (see (15.9)) produce quantities of order O(1) (that is, not O(κ),

O(κ−1), or anything else):

δ2x

h2Un

ml = O(1),δt

κUn

ml = O(1),δt

κ

δ2x

h2Un

ml = O(1), etc. (15.27)

Before we proceed with the derivation, we will pause and make a number of comments abouthandling operators in equations. Note that the operators mentioned before (15.27) are simplythe discrete analogues of the continuous operators ∂/∂t, ∂2/∂x2, and ∂2/∂y2, respectively. Inthe discrete case, the latter two operators become matrices; for example, in the one-dimensionalcase, operator δ2

x coincides with matrix A in (13.10)28. Therefore, when reading about, orwriting yourself, formulae involving operators (which you will have to do extensively in theremainder of this Lecture), think of the latter as matrices. From this simple observation therefollows an important practical conclusion: If a formula involves a product of two oper-ators, the order of the operators in the product must not be arbitrarily changed,because different operators, in general, do not commute. This is completely analogousto the fact that for two matrices A and B,

AB 6= BA in general.

(But, of course,A + B = B + A,

and the same is true about any two operators.)We conclude this detour about operator notations with two remarks.

Remark 1 One can show that operators δ2x and δ2

y actually do commute, as do their continuousprototypes. However, we will not use this fact in our derivation, so that the latter remains validfor more general operators that do not necessarily commute.Remark 2 Any two operators, which are (arbitrary) functions of the same primordial operator,commute. That is, if O is any operator and f(·) and g(·) are any two functions, then

f(O) g(O) = g(O) f(O) . (15.28)

For example, (a + bδ2

x

) (c + dδ2

x

)=

(c + dδ2

x

) (a + bδ2

x

)(15.29)

for any scalars a, b, c, d.

We now return to the derivation of a suitable modification of (15.17). From (15.27) it followsthat

δ2x

h2

δ2y

h2

δt

κUn

ml = O(1), and so, for instance,κ2

4

δ2x

h2

δ2y

h2

Un+1ml − Un

ml

κ= O(κ2) . (15.30)

28In the two-dimensional case, the matrices for δ2x and δ2

y are more complicated and depend on the order inwhich the grid points are arranged into the vector ~U. Fortunately for us, we will not require the correspondingexplicit forms of δ2

x and δ2y.


The accuracy of scheme (15.17) is O(κ2 + h2), and therefore we can add to it any term ofthe same order without changing the accuracy of the scheme. Let us use this observation andadd the term appearing on the l.h.s. of the second equation in (15.30) to the l.h.s. of scheme(15.17), whose both sides are divided by κ. The result is:

Un+1ml − Un

ml

κ+

κ2

4

δ2x

h2

δ2y

h2

Un+1ml − Un

ml

κ=

1

2h2

(δ2x + δ2

y

) (Un+1

ml + Unml

). (15.31)

Note that scheme (15.31) still has the accuracy O(κ2 + h2).Next, we rewrite the last equation in the equivalent form:

(1− r

2δ2x −

r

2δ2y +

r

2δ2x

r

2δ2y

)Un+1

ml =(1 +

r

2δ2x +

r

2δ2y +

r

2δ2x

r

2δ2y

)Un

ml . (15.32)

The operator expressions on both sides of the above equation can be factored, resulting in

(1− r

2δ2x

) (1− r

2δ2y

)Un+1

ml =(1 +

r

2δ2x

) (1 +

r

2δ2y

)Un

ml . (15.33)

Note that when factoring the operator expressions, we did not change the order of operators intheir product.

Scheme (15.33) is the main result of this section. In the next section, we will show howthis scheme can be implemented in a time-efficient manner. The methods that do so are calledthe Alternating Direction Implicit (ADI) methods. Here we preview the basic idea commonto all of them. Namely, the computations are split in 2 (for the 2D case, and 3, for the 3Dcase) steps. In the first step, one applies an implicit method in the x-direction and an explicitmethod in the y-direction, producing an intermediate solution. The operations count for thisstep is as follows: One needs to solve (L − 1) tridiagonal (M − 1) × (M − 1) systems; thiscan be done with O(ML) operations. In the second step, one applies an implicit method inthe y-direction and an explicit method in the x-direction, which can also be implemented withO(ML) operations. Hence the total operations count is also O(ML).

15.5 Alternating Direction Implicit methods

Peaceman–Rachford method

For this ADI method, the two steps mentioned at the end of the previous section areimplemented as follows:

(a) :(1− r

2δ2x

) ∗Uml =

(1 +

r

2δ2y

)Un

ml ,

(b) :(1− r

2δ2y

)Un+1

ml =(1 +

r

2δ2x

) ∗Uml .

(15.34)

Let us first show that this method is equivalent to (15.33). This will imply that it satisfiesproperty (i) of the “dream scheme” conditions (15.26). Indeed, let us apply the operator

(1− r

2δ2x

)


to both sides of (15.34b). Then we obtain the following sequence of equations:

(1− r

2δ2x

) (1− r

2δ2y

)Un+1

ml =(1− r

2δ2x

) (1 +

r

2δ2x

) ∗Uml

(15.29)=

(1 +

r

2δ2x

) (1− r

2δ2x

) ∗Uml

(15.34a)=

(1 +

r

2δ2x

) (1 +

r

2δ2y

)Un

ml, (15.35)

which proves that (15.34) is equivalent to (15.33).It is easy to see that the Peaceman–Rachford method (15.34) possesses property (iii) of

(15.26), i.e. is computationally efficient. Indeed, in order to compute each of the L − 1 sub-vectors

~∗U; l =

[ ∗U1,l,

∗U2,l . . . ,

∗UM−2,l,

∗UM−1,l

]T

, (15.36)

of the intermediate solution∗Uml, one needs to solve a tridiagonal (M−1)×(M−1) system given

by Eq. (15.34a) for each l. Thus, the step described by (15.34a) requires O(ML) operations.Specifically, for l = 2, . . . , L− 2 (i.e. away from the boundaries), such a system has the form

(1− r

2δ2x

) ~∗U; l = ~Un

; l +r

2

[~Un

; l+1 − 2~Un; l + ~Un

; l−1

], 2 ≤ l ≤ L− 2 , (15.37)

where ~Un; l is defined in (15.20). The counterpart of (15.37) for the boundary rows (with l = 1

and l = L − 1) will be given in the next section. Continuing, the operator δ2x in (15.37) is an

(M −1)× (M −1) tridiagonal matrix, whose specific form depends on the boundary conditionsand will be discussed in the next section. Note that the operator δ2

y on the r.h.s. of (15.34a) isnot a matrix. Indeed, if it were a matrix, it would have been (L − 1) × (L − 1), because thediscretization along the y-direction contains L− 1 inner (i.e. non-boundary) points. However,it would then have been impossible to multiply such a matrix with the (M − 1)-componentvectors ~Un

; l. Therefore, in (15.34a), δ2y is interpreted not as a matrix but as the operation of

addition and subtraction of vectors ~Un; l, as shown on the r.h.s. of (15.37).

Similarly, after all components of the intermediate solution have been determined, it remainsto solve (M − 1) equations (15.34b) for the unknown vectors

~Um; = [Um,1, Um,2 . . . , Um,L−2, Um,L−1]T , m = 1, . . . , M − 1 . (15.38)

Each of these equations is an (L− 1)× (L− 1) tridiagonal system of the form

(1− r

2δ2y

)~Un+1

m; =~∗Um; +

r

2

[~∗Um+1; − 2

~∗Um; +

~∗Um−1;

], 1 ≤ m ≤ M − 1 , (15.39)

where~∗Um; are defined similarly to ~Um; . Note that now the interpretations of operators δ2

x andδ2y have interchanged. Namely, the δ2

y on the l.h.s. of (15.39) is an (L − 1) × (L − 1) matrix,while the δ2

x has to be interpreted as an operation of addition and subtraction of (L − 1)-

component vectors~∗Um;. The solution of M − 1 tridiagonal systems (15.39), and hence the

implementation of step (15.34b), requires O(ML) operations, and thus the total operationscount for the Peaceman–Rachford method is O(ML).


Finally, it remains to show that the Peaceman–Rachford method is unconditionally stable,i.e. has property (ii) of (15.26). This can be done as follows. Equations (15.34) have constant(in x and y) coefficients and hence their solution can be sought in the form:

Unml = ρn eiβmh eiγlh ,

∗Uml=

∗ρ ρn eiβmh eiγlh . (15.40)

Substituting (15.40) into (15.34) and using (15.12) and (15.13), one obtains:

∗ρ=

1− Y

1 + X,

ρ =∗ρ ·1−X

1 + Y=

1−X

1 + X· 1− Y

1 + Y,

(15.41)

where we have introduced two more shorthand notations:

X = 2r sin2 βh

2, Y = 2r sin2 γh

2. (15.42)

From the second of Eqs. (15.41) it follows that |ρ| ≤ 1 for all harmonics (i.e., for all β and γ),because ∣∣∣∣

1−X

1 + X

∣∣∣∣ ≤ 1 for all X ≥ 0. (15.43)

This shows that the Peaceman–Rachford method for the 2D Heat equation is unconditionallystable. Altogether, the above has shown that this method satisfies all the three conditions(15.26) of a “dream scheme”.

A drawback of the Peaceman–Rachford method is that its generalization to 3 spatial dimen-sions is no longer unconditionally stable. Below we provide a sketch of proof of this statement.

For the 3D Heat equationut = uxx + uyy + uzz, (15.44)

the generalization of the Peaceman–Rachford method is:

(a) :(1− r

3δ2x

) ∗Umlj =

(1 +

r

3δ2y +

r

3δ2z

)Un

mlj ,

(b) :(1− r

3δ2y

) ∗∗Umlj =

(1 +

r

3δ2x +

r

3δ2z

) ∗Umlj ,

(c) :(1− r

3δ2z

)Un+1

mlj =(1 +

r

3δ2x +

r

3δ2y

) ∗∗Umlj ,

(15.45)

where δ2z is defined similarly to δ2

x and δ2y . Substituting into (15.45) the ansatze

Unmlj = ρn eiβmh eiγlh eiξjh ,

∗Umlj=

∗ρ ρn eiβmh eiγlh eiξjh ,

∗∗Umlj=

∗∗ρ ρn eiβmh eiγlh eiξjh ,

(15.46)one obtains, similarly to (15.41):

ρ =

(1− 2

3(Y + Z)

) (1− 2

3(X + Z)

) (1− 2

3(X + Y )

)(1 + 2

3X

) (1 + 2

3Y

) (1 + 2

3Z

) , (15.47)

where X and Y have been defined in (15.42) and Z is defined similarly. The amplificationfactor (15.47) is not always less than 1 in magnitude. For example, when X, Y , and Z are alllarge numbers (and hence r is large), the value of the amplification factor is ρ ≈ −8 (you will


be asked to verify this in a QSA), and hence the 3D Peaceman–Rachford method (15.45) is notunconditionally stable.

An alternative ADI method that has an unconditionally stable generalization to 3 spatialdimensions is described next.

Douglas method, a.k.a. Douglas–Gunn method29

The equations of this method are:

(a) :(1− r

2δ2x

) ∗Uml =

(1 +

r

2δ2x + rδ2

y

)Un

ml ,

(b) :(1− r

2δ2y

)Un+1

ml =∗Uml −r

2δ2y Un

ml .(15.48)

Let us now demonstrate that all the three properties (15.26) hold for the Douglas method.To demonstrate property (i), it is sufficient to show that (15.48) is equivalent to scheme

(15.33). One can do so following the idea(s) of (15.35); you will be asked to provide the detailsin a homework problem.

To demonstrate property (ii), one proceeds similarly to the lines of (15.40) and (15.41).Namely, substituting (15.40) into (15.48) and using Eqs. (15.12), (15.13), and (15.42), onefinds:

∗ρ=

1−X − 2Y

1 + X,

ρ =

∗ρ +Y

1 + Y=

1−X

1 + X· 1− Y

1 + Y.

(15.49)

Thus, the amplification factor for the Douglas method in 2D is the same as that factor of thePeaceman–Rachford method, and hence the Douglas method is unconditionally stable in 2D.

Finally, property (iii) for the Douglas method is established in complete analogy with howthat was done for the Peaceman–Rachford method (see the text around Eqs. (15.36)–(15.39)).

Let us now show that the generalization of the Douglas method to 3D is also unconditionallystable. The corresponding equations have the form:

(a) :(1− r

2δ2x

) ∗Umlj =

(1 +

r

2δ2x + rδ2

y + rδ2z

)Un

mlj ,

(b) :(1− r

2δ2y

) ∗∗Umlj =

∗Umlj −r

2δ2y Un

mlj ,

(c) :(1− r

2δ2z

)Un+1

mlj =∗∗Umlj −r

2δ2z Un

mlj , .

(15.50)

Using the von Neumann analysis, one can show that amplification factor for (15.50) is

ρ = 1− 2(X + Y + Z)

(1 + X)(1 + Y )(1 + Z), (15.51)

so that, clearly, ρ ≤ 1. Using techniques from multivariable Calculus, it is easy to show thatalso ρ ≥ −1, and hence the 3D Douglas method is unconditionally stable.

29This method was proposed by J. Douglas for the two- and three-dimensional Heat equation in [“On thenumerical integration of uxx + uyy = ut by implicit methods,” J. Soc. Indust. Appl. Math. 3 42–65 (1955)]and in [“Alternating direction methods for three space variables,” Numerische Mathematik 4 41–63 (1962)]. Ageneral form of such methods was discussed by J. Douglas and J. Gunn in [“A general formulation of alternatingdirection methods, I. Parabolic and hyperbolic problems,” Numerische Mathematik 6 428–453 (1964)].


To conclude this subsection, we mention two more methods for the 2D Heat equation.

D’yakonov methodThe equations of this method are

(a) :(1− r

2δ2x

) ∗Uml =

(1 +

r

2δ2x

) (1 +

r

2δ2y

)Un

ml ,

(b) :(1− r

2δ2y

)Un+1

ml =∗Uml .

(15.52)

One can show, similarly to how that was done for the Peaceman–Rachford and Douglas meth-ods, that the D’yakonov method possesses all the three properties (15.26).

Fairweather-Mitchell schemeThis scheme is

(1− θ0δ

2x

) (1− θ0δ

2y

)Un+1

ml =(1 + (1− θ0)δ

2x

) (1 + (1− θ0)δ

2y

)Un

ml ,

θ0 =r

2− 1

12.

(15.53)

This scheme improves scheme (15.33) in the same manner in which the Crandall method“(13.17)+(13.19)” for the 1D Heat equations improves the Crank-Nicolson method. Conse-quently, its accuracy is O(κ2 + h4), and the scheme is stable. As far as implementing thisscheme in a time-efficient manner, this can be done straighforwardly by using suitable modifi-cations of the Peaceman–Rachford or D’yakonov methods.

GeneralizationsIn Appendix 1 we will present two important generalizations.First, we will take an another look at the Douglas method (15.48) and thereby observe its

relation to the predictor-corrector methods considered in Lecture 3 and to the IMEX methodsmentioned in Lecture 14.

Second, we will show how one can construct an unconditionally stable method whose globalerror is of the order O(κ2 + h2), for a parabolic-type equation with a mixed derivative term,e.g.:

ut = a(xx)uxx + a(xy)uxy + a(yy)uyy ; (15.54)

here a(xx) etc. are coefficients, and the term with the mixed derivatives is underlined. Ourconstruction will utilize the first generalization considered in Appendix 1. It is worth pointingout that construction of a scheme with aforementioned properties for (15.54) was not a trivialproblem. This is attested by the fact that it was solved more than 30 years after the pioneeringworks by Peaceman, Rachford, Douglas, and others on the Heat equation (15.1). The paper30

where this problem was solved, is posted on the course website.

A good reference on finite difference methods in two and three spatial dimensions is a bookby A.R. Mitchell and G.F. Griffiths, “The Finite Difference Method in Partial DifferentialEquations” (Wiley, 1980).

30I.J.D. Craig and A.D. Sneyd, “An alternating-direction implicit scheme for parabolic equations with mixedderivatives,” Computers and Mathematics with Applications 16(4), 341–350 (1988).


15.6 Boundary conditions for the ADI methods

Here we will show how to prescribe boundary conditions for the intermediate solution∗Uml

appearing in the ADI methods considered above. We will do so for the Dirichlet boundaryconditions (15.3) and (15.4) and for the Neumann boundary conditions

ux(0, y, t) = g0(y, t), ux(1, y, t) = g1(y, t), 0 ≤ y ≤ Y, t ≥ 0 ; (15.55)

uy(x, 0, t) = g2(x, t), uy(x, Y, t) = g3(x, t), 0 ≤ x ≤ 1, t ≥ 0 . (15.56)

The corresponding generalizations for the mixed boundary conditions (14.3) can be obtainedstraightforwardly. Note that the counterpart of the matching conditions (15.5) between theboundary conditions on one hand and the initial condition one the other, for the Neumannboundary conditions has the form:

g0(y, 0) = (u0)x(0, y), g1(y, 0) = (u0)x(1, y),

g2(x, 0) = (u0)y(x, 0), g3(x, 0) = (u0)y(x, Y ) .(15.57)

The counterpart of the requirement (15.6) that the boundary conditions match at the cornersof the domain follows from the relation uxy(x, y) = uyx(x, y) and has the form:

(g0)y(0, t) = (g2)x(0, t), (g0)y(Y, t) = (g3)x(0, t),

(g3)x(1, t) = (g1)y(Y, t), (g1)y(0, t) = (g2)x(1, t), for t > 0.(15.58)

Peaceman–Rachford method

Dirichlet boundary conditions

Note that in order to solve Eq. (15.34a) for the∗Uml

with 1 ≤ {m, l} ≤ {(M−1), (L−1)}, one requires

the values of∗U0,l and

∗UM,l with 1 ≤ l ≤ L−1. The

corresponding nodes are shown as open circles inthe figure on the right. Note that one does not need

the other boundary values,∗Um,0 and

∗Um,L, to solve

(15.34b), because the l.h.s. of the latter equationis only defined for 1 ≤ l ≤ L− 1. Hence, one does

not need (and cannot determine) the values∗Um,0

and∗Um,L in the Peaceman–Rachford method.

m=M

m=0

x y

g2

g0

g1

g3

Thus, how does one find the required boundary values∗U0,l and

∗UM,l? To answer this

question, note that the term in (15.34a) that produces∗U0,l and

∗UM,l is: r

2δ2x

∗Uml (with m = 1

and m = M−1). Let us then eliminate this term using both Eqs. (15.34). The most convenientway to do so is to rewrite these equations in an equivalent form:

(1− r

2δ2x

) ∗Uml =

(1 +

r

2δ2y

)Un

ml ,

(1 +

r

2δ2x

) ∗Uml =

(1− r

2δ2y

)Un+1

ml ,


and then add them. The result is:∗Uml=

1

2

(Un

ml + Un+1ml

)+

r

4δ2y

(Un

ml − Un+1ml

). (15.59)

Now, this equation, unlike (15.34a), can be evaluated for m = 0 and m = M , yielding

∗U{0,M}, l =

1

2

(Un{0,M}, l + Un+1

{0,M}, l

)+

r

4δ2y

(Un{0,M}, l − Un+1

{0,M}, l

)

=1

2

((g{0,1})

nl + (g{0,1})

n+1l

)+

r

4δ2y

((g{0,1})

nl − (g{0,1})

n+1l

)(15.60)

=1

2

((g{0,1})

nl + (g{0,1})

n+1l

)+

r

4·

([(g{0,1})

nl+1 − 2(g{0,1})

nl + (g{0,1})

nl−1

]− [(g{0,1})

n+1l+1 − 2(g{0,1})

n+1l + (g{0,1})

n+1l−1

])

≡ (G{0,1})l ,

1 ≤ l ≤ L− 1 .

It is now time to complete the discussion about the implementations of operators r2δ2x and

r2δ2y in each of the equations (15.34). Recall that we started this discussion after Eq. (15.36),

but were unable to complete it then because we did not have the information about boundaryconditions. Let us begin with Eq. (15.34a). There, operator r

2δ2x is the following (M − 1) ×

(M − 1) matrix:On the r.h.s. of (15.34a):

r

2δ2x =

−r r2

0 · · 0r2−r r

20 · 0

· · · · · ·0 · 0 r

2−r r

2

0 · · 0 r2−r

,(15.61)

which has been obtained in analogy with matrix A in (13.10). Operator δ2y should be interpreted

not as a matrix but as an operation of adding and subtracting (M − 1)-component vectors,as was shown on the r.h.s. of (15.37). Below we present the generalization of (15.37) for theboundary rows l = 1 and l = L− 1:

On the r.h.s. of (15.34a):

(1 + r

2δ2y

)~Un

; l = ~Un; l + r

2

[~Un

; l+1 − 2~Un; l + ~Un

; l−1

], 2 ≤ l ≤ L− 2 ;

(1 + r

2δ2y

)~Un

; 1 = ~Un; 1 + r

2

[~Un

; 2 − 2~Un; 1 + ~Bn

2

],

(1 + r

2δ2y

)~Un

; L−1 = ~Un; L−1 + r

2

[~Bn

3 − 2~Un; L−1 + ~Un

; L−2

],

(15.62)

where ~B2 and ~B3 have been defined in (15.21). Note that the r.h.s. of (15.34a) also contains

terms contributed by∗U{0,M}, l, as we will explicitly show shortly.

In Eq. (15.34b), as has been mentioned earlier, the interpretations of δ2x and δ2

y are reversed.Namely, now r

2δ2y has the form given by the r.h.s. of (15.61); the dimension of this matrix is

(L− 1)× (L− 1). Operator δ2x is implemented as was shown on the r.h.s. of (15.39); below we

show its form for completeness of the presentation:

On the r.h.s. of (15.34b):

(1 + r

2δ2x

) ~∗Um; =

~∗Um; +

r

2

[~∗Um+1; − 2

~∗Um; +

~∗Um−1;

]. 1 ≤ m ≤ M − 1 ;

(15.63)


Let us now summarize the above steps in the form of an algorithm.

Algorithm of solving the 2D Heat equation with Dirichlet boundary conditionsby the Peaceman–Rachford method:

The following steps need to be performed inside the loop over n (i.e., advancing in time).Suppose that the solution has been computed at the nth time level.

Step 1 (set up boundary conditions):Define the boundary conditions at the (n + 1)th time level:

U(n+1){0, M}, l =

(g{0, 1}

)(n+1)

l, 0 ≤ l ≤ L; U

(n+1)m, {0, L} =

(g{2, 3}

)(n+1)

m, 1 ≤ m ≤ M−1 . (15.64)

(Recall that the boundary conditions match at the corners; see (15.6).)Next, determine the necessary boundary values of the intermediate solution:

∗U{0,M}, l= (G{0,1})l , 1 ≤ l ≤ L− 1 , (15.65)

where (G{0,1})l are defined in (15.60).

Step 2:For each l = 1, . . . , L− 1, solve the tridiagonal system

(1− r

2δ2x

) ~∗U; l = ~Un

; l +r

2

[~Un

; l+1 − 2~Un; l + ~Un

; l−1

]+

r

2

~∗b; l ,

1 ≤ l ≤ L− 1,

(15.66)

where r2δ2x is an (M − 1)× (M − 1) matrix of the form (15.61),

~∗U; l =

∗U1,l∗U2,l

·∗UM−2,l∗UM−1,l

, ~Un; l =

Un1,l

Un2,l

·Un

M−2,l

UnM−1,l

,~∗b; l =

(G0)l

0·0

(G1)l

.

see(15.20)

and(15.36)

Note that ~Un; 0 ≡ ~Bn

2 and ~Un; L ≡ ~Bn

3 are determined from the boundary conditions on the nthtime level.

Thus, combining the results of (15.64), (15.65), and (15.66), one has the following values ofthe intermediate solution:

∗Um,l for 0 ≤ m ≤ M, 1 ≤ l ≤ L− 1 .

Step 3:The solution Un+1

m,l with 1 ≤ {m, l} ≤ {M − 1, L− 1} is then determined from

(1− r

2δ2y

)~Un+1

m; =~∗Um; +

r

2

[~∗U

n

m+1 ; − 2~∗Um; +

~∗Um−1 ;

]+

r

2~bn+1

m; ,

1 ≤ m ≤ M − 1.

(15.67)


Here r2δ2y is the (L− 1)× (L− 1) matrix of the form (15.61), and

~∗Um; =

∗Um,1∗Um,2

·∗Um,L−2∗Um,L−1

, ~Un+1m; =

Un+1m,1

Un+1m,2

·Un+1

m,L−2

Un+1m,L−1

, ~bn+1m; =

(g2)n+1m

0·0

(g3)n+1m

. (15.68)

This completes the process of advancing the solution by one step in time.

Neumann boundary conditions

This case is technically more involved than the case of Dirichlet boundary conditions. There-fore, here we only list the steps of the algorithm of advancing the solution from the nth to the(n+1)st time level, while relegating the detailed derivation of these steps to Appendix 2. Also,note that you will not need to use this algorithm in any of the homework problems. It is pre-sented here so that you would be able to use it whenever you have to solve a problem of thiskind in your future career.

Algorithm of solving the 2D Heat equation with Neumann boundary conditionsby the Peaceman–Rachford method:

Step 1 (set up boundary conditions):Given the solution Un

m,l with 0 ≤ {m, l} ≤ {M, L}, find the values at the virtual nodes, Un−1,l

and UnM+1,l with −1 ≤ l ≤ L + 1 and Un

m,−1 and Unm,L+1 with 0 ≤ m ≤ M , from (15.94) and

(15.95) of Appendix 2:

Un−1,l = Un

1,l − 2h(g0)nl , Un

M+1,l = UnM−1,l + 2h(g1)

nl , 0 ≤ l ≤ L ;

Unm,−1 = Un

m,1 − 2h(g2)nm , Un

m,L+1 = Unm,L−1 + 2h(g3)

nm , 0 ≤ m ≤ M ;

(15.94)

Un−1,−1 = Un

1,−1 − 2h(g0)n−1 , Un

M+1,−1 = UnM−1,−1 + 2h(g1)

n−1 ,

Un−1,L+1 = Un

1,L+1 − 2h(g0)nL+1 , Un

M+1,L+1 = UnM−1,L+1 + 2h(g1)

nL+1 .

(15.95)

Define auxiliary functions given by (15.97) and (15.98) of Appendix 2, which will later be

used to compute the boundary values of the intermediate solution∗U :

For 0 ≤ l ≤ L:

(G0)l =1

2

((g0)

nl + (g0)

n+1l

)+

r

4

([(g0)

nl+1 − 2(g0)

nl + (g0)

nl−1

]− [(g0)

n+1l+1 − 2(g0)

n+1l + (g0)

n+1l−1

]),

(15.97)

(G1)l =1

2

((g1)

nl + (g1)

n+1l

)+

r

4

([(g1)

nl+1 − 2(g1)

nl + (g1)

nl−1

]− [(g1)

n+1l+1 − 2(g1)

n+1l + (g1)

n+1l−1

]),

(15.98)

(Note that the form of G0 and G1 above is the same as in the case of the Dirichlet boundaryconditions — see (15.60), — although the meanings of g0 and g1 are different in these twocases.)


Step 2:For each 0 ≤ l ≤ L, solve the linear system, whose form follows from (15.92) and (15.99) ofAppendix 2: (

1− r

2δ2x

) ~∗U; l = ~Un

; l +r

2

[~Un

; l+1 − 2~Un; l + ~Un

; l−1

]+

r

2

~∗b; l ,

0 ≤ l ≤ L,

(15.69)

where r2δ2x is an (M + 1)× (M + 1) matrix of the form

−r r 0 · · 0r2−r r

20 · 0

· · · · · ·0 · 0 r

2−r r

2

0 · · 0 r −r

, (15.70)

(here the terms that differ from the corresponding matrix for Dirichlet boundary conditions areincluded in the box),

~∗U; l =

∗U0,l∗U1,l

·∗UM−1,l∗UM,l

, ~Un; l =

Un0,l

Un1,l

·Un

M−1,l

UnM,l

,~∗b; l =

−2h(G0)l

0·0

2h(G1)l

, (15.71)

and G0 and G1 are defined in (15.97) and (15.98) (see above and in Appendix 2).Having thus determined the following values of the intermediate solution:

∗Um,l for 0 ≤ {m, l} ≤ {M,L},

find the values ∗U−1,l and

∗UM+1,l for 0 ≤ l ≤ L

from (15.97) and (15.98) of Appendix 2:

∗U−1,l=

∗U1,l −2h(G0)l, 0 ≤ l ≤ L, (15.97′)

∗UM+1,l=

∗UM−1,l +2h(G1)l, 0 ≤ l ≤ L. (15.98′)

Thus, upon completing Step 2, one has the following values of the intermediate solution

∗Um,l for −1 ≤ m ≤ M + 1 and 0 ≤ l ≤ L,

which are shown in Appendix 2 to be necessary and sufficient to find the solution on the (n+1)sttime level.

Step 3:The solution at the new time level, Un+1

m,l with 0 ≤ {m, l} ≤ {M,L}, is determined from(15.85), (15.88), and (15.89) of Appendix 2, which constitute the following (L + 1) × (L + 1)linear systems for each of the m = 0, . . . , M :

(1− r

2δ2y

)~Un+1

m; =~∗Um; +

r

2

[~∗U

n

m+1 ; − 2~∗Um; +

~∗Um−1 ;

]+

r

2~bn+1

m; ,

0 ≤ m ≤ M.

(15.72)


Here r2δ2y is the (L + 1)× (L + 1) matrix of the form (15.70), and

~∗Um; =

∗Um,0∗Um,1

·∗Um,L−1∗Um,L

, ~Un+1m; =

Un+1m,0

Un+1m,1

·Un+1

m,L−1

Un+1m,L

, ~bn+1m; =

−2h(g2)n+1m

0·0

2h(g3)n+1m

. (15.73)

This completes the process of advancing the solution by one step in time.

We conclude this subsection with the counterparts of Eq. (15.60) for the Douglas andD’yakonov methods. You will be asked to derive these results in a homework problem. Wewill not state any results for Neumann boundary conditions for the Douglas and D’yakonovmethods.

Douglas method

The Dirichlet boundary conditions for the intermediate solution∗U have the form:

∗U{0,M},l= Un+1

{0,M},l +r

2δ2y

(Un{0,M},l − Un+1

{0,M},l

), 1 ≤ l ≤ L− 1 . (15.74)

D’yakonov method

The Dirichlet boundary conditions for the intermediate solution∗U have the form:

∗U{0,M},l=

(1− r

2δ2y

)Un+1{0,M},l , 1 ≤ l ≤ L− 1 . (15.75)

15.7 Appendix 1: A generalized form of the ADI methods, anda second-order ADI method for the parabolic equation withmixed derivatives, Eq. (15.54)

A brief preview of this section was done at the end of Section 15.5. The presentation below isbased on the papers by K.J. in ’t Hout and B.D. Welfert, “Stability of ADI schemes applied toconvection-diffusion equations with mixed derivative terms,” Applied Numerical Mathematics57, 19–35 (2007) and by I.J.D. Craig and A.D. Sneyd, “An alternating-direction implicit schemefor parabolic equations with mixed derivatives,” Computers and Mathematics with Applications16(4), 341–350 (1988). Both papers are posted on the course website.

Let us begin by writing a general form of the equation that included the Heat equation(15.1) as a special case:

ut = F ≡ F (0) + F (1) + F (2) , (15.76)

where F (1) and F (2) are terms that contain only the derivatives of u with respect to x and y,respectively, and F (0) contains all other terms (e.g., nonlinear or with mixed derivatives). Forexample, in (15.54),

F (0) = a(xy)(x, y)uxy, F (1) = a(xx)(x, y)uxx, F (2) = a(yy)(x, y)uyy .


Next, note that the Douglas method (15.48), which we repeat here for the reader’s conve-nience: (

1− r

2δ2x

) ∗Uml =

(1 +

r

2δ2x + rδ2

y

)Un

ml ,

(1− r

2δ2y

)Un+1

ml =∗Uml −r

2δ2y Un

ml ,(15.48)

can be written in an equivalent, but different form:

W (0) = Un + rδ2xU

n + rδ2yU

n,

W (1) = W (0) +1

2(rδ2

xW(1) − rδ2

xUn),

W (2) = W (1) +1

2(rδ2

yW(2) − rδ2

yUn),

U (n+1) = W (2) .

(15.77)

Here, for brevity, we have omitted the subscripts {m, l} in Unm,l etc. The correspondence of

notations of (15.48) and (15.77) is:

W(1)(15.77) =

∗U (15.48) .

In the notations introduced in (15.76), this can be written as

W (0) = Un + κF (Un),

W (k) = W (k−1) +1

2κ(F (k)(W (k))− F (k)(Un)

), k = 1, 2;

U (n+1) = W (2) .

(15.78)

Recall that F used in the first equation above is defined in (15.76).Let us make two observations about scheme (15.78). First, it can be interpreted as a

predictor-corrector method, which we considered in Lecture 3. Indeed, the first equation in(15.78) predicts the value of the solution at the next time level by the simple Euler method.The purpose of each of the subsequent steps is to stabilize the predictor step by employing animplicit modified Euler step in one particular direction (i.e., along either x or y). Indeed, ifwe set F (0) = F (2) = 0 in (15.76), then (15.78) reduces to the implicit modified Euler method.You will be asked to verify this in a QSA.

Second, (15.78) is seen to be closely related to the IMEX family of methods; see scheme(14.51) in Lecture 14.

For F (0) 6= 0, method (15.78) has accuracy O(κ+h2) (for F (0) = 0, its accuracy is O(κ2+h2),as we know from the discussion of the Douglas method in Section 15.5). It is of interest andof considerable practical significance to construct an extension of this scheme that would haveaccuracy O(κ2 + h2) even when F (0) 6= 0. Two such schemes were presented in the paper byin ’t Hout and Welfert, who generalized schemes presented earlier by other researchers. Thefirst scheme is:

W (0) = Un + κF (Un),

W (k) = W (k−1) +1

2κ(F (k)(W (k))− F (k)(Un)

), k = 1, 2;

V (0) = W (0) +1

2κ(F (0)(W (2))− F (0)(Un)

),

V (k) = V (k−1) +1

2κ(F (k)(V (k))− F (k)(Un)

), k = 1, 2;

U (n+1) = V (2) .

(15.79)


After one round of prediction and correction, accomplished by the first two lines of this scheme,it proceeds to do another round of prediction and correction, given by the third and fourth lines.It appears that it is this second round that brings the accuracy of the scheme up to the orderO(κ2). Intuitively, the reason why this is so can be understood by making an analogy of thesetwo rounds with the two steps of the modified explicit Euler method. Specifically, in a QSAyou will be asked to show that the scheme (15.79) with F (1) = F (2) = 0 reduces to the modifiedexplicit Euler method.

The second scheme proposed by in ’t Hout and Welfert is obtained from (15.79) by replacingF (0) in its third line by F .

Stability of the above schemes, as well as of the generalized Douglas scheme (15.78), hasbeen investigated by in ’t Hout and Welfert. In particular, they showed that schemes (15.78)and (15.79) are unconditionally stable for the so-called convection-diffusion equation

ut = c(x)ux + c(y)uy + a(xx)uxx + a(xy)uxy + a(yy)uyy, (15.80)

where all the coefficients may depend on x and y, and the quadratic form a(xx)x2 + a(xy)xy +a(yy)y2 is positive definite. The third scheme, mentioned after (15.79), can also be madeunconditionally stable upon replacing the coefficient 1/2 in the second and fourth lines by anynumber θ ≥ 3/4.

Below we will specify scheme (15.79) for the case of equation (15.80) with c(x) = c(y) = 0:

W (0) = Un + r(a(xx)δ2

x + a(xy)δxy + a(yy)δ2y

)Un,

(1− 1

2ra(xx)δ2

x

)W (1) = W (0) − 1

2ra(xx)δ2

xUn,

(1− 1

2ra(yy)δ2

y

)W (2) = W (1) − 1

2ra(yy)δ2

yUn,

V (0) = W (0) +1

2r(a(xy)δxyW

(2) − a(xy)δxyUn),

(1− 1

2ra(xx)δ2

x

)V (1) = V (0) − 1

2ra(xx)δ2

xUn,

(1− 1

2ra(yy)δ2

y

)U (n+1) = V (1) − 1

2ra(yy)δ2

yUn .

(15.81)

Here all the coefficients are evaluated at node (m, l), and the mixed-derivative operator is:

δxyUm,l =1

4h2

(Um+1,l+1 + Um−1,l−1 − Um−1,l+1 − Um+1,l−1

). (15.82)

Scheme (15.81) was originally proposed by Craig and Sneyd in their paper cited above.There, it was given by their Eq. (7) in more condensed notations, which we will write as:

(1− 1

2ra(xx)δ2

x

)(1− 1

2ra(yy)δ2

y

)W (2) =

(1 +

1

2ra(xx)δ2

x

)(1 +

1

2ra(yy)δ2

y

)Un + ra(xy)δxyU

n,

(1− 1

2ra(xx)δ2

x

)(1− 1

2ra(yy)δ2

y

)Un+1 =

(1 +

1

2ra(xx)δ2

x

)(1 +

1

2ra(yy)δ2

y

)Un

+1

2ra(xy)δxy

(W (2) + Un

).

(15.83)In order to turn the Graig–Sneyd scheme (15.81) into a practical algorithm, one needs to

specify what boundary conditions for its auxiliary variables are needed and how those can


be found. I have been unable to find an answer to this question in the published literature;therefore below I present my own answer. Let us start at the first line of (15.81). Its left-handside can be computed only at the interior grid points, i.e., for 1 ≤ m ≤ M−1 and 1 ≤ l ≤ L−1,since the calculation of the terms on the right-hand side requires boundary values of Un alongthe entire perimeter of the computational domain. Next, to determine W (1) in the second line,we need its boundary values W

(1){0,M}, l for 1 ≤ l ≤ L − 1. Those can be found from the third

line with m = 0 and m = M if one knows the boundary values W(2){0,M}, l for all l, i.e., for

0 ≤ l ≤ L. Thus, we focus on specifying or finding these latter boundary values. It turns outthat one cannot find them. Indeed, the fourth line of (15.81) does not provide any information

about W(2){0,M}, l but rather, to aggravate the matters, requires the values W

(2)m, {0,L} at the other

boundary in order to compute δ2xyW

(2) for all interior points 1 ≤ m ≤ M − 1, 1 ≤ l ≤ L − 1.One can also verify that none of these boundary values can be computed if we start from thelast line of (15.81), either. Thus, the only option remains to specify the values of W (2) alongthe entire perimeter of the computational domain.

This option is consistent with the form (15.83) of the Craig–Sneyd scheme. Indeed, thenthe first line of that scheme is nothing but the inhomogeneous version of (15.33) (with theinhomogeneity being the δxyU

n-term), and we know that to solve it, the boundary values ofthe variable on the left-hand side must be specified.

The way to specify the boundary values of W (2) appears to be to let them equal those ofUn+1:

W(2){0,M}, l = Un+1

{0,M}, l , 0 ≤ l ≤ L;

W(2)m, {0,L} = Un+1

m, {0,L} , 1 ≤ m ≤ M − 1.(15.84)

To see that, let the mixed-derivative term in (15.81) vanish: a(xy) = 0. Then from the fourthline of that scheme, V (0) = W (0), and then W (2) simply coincides with Un+1 at all nodes.

To summarize, we list the steps of implementing the algorithm (15.81), (15.84) into a code.

Step 1: Define the boundary values Un+1{0,M}, l and W

(2){0,M}, l for 0 ≤ l ≤ L. Next, compute the

boundary values V(1){0,M}, l and W

(1){0,M}, l for 1 ≤ l ≤ L−1 from the last and third lines of (15.81),

respectively.

Step 2: Define the boundary values Un+1m, {0,L} and W

(2)m, {0,L} for 1 ≤ m ≤ M − 1.

Step 3: Find the variables on the left-hand sides of the first two lines of (15.81). This can be

done because the required boundary values of W (1) have been computed in Step 1.

Step 4: Find W (2) at all the interior points from the third line of (15.81). This can be done

because the required boundary values of W (2) have been defined in Step 2.

Step 5: Find the variables on the left-hand sides of the fifth and sixth lines of (15.81). This

can be done because the required boundary values of V (1) have been computed in Step 1 andthe required boundary values of W (2) have been defined in Steps 1 and 2.

Step 6: Find Un+1 at all the interior points from the last line of (15.81). This can be donebecause the required boundary values have been defined in Step 2.

Generalization of the Craig–Sneyd scheme (15.81) to three spatial dimensions is straight-forward. Craig and Sneyd also generalized it to a system of coupled equations of the form(15.80); see “An alternating direction implicit scheme for parabolic systems of partial differen-tial equations,” Computers and Mathematics with Applications 20(3), 53–62 (1990). Also, in ’t


Hout and Welfert published a follow-up paper to their paper cited above; it is: “Unconditionalstability of second-order ADI schemes applied to multi-dimensional diffusion equations withmixed derivative terms,” Applied Numerical Mathematics 59, 677–692 (2009). There, theyconsider a more restricted class of equations than (15.80) (diffusion only, no convection), but inexchange are able to prove unconditional stability of a certain scheme in any number of spatialdimensions. This has applications in, e.g., financial mathematics.

15.8 Appendix 2: Derivation of the Peaceman–Rachford algorithmfor the 2D Heat equation with Neumann boundary conditions

In order for you to understand the details of this derivation better, it is recommended that youfirst review the corresponding derivation for the Crank-Nicolson method in Sec. 14.1, since itis that derivation on which the present one is based. It should also help you to draw a singletime level and refer to that drawing throughout the derivation.

We begin by determining which boundary values of the intermediate solution∗U are required

to compute the solution Un+1ml , 0 ≤ {m, l} ≤ {M, L} at the new time level. To that end, let us

write down Eq. (15.34b) in a detailed form:

Un+1m,l −

r

2

[Un+1

m,l+1 − 2Un+1m,l + Un+1

m,l−1

]=

∗Um,l +

r

2

[ ∗Um+1,l −2

∗Um,l +

∗Um−1,l

]. (15.85)

We need to determine the solution on the l.h.s. for 0 ≤ {m, l} ≤ {M, L}. First, we note thatwe can set l = 0 in (15.85) despite the fact that Un+1

m,−1 will then appear in the last term on thel.h.s., and that value is not part of the solution. To eliminate that value, we use the boundarycondition at y = 0:

Un+1m,1 − Un+1

m,−1

2h= (g2)

n+1m , ⇒ Un+1

m,−1 = Un+1m,1 − 2h(g2)

n+1m , 0 ≤ m ≤ M . (15.86)

Similarly, at y = Y , we have

Un+1m,L+1 − Un+1

m,L−1

2h= (g3)

n+1m , ⇒ Un+1

m,L+1 = Un+1m,L−1 + 2h(g3)

n+1m , 0 ≤ m ≤ M . (15.87)

Therefore, Eq. (15.85) for l = 0 and l = L is replaced by the respective equations:

Un+1m,0 −

r

2

[2Un+1

m,1 − 2Un+1m,0 − 2h(g2)

n+1m

]=

∗Um,0 +

r

2

[ ∗Um+1,0 −2

∗Um,0 +

∗Um−1,0

]; (15.88)

Un+1m,L −

r

2

[2h(g3)

n+1m − 2Un+1

m,L + 2Un+1m,L−1

]=

∗Um,L +

r

2

[ ∗Um+1,L −2

∗Um,L +

∗Um−1,L

]. (15.89)

Thus, in order to determine from (15.85), (15.88), and (15.89) the solution Un+1m,l for all 0 ≤

{m, l} ≤ {M, L}, we will need to know

∗Um,l for −1 ≤ m ≤ M + 1 and 0 ≤ l ≤ L. (15.90)

The difficulty that we need to overcome is the determination of the boundary values

∗U−1,l and

∗UM+1,l for 0 ≤ l ≤ L. (15.91)


Now let us see which of the values∗Um,l we can determine directly from (15.34a). To that

end, let us write down that equation in a detailed form, similarly to (15.85):

∗Um,l −r

2

[ ∗Um+1,l −2

∗Um,l +

∗Um−1,l

]= Un

m,l +r

2

[Un

m,l+1 − 2Unm,l + Un

m,l−1

]. (15.92)

From this equation, we see that in order to determine the values required in (15.90), we needto know

Unml for −1 ≤ m ≤ M + 1 and −1 ≤ l ≤ L + 1. (15.93)

While those values for 0 ≤ {m, l} ≤ {M,L} are known from the solution on the nth time level,the values Un

m,l for {m, l} = −1 and {m, l} = {M + 1, L + 1} are not, and hence they need tobe found from the boundary conditions. This is done similarly to (15.86) and (15.87):

Un−1,l = Un

1,l − 2h(g0)nl , Un

M+1,l = UnM−1,l + 2h(g1)

nl , 0 ≤ l ≤ L ;

Unm,−1 = Un

m,1 − 2h(g2)nm , Un

m,L+1 = Unm,L−1 + 2h(g3)

nm , 0 ≤ m ≤ M .

(15.94)

Once the values in (15.94) have been found, we determine the values at the corners:

Un−1,−1 = Un

1,−1 − 2h(g0)n−1 , Un

M+1,−1 = UnM−1,−1 + 2h(g1)

n−1 ,

Un−1,L+1 = Un

1,L+1 − 2h(g0)nL+1 , Un

M+1,L+1 = UnM−1,L+1 + 2h(g1)

nL+1 .

(15.95)

Note that the smoothness of the solution at the corners is ensured by the matching conditions(15.58). Thus, with (15.94) and (15.95), we have all the values required in (15.93).

We now turn back to (15.92). From it, we see that with all the values (15.93) being known,the l.h.s. of (15.92) can be determined from an (M + 1)× (M + 1) system of equations only ifwe know the values in (15.91). To determine these values, we use Eq. (15.59) in the followingway: we subtract that equation with m → (m− 1) from the same equation with m → (m + 1)and divide the result by (2h). For m = 0, this yields:

∗U1,l −

∗U−1,l

2h=

1

2

(Un

1,l − Un−1,l

2h+

Un+11,l − Un+1

−1,l

2h

)+

r

4·

([Un

1,l+1 − Un−1,l+1

2h− 2

Un1,l − Un

−1,l

2h+

Un1,l−1 − Un

−1,l−1

2h

]

+

[Un

1,l+1 − Un−1,l+1

2h− 2

Un1,l − Un

−1,l

2h+

Un1,l−1 − Un

−1,l−1

2h

]),

for 0 ≤ l ≤ L.

(15.96)

If we can find each term on the r.h.s. of this equation, then we know the value of the term

on the l.h.s. and hence can determine∗U−1,l. But each of these terms can be found from the

boundary conditions! Upon this observation, (15.96) can be rewritten as follows:

∗U1,l −

∗U−1,l

2h=

1

2

((g0)

nl + (g0)

n+1l

)+

r

4·

([(g0)

nl+1 − 2(g0)

nl + (g0)

nl−1

]

+[(g0)

n+1l+1 − 2(g0)

n+1l + (g0)

n+1l−1

]),

≡ (G0)l ,

for 0 ≤ l ≤ L.

(15.97)


Similarly,∗UM+1,l −

∗UM−1,l

2h=

1

2

((g1)

nl + (g1)

n+1l

)+

r

4·

([(g1)

nl+1 − 2(g1)

nl + (g1)

nl−1

]

+[(g1)

n+1l+1 − 2(g1)

n+1l + (g1)

n+1l−1

]),

≡ (G1)l ,

for 0 ≤ l ≤ L.

(15.98)

Using Eqs. (15.97) and (15.98), the linear systems given, for each l = 0, . . . , L, by Eqs. (15.92)with 1 ≤ m ≤ M − 1, can be supplemented by the following equations for m = 0 and m = M :

∗U0,l −r

2

[2

∗U1,l −2

∗U0,l −2h(G0)l

]= Un

0,l +r

2

[Un

0,l+1 − 2Un0,l + Un

0,l−1

],

∗UM,l −r

2

[2h(G1)l − 2

∗UM,l +2

∗UM−1,l

]= Un

M,l +r

2

[Un

M,l+1 − 2UnM,l + Un

M,l−1

],

0 ≤ l ≤ L .

(15.99)

From (15.99) and the remaining equations in (15.92) one can determine

∗Um,l for 0 ≤ {m, l} ≤ {M,L}, (15.100)

and the remaining values (15.91) are determined from (15.97) and (15.98).


1. According to the lexicographic ordering, which quantity appears earlier in the vector ~Uin Eq. (15.8): U2,5 or U5,2?

2. Verify Eq. (15.14).

3. Verify Eq. (15.15).

4. Verify Eq. (15.19).

5. What is the length of vector ~bnl in Eq. (15.21)?

6. Write down Eq. (15.19) for l = 2. Then verify that it is equivalent to Eq. (15.22) for thesame value of l.

7. Obtain the analog of Eq. (15.23) for l = L− 1.

8. What is the length of vector Bn in Eq. (15.25)?

9. Why is the CN scheme (15.16) computationally inefficient?

10. State the three properties that a computationally efficient scheme for the 2D Heat equa-tion must have.

11. What is the order of the truncation error of scheme (15.31)?


12. Verify that (15.33) is equivalent to (15.32).


14. Make sure you can justify each step in (15.35).


16. Explain in detail why (15.34) is computationally efficient (that is, which systems need tobe solved at each step).

17. Obtain both equations in (15.41).

18. Why does one want to look for alternatives to the Peaceman–Rachford method?


20. Produce an example of r, β, γ, and ξ such that the corresponding amplification factor(15.47) is greater than 1 in magnitude.

21. Explain in detail why (15.48) is computationally efficient (that is, which systems need tobe solved at each step).

22. Consider the Peaceman–Rachford method for the Heat equation with Dirichlet boundaryconditions. Explain which boundary values of the intermediate solution one requires, andwhy one does not need other boundary values.

23. Verify that the scheme (15.78) with F (0) = F (2) = 0 reduces to the modified implicitEuler method.

24. Verify that the scheme (15.79) with F (1) = F (2) = 0 reduces to the modified explicit Eulermethod.

25. Make sure you can follow the argument made around (15.84).


16 Hyperbolic PDEs:

Analytical solutions and characteristics

Hyperbolic PDEs describe propagation of disturbances in space and time when the total energyof the disturbances remains conserved. It is the condition of energy conservation that makesthe hyperbolic equations different from parabolic ones, considered in Lectures 12 through 15.The following analogy with ODEs is intended to clarify the difference between hyperbolic andparabolic PDEs. Parabolic equations are multi-dimensional counterparts of the ODE

y′ = −λy, Re λ > 0, (16.1)

and thus describe processes of relaxation of the initial disturbance towards an equilibrium(which is y = 0 in the case of (16.1)). Hyperbolic equations are multi-dimensional counterpartsof the ODE

y′′ = −λ2y, λ2 > 0, (16.2)

which describes oscillations (see Lecture 5). However, hyperbolic PDEs describe not only oscil-lations, but also (and, in fact, much more often) propagation of initial disturbances. Examplesinclude, e.g., propagation of sound and light.

16.1 Solution of the Wave equation

In fact, the basic form (i.e., before any perturbations or specific details are included into themodel) of the equation that governs propagation of light and sound is the same. That sameequation, called the Wave equation, also arises in a great variety of applications in physics andengineering. A classic example, considered in most textbooks, is the vibration of a string. Thecorresponding equation is

utt = c2uxx . (16.3)

In the above example of a string, c =√

T/ρ, where T and ρ are the string’s tension anddensity, respectively. As we will see shortly, in general, c is the speed of propagation of initialdisturbances (e.g., the speed of sound for sound waves or the light speed for light waves).

To solve Eq. (16.3), we need to supplement it with initial and boundary conditions. We willdo so later on. For now, let us discuss the general solution of (16.3). We will use this analyticsolution as a reference for numerical solutions that we will obtain in Lecture 17.

Rewriting (16.3) in the form presented in Lecture 11:

1 · utt + 2 · 0 · uxt + (−c2) · uxx = 0,

and then using Eq. (11.15), we obtain the equations for the two characteristics of (16.3):

(dx

dt

)

1,2

= ±c . (16.4)


Thus,

Along characteristics 1:

x− ct ≡ ξ = const; (16.5a)

Along characteristics 2:

x + ct ≡ η = const. (16.5b)

This is illustrated in the figure on the right.

x

t

ξ = const

η = const

The significance of the characteristics follows from the fact that any piece of initial orboundary data propagates along the characteristics and thereby determines the solution of(16.3) at any point in space and time. We will derive parts of this result later, and you will beasked to complete that derivation in a homework problem. For now, it will be sufficient for ourpurposes to give the general solution of (16.3) without a derivation:

u(x, t) = F (x− ct) + G(x + ct). (16.6)

Here F and G are functions determined by the initial and boundary conditions, as we will showshortly. The meaning of solution (16.6) is this: The solution of the Wave equation splits intotwo waveforms, each of which travels along its own characteristic.

Now let us show how F and G are found assuming that the intial conditions

u(x, t = 0) = φ(x), ut(x, t = 0) = ψ(x), −∞ < x < ∞ (16.7)

are prescribed on the infinite line. That is, for now we will assume no explicit boundaryconditions; implicitly, we will assume that there is no disturbance coming into the region offinite x-values from either x = −∞ or x = +∞. Note that in (16.7), φ(x) can be interpretedas the initial shape of the disturbance and ψ(x), as its initial velocity. By substituting (16.6)into (16.7) and following a calculation outlined in Appendix 1, one obtains:

u(x, t) =1

2( φ(x− ct) + φ(x + ct) ) +

1

2c

∫ x+ct

x−ct

ψ(s) ds . (16.8)

This formula is called the D’Alambert solution of the Wave equation (16.3) set up on the infiniteline with the initial conditions (16.7). For example, when the initial velocity is zero everywhere,the solution (16.8) at any time is given by two replicas of the initial disturbance φ(x) that travelalong the characteristics x − ct = const and x + ct = const. Note especially that if the initialdisturbance is not smooth (e.g., is discontinuous), the discontinuities are not smoothened outduring the propagation but simply propagate along the characteristics.

Now let us show how formula (16.6) can be used to obtain a solution of (16.3) on a finiteinterval. Instead of initial conditions (16.7), we will now consider initial conditions

u(x, t = 0) = φ(x), ut(x, t = 0) = ψ(x), 0 ≤ x ≤ L (16.9)

along with boundary conditions

u(x = 0, t) = g0, u(x = L, t) = gL, t > 0 . (16.10)


Note that the boundary values for ut need not to be specified because they are determined by(16.10). Also, on physical grounds, we require that the boundary and initial conditions match:

φ(x = 0) = g0(t = 0),φ(x = L) = gL(t = 0);

ψ(x = 0) = g′0(t = 0),ψ(x = L) = g′L(t = 0).

(16.11)

In what follows we illustrate the method of finding a solution of (16.3) on a finite intervalfor the special case when the boundary values g0 and gL do not depend on time. (Thesame method, but with additional effort, can be used in the general case of time-dependentboundary conditions.) When the boundary conditions are time-independent, we will first showthat they can be set to zero without loss of generality. Using a trick analogous to that used inthe homework problems for Lecture 9, we consider a modified function

u = u−(

g0 − gL − g0

Lx

), (16.12)

which satisfies both the Wave equation (16.3) and the zero boundary conditions

u(0, t) = 0, u(L, t) = 0.

Thus we set g0 = gL = 0 in (16.10) in what follows.

Now we will use a so-called method of reflections,where we claim that the solution of (16.3), (16.9),(16.10) (with g0 = gL = 0) is given by for-mula (16.6) with φ(x) and ψ(x) being replaced bytheir anti-symmetric, 2L-periodic extensions aboutpoints x = 0 and x = L:

φ(x) =

· · · · · ·−φ(−x), −L ≤ x ≤ 0φ(x), 0 ≤ x ≤ L−φ(2L− x), L ≤ x ≤ 2L· · · · · ·

(16.13)

and similarly for ψ(x).

0 L 2L −L

φ(x)

^

x

Then the solution

u(x, t) =1

2

(φ(x− ct) + φ(x + ct)

)+

1

2c

∫ x+ct

x−ct

ψ(s) ds (16.14)

satisfies the PDE (16.3) (by virtue of (16.6)) and the initial condition (16.9) (by virtue of(16.13)). In a question for self-assessment you will be asked to verify that it also satisfies thezero boundary conditions at x = 0 and x = L.

16.2 Wave equation as a system of first-order PDEs

Let us now present another point of view of the Wave equation. The numerical method devel-oped in Lecture 17 will utilize this point of view.

If we denoteut = p, cux = q,


then Eq. (16.3) becomespt − cqx = 0 . (16.15a)

From the formula uxt = utx we obtain

qt − cpx = 0 . (16.15b)

In matrix form, these equations are written as

∂

∂t

(pq

)− c

(0 11 0

)∂

∂x

(pq

)=

(00

). (16.16)

We proceed by diagonalizing the matrix in the above equation:

(0 11 0

)= S−1

(1 00 −1

)S, S =

1√2

(1 11 −1

)= S−1. (16.17)

Multiplying (16.16) on the left by S−1 and using (16.17), we arrive at a diagonal (i.e., decoupled)system of first-order heperbolic equations:

∂

∂t

(vw

)−

(c 00 −c

)∂

∂x

(vw

)=

(00

). (16.18)

where (vw

)= S−1

(pq

)=

1√2

(p + qp− q

). (16.19)

In the component-by-component form, (16.18) is

vt − cvx = 0, (16.18a)′

wt + cwx = 0. (16.18b)′

In Appendix 2 we show that the general solutions of (16.18)′ are

v(x, t) = g(x + ct), (16.20a)

w(x, t) = f(x− ct). (16.20b)

Substituting these solutions into (16.19) and solving for p and q, we obtain

p = ut = 1√2(g(x + ct) + f(x− ct)) ,

q = cux = 1√2(g(x + ct)− f(x− ct)) .

(16.21)

Integration of the latter equations yields

u(x, t) = F (x− ct) + G(x + ct), (16.6)′

where

F (ξ) = − 1√2c

∫ ξ

f(ξ) dξ, G(η) =1√2c

∫ η

g(η) dη . (16.22)

Thus, we have reobtained the general solution (16.6) of the Wave equation.

As we have noted, the value of representing the Wave equation (16.3) in the form of asystem of first-order equations, (16.16), is that in the next Lecture, we will develop methods of


numerical solution of first-order hyperbolic PDEs. In preparation to this development, let usset up an initial-boundary value problem (IBVP) for the simplest first-order hyperbolic PDE,

wt + cwx = 0. (16.18 b)′

In what follows, we assume c > 0 unless stated otherwise. As shown in Appendix 2, the solutionof (16.18b)′ is given by (16.20b). Characteristics

x− ct = ξ = const, or t =1

c(x− ξ) (16.23)

of that equation are shown in the figure next to formulae (16.5). By looking at those character-istics, one sees that one can prescribe the IBVP for (16.18b)′ in two ways: either by an initialcondition on the entire line,

w(x, t = 0) = φ(x), −∞ < x < ∞ (16.24)

or on the boundary of the first quadrant of the (x, t)-plane:

w(x, t = 0) = φ(x), x ≥ 0

w(x = 0, t) = g(t). t ≥ 0(16.25)

Then the initial and boundary (if applicable) values will propagate along the characteristics(16.23) and thereby determine the solution at any point inside the first quadrant (x ≥ 0, t ≥ 0).Note that the solution of (16.18b)′ cannot be defined in the second quadrant, (x ≤ 0, t ≥ 0),because the characteristics do not extend there.

16.3 Appendix 1: Derivation of D’Alambert’s formula (16.8)

Substituting (16.6) into (16.7) and using the identities

Ft(x− ct) = −cF ′(ξ) = −cFx(ξ), Gt(x + ct) = cG′(η) = cGx(η), (16.26)

where F ′ ≡ dF/dξ and G′ ≡ dG/dη, one obtains:

F (x) + G(x) = φ(x), −cFx(x) + cGx(x) = ψ(x) . (16.27)

Upon differentiating the first of these equations by x, one obtains a system of two equationsfor Fx(x) and Gx(x). In a homework problem you will be asked to verify that its solutionintegrated over x is

F (x) =1

2

(φ(x)− 1

c

∫ x

−∞ψ(s) ds

), G(x) =

1

2

(φ(x) +

1

c

∫ x

−∞ψ(s) ds

). (16.28a)

Hence

F (x−ct) =1

2

(φ(x− ct)− 1

c

∫ x−ct

−∞ψ(s) ds

), G(x+ct) =

1

2

(φ(x + ct) +

1

c

∫ x+ct

−∞ψ(s) ds

).

(16.28b)The substitution of (16.28b) into (16.6) yields (16.8).


16.4 Appendix 2: Solution of (16.18)′ is given by (16.20)

Let us begin by putting the PDEwt + cwx = 0 (16.18 b)′

in the form of an ODE. Consider a change of variables:

(x, t) → (ξ = x− ct, η = x + ct). (16.29)

Using the Chain Rule for the function of several variables,

∂

∂t=

∂ξ

∂t

∂

∂ξ+

∂η

∂t

∂

∂η,

∂

∂x=

∂ξ

∂x

∂

∂ξ+

∂η

∂x

∂

∂η, (16.30)

we obtain: (−c

∂

∂ξ+ c

∂

∂η

)w + c

(∂

∂ξ+

∂

∂η

)w = 0 ⇒ wη = 0. (16.31)

The last equation means that w depends only on ξ, which, in view of (16.23), implies (16.20b).Similarly, one shows that the solution of (16.18a)′ is given by (16.20a).

Thus, for any two “points” (x1, t1) and (x2, t2) in the (x, t)-plane that satisfy x1 − ct1 =x2 − ct2, the solution of (16.18b)′ satisfies

w(x1, t1) = w(x2, t2). (16.32)


1. Suppose one wants to develop a second-order accurate finite-difference scheme for a hy-perbolic PDE. Which of the two ODE methods one should mimic the scheme after: themodified Euler or the Leap-frog?

2. What is the meaning of the solution (16.6)?

3. What is the meaning of each piece of the initial conditions (16.7)?

4. Where will the trick based on substitution (16.12) cause problems if the boundary con-ditions were to depend on time?

5. Verify the statements made after formula (16.14).

6. Why did we want to diagonalize the matrix in (16.16)?

7. Verify that you obtain (16.6)′ and (16.22) from (16.21).

8. Verify (16.31).


17 Method of characteristics for solving hyperbolic PDEs

In this lecture we will describe a method of numerical integration of hyperbolic PDEs whichuses the fact that all solutions of such PDEs propagate along characteristics.

17.1 Method of characteristics for a single hyperbolic PDE

Let us start the discussion with the simplest, first-order hyperbolic PDE

wt + cwx = 0, (17.1)

where we will take c > 0 for concreteness. For now, we assume that c = const; later on thisrestriction will be removed. The general solution of (17.1), derived in Appendix 2 of Lecture16, is

w(x, t) = w(x− ct). (17.2)

Thus, if the steps in x and t are related so that

∆x = c ∆t, (17.3)

thenw(x + ∆x, t + ∆t) = w

(x + ∆x− c(t + ∆t)

)= w(x− ct); (17.4)

see also (16.32). This simply illustrates the fact that the solution does not change along thecharacteristic x− ct = ξ.

To put (17.4) at the foundation of a numerical method, consider the mesh

xm = mh, m = 0, 1, 2, . . .

tn = nκ, n = 0, 1, 2, . . .(17.5a)

where h and κ are related as per (17.3):

h = c κ. (17.5b)

This is illustrated in the figure on the right.

t

x

ξ=0

ξ= −h ξ= −2h

ξ= h

ξ= 2h

h 2h

κ

2κ

The initial and boundary conditions for this problem are given by

w(x, t = 0) = φ(x), x ≥ 0;

w(x = 0, t) = g(t), t ≥ 0.(16.25)

In the discretized form, they are:

W 0m = φ(xm), m ≥ 0;

W n0 = g(tn), n ≥ 0.

(17.6)

(Obviously, we require φ(0) = g(0) for the boundary and initial conditions to be consistentwith each other.) Then, according to (17.4), the solution at the node (m,n) with m > 0 andn > 0 is found as

W nm =

{W 0

m−n = φ(xm−n), m ≥ n;

W n−m0 = g(tn−m), n ≥ m.

(17.7)


This method, called the method of characteristics, can be generalized to the equation

wt + c(x, t, w) wx = f(x, t, w). (17.8a)

For the sake of clarity, we will work out this generalization in two steps. First, we consider thecase when f(x, t, w) ≡ 0, i.e.

wt + c(x, t, w) wx = 0. (17.8b)

Then, making a change of variables

(x, t) −→ (ξ, t), where ξ = x−∫ t

0

c(x(t′), t′, w(x(t′), t′)

)dt′, (17.9)

and proceeding similarly31 to Appendix 2 of Lecture 16, one can show that

wt = 0 ⇒ w(x, t) = w(ξ) irrespective of a specific value of t. (17.10)

The equation for the characteristics ξ = const of(17.8) is obtained by differentiating the expression

x−∫ t

0

c(x(t′), t′, w(x(t′), t′)

)dt′ = const

(see (17.9)) with respect to t. The result is:

dx

dt= c(x, t, w), w = const (17.11)

where the last condition (w = const) appears be-cause along the characteristic, the solution w doesnot change (see (17.10)).

x

t

h 2h

κ

2κ

m=0 m=1

m=2

m=−1 m=−2

°

°

°

x11

x12

x13

x10

x−12

x−13

x−11

°

°

°

°

0

Note that unlike in the figure next to Eqs. (17.5), the characteristics corresponding to Eqs. (17.11)are curved, not straight, lines, as illustrated above.

The numerical solution of Eqs. (17.10) and (17.11) can be generated as follows. Let us denotexn

m to be the grid point at the intersection of the time level t = nκ and the characteristicξ = mh (see the figure above for an illustration). Note that this definition of xn

m is differentfrom the definition of xm in (17.5a). Namely, there, xm are fixed points of the spatial gridwhich are defined independently of the time grid. On the contrary, in scheme (17.12) below,xn

m moves along the m-th characteristic and hence is different at each time level.Continuing with setting up a scheme for (17.10) and (17.11), let W n

m denote the value of w

at the grid point xnm, i.e. W n

m = w(ξ = mh, t = nκ). Then:

n = 0 : x0m = mh, W 0

m = φ(mh), m ≥ 0; (17.12a)

n = 1 :x1−1 = 0, W 1

−1 = g(κ),

x1m = x0

m +

∫ κ

0

c(x, t, W 0

m

)dt, W 1

m = φ(mh), m ≥ 0;(17.12b)

31E.g., ∂t = ∂tξ ∂ξ + ∂tt ∂t = −c ∂ξ + ∂t.


n ≥ 2 : xn−n = 0, W n

−n = g(nκ),

xnm = xn−1

m +

∫ nκ

(n−1)κ

c(x, t, W n−1

m

)dt,

{W n

m = g(−mκ),

W nm = φ(mh),

−(n− 1) ≤ m < 0,

m ≥ 0.(17.12c)

Above, the expression∫ nκ

(n−1)κc(x, t, w) dt is a symbol denoting the result of integration of

the ODE (17.11) from t = (n − 1)κ to t = nκ. This integration may be performed eitheranalytically (if the problem so allows) or numerically using any of the numerical methods forODEs. Note that this integration is the only computation required in (17.12); the rest of it isjust the assignment of known values to the grid points at each time level.

Let us emphasize the meaning of scheme (17.12). First, it computes the values xnm along

the respective characteristics for each m as per the first equation in (17.11). Then, the valueof w is kept constant along each characteristic, as specified by (17.10).

Remark If one wants to keep the number of grid points at each time level of (17.12) the same(say, (M + 1)), then one needs to “chop off” the right-most point at every time step. Recallalso that one cannot prescribe a boundary condition at the right boundary x = Mh.

Let us illustrate the solution of (17.8b) by Eqs. (17.12) for the so-called shock wave equation32

ut + uux = 0, (17.13)

which arises in a great many applications (e.g., in gas dynamics or in traffic flow modelling).As an initial condition, let us take

u(x, 0) ≡ φ(x) =

{a sin2 πx, 0 ≤ x ≤ 1,0, otherwise,

(17.14)

where a is some constant. We will consider this problem on the infinite line, x ∈ (−∞, ∞),but in our numerical solution will only follow points where u 6= 0.

According to Eqs. (17.10) and (17.9), the solution of problem (17.13), (17.14) is given byan implicit formula

u = φ

(x−

∫ t

0

u(x(t′), t′) dt′)

, (17.15)

where we have used the initial condition u(x, t = 0) = φ(x). Recall that in (17.15), x(t) standsfor the equation of one given characteristic; in other words, the integral is computed along thatcharacteristic. We will now show that characteristics of (17.13) have a special form that allowthe integral in (17.15) to be be simplified. Indeed, the equation for the characteristics of (17.13)follows from (17.11):

dx

dt= u, u = const. (17.16)

Thus, since the u on the right-hand side of (17.15) is constant along the characteristic, then(17.15) reduces to

u = φ(x− ut). (17.15′)

This is now an implicit algebraic equation for u which, in principle, can be solved for each pair(x, t).

32Another name of this equation, or, more precisely, of the more general equation ut + c(u)ux = 0, is the“simple wave” equation, with the adjective “simple” originating from physical considerations.


To obtain a numerical solution of (17.13), (17.14) explicitly, we use a modification of (17.12)which would allow us to keep track of only those grid points where u 6= 0. The correpondingscheme on such a moving grid is:

n = 0 : x0m = mh, U0

m = φ(mh); (17.17a)

n = 1 : x1m = x0

m + U0mκ, U1

m = φ(mh); (17.17b)

(where the meaning of index m is clarified in (17.17d) below)

n ≥ 2 : xnm = xn−1

m + Un−1m κ, Un

m = Un−1m = . . . = U0

m. (17.17c)

Note that in all these equations,

m = 0, 1, . . . , M, and h = 1/M (see (17.14)), (17.17d)

so that a particular value of m labels the characteristic emanating at point (x = mh, t = 0).This way of labeling is illustrated in the figure next to Eq. (17.11). It defines a grid whichmoves to the right (given that the initial velocity φ(x) ≥ 0). Also, at each time level except theone at t = 0, the internode spacing along x is not uniform. Note that (17.17) is the discretizedform of the exact analytical solution (17.15′). You will be asked to plot solution (17.17) in ahomework problem.

Let us now return to Eq. (17.8) with f ½½≡ 0. Similarly to (17.10), one then obtains

wt = f(ξ, t, w), (17.18a)

where f(ξ, t, w) is obtained from f(x, t, w) by the change of variables (17.9). (For example,if f(x, t) = x + t2 and c = 3, then f(ξ, t) = ξ + 3t + t2.) Equation (17.18a) says that w(ξ, t)is no longer a constant along a characteristic

ξ = const (17.18b)

but instead varies along it in the prescribed manner. When solving (17.18a), ξ should beconsidered as a constant parameter. To find the equation of the characteristics, one needs tosolve the first equation in (17.11) where instead of the second equation in (17.11), i.e. w = const,one now needs to use (17.18a). Thus, the solution of the original PDE (17.8a) reduces to thesolution of two coupled ODEs: (17.18) and

dx

dt= c(ξ, t, w), x(t = 0) = ξ, (17.19)

where c(ξ, t, w) is obtained from c(x, t, w) by the change of variables (17.9) (see the clarificationafter (17.18a)). An implementation of the solution of (17.18) and (17.19) that assumes theboundary conditions (17.6) is given below:

n = 0 : x0m = mh, ξm = x0

m, W 0m = φ(mh); (17.20a)

n = 1 : x1−1 = 0, W 1

−1 = g(κ),(

x1m

W 1m

)=

(x0

m

W 0m

)+

∫ κ

0

(c(ξm, t, wm(t)

)

f(ξm, t, wm(t)

))

dt, m ≥ 0;(17.20b)


n ≥ 2 : xn−n = 0, W n

−n = g(nκ),(

xnm

W nm

)=

(xn−1

m

W n−1m

)+

∫ nκ

(n−1)κ

(c(ξm, t, wm(t)

)

f(ξm, t, wm(t)

))

dt, m ≥ −n + 1.

(17.20c)Here the expression ∫ nκ

(n−1)κ

(c(ξm, t, wm(t)

)

f(ξm, t, wm(t)

))

dt

is a symbol that denotes the result of integration of the coupled ODEs (17.18) and (17.19). Inpractice, this integration can be done numerically by any suitable ODE method. Also, wm(t)above means the solution along the characteristic ξ = ξm (see (17.20a)).

The meaning of scheme (17.20) is the following. As previously scheme (17.12), it computesthe curves of characteristics xm(t) = ξm as per (17.19). However, unlike (17.12), now the valueof w is not constant along each characteristic but instead varies according to (17.18). Notethat now the equations for the characteristic and for the solution w are coupled and need tobe solved simultaneously.

17.2 Method of characteristics for a system of hyperbolic PDEs

In this section, we will first point out technical diffuculties that can arise when using themethod of charateristics for a system of PDEs. Then we will work out an example where thosedifficulties do not occur.

If we attempt to generalize the approach that ledto schemes (17.12) and (17.20) to the case whereone has a coupled system of two PDEs with inter-secting families of curved (i.e., not straight-line)characteristics, one is likely to encounter a prob-lem depicted in the figure on the right. Namely,suppose the characteristics of the two families arechosen to intersect at level t = 0. In the figure,the intersection points are xm−1, xm, xm+1, etc.However, these characteristics no longer intersectat subsequent time levels; this is especially visibleat levels t = 2κ and t = 3κ.

t

xm−1

0

κ

2κ

xm

xm+1


An analogous problem can also occur if one hasthree or more characteristics, even if they arestraight lines. The only case where this will notoccur is where all the characteristics can be cho-sen to intersect at uniformly spaced points at eachlevel. An example of such a special situation isshown on the right. Note that the vertical charac-teristics are just the lines

dx

dt= 0 ⇒ x(t) = ξm ≡ const. (17.21)

t

xm−1

0

κ

2κ

xm

xm+1

Characteristrics (17.21) occur whenever the system of PDEs includes an equation

wj, t = f(x, t, ~w

). (17.22)

You will be asked to verify this in a QSA.A way around this issue is to interpolate the values of the solution at each time level. For

example, suppose one is to solve a system of two PDEs for w1 and w2 on the segment 0 ≤ x ≤ 1,with the characteristics for w1 (w2) going northeast (northwest). Let there be (M − 1) internalpoints, xm = mh, m = 1, . . . , (M − 1) at the initial time level t0 = 0. Suppose that the

characteristics for wj, j = 1, 2, intersect the next time level t1 = κ at points x(j)m . Then one

can interpolate the set of values w(j)m from the respective nonuniform grid x

(j)m onto the same

grid xm as at the initial time level. This interpolation process is then repeated at every timelevel.

Matlab’s command to interpolate a vector y from a grid defined by a vector x (such thatlength(x)=length(y)) onto a vector xx is:yy = spline( x, y, xx) .

We will now work out a solution of a system of two PDEs with straight-line charactristics:

w1, t + cw1, x = f1(x, t, w1, w2),

w2, t − cw2, x = f2(x, t, w1, w2),

wj(x, 0) = φj(x), 0 ≤ x ≤ 1, j = 1, 2

w1(0, t) = g1(t), w2(1, t) = g2(t), t ≥ 0.

(17.23)

The two characteristic directions of (17.23) are

ξj = x− cjt, j = 1, 2; c1 = c, c2 = −c. (17.24)

If fj for j = 1 and/or 2 in (17.23) vanishes, then the respective wj will not change along itscharacteristic ξj. Therefore, for fj 6= 0, it is convenient to calculate the change of wj along thecharacteristic ξj. Then, similarly to (17.18a), we can write the first two equations of (17.23) as

wj, t = fj

(ξj + cjt, t, w1(ξ1, t), w2(ξ2, t)

)along ξj = const, j = 1, 2. (17.25)


Note that while integrating, say, the equation with j = 1, the argument ξ2 of w2 should not beconsidered as constant. At the moment, this prescription is rather vague, but later on we willpresent a specific example of how it can be implemented.

The formal numerical implementation of the solution of (17.25) is given below on the grid(17.5) where the maximum value of m = M corresponds to the right boundary x = 1. Note thatthis grid is stationary and hence is different from the moving grids used in schemes (17.12),(17.17), and (17.20). In particular, in this stationary grid, m does not label a particularcharacteristic.

The scheme for (17.25) is:

n = 0 : (ξj)0m = xm, (Wj)

0m = φj(mh), j = 1, 2; (17.26a)

n ≥ 1 : (ξ1)nm = xm − c1κn ≡ (m− n)h, (see (17.5a,b))

(W1)n0 = g1(nκ),

(W1)nm = (W1)

n−1m−1 +

∫ nκ

(n−1)κ

f1

((ξ1)

n−1m−1 + c1t, t, W1, W2

)dt, m = 1, . . . , M ;

(ξ2)nm = xm − c2κn ≡ (m + n)h,

(W2)nM = g2(nκ),

(W2)nm = (W2)

n−1m+1 +

∫ nκ

(n−1)κ

f2

((ξ2)

n−1m+1 + c2t, t, W1, W2

)dt, m = 0, . . . , M − 1.

(17.26b)Note that with the step sizes along the temporal and spatial coordinates being related by

(17.5b), the values of ξ1 and ξ2 stay constant along the lines m−n = const and m+n = const,respectively.

To turn scheme (17.26) into a useful tool, we need to specify how the integrals

∫ nκ

(n−1)κ

fj

((ξj)

n−1m+(−1)j + cjt, t, W1, W2

)dt, j = 1, 2

can be computed. Recall that these integrals are just the symbols denoting the increment of thesolutions of (17.25) from t = (n− 1)κ to t = nκ along the respective characteristic ξj = const.Below we show how this can be done by the modified explicit Euler method. We will write theequations first and then will comment on their meaning.

W1 = (W1)n−1m−1 + κ f1

((ξ1)

n−1m−1 + c1κ(n− 1), cκ(n− 1), (W1)

n−1m−1, (W2)

n−1m−1

),

W2 = (W2)n−1m+1 + κ f2

((ξ2)

n−1m+1 + c2κ(n− 1), cκ(n− 1), (W1)

n−1m+1, (W2)

n−1m+1

);

(17.27a)

(W1)nm =

1

2

[(W1)

n−1m−1 + W1 + κ f1

((ξ1)

nm + c1κn, cκn, W1, W2

) ],

(W2)nm =

1

2

[(W2)

n−1m+1 + W2 + κ f2

((ξ2)

nm + c2κn, cκn, W1, W2

) ].

(17.27b)

Note that the notations (ξ1)n−1m−1 + c1κ(n−1) and (ξ2)

n−1m+1 + c2κ(n−1) in (17.27a) have been

used only to mimic the corresponding terms in (17.25). Those terms, as evident from the first


two equations of (17.23) and from (17.25), must equal xm−1 and xm+1 for j = 1 and j = 2,respectively. Indeed:

(ξ1)n+1m−1 + c1κ(n− 1) =

(h(m− 1)− cκ(n− 1)

)+ cκ(n− 1) = xm−1,

(ξ2)n+1m+1 + c2κ(n− 1) =

(h(m + 1) + cκ(n− 1)

)− cκ(n− 1) = xm+1,

where we have used the equations for (ξj)nm from (17.26b). Similarly, (ξj)

nm + cjκn in (17.27b)

equal xm for both j = 1 and 2.

The meaning of the first equation in (17.27a) is thefollowing. The change of W1 is computed along thecharacteristic ξ1 = (ξ1)

n−1m−1 by the simple Euler ap-

proximation, whereby all arguments of f1 are eval-uated at the “starting” node (x = xm−1, t = tn−1).Since, as we have said, this change occurs alongthe characteristic ξ1 = (ξ1)

n−1m−1, which is labeled

“ξ1 = const” in the figure on the right, then the“final” node of this step is (x = xm, t = tn).

n

n−1

m−1 m m+1

ξ1=const ξ

2=const

Similarly, the change of W2 in (17.27a) is computed along the characteristic ξ2 = (ξ2)n−1m+1

by the simple Euler approximation; hence all arguments of f2 are evaluated at the “starting”node (x = xm+1, t = tn−1) for that characteristic (which is labeled “ξ2 = const” in the figureabove.) The step along this characteristic ends at the same node (x = xm, t = tn).

Finally, the equations in (17.27b) are the standard “corrector” equations of the explicitmodified Euler method.

Scheme (17.26), (17.27) can be straightforwardly generalized for more than two coupledfirst-order hyperbolic PDEs, as long as all the characteristics can be chosen to intersect atuniformly spaced points at each time level. An example of that situation is shown in thefigure next to Eq. (17.21). Extending the scheme to use a Runge–Kutta type method of orderhigher than two (which is the order of the modified explicit Euler method) also appears to bestraightforward.


1. What is the meaning of scheme (17.7)?

2. Where can one specify a boundary condition for Eqs. (17.8) and where can one not?

3. Why is w = const in (17.11)?


5. What does the expression

∫ nκ

(n−1)κ

c(x(t′), t′, w) dt′ in (17.12) stand for?

6. Explain where solution (17.15) comes from and then how it is reduced to (17.15′).



8. Describe a technical problem that is likely to occur when solving a system of coupledPDEs by the method of chatacteristics. How can this problem be overcome?

9. Verify the statement found after Eq. (17.22).

10. What is the difference between the grids used in schemes (17.12) and (17.20), on onehand, and in scheme (17.26), on the other?

11. Why is W2 in the first line of (17.27a) evaluated at xm−1? Why is W1 in the second lineof (17.27a) evaluated at xm+1?

numericalmethods_uofv

Documents