numerical mathematics ii: numerical solution of ordinary...

Numerical Mathematics II:

Numerical Solution of

Ordinary Differential Equations

Peter Philip∗

Lecture Notes

Originally Created for the Class of Spring Semester 2015 at LMU Munich,

Revised and Extended for Several Subsequent Classes

August 8, 2020

Contents

1 Setting and Motivation 3

1.1 Initial Value Problems for ODE . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Explicit Single-Step Methods 8

2.1 Explicit Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 General Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Classical Explicit Runge-Kutta Method . . . . . . . . . . . . . . . . . . . 16

3 General Runge-Kutta Methods 19

3.1 Definition, Butcher Tableaus, First Properties . . . . . . . . . . . . . . . 19

3.2 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Autonomous ODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Stability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5 Higher Orders of Consistency and Convergence . . . . . . . . . . . . . . . 39

∗E-Mail: [email protected]

CONTENTS 2

4 Accuracy Improvements 60

4.1 The Idea of Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Asymptotic Error Expansions . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Extrapolation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Stepsize Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Stiff Equations 81

6 Collocation Methods 90

A Lipschitz Continuity 105

B Ordinary Differential Equations (ODE) 107

B.1 Regularity of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B.2 Differentiability with Respect to Initial Conditions and Parameters . . . 108

References 109

1 SETTING AND MOTIVATION 3

1 Setting and Motivation

We write K if a statement is meant to be valid for both K = R and K = C. Even ifK = C, differentiability will always mean R-differentiability, identifying Cn ∼= R2n.

1.1 Initial Value Problems for Ordinary Differential Equations

Definition 1.1. Let n ∈ N. Given G ⊆ R×Kn and f : G −→ Kn, we call

y′ = f(x, y) (1.1)

an (explicit) ordinary differential equation (ODE) of first order. A solution to this ODEis a differentiable function φ : I −→ Kn, defined on a nontrivial (bounded or unbounded,open or closed or half-open) interval I ⊆ R, satisfying the two conditions

(i)(x, φ(x)

)∈ I ×Kn : x ∈ I

⊆ G,

(ii) φ′(x) = f(x, φ(x)) for each x ∈ I,

where condition (i) is necessary to even formulate condition (ii).

—

To single out a specific solution to an ODE, we need further requirements, e.g., so-calledinitial conditions:

Definition 1.2. Let n ∈ N. An initial value problem for (1.1) consists of the ODE (1.1)plus the initial condition

y(ξ) = η (1.2)

with given ξ ∈ R and η ∈ Kn. A solution φ to the initial value problem is a differentiablefunction φ as in Def. 1.1 that is a solution to the ODE and that also satisfies (1.2) (withy replaced by φ) – in particular, this requires ξ ∈ I.

—

It can also be of interest to consider higher-order ODE, i.e. ODE that involve higherderivatives of the unknown function y. However, as it turns out, one can write everyhigher-order ODE as an equivalent first-order ODE (see, e.g., [Phi16c, Sec. 3.1]) with acorresponding equivalence of initial value problems. Thus, we will restrict ourselves tofirst-order ODE in this class.

Remark 1.3. It is often useful to write an initial value problem as in Def. 1.2 asan equivalent integral equation: The following result is a simple consequence of thefundamental theorem of calculus: If G ⊆ R × Kn, n ∈ N, and f : G −→ Kn iscontinuous, then, for each (ξ, η) ∈ G, the initial value problem consisting of (1.1) and(1.2) is equivalent to the integral equation

y(x) = η +

∫ x

ξ

f(t, y(t)

)dt , (1.3)


in the sense that a differentiable function φ : I −→ Kn, with ξ ∈ I ⊆ R being anontrivial interval, and φ satisfying Def. 1.1(i) is a solution to (1.1) and (1.2) if, andonly if,

∀x∈I

φ(x) = η +

∫ x

ξ

f(t, φ(t)

)dt ,

i.e. if, and only if, φ is a solution to the integral equation (1.3).

—

Under suitable hypotheses, initial value problems have a unique (maximal) solution:

Theorem 1.4. If G ⊆ R × Kn is open, n ∈ N, and f : G −→ Kn is continuousand locally Lipschitz with respect to y (cf. Def. A.1(b)), then, for each (ξ, η) ∈ G, theinitial value problem consisting of (1.1) and (1.2) has a unique maximal solution (wherea solution is called maximal provided it is defined on an open interval and cannot beextended to a solution on some larger open interval).

Proof. The theorem follows by combining [Phi16c, Th. 3.15] with [Phi16c, Th. 3.22].

For most ODE, no explicit solution formulas are available, and numerical methods arerequired to approximate solutions. Stable numerical methods virtually always requirethat small changes of the input data result in small changes of the output data. Thefollowing result Th. 1.6 shows that initial value problems for ODE are not hopeless inthis regard. To prove Th. 1.6, we will make use of the following version of Gronwall’sinequality:

Proposition 1.5 (Gronwall’s Inequality). Let a, b ∈ R, a < b, α, φ ∈ C[a, b], α ≥ 0,C ∈ R. If φ satisfies the integral inequality

∀x∈[a,b]

φ(x) ≤ C +

∫ x

a

α(t)φ(t) dt , (1.4)

then φ can be estimated in the following way:

∀x∈[a,b]

φ(x) ≤ C exp

(∫ x

a

α(t) dt

). (1.5)

Proof. Fix ǫ > 0 and define the auxiliary function

g : [a, b] −→ R, g(x) := (C + ǫ) exp

(∫ x

a

α(t) dt

).

We differentiate g to note it satisfies the ODE

∀x∈[a,b]

g′(x) = (C + ǫ)α(x) exp

(∫ x

a

α(t) dt

)= α(x)g(x),

and, equivalently,

∀x∈[a,b]

g(x) = C + ǫ+

∫ x

a

α(t)g(t) dt .


We claim that∀

x∈[a,b]φ(x) < g(x). (1.6)

Due to (1.4) and ǫ > 0, we know φ(a) < g(a). Seeking a contradiction, assume

s := supx ∈ [a, b] : φ(t) < g(t) for each a ≤ t < x

< b.

Then the continuity of φ and g implies s > a and φ(s) = g(s). On the other hand,

φ(s)(1.4)

≤ C +

∫ s

a

α(t)φ(t) dt < C + ǫ+

∫ s

a

α(t)g(t) dt = g(s)

in contradiction to φ(s) = g(s). Thus, the assumption s < b was wrong, which means(1.6) is true. Since ǫ > 0 was arbitrary, (1.6) implies (1.5).

Theorem 1.6 (Continuity in Initial Conditions). If G ⊆ R × Kn is open, n ∈ N, andf : G −→ Kn is continuous and globally L-Lipschitz with respect to y, i.e.

∃L≥0

∀(x,y),(x,y)∈G

∥∥f(x, y)− f(x, y)∥∥ ≤ L‖y − y‖, (1.7)

then the solutions to (1.1) depend continuously on the initial condition: Let φ1, φ2 :I −→ Kn both be solutions to (1.1) defined on the same interval I ⊆ R with ξ ∈ I, then,

∀x∈I,x≥ξ

∥∥φ1(x)− φ2(x)∥∥ ≤ eL(x−ξ)

∥∥φ1(ξ)− φ2(ξ)∥∥. (1.8)

Proof. As both φ1 and φ2 satisfy an integral equation corresponding to (1.3), we obtain,for each x ∈ I with x ≥ ξ:

‖φ1(x)− φ2(x)‖ =

∥∥∥∥φ1(ξ)− φ2(ξ) +

∫ x

ξ

(f(t, φ1(t))− f(t, φ2(t))

)dt

∥∥∥∥(1.7)

≤ ‖φ1(ξ)− φ2(ξ)‖+ L

∫ x

ξ

‖φ1(t)− φ2(t)‖ dt .

We now note that we can apply Prop. 1.5 with a := ξ, C := ‖φ1(ξ) − φ2(ξ)‖, α ≡ L,and φ replaced by ‖φ1 − φ2‖ to obtain (1.8).

1.2 Approximation

The general idea of most numerical approximation methods for initial value problemsfor ODE y′ = f(x, y), y(ξ) = η, is to start at x0 := ξ and to proceed by discrete stepsh0, h1, . . . to x1 := x0 + h0 > x0, x2 := x1 + h1 > x1, . . . , while, simultaneously, startingwith y0 := η and proceeding to approximations y1 of y(x1), y2 of y(x2), . . . This is theidea giving rise to the following Def. 1.7. However, Def. 1.7 is formulated in a formthat is not based on a given background ODE. Rather, the defined method can be seenas being a discrete analogon to the continuous ODE: While a solution φ to the ODE


(1.1) has to satisfy the equation at each point x of the solution interval I, a solutionto a discrete method as defined in Def. 1.7 is a (finite) sequence that merely needs tosatisfy a finite system of equations, one equation for each of a finite sequence of pointsx0 < x1 < · · · < xN ∈ I. Of course, one can only expect the discrete solution toapproximate a continuous solution to a given ODE in a reasonable way, if the definingfunction ψ of the discrete method is suitably related to the right-hand side f of theconsidered ODE.

Definition 1.7. (a) Given a real interval [a, b] ⊆ R, a < b, the finite sequence

∆ = (x0, . . . , xN) ∈ RN+1, N ∈ N,

is called a partition of [a, b] if, and only if, a = x0 < x1 < · · · < xN = b (∆ isthen also called a grid or a mesh on [a, b]). Given a partition ∆ of [a, b], define thenumbers

∀k∈0,...,N−1

hk := xk+1 − xk, (1.9a)

hmax(∆) := maxhk : k ∈ 0, . . . , N − 1

, (1.9b)

hmin(∆) := minhk : k ∈ 0, . . . , N − 1

. (1.9c)

The numbers hk are called step sizes and hmax(∆) is called the mesh size of ∆.Moreover, let Π([a, b]) denote the set of all partitions of [a, b].

(b) Let m,n ∈ N. A (general) m-step method is given via parameters α0, . . . , αm−1 ∈ R

and a defining function ψ (sometimes also called increment function) of the form

ψ : Dψ −→ Kn, where Dψ ⊆ Rm+1 × (Kn)m+1 × R

is the domain of ψ. Given

((ξ0, η0), . . . , (ξm−1, ηm−1)

)∈(R×Kn

)mand h ∈ R+ (1.10a)

such that ξ0 < · · · < ξm−1, (1.10b)

the pair (ξm, ηm) ∈ R × Kn is called a local solution to the method if, and only if,(1.11) holds, where

(ξ0, . . . , ξm−1, ξm, η0, . . . , ηm−1, ηm, h) ∈ Dψ, (1.11a)

ξm = ξm−1 + h, (1.11b)

ηm =m−1∑

j=0

αjηj + hψ(ξ0, . . . , ξm−1, ξm, η0, . . . , ηm−1, ηm, h). (1.11c)

Given a real interval [a, b] ⊆ R, a < b, with a partition ∆ = (x0, . . . , xN) as definedin (a), the finite sequence

((x0, y0), . . . , (xN , yN)

)∈(R×Kn

)N+1


is called a global solution to the method on ∆ if, and only if, it constitutes a sequenceof local solutions in the sense that, for each k ∈ 0, . . . , N−m, (1.12) holds, where(xk, . . . , xk+m−1, xk+m, yk, yk+1, . . . , yk+m−1, yk+m, hk+m−1

)∈ Dψ, (1.12a)

yk+m=

m−1∑

j=0

αjyk+j + hk+m−1 ψ(xk, . . . , xk+m−1, xk+m, yk, yk+1, . . . , yk+m−1, yk+m, hk+m−1

).

(1.12b)

Ifm = 1, then one speaks of a (general) single-step or one-step method, form > 1 ofa (general) multistep method. An m-step method is called explicit if the incrementfunction ψ does not, actually, depend on ηm (i.e. on yk+m in (1.12)) and implicitotherwise. If m = 1 and α0 = 1, then we speak of a standard single-step methodor, simply, a single-step method.

Remark 1.8. (a) Note that, in (1.11c) and (1.12b), the sum involving the αj and theh-factor in front of ψ could have been incorporated into the increment function.However, the form given in (1.11c) and (1.12b) seems to be the one commonlyused in the literature; indeed, using this form is convenient for a large number ofcommonly considered methods (we will see examples below; for explicit single-stepmethods, also compare Lem. 2.4(c) below).

(b) If φ : I −→ Kn is the solution to an ODE, then, a priori, the methods defined inDef. 1.7 are designed to yield discrete approximations y0, y1, . . . to φ(x0), φ(x1), . . .for x0, x1, · · · ∈ I, but they do not yield an approximating function u : I −→ Kn ofφ. To pass from the yk to a function u requires interpolation. Different interpolationmethods are possible, where interpolation by splines is often used, piecewise linear(i.e. affine) splines being the most simple possible choice.

Remark 1.9. Consider the situation of Def. 1.7.

(a) Consider an explicit m-step method according to Def. 1.7(b). Given data as in(1.10), the method has a local solution (ξm, ηm) (which is then necessarily unique) if,and only if, (1.11a) is satisfied with ξm defined by (1.11b) and ηm defined by (1.11c):As the method is explicit, (1.11a) does not actually depend on ηm and, if (1.11a)holds, then ηm is well-defined by (1.11c). In consequence, if ∆ = (x0, . . . , xN) isa partition of [a, b] with N ≥ m and y0, . . . , ym−1 ∈ Kn are initial data satisfying(1.12a) with k = 0, then (1.12b) yields a recursion for the computation of yl,l ≥ m, that continuous uniquely as long as (1.12a) holds. Thus, given initial data,an explicit m-step method has a global solution on ∆ (which is then necessarilyunique) if, and only if, (1.12a) holds for each k ∈ 0, . . . , N −m.

(b) Clearly, every explicit method can be written as an implicit method (e.g. by addingηm on both sides of (1.11c), followed by a division by 2). The converse is, in general,not true, namely, if there are cases such that (1.11c) has either no solution or morethan one solution.

2 EXPLICIT SINGLE-STEP METHODS 8

(c) Explicit methods tend to be computationally much simpler to execute, as obtain-ing yk+m from (1.12b) basically means one evaluation of ψ. For implicit methods,yk+m must be obtained as a solution to (1.12b), which, in general, constitutes acoupled system of n nonlinear equations, and might be very difficult (or impossi-ble) to solve. The system (1.12b) is typically again solved by a suitable numericalmethod, e.g. Newton’s method. In particular, obtaining yk+m usually requires sev-eral evaluations of ψ. On the other hand, depending on the ODE to be solved, theadditional computational cost in each step of such an implicit method might bemore than compensated by better convergence properties. The type of ODE, wherethis occurs1 is known as stiff ODE (cf. Sec. 5).

2 Explicit Single-Step Methods

The most simple methods considered in Def. 1.7(b) are explicit methods for m = 1, i.e.explicit single-step methods, which are the subject of the present section.

2.1 Explicit Euler Method

The explicit Euler method is the most simple of the explicit single-step methods. It isnot really sufficiently accurate for practical use, but it is still instructive to study tofamiliarize ourselves with some of the relevant ideas and issues.

Definition 2.1. Let n ∈ N, G ⊆ R×Kn, f : G −→ Kn. In the situation of the initialvalue problem

y′ = f(x, y), y(ξ) = η, (ξ, η) ∈ G,

the (standard) explicit single-step method with defining function

ψ : G× R −→ Kn, ψ(x, y, h) := f(x, y),

is called the explicit Euler method for the ODE y′ = f(x, y). Let ∆ = (x0, . . . , xN ),N ∈ N, be a partition of I := [ξ, b], ξ < b. Setting m = 1 and α0 = 1 in Def. 1.7, weobtain (

(ξ, η) = (x0, y0), . . . , (xN , yN))∈(I ×Kn

)N+1

to be a global solution to the method on ∆ if, and only if, the yk satisfy the recursion

y0 = η,

∀k∈0,...,N−1

yk+1 = yk + hk f(xk, yk), (xk, yk) ∈ G, hk = xk+1 − xk.(2.1)

1This statement is actually not mathematically precise, since the implicit methods useful for solvingstiff ODE can be reformulated equivalently as explicit methods (with a new defining function) – auseful method must have the property that, in each step, one can uniquely select an admissible yk+mto advance the discrete solution. In a related issue, there does not seem to exist a mathematicallyrigorous definiton of stiff ODE in the literature.


Remark 2.2. (a) While, for (ξ, η) ∈ G and h ∈ R+, the explicit Euler method alwayshas the unique local solution

(x, y) :=(ξ + h, η + h f(ξ, η)

),

in general, there is no guarantee that the method has a global solution on ∆. Moreprecisely, the explicit Euler method has a global solution on ∆ with y0 = η if, andonly if,

(x0, y0) = (ξ, η) and ∀k∈0,...,N−1

(xk, yk) ∈ G

(if (xk, yk) /∈ G, then yl is not defined for l > k, also cf. Rem. 1.9(a)).

(b) For n = 1 and K = R, the explicit Euler method can easily be visualized: If y isa solution to y′ = f(x, y), then, at xk, its slope is y′(xk) = f(xk, y(xk)). Thus, instep k+1, the Euler method approximates y by the line through (xk, yk) with slopef(xk, yk) (using that yk is supposed to approximate y(xk)). This line is describedby

l(x) = yk + (x− xk) f(xk, yk),

i.e. yk+1 = l(xk+1) = yk + hk f(xk, yk) as in (2.1).

2.2 General Error Estimates

We will now introduce notions and results that help to gauge the accuracy (and theexpected error) of explicit single-step methods when approximating a function φ : I −→Kn, defined on an interval I ⊆ R. While we intend φ to be a solution of a given ODE,we, once again, formulate the following Def. 2.3 without providing a background ODE:

Definition 2.3. Let [a, b] ⊆ R, a < b, and φ : [a, b] −→ Kn, n ∈ N, η := φ(a).Moreover, consider a general explicit single-step method according to Def. 1.7(b), withα0 ∈ R, and defining function

ψ : Dψ −→ Kn, Dψ ⊆ R×Kn × R.

Then, for each partition ∆ = (x0, . . . , xN ) of [a, b], the global solution((x0, y0), . . . , (xN , yN)

)∈([a, b]×Kn

)N+1

with y0 = η (provided it exists), is given by the recursion

y0 = η,

∀k∈0,...,N−1

yk+1 = α0 yk + hk ψ(xk, yk, hk), hk := xk+1 − xk.(2.2)

(a) The method is said to have order of convergence p ∈ N with respect to φ if, andonly if, there exists ǫ := ǫ(ψ, φ) ∈ R+ such that the global solution satisfying (2.2)exists for each ∆ ∈ Π([a, b]) with hmax(∆) < ǫ and, moreover,

∃C≥0

∀∆∈Π([a,b]),hmax(∆)<ǫ

max‖yk − φ(xk)‖ : k ∈ 0, . . . , N

≤ C hpmax(∆), (2.3)


where the quantity on the left-hand side of the inequality in (2.3) is known as themethod’s global truncation error (it depends on ψ, φ, and ∆).

(b) For each x ∈ [a, b[ and each h ∈]0, b− x] such that (x, φ(x), h) ∈ Dψ, we call

λ(x, h) := α0 φ(x) + hψ(x, φ(x), h)︸︷︷︸cf. (2.2)

−φ(x+ h) (2.4)

the method’s local truncation error at point (x+h, φ(x+h)) with respect to the step-size h. We call λ well-defined if, and only if, for each x ∈ [a, b[, there exists ǫλ(x) :=ǫλ(ψ, φ, x) ∈ R+ such that λ(x, h) is defined for each h ∈ Ix :=]0, ǫλ(x)[∩]0, b− x] –uniformly well-defined provided that ǫλ := ǫλ(x) can be chosen independently of x.If λ is well-defined, then the method is said to be consistent if, and only if,

∀x∈[a,b[

limh↓0

λ(x, h)

h= 0; (2.5)

and consistent of order p ∈ N if, and only if,

∃C≥0

∀x∈[a,b[

∀h∈Ix

‖λ(x, h)‖ ≤ C hp+1. (2.6)

Lemma 2.4. Consider the situation of Def. 2.3 and assume the local truncation errorλ to be well-defined.

(a) If the method given by ψ and α0 is consistent of order p ∈ N, then it is consistent.

(b) Let k ∈ 1, . . . , n. If φk is differentiable at x ∈ [a, b[ and limh↓0λk(x,h)

hexists in K,

then α0 = 1 or φk(x) = 0 or limh→0 |ψk(x, φ(x), h)| = ∞.

(c) Let φ be differentiable and, for x ∈ [a, b[, assume limh→0 ψ(x, φ(x), h) exists in Kn.

If limh↓0λ(x,h)h

exists in Kn, then α0 = 1 or φ(x) = 0 (in particular, if the methodis consistent, then φ ≡ 0 or α0 = 1).

(d) Assume φ to be a solution to the ODE y′ = f(x, y) with f : G −→ Kn, G ⊆ R×Kn,n ∈ N. In view of (c), also assume α0 = 1. Then the method given by ψ is consistentif, and only if,

∀x∈[a,b[

limh↓0

ψ(x, φ(x), h) = f(x, φ(x)). (2.7)

Proof. (a): If (2.6) holds, then

∀x∈[a,b[

limh↓0

‖λ(x, h)‖h

= limh↓0

(Chp) = 0,

proving the method to be consistent.

(b): Let x ∈ [a, b[ and 0 < h < minǫλ(x), b− x. From (2.4), we obtain

λk(x, h)

h=α0φk(x)− φk(x) + φk(x)− φk(x+ h)

h+ ψk

(x, φ(x), h

)

=φk(x)

(α0 − 1

)+ φk(x)− φk(x+ h)

h+ ψk

(x, φ(x), h

).


Note limh↓0φk(x)−φk(x+h)

h= −φ′

k(x). Thus, as

φk(x)(α0 − 1

)6= 0 ⇒ lim

h↓0

∣∣φk(x)(α0 − 1

)∣∣h

= ∞,

we see that limh↓0λk(x,h)

hcan only exist in K, provided that limh↓0

∣∣ψk(x, φ(x), h

)∣∣ = ∞as well.

(c) is an immediate consequence of (b).

(d): Since φ is differentiable and α0 = 1, in consequence of (2.4),

limh↓0

λ(x, h)

hexists ⇔ lim

h↓0ψ(x, φ(x), h) exists.

Moreover, if the limits exist, then

limh↓0

λ(x, h)

h= −φ′(x) + lim

h↓0ψ(x, φ(x), h) = −f(x, φ(x)) + lim

h↓0ψ(x, φ(x), h),

implying the equivalence between (2.5) and (2.7).

We will see in Cor. 2.8 below that, if ψ and φ are sufficiently regular, then consistenceof order p implies order of convergence p as well. In preparation, we prove the followinglemma:

Lemma 2.5. Let N ∈ N and consider the finite sequence (a0, . . . , aN) in R+0 . If there

are numbers β, h0, . . . , hN−1 ∈ R+0 and L ∈ R+ such that

∀k∈0,...,N−1

ak+1 ≤ (1 + hkL) ak + hkβ, (2.8a)

then

∀k∈0,...,N

ak ≤eLxk − 1

Lβ + a0 e

Lxk , where xk :=k−1∑

j=0

hj. (2.8b)

Proof. We prove (2.8b) via induction on k: For k = 0, we have the true statementa0 ≤ a0. Thus, let k ∈ 0, . . . , N − 1. Then one estimates

ak+1 ≤ (1 + hkL) ak + hkβind.hyp.

≤≤ehkL︷︸︸︷

(1 + hkL)

(eLxk − 1

Lβ + a0 e

Lxk

)+ hkβ

≤(eL(xk+hk) − 1− hkL

L+ hk

)β + a0 e

L(xk+hk)

=eLxk+1 − 1

Lβ + a0 e

Lxk+1 ,

completing the proof.


Definition 2.6. The function ψ : Dψ −→ Kn, n ∈ N, with Dψ ⊆ R×Kn × R is called(globally) L-Lipschitz with respect to y if, and only if,

∃L≥0

∀(x,y,h),(x,y,h)∈Dψ

∥∥ψ(x, y, h)− ψ(x, y, h)∥∥ ≤ L‖y − y‖. (2.9)

—

The following Th. 2.7 does not only account for the truncation error, but also for apossible error in the initial value as well as possible rounding errors in each step.

Theorem 2.7. Consider the setting of Def. 2.3 with α0 := 1. However, instead of therecursion (2.2), we consider the modification

v0 = η + e0,

∀k∈0,...,N−1

vk+1 = vk + hk ψ(xk, vk, hk) + rk, hk := xk+1 − xk,(2.10)

where e0 ∈ Kn represents a possible error in the initial value and the rk ∈ Kn representpossible rounding errors in the kth step. We assume

∀k∈0,...,N−1

‖rk‖ ≤ δ ∈ R+0 . (2.11)

Moreover, assume the underlying explicit single-step method given by ψ is consistent oforder p ∈ N with λ uniformly well-defined according to Def. 2.3(b), and assume ψ to beglobally L-Lipschitz with respect to y, L > 0. Then, for each partition ∆ = (x0, . . . , xN)of [a, b] with hmax(∆) < ǫλ (ǫλ as in Def. 2.3(b)) and such that the recursion (2.10) iswell-defined (i.e. (xk, vk, hk) ∈ Dψ for each k ∈ 0, . . . , N − 1), the following estimateholds true:

max‖vk − φ(xk)‖ : k ∈ 0, . . . , N

≤ eL(b−a) − 1

L

(C (hmax(∆))p +

δ

hmin(∆)

)+ ‖e0‖ eL(b−a), (2.12)

C being the constant given by (2.6).

Proof. To simplify notation, we write hmax := hmax(∆) and hmin := hmin(∆). Introduc-ing, for each k ∈ 0, . . . , N, the abbreviations

φk := φ(xk), ek := vk − φk,

and, for each k ∈ 0, . . . , N − 1,

λk := λ(xk, hk)(2.4)= φk + hk ψ(xk, φk, hk)− φk+1, (2.13)

the idea is to apply Lem. 2.5 with ak := ‖ek‖. Using (2.13) together with the recursion(2.10) for vk, we obtain, for each k ∈ 0, . . . , N − 1,

ek+1 = vk+1 − φk+1 = ek + hk ψ(xk, vk, hk) + rk + λk − hk ψ(xk, φk, hk).


Applying the norm and using the assumed Lipschitz continuity of ψ with respect to ytogether with (2.6) then yields

‖ek+1‖ ≤ (1 + hk L) ‖ek‖+ hk

(C hpmax +

δ

hmin

),

which is (2.8a) with ak = ‖ek‖ and β = C hpmax +δ

hmin. Thus, (2.8b) yields

∀k∈0,...,N

‖ek‖ ≤ eL(xk−a) − 1

L

(C hpmax +

δ

hmin

)+ ‖e0‖ eL(xk−a),

proving (2.12).

Corollary 2.8. As in Th. 2.7, consider the setting of Def. 2.3 with α0 := 1. Assume ψto be globally L-Lipschitz with respect to y, L > 0, and assume the explicit single-stepmethod given by ψ is consistent of order p ∈ N.

(a) If λ is uniformly well-defined and if there exists ǫ ∈ R+, ǫ ≤ ǫλ, such that the globalsolution (

(x0, y0), . . . , (xN , yN))∈([a, b]×Kn

)N+1,

satisfying

y0 = η := φ(a),

∀k∈0,...,N−1

yk+1 = yk + hk ψ(xk, yk, hk), hk := xk+1 − xk,

exists for each ∆ ∈ Π([a, b]) with hmax(∆) < ǫ, then the method has order ofconvergence p as defined in Def. 2.3(a). More precisely,

∀∆∈Π([a,b]),hmax(∆)<ǫ

max‖yk − φ(xk)‖ : k ∈ 0, . . . , N

≤ K hpmax, (2.14a)

where hmax := hmax(∆), K =C

L

(eL(b−a) − 1

), (2.14b)

C being the constant given by (2.6).

(b) If there exists r ∈ R+ such that the domain Dψ of ψ satisfies

Dψ ⊇(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ r

× [0, r], (2.15)

then λ is uniformly well-defined with ǫλ = r, an ǫ ≤ r, satisfying the hypothesis of(a), does exist, and the method has order of convergence p.

Proof. (a): One merely has to set e0 := 0 and δ := 0 in Th. 2.7.

(b): If (x, h) ∈ [a, b]× [0, r], then (x, φ(x), h) ∈ Dψ by (2.15), showing λ to be uniformlywell-defined with ǫλ = r. Define

ǫ :=

r for K = 0,

minr, p√

rK

for K > 0.


We can now apply Th. 2.7 inductively: Let ∆ ∈ Π([a, b]) with hmax(∆) < ǫ. Then(x0, y0, h0) = (a, φ(a), h0) ∈ Dψ by (2.15). Now let k ∈ 0, . . . , N−1 and, by induction,assume (xl, yl, hl) ∈ Dψ for each l ∈ 0, . . . , k. Applying Th. 2.7 with b replaced byxk+1 (and e0 := 0, δ := 0), we obtain

‖yk+1 − φ(xk+1)‖ ≤ eL(xk+1−a) − 1

LC (hmax(∆))p ≤ K ǫp ≤ K

r

K= r,

showing (xk+1, yk+1, hk+1) ∈ Dψ by (2.15), completing the induction and the proof.

Corollary 2.9. If one considers the situation of Th. 2.7 with e0 = 0 and equidistantstepzises (i.e. h := hmin(∆) = hmax(∆) < ǫλ), then the bound for the total error givenby (2.12) is minimized for

h := hopt :=

(δ

p

) 1p+1

, (2.16a)

leading to

max‖vk − φ(xk)‖ : k ∈ 0, . . . , N

≤ (1 + p)K

(δ

p

) pp+1

, (2.16b)

K :=maxC, 1

L

(eL(b−a) − 1

). (2.16c)

Proof. For e0 = 0 and equidistant stepzises, the estimate in (2.12) becomes

max‖vk − φ(xk)‖ : k ∈ 0, . . . , N

≤ eL(b−a) − 1

L

(C hp +

δ

h

)

(2.16c)

≤ K

(hp +

δ

h

)=: α(h).

Clearly, α is differentiable on R+ and α′(h) = K(php−1− δh2). Thus, α′ has its only zero

at hopt, where the sign changes from negative to positive, proving α to be minimal athopt. Moreover,

α(hopt) = K

((δ

p

) pp+1

+ δ(pδ

) 1p+1

)= K

(δ

p

) pp+1(1 + δ

(pδ

) pp+1(pδ

) 1p+1

)

= (1 + p)K

(δ

p

) pp+1

,

proving (2.16b).

Remark 2.10. We conclude from Cor. 2.9 that it is actually counterproductive toreduce the stepsize to values below hopt if rounding errors are known to be of order δ.

—

In the proof of Th. 2.12 below, we intend to make use of Cor. 2.8(b). In preparation,we provide the following lemma:


Lemma 2.11. Let n ∈ N and consider Kn with some arbitrary norm ‖ · ‖.

(a) Suppose ∅ 6= C ⊆ O ⊆ Kn, where O is open and C is compact. Then there existsr ∈ R+ such that

Cr :=y ∈ Kn : dist(y, C) ≤ r

⊆ O,

wheredist(y, C) := min‖y − c‖ : c ∈ C

(note that the min exists, since C is compact and the norm is continuous).

(b) Let G ⊆ R×Kn be open. If a, b ∈ R, a < b, φ : [a, b] −→ Kn is continuous, and

C :=(x, φ(x)) : x ∈ [a, b]

⊆ G,

then there exists r ∈ R+ such that

C :=(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ r

⊆ G.

Proof. (a): If O = Kn, then there is nothing to prove. Otherwise, let

δ := dist(C,Kn \O) = mindist(c,Kn \O) : c ∈ C

(this min exists, as C is compact and the map c 7→ dist(c,Kn \ O) is continuous, cf.[Phi16b, Ex. 2.6(b)]). Then δ > 0, as O is open. Set r := δ/2. Then Cr ⊆ O: Indeed, ify ∈ Cr, c ∈ C, and z ∈ Kn \O, then

‖y − z‖ ≥ ‖c− z‖ − ‖c− y‖ ≥ δ − δ

2=δ

2,

showing y ∈ O.

(b): As φ is continuous and [a, b] is compact, C must be compact as well. On R×Kn,consider the norm | · | + ‖ · ‖ and apply (a) with respect to this norm and the setsC ⊆ G ⊆ R×Kn, yielding r > 0 such that

C ⊆ Cr ⊆ G,

as desired: Indeed, if (x, y) ∈ C, then

dist((x, y), C

)≤ |x− x|+ ‖y − φ(x)‖ ≤ r,

showing (x, y) ∈ Cr.

Theorem 2.12. Let n ∈ N, G ⊆ R × Kn, f : G −→ Kn. As in Def. 2.1, consider theexplicit Euler method given by

ψ : G× R −→ Kn, ψ(x, y, h) := f(x, y),

for the ODE y′ = f(x, y). We now assume f ∈ C1(G,Kn), G ⊆ R× Kn open, (ξ, η) ∈G, and let φ : I −→ Kn be the unique maximal solution to the initial value problemy′ = f(x, y), y(ξ) = η, given by Th. 1.4 (note that f ∈ C1(G,Kn) implies f to be locallyLipschitz with respect to y according to Prop. A.4 of the Appendix). If b ∈ I, b > ξ,then, with respect to φ on [ξ, b], the method is consistent of order 1 and it has order ofconvergence 1.


Proof. If f ∈ C1(G,Kn), then φ, as a solution to y′ = f(x, y), must be C2 according toProp. B.1 in the Appendix. Thus, we can apply Taylor’s theorem with the remainderterm in integral form to obtain, for each x ∈ [ξ, b[ and each h ∈]0, b− x],

φ(x+ h) = φ(x) + φ′(x)h+

∫ x+h

x

(x+ h− t)φ′′(t) dt . (2.17)

As φ is defined on the entire interval [ξ, b], the local truncation error

λ(x, h) = φ(x) + h

f(x,φ(x))=φ′(x)︷︸︸︷ψ(x, φ(x), h)−φ(x+ h)

is well-defined for each x ∈ [ξ, b[, h ∈]0, b − x]. For such (x, h), applying (2.17) in theexpression for λ(x, h) yields

λ(x, h) = −∫ x+h

x

(x+ h− t)φ′′(t) dt ,

and, thus,‖λ(x, h)‖ ≤ C h2,

where

C =1

2max‖φ′′(t)‖ : t ∈ [ξ, b],

showing the method to be consistent of order 1 according to Def. 2.3(b). It now followsfrom Cor. 2.8(b) that the method has order of convergence 1 as well: As φ is continuousand G is open, Lem. 2.11(b) yields r > 0 such that

K :=(x, y) ∈ [ξ, b]×Kn : ‖y − φ(x)‖ ≤ r

× [0, r] ⊆ G× R = Dψ,

proving the validity of (2.15). Moreover, f being continuous and locally Lipschitz withrespect to y implies ψ to have the same properties. Thus, according to Prop. A.3 ofthe Appendix, ψ is globally Lipschitz with respect to y on the compact set K. Inconsequence, ψ satisfies all the hypotheses of Cor. 2.8(b), yielding the method to haveorder of convergence 1.

2.3 Classical Explicit Runge-Kutta Method

While the explicit Euler method discussed above is still relatively simple, it is also tooinaccurate for most practical applications. The classical explicit Runge-Kutta (RK)method introduced below is slightly more involved, yielding a fourth-order method,often sufficiently good for serious use.

Definition 2.13. Let n ∈ N, G ⊆ R×Kn, f : G −→ Kn. In the situation of the initialvalue problem

y′ = f(x, y), y(ξ) = η, (ξ, η) ∈ G,


the (standard) explicit single-step method with defining function

ψ : Dψ −→ Kn, ψ(x, y, h) :=1

6

(k1(x, y) + 2k2(x, y, h) + 2k3(x, y, h) + k4(x, y, h)

),

k1(x, y) := f(x, y),

k2(x, y, h) := f

(x+

h

2, y +

h

2k1(x, y)

),

k3(x, y, h) := f

(x+

h

2, y +

h

2k2(x, y, h)

),

k4(x, y, h) := f(x+ h, y + hk3(x, y, h)

),

is called the classical explicit Runge-Kutta (RK) method for the ODE y′ = f(x, y). Thedomain Dψ of ψ consists of all (x, y, h) ∈ G× R such that all arguments of f , requiredby k1(x, y, h), . . . , k4(x, y, h), are in G. Let ∆ = (x0, . . . , xN ), N ∈ N, be a partition ofI := [ξ, b], ξ < b. Setting m = 1 and α0 = 1 in Def. 1.7, we obtain

((ξ, η) = (x0, y0), . . . , (xN , yN)

)∈(I ×Kn

)N+1

to be a global solution to the method on ∆ if, and only if, the yk satisfy the recursion

y0 = η,

∀k∈0,...,N−1

yk+1 = yk +hk6

(k1(xk, yk) + 2k2(xk, yk, hk)

+ 2k3(xk, yk, hk) + k4(xk, yk, hk)),

hk = xk+1 − xk, (xk, yk, hk) ∈ Dψ.

(2.18)

Remark 2.14. As for the explicit Euler method in Rem. 2.2(a), we have the issue that,in general, there is no guarantee yk+1 is well-defined by (2.18) (i.e., for a given partition∆, a global solution (or even a local solution) might not exist). More precisely, yk+1 iswell-defined by (2.18) if, and only if,

(xk, yk) ∈ G,(xk + hk/2, yk + hkk1(xk, yk)/2

)∈ G,

(xk + hk/2, yk + hkk2(xk, yk, hk)/2

)∈ G,

and(xk + hk, yk + hkk3(xk, yk, hk)

)∈ G

(2.19)

(i.e. the recursion (2.18) terminates with yk if (2.19) fails).

Theorem 2.15. Let n ∈ N, G ⊆ R × Kn, f : G −→ Kn. For the ODE y′ = f(x, y),consider the classical explicit RK method of Def. 2.13. We now assume f ∈ C4(G,Kn),G ⊆ R × Kn open, (ξ, η) ∈ G, and, as in Th. 2.12, we let φ : I −→ Kn be the uniquemaximal solution to the initial value problem y′ = f(x, y), y(ξ) = η, given by Th. 1.4,also letting b ∈ I, b > ξ.


(a) With respect to φ on [ξ, b], the method is consistent of order 4.

(b) With respect to φ on [ξ, b], the method has order of convergence 4.

Proof. (a): The proof that the classical explicit RK method is consistent of order 4 isalready somewhat tedious. It can be carried out using Taylor’s theorem, using the sameidea as in the proof of Th. 2.12, see [DB08, Sec. 4.2.2]. A more systematic approachthat is also suitable for higher-order methods uses the graph-theoretic ideas of Sec. 3.5below, where we will complete the proof in Ex. 3.39(b).

(b): If f is globally Lipschitz with respect to y on some set U ⊆ G, then so is ψ onUψ := Dψ ∩ (U × [0, T ]) for each T > 0: While this holds more generally (cf. Rem.3.6(b) below), we will directly verify that ψ is globally Lipschitz with respect to y onUψ in the present situation: If f is L-Lipschitz with respect to y on U , L ≥ 0, and(x, y, h), (x, y, h) ∈ Dψ ∩ (U × R+

0 ), then∥∥k1(x, y)− k1(x, y)∥∥ ≤ L‖y − y‖,

∥∥k2(x, y, h)− k2(x, y, h)∥∥ ≤

(L+

hL2

2

)‖y − y‖,

∥∥k3(x, y, h)− k3(x, y, h)∥∥ ≤

(L+

hL2

2+h2L3

4

)‖y − y‖,

∥∥k4(x, y, h)− k4(x, y, h)∥∥ ≤

(L+ hL2 +

h2L3

2+h3L4

4

)‖y − y‖,

∥∥ψ(x, y, h)− ψ(x, y, h)∥∥ ≤ 1

6

(6L+ 3hL2 + h2L3 +

h3L4

4

)‖y − y‖.

For h ∈ [0, T ], we see ψ to be Lipschitz with respect to y. Now we can prove order ofconvergence 4 similar to the corresponding part of the proof of Th. 2.12: As in the proofof Lem. 2.11(b), we observe that the continuity of φ implies the compactness of

C :=(x, φ(x)) : x ∈ [ξ, b]

⊆ G,

and, using the norm | · |+ ‖ · ‖ on R×Kn, Lem. 2.11(a) yields r0 > 0 such that

Cr0 =(x, y) ∈ R×Kn : dist((x, y), C) ≤ r0

⊆ G.

As f (and, thus, k1) is continuous, ‖k1‖ is bounded by some M ∈ R+ on the compactset Cr0 . Letting r1 := min r0

4, r02M

, k2 is defined on the set K1 := Cr1 × [0, r1]: Indeed,if (x, y, h) ∈ K1, then there exists (t, z) ∈ C such that |x− t|+ ‖y − z‖ ≤ r1, implying

∣∣∣∣x+h

2− t

∣∣∣∣+∥∥∥∥y +

h

2k1(x, y)− z

∥∥∥∥ ≤ |x− t|+ h

2+ ‖y − z‖+ h

2‖k1(x, y)‖

≤ r04+r08+r04+r0M

4M< r0,

showing (x+ h2, y+ h

2k1(x, y)) ∈ Cr0 ⊆ G. In particular, k2 is also bounded byM on K1.

Thus, we can repeat the same argument, first for k3 and then for k4, obtaining both tobe defined and bounded by M on K1. In consequence (cf. the proof of Lem. 2.11(b)),

K :=(x, y) ∈ [ξ, b]×Kn : ‖y − φ(x)‖ ≤ r1

× [0, r1] ⊆ K1 ⊆ Dψ,

3 GENERAL RUNGE-KUTTA METHODS 19

proving the validity of (2.15). Moreover, f being continuous and locally Lipschitz withrespect to y implies ψ to have the same properties (locally Lipschitz follows via the aboveargument applied to local neighborhoods, where f (and, thus, ψ) is globally Lipschitz).Thus, according to Prop. A.3 of the Appendix, ψ is globally Lipschitz with respect toy on the compact set K. In consequence, ψ satisfies all the hypotheses of Cor. 2.8(b),yielding the method to have order of convergence 4.

3 General Runge-Kutta Methods

3.1 Definition, Butcher Tableaus, First Properties

In generalization of the methods defined in Def. 2.1 and Def. 2.13, respectively, we definegeneral RK methods:

Definition 3.1. Let s, n ∈ N, G ⊆ R×Kn, f : G −→ Kn, and consider the initial valueproblem

y′ = f(x, y), y(ξ) = η, (ξ, η) ∈ G.

(a) We call a (standard) explicit single-step method an s-stage Runge-Kutta (RK)method if, and only if, its defining function has the form

ψ : Dψ −→ Kn, ψ(x, y, h) =s∑

j=1

bjkj(x, y, h), (3.1a)

where, for each (x, y, h) ∈ Dψ, the auxiliary vectors k1(x, y, h), . . . , ks(x, y, h) ∈ Kn

satisfy the system

∀j∈1,...,s

kj(x, y, h) = f

(x+ cjh, y + h

s∑

l=1

ajlkl(x, y, h)

), (3.1b)

the coefficients b1, . . . , bs ∈ R are called weights, c1, . . . , cs ∈ R are called nodes,and the matrix A := (ajl) ∈ M(s,K) is called RK matrix. The coefficients arecommonly compiled in a so-called Butcher tableau in the form

c1 a11 . . . a1s...

......

cs as1 . . . assb1 . . . bs

orc A

bt, c :=

c1...cs

, b :=

b1...bs

. (3.2)

(b) An RK method is called explicit if, and only if, the RK matrix is strictly lowertriangular (i.e. lower triangular with all ajj = 0); implicit otherwise.

(c) We say that an RK method satisfies the consistency condition (cf. Rem. 3.6(a) andTh. 3.8(c) below) if, and only if, the sum of the weights equals 1, i.e. if, and only if,

s∑

j=1

bj = 1. (3.3)


(d) We say that an RK method satisfies the node condition if, and only if,

∀j∈1,...,s

cj =s∑

l=1

ajl. (3.4)

—

It is sometimes useful to rewrite an RK method by replacing the vectors kj by vectorsuj, making use of the following lemma:

Lemma 3.2. Let s, n ∈ N, G ⊆ R × Kn, f : G −→ Kn, bj, cj ∈ R, ajl ∈ K for eachj, l ∈ 1, . . . , s.

(a) Let (x, y, h) ∈ R × Kn × R and consider vectors k1, . . . , ks, u1, . . . , us ∈ Kn as wellas the systems

∀j∈1,...,s

kj = f

(x+ cjh, y + h

s∑

l=1

ajlkl

), (3.5a)

∀j∈1,...,s

uj = y + hs∑

l=1

ajlf(x+ clh, ul

). (3.5b)

If k1, . . . , ks satisfy (3.5a) (which, in particular, is supposed to mean, for each j ∈1, . . . , s, that the used argument of f is in G) and

∀j∈1,...,s

uj := y + hs∑

l=1

ajlkl, (3.6a)

then u1, . . . , us satisfy (3.5b). Conversely, if u1, . . . , us satisfy (3.5b) (which, inparticular, is supposed to mean, for each j ∈ 1, . . . , s, that the used argument off is in G) and

∀j∈1,...,s

kj := f(x+ cjh, uj), (3.6b)

then k1, . . . , ks satisfy (3.5a).

(b) A function ψ : Dψ −→ Kn satisfies (3.1) for some (x, y, h) ∈ Dψ if, and only if, itsatisfies

ψ(x, y, h) =s∑

j=1

bjf(x+ cjh, uj(x, y, h)

), (3.7a)

where the auxiliary vectors u1(x, y, h), . . . , us(x, y, h) ∈ Kn satisfy the system

∀j∈1,...,s

uj(x, y, h) = y + hs∑

l=1

ajlf(x+ clh, ul(x, y, h)

). (3.7b)


Proof. (a): If k1, . . . , ks satisfy (3.5a) and u1, . . . , us are given by (3.6a), then (3.5a) canbe written as

∀j∈1,...,s

kj = f(x+ cjh, uj).

Plugging these expressions for the kj back into (3.6a) then yields (3.5b). Now, ifu1, . . . , us satisfy (3.5b) and k1, . . . , ks are given by (3.6b), then (3.5b) can be writtenas

∀j∈1,...,s

uj = y + h

s∑

l=1

ajlkl.

Plugging these expressions for the uj back into (3.6b) then yields (3.5a).

(b) is now immediate from (a).

Definition 3.3. Consider the situation of Def. 3.1.

(a) If the defining function ψ of an RK method, as defined in Def. 3.1(a), is written inthe form (3.1), then we say it is written in k-form. If ψ is written in the (equivalent)form (3.7), then we say it is written in u-form.

(b) An RK method is said to have unique local solutions if, and only if, for each (x, y) ∈G, there exist γ(x, y) ∈]0,∞] and r(x, y) ∈ R+ such that, if h ∈ [0, γ(x, y)[, thenthe system (3.7b) has a unique (local) solution

u(x, y, h) ∈ Br(x,y)(y, . . . , y) ⊆ (Kn)s ∼= Kns

(as all norms on Kns are equivalent, the existence of γ(x, y) and r(x, y) does notdepend on the chosen norm – however, the size of these numbers will, in general,depend on the chosen norm). If the RK method has unique local solutions, thenwe say it is in standard form if, and only if, for each (x, y) ∈ G and h ∈ [0, γ(x, y)[,we have (x, y, h) ∈ Dψ and u(x, y, h) ∈ Br(x,y)(y, . . . , y). We note that the presentdefinition is consistent with Def. 1.7(b) in the sense that, if ψ : Dψ −→ Kn is thedefining function of the RK method with unique local solutions in standard form,then, given (x, y) ∈ G and h ∈ [0, γ(x, y)[, the pair

(x+ h, y + h

s∑

j=1

bjkj(x, y, h)

)=

(x+ h, yk + h

s∑

j=1


))

is the unique local solution to the method in the sense of Def. 1.7(b).

Remark 3.4. (a) Let ∆ = (x0, . . . , xN ), N ∈ N, be a partition of I := [ξ, b], ξ < b.Setting m = 1 and α0 = 1 in Def. 1.7, we obtain

((ξ, η) = (x0, y0), . . . , (xN , yN)

)∈(I ×Kn

)N+1


to be a global solution to an RK method given according to Def. 3.1(a) if, and onlyif, the yk satisfy the recursion y0 = η ∈ Kn,

∀k∈0,...,N−1

yk+1 = yk + hk ψ(xk, yk, hk)

(3.1a)= yk + hk

s∑

j=1

bjkj(xk, yk, hk)

(3.7a)= yk + hk

s∑

j=1

bjf(xk + cjhk, uj(xk, yk, hk)

), (3.8a)

hk = xk+1 − xk, (xk, yk, hk) ∈ Dψ, (3.8b)

where the kj and uj satisfy systems given by (3.1b) and (3.7b), respectively.

(b) From (3.8a), we note that every RK method given by a defining function as in(3.1) (or as in (3.7)) is an explicit single-step method according to Def. 1.7(b),even if the ajl yield an implicit RK method (as the right-hand side of (3.8a) doesnot depend on yk+1). However, if the RK method is explicit, then k1, . . . , ks canbe computed recursively from (3.1b), whereas, for an implicit RK method, the(in general, nonlinear) system of equations for the kj will, in general, only have aunique solution under additional hypotheses (and, likewise, for the u1, . . . , us). Asmentioned before in Rem. 1.9(b), every explicit m-step method can equivalently bewritten as an implicit m-step method. For RK methods, another way this can bedone is as follows: Rewrite (3.8a) in the implicit form

yk+1 = yk + hk

s∑

j=1

bjf(xk + cjhk, uj(xk, yk+1, hk)

), (3.9a)

where (replacing yk in the expression for uj(xk, yk, hk) via (3.8a))

uj(xk, yk+1, hk) := uj(xk, yk, hk)

(3.8a)= yk+1 + hk

s∑

l=1

(ajl − bj)f(xk + clhk, ul(xk, yk+1, hk)

). (3.9b)

While this representation is, in most cases, not of practical use, it shows, e.g.,the equivalence of a suitable implicit RK method with the so-called implicit Eulermethod (see Ex. 3.5(c) below).

Example 3.5. (a) Comparing Def. 2.1 with Def. 3.1, we see that the explicit Eulermethod is a 1-stage RK method with Butcher tableau

0 01


(b) Comparing Def. 2.13 with Def. 3.1, we see that the classical explicit RK method isa 4-stage RK method with Butcher tableau

012

12

12

0 12

1 0 0 116

13

13

16

(c) Consider the implicit 1-stage RK method with Butcher tableau

1 11

Using the representation (3.9), we obtain, since a11 = b1 = 1,

u1(xk, yk+1, hk)(3.9b)= yk+1 + hk (a11 − b1) f

(xk + c1hk, u1(xk, yk+1, hk)

)= yk+1.

Thus,

yk+1 = yk + hk f(xk + c1hk, u1(xk, yk+1, hk)

)= yk + hk f(xk+1, yk+1),

showing that this RK method is precisely what is known in the literature as theimplicit Euler method.

Remark 3.6. Let s ∈ N and assume ψ : Dψ −→ Kn defines an explicit s-stage RKmethod according to Def. 3.1(a) (cf. Rem. 3.4). Recall f : G −→ Kn, G ⊆ R × Kn,Dψ ⊆ R×Kn × R.

(a) Since the RK matrix A is strictly lower triangular, the kj , j ∈ 1, . . . , s, arerecursively well-defined by (3.1b) as functions kj : Dψ −→ Kn (likewise, the uj :Dψ −→ Kn are recursively well-defined by (3.7b)). If f is continuous, then so isψ: Indeed, if f is continuous, then an induction on j ∈ 1, . . . , s shows each kj(and each uj) to be continuous, implying ψ to be continuous as well. Moreover, iff is continuous and G is open, then another induction shows each kj(x, y, h) (andeach uj(x, y, h)) to be defined if (x, y) ∈ G and h is sufficiently small (in particular,the RK method then has unique local solutions in the sense of Def. 3.3(b)), and weobtain

∀j∈1,...,s

limh↓0

kj(x, y, h) = f(x, y), limh↓0

uj(x, y, h) = y,

implying

limh↓0

ψ(x, y, h) = f(x, y)s∑

j=1

bj.

Thus, if f is continuous on G open in the situation of Def. 2.3, where φ : [a, b] −→Kn, [a, b] ⊆ R, a < b, is a solution to y′ = f(x, y), then the local truncation error


λ of Def. 2.3(b) is well-defined and, combining the above with (2.7), shows the RKmethod to be consistent in the sense of Def. 2.3(b) if it satisfies the consistencycondition (3.3). The consistency condition is also necessary for the method to beconsistent if there exists x ∈ [a, b] with f(x, φ(x)) 6= 0.

(b) If f is globally Lipschitz with respect to y on some set U ⊆ G, then so is ψ onUψ := Dψ ∩ (U × [0, T ]) for each T > 0: Indeed, if f is globally Lipschitz withrespect to y on U , then an induction on j ∈ 1, . . . , s shows each kj (and each uj)to be globally Lipschitz with respect to y on Uψ, implying ψ to be Lipschitz withrespect to y on Uψ as well.

(c) If f ∈ Ck(G,Kn) and Dψ is open, then ψ ∈ Ck(Dψ,Kn): One can show (again using

straightforward inductions together with the chain rule) that each partial of orderm of kj (and of uj), m ∈ 1, . . . , k, j ∈ 1, . . . , s is a polynomial in partials oforder ≤ m of f , implying the same for the partials of ψ.

3.2 Existence and Uniqueness

We will now turn to the solvability question regarding the systems (3.1b) and (3.7b) forimplicit RK methods. In Th. 3.8 below, we will prove existence and (local) uniquenessof solutions to (3.1b) and (3.7b), provided f is continuous and locally Lipschitz withrespect to y. The proof will be based on the following Prop. 3.7, which can be viewedas a parametrized version of the Banach fixed point theorem (cf. [Phi16b, Th. 2.29]).

Proposition 3.7. Let (X, d), (Y, d) be metric spaces, where the metric space (Y, d) iscomplete. Let c ∈ Y , r ∈ R+, consider the ball B := Br(c) ⊆ Y and a continuous mapF : X × B −→ Y , that is Lipschitz with respect to y, satisfying

∀(x,y1),(x,y2)∈X×B

d(F (x, y1), F (x, y2)

)≤ Ld(y1, y2) (3.10)

with some Lipschitz constant 0 ≤ L < 1. If F also satisfies the condition

∀x∈X

d(F (x, c), c

)< r (1− L), (3.11)

then, for each x ∈ X, there exists a unique g(x) ∈ B such that

g(x) = F (x, g(x)). (3.12)

Moreover, the map g : X −→ B, x 7→ g(x), is continuous and, for each x ∈ X, thesequence

(yk(x)

)k∈N0

, recursively defined by y0(x) := c and

yk+1(x) := F(x, yk(x)

)for each k ∈ N0, (3.13)

converges to g(x):limk→∞

yk(x) = g(x). (3.14)


Proof. For L = 0, for each x ∈ X, the function y 7→ F (x, y) is constant, i.e.

∃yx∈B

F (x, ·) ≡ yx.

Thus, g : X −→ B, g(x) := yx, where the continuity of F implies the continuity of g.For the rest of the proof, let 0 < L < 1. Fix x ∈ X and let the sequence

(yk(x)

)k∈N0

be

defined by y0(x) := c and (3.13). To verify that the sequence is well-defined, we prove,using an induction, it is a sequence in B: y0(x) = c ∈ B and

d(y1(x), c

)= d(F (x, c), c

) (3.11)< r (1− L) < r,

implies y1(x) ∈ B. Now let k, l ∈ N0, k > l. Then, by induction, y0, . . . , yk ∈ B, and

d(yk+1(x), yk(x)

)= d

(F (x, yk(x)), F (x, yk−1(x))

)

(3.10)

≤ Ld(yk(x), yk−1(x)

)≤ Lk−l d

(yl+1(x), yl(x)

). (3.15)

Thus, for each l, j ∈ N0 with l + j ≤ k + 1,

d(yl+j(x), yl(x)

)≤

l+j−1∑

m=l

d(ym+1(x), ym(x)

) (3.15)

≤l+j−1∑

m=l

Lm−l d(yl+1(x), yl(x)

)

≤ 1

1− Ld(yl+1(x), yl(x)

) (3.15)

≤ Ll

1− Ld(y1(x), y0(x)

)

=Ll

1− Ld(y1(x), c

) (3.11)< Ll r ≤ r. (3.16)

In particular, using l := 0 and j := k + 1 in (3.16) shows yk+1(x) ∈ B, completing theinduction proof of

(yk(x)

)k∈N0

being a well-defined sequence in B. The sequences nowgive rise to the functions

fk : X −→ B, fk(x) := yk(x),

defined for each k ∈ N0. Using the assumed continuity of F together with an inductionon k ∈ N0, we verify the fk to be continuous: f0 ≡ c is constant and, hence, continuous.Now let k ∈ N0. Since fk+1(x) = F

(x, fk(x)

)and fk is continuous by induction, fk+1

is continuous as well. To proceed, we, once again, fix x ∈ X. As liml→∞ Ll = 0, (3.16)proves

(fk(x)

)k∈N0

to be a Cauchy sequence in Y . As Y is complete, we may define

g : X −→ Y, g(x) := limk→∞

fk(x).

Taking l = 0 in (3.16) shows

∀x∈X

d(g(x), c

)= lim

j→∞d(yj(x), c

) (3.16)

≤ 1

1− Ld(y1(x), c

)< r,

i.e. g maps into B, as desired. The continuity of F allows one to take limits in (3.13),showing, for each x ∈ X,

g(x) = limk→∞

yk+1(x) = limk→∞

F(x, yk(x)

)= F

(x, g(x)

),


i.e. (3.12) is satisfied. We need to show g is continuous. To this end, we fix x ∈ X andl ∈ N0 in (3.16). For j → ∞, we then obtain

d(g(x), fl(x)

)≤ Ll

1− Ld(y1(x), c

)< Ll r,

showing fl to converge uniformly to g. As a uniform limit of the continuous functionsfl, the function g is itself continuous (see, e.g., [Phi16c, Th. 3.5]).

Given x ∈ X, it remains to show the uniqueness of the g(x), satisfying (3.12). Suppose,y ∈ B also satisfies y = F (x, y). Then

d(g(x), y

)= d(F (x, g(x)), F (x, y)

) (3.10)

≤ Ld(g(x), y

),

which implies 1 ≤ L for d(g(x), y

)> 0. Thus, L < 1 implies d

(g(x), y

)= 0 and

g(x) = y.

Theorem 3.8. Let s, n ∈ N, let G ⊆ R × Kn be open and assume f : G −→ Kn tobe continuous and locally Lipschitz with respect to y; let bj, cj ∈ R, ajl ∈ K for eachj, l ∈ 1, . . . , s; and let ‖A‖∞ denote the operator norm of the RK matrix A = (ajl)with respect to ‖ · ‖∞ on Ks.

(a) For each (x, y) ∈ G, there exist γ(x, y) ∈]0,∞] and r(x, y) ∈ R+ such that, ifh ∈]− γ(x, y), γ(x, y)[, the system (3.5b) has a unique (local) solution

u(x, y, h) ∈ Br(x,y)(y, . . . , y) ⊆ (Kn)s ∼= Kns

(in particular, the resulting RK method has unique local solutions in the sense ofDef. 3.3(b)). Moreover, the function

gx,y : [0, γ(x, y)[−→ Br(x,y)(y, . . . , y), gx,y(h) := u(x, y, h),

is continuous withgx,y(0) = u(x, y, 0) = (y, . . . , y). (3.17)

If f ∈ Cp(G,Kn), p ∈ N, then one can choose γ(x, y) such that gx,y is p timescontinuously differentiable.

(b) If, additionally, G = R×Kn and f is globally L-Lipschitz with respect to y (L ∈ R+0 ),

then, for each h ∈]− γ, γ[ with

γ :=1

L ‖A‖∞(1/0 := ∞),

the system (3.5b) has a unique (global) solution u(x, y, h) ∈ (Kn)s ∼= Kns.

(c) If the corresponding RK method is in standard form according to Def. 3.3(b) and(x, y) ∈ G, then

∀j∈1,...,s

limh↓0

uj(x, y, h) = y, limh↓0

kj(x, y, h) = f(x, y),


implying

limh↓0

ψ(x, y, h) = f(x, y)s∑

j=1

bj.

Moreover, if φ : [a, b] −→ Kn, [a, b] ⊆ R, a < b, is a solution to y′ = f(x, y), thenthe local truncation error λ of Def. 2.3(b) is well-defined and the RK method isconsistent in the sense of Def. 2.3(b) if it satisfies the consistency condition (3.3).The consistency condition is also necessary for the method to be consistent if thereexists x ∈ [a, b] with f(x, φ(x)) 6= 0.

Proof. (a): Fix an arbitrary norm ‖ · ‖ on Kn and fix (x, y) ∈ G. Set ~y := (y, . . . , y) ∈Kns. The idea is to find γ := γ(x, y) ∈ R+ and r := r(x, y) ∈ R+ such that we can write(3.5b) in the form

u = F (h, u)

(corresponding to (3.12)) with

F : X × B −→ Kns, Fj(h, u) := y + h

s∑

l=1

ajlf(x+ clh, ul) ∈ Kn (j ∈ 1, . . . , s),

(3.18)where X :=]− γ, γ[ and B := Br(~y) ⊆ Kns, and to solve u = F (h, u) by applying Prop.3.7. If A = 0, then F ≡ ~y is constant and the assertions are trivially true. Thus, forthe rest of the proof, we assume A 6= 0, i.e. ‖A‖∞ > 0. As G is open and f is locallyLipschitz with respect to y, there exist numbers α := α(x, y) ∈ R+, r := r(x, y) ∈ R+,and L := L(x, y) ∈ R+

0 such that K := [x− α, x+ α]× Br(y) ⊆ G and

∀(t,z1),(t,z2)∈K

∥∥f(t, z1)− f(t, z2)∥∥ ≤ L‖z1 − z2‖.

As the continuous map f is bounded on compact sets, we may choose S ∈ R+ with

S > M := max‖f(t, y)‖ : t ∈ [x− α, x+ α]

.

We now fix some θ ∈]0, 1[ (the rest of the proof will work for each such θ ∈]0, 1[) anddefine

γ := min

α

‖c‖∞,r (1− θ)

S‖A‖∞,

θ

L‖A‖∞

, c := (c1, . . . , cs), 1/0 := ∞ (3.19)

(note 0 < γ < ∞, as ‖A‖∞ > 0 implies r (1−θ)S‖A‖∞ < ∞). For the norm on Kns, we choose

the one defined by

∀u=(u1,...,us)∈Kns

‖u‖ := max‖ul‖ : l ∈ 1, . . . , s

.

Then B = Br(~y) = Br(y) × · · · × Br(y). The next step is the verification that F , asdefined in (3.18), is well-defined and satisfies the hypotheses of Prop. 3.7. If h ∈ X =]− γ, γ[, then, for 0 < ‖c‖∞,

∀l∈1,...,s

|x+ clh− x| ≤ ‖c‖∞|h| < ‖c‖∞ γ ≤ α.


For ‖c‖∞ = 0, |x + clh − x| = 0 < α also holds. Thus, in each case, if (h, u) ∈ X × B,then, for each l ∈ 1, . . . , s, (x + clh, ul) ∈]x − α, x + α[×Br(y) ⊆ G, i.e. F is well-defined. The continuity of F is then clear from the assumed continuity of f . Next, foreach j ∈ 1, . . . , s, each h ∈ X, and each u, u ∈ B, one estimates

∥∥Fj(h, u)− Fj(h, u)∥∥ ≤ γ

s∑

l=1

|ajl|∥∥f(x+ clh, ul)− f(x+ clh, ul)

∥∥

≤ γ Ls∑

l=1

|ajl| ‖ul − ul‖.

In consequence, we obtain

∥∥F (h, u)− F (h, u)∥∥ ≤ γ L ‖A‖∞ ‖u− u‖ ≤ θ ‖u− u‖,

proving F to satisfy (3.10) (where our 0 < θ < 1 plays the role of L in (3.10)). Moreover,for each each h ∈ X,

∥∥F (h, ~y)− ~y∥∥ = max

‖Fj(h, ~y)− y

∥∥ : j ∈ 1, . . . , s

≤ γ

s∑

l=1

|ajl|∥∥f(x+ clh, y)

∥∥ ≤ γ ‖A‖∞M < r (1− θ)

(recalling ‖A‖∞ to be the row sum norm of A). Thus, F satisfies (3.11) as well (whereour ~y plays the role of c in (3.11)). We now employ Prop. 3.7 to obtain, for each h ∈ X,a unique gx,y(h) ∈ B, satisfying

gx,y(h) = F(h, gx,y(h)

),

as desired. In addition, Prop. 3.7 yields the continuity of the map gx,y : X −→ B,h 7→ gx,y(h). In particular, (3.17) is satisfied due to F (0, ·) ≡ ~y. If f ∈ Cp(G,Kn),p ∈ N, then the map

Γ : X ×B −→ Kns, Γ(h, u) := u− F (h, u),

is Cp as well, and we can apply the implicit function theorem [Phi16b, Th. 4.49] to theequation

Γ(h, u) = u− F (h, u) = 0.

As all partials of F with respect to u-components vanish at h = 0, we have DuΓ(0, ~y) =Id, which is invertible. Hence, the implicit function theorem yields a neighborhoodU × V of (0, ~y) and a p times continuously differentiable map g : U −→ V such that

∀(h,u)∈U×V

(Γ(h, u) = 0 ⇔ u = g(h)

).

By possibly decreasing the numbers γ and r from above, we may assume U = X andV = B, in which case g = gx,y.


(b): As in (a), the case ‖A‖∞ = 0 is clear, as F is then constant. Thus, we proceed,assuming ‖A‖∞ > 0. If G = R × Kn and f is globally L-Lipschitz with respect to y,then, in the proof of (a), for each α > 0 and each r > 0, the map F is well-defined and fis globally L-Lipschitz with respect to y on K. Seeking a contradiction, let γ0 :=

1L ‖A‖∞

and assume there exists h ∈] − γ0, γ0[ and v, w ∈ Kns with v 6= w, v = F (h, v), andw = F (h,w). Choose

α :=

‖c‖∞L‖A‖∞ for ‖c‖∞ > 0,

1L‖A‖∞ for ‖c‖∞ = 0,

and choose θ ∈]0, 1[ sufficiently close to 1 such that θL‖A‖∞ > |h|. Then choose

r > max

‖v − ~y‖, ‖w − ~y‖, S‖A‖∞|h|

1− θ

.

The number γ ∈ R+, defined according to (3.19), then satisfies γ > |h| and (a) impliesthe contradiction v = gx,y(h) = w.

(c): If the RK method is in standard form, then, with gx,y according to (a), u(x, y, h) =gx,y(h) must hold for each sufficiently small h. Then the continuity of gx,y implies

limh↓0

u(x, y, h) = limh↓0

gx,y(h) = gx,y(0)(3.17)= (y, . . . , y)

and, for each j ∈ 1, . . . , s, also using the continuity of f ,

limh↓0

kj(x, y, h) = limh↓0

f(x+ cjh, uj(x, y, h)) = f(x, y).

Thus,

limh↓0

ψ(x, y, h) = limh↓0

s∑

j=1


)= f(x, y)

s∑

j=1

bj,

as claimed. If x ∈ [a, b] and φ : [a, b] −→ Kn is a solution to y′ = f(x, y), thenlimh↓0 ψ(x, φ(x), h) = f(x, φ(x)) if, and only if,

∑sj=1 bj = 1 or f(x, φ(x)) = 0. Thus,

Lem. 2.4(d) establishes the case.

Remark 3.9. In the situation of Th. 3.8, assume f : Kn −→ Kn (i.e. f does notdepend on x – the ODE is autonomous, cf. Sec. 3.3 below). Moreover, assume f to becontinuous and locally Lipschitz, but not globally Lipschitz. According to Prop. A.3 ofthe Appendix, f is globally Lipschitz on each compact subset C ⊆ Kn. In consequence,for each y ∈ Kn and each r ∈ R+, there exists a smallest Lipschitz constant Lr(y) ∈ R+

0

such that f is Lr(y)-Lipschitz on Br(y) (with limr→∞ Lr(y) = ∞, as f is not globallyLipschitz). As in the proof of Th. 3.8(b), one sees that, for each h ∈] − γ, γ[ withγ := 1

Lr(y) ‖A‖∞ , the system (3.5b) has a unique solution u(x, y, h) ∈ Br(y). Thus, if

u(h) ∈ Kn are such that, for each sufficiently small |h| ∈ R+, u(h) is a solution to (3.5b)with u(h) 6= u(x, y, h), then

limh→0

‖u(h)‖ = ∞.


One can actually still conduct a similar argument if f is defined on an arbitrary openΩ ⊆ Kn, in which case one obtains the u(h) to go to the boundary of G for h → 0, inthe sense that they escape every compact subset C of G forever for h→ 0 (cf. [Phi16c,Def. 3.25].

Example 3.10. We consider the implicit Euler method of Ex. 3.5(c) for the initial valueproblem

y′ = µ (1− y2), y(ξ) = η, (ξ, η) ∈ R2, µ ∈ R+.

Here, we have f : R2 −→ R, f(x, y) := µ (1 − y2) (which is locally Lipschitz, but notglobally Lipschitz), and b1 = c1 = a11 = 1. For (x, y, h) ∈ R2 × (R \ 0), (3.5b) reducesto

u := u1 = y + h f(x+ h, u) = y + hµ (1− u2) (3.20)

or, equivalently,

u2 +u

hµ− y + hµ

hµ= 0.

This quadratic equation for u has the solutions

u = u(h) = − 1

2hµ±√

1

4h2 µ2+y + hµ

hµ=

−1±√

1 + 4hµ (y + hµ)

2hµ,

which are real, provided |h| is sufficiently small, guaranteeing |4hµ (y + hµ)| ≤ 1. Con-sistently with Rem. 3.9, we obtain

limh→0

∣∣∣∣∣−1−

√1 + 4hµ (y + hµ)

2hµ

∣∣∣∣∣ = ∞

and, consistently with (3.17), we obtain

limh→0

−1 +√

1 + 4hµ (y + hµ)

2hµ= lim

h→0

4hµ (y + hµ)

2hµ(1 +

√1 + 4hµ (y + hµ)

) = y.

Thus, to obtain the RK method in standard form, we choose γ(y) ∈ R+ such that|4hµ (y + hµ)| ≤ 1 for each h ∈]− γ(y), γ(y)[, and

u : R2×]0, γ(y)[−→ R, u(x, y, h) :=−1 +

√1 + 4hµ (y + hµ)

2hµ.

Then the resulting ψ according to (3.7a) is ψ : R2×]− γ(y), γ(y)[−→ R,

ψ(x, y, h) := f(x+ h, u(x, y, h)

)= µ

(1− u2(x, y, h)

).

Given (ξ, η) ∈ R2 with ξ = x0 < x1 < . . . and hk := xk+1 − xk sufficiently small, thisyields the recursion y0 = η,

yk+1 = yk + hkµ(1− u2(xk, yk, hk)

) (3.20)= u(xk, yk, hk) =

−1 +√1 + 4hkµ (yk + hkµ)

2hk µ.


3.3 Autonomous ODE

Determining the parameters of RK methods such that a high order of convergence isobtained is, in general, a difficult task. It is somewhat simplified when one restrictsoneself to so-called autonomous ODE:

Definition 3.11. If Ω ⊆ Kn, n ∈ N, and f : Ω −→ Kn, then the n-dimensionalfirst-order ODE

y′ = f(y) (3.21)

is called autonomous.

—

As it turns out, every nonautonomous ODE can equivalently be written as an au-tonomous ODE:

Theorem 3.12. Let G ⊆ R×Kn, n ∈ N, and f : G −→ Kn. Then the nonautonomousODE

y′ = f(x, y) (3.22a)

is equivalent to the autonomous ODE

y′ = g(y), (3.22b)

whereg : G −→ Kn+1, g(x, y1, . . . , yn) :=

(1, f(x, y1, . . . , yn)

), (3.23)

in the following sense:

(a) If φ : I −→ Kn is a solution to (3.22a), then ψ : I −→ Kn+1, ψ(x) := (x, φ(x)), isa solution to (3.22b).

(b) If ψ : I −→ Kn+1, x 7→ (ψ0(x), ψ1(x), . . . , ψn(x)), is a solution to (3.22b) with theproperty

∃x0∈I

ψ0(x0) = x0, (3.24)

then φ : I −→ Kn, φ(x) := (ψ1(x), . . . , ψn(x)), is a solution to (3.22a).

Proof. (a): If φ : I −→ Kn is a solution to (3.22a) and ψ : I −→ Kn+1, ψ(x) :=(x, φ(x)), then

∀x∈I

ψ′(x) = (1, φ′(x)) =(1, f(x, φ(x))

)= g(x, φ(x)) = g(ψ(x)),

showing ψ is a solution to (3.22b).

(b): If ψ : I −→ Kn+1 is a solution to (3.22b) with the property (3.24) and φ : I −→ Kn,φ(x) := (ψ1(x), . . . , ψn(x)), then (3.24) implies ψ0(x) = x for each x ∈ I and, thus,

∀x∈I

φ′(x) = (ψ′1(x), . . . , ψ

′n(x)) = f(x, ψ1(x), . . . , ψn(x)) = f(x, φ(x)),

showing φ is a solution to (3.22a).


The following Prop. 3.14 shows that RK methods are consistent with the equivalenceof Th. 3.12 (such that designing RK methods for autonomous problems does not re-ally constitute a restriction), provided they satisfy both the consistency and the nodecondition. The proof of Prop. 3.14 makes use of the following lemma:

Lemma 3.13. Let s, n ∈ N, G ⊆ R × Kn, f : G −→ Kn, cj ∈ R, ajl ∈ K for eachj, l ∈ 1, . . . , s. Let g : G −→ Kn+1 be defined as in (3.23). Let (x, y, h) ∈ R×Kn×R

and consider vectors kf,1, . . . , kf,s ∈ Kn, kg,1, . . . , kg,s ∈ Kn+1, kg,1, . . . , kg,s ∈ Kn, wherekg,j = (kg,j,0, kg,j) for each j ∈ 1, . . . , s. Moreover, consider the system (3.5a), whichwe rewrite here as

∀j∈1,...,s

kf,j = f

(x+ cjh, y + h

s∑

l=1

ajlkf,l

)(3.25a)

and

∀j∈1,...,s

kg,j = g

((x, y) + h

s∑

l=1

ajlkg,l

)

=

(1, f

(x+ h

s∑

l=1

ajlkg,l,0, y + hs∑

l=1

ajlkg,l

)).

(3.25b)

If kg,1, . . . , kg,s satisfy (3.25b) (which, in particular, is supposed to mean, for each j ∈1, . . . , s, that the used argument of f is in G), then kg,1,0 = · · · = kg,s,0 = 1. If, inaddition, the node condition (3.4) holds (i.e. cj =

∑sl=1 ajl for each j ∈ 1, . . . , s), then

kf,1 := kg,1, . . . , kf,s := kg,s satisfy (3.25a) (i.e. (3.5a)).

Proof. The first component of (3.25b) immediately yields kg,1,0 = · · · = kg,s,0 = 1. Usingthis and the node condition in (3.25b), we see that the remaining components of (3.25b)become (3.25a) with kf,j replaced by kg,j.

Proposition 3.14. Let s ∈ N and consider an s-stage RK method according to Def. 3.1that has unique local solutions as defined in Def. 3.3(b) (sufficient conditions are thatthe RK method is explicit or that f satisfies the conditions of Th. 3.8). Moreover, letthe RK method be in standard form (also defined in Def. 3.3(b)). Let ψf be the definingfunction corresponding to f and ψg the defining function corresponding to g of (3.23)(as (3.22b) is autonomous, g and, thus, ψg, do not depend on x). Let (ξ, η) ∈ G, b > ξ,and let (x0, . . . , xN) be a partition of [ξ, b]. Assume

y0 = η,

∀k∈0,1,...,N−1

yk+1 = yk + hk ψf (xk, yk, hk), hk := xk+1 − xk,(3.26a)

and

z0 = (ξ, η),

∀k∈0,1,...,N−1

zk+1 = zk + hk ψg(zk, hk).(3.26b)


Then∀

k∈0,1,...,Nzk = (xk, yk) (3.26c)

if the RK method satisfies both the consistency condition (3.3) and the node condition(3.4).

Proof. For each j ∈ 1, . . . , s, we denote by kf,j ∈ Kn (resp. by kg,j = (1, kg,j) ∈ Kn+1,kg,j ∈ Kn) the auxiliary vectors of the RK method corresponding to ψf (resp. to ψg).Suppose, the method satisfies (3.3) and (3.4). We show (3.26c) via induction on k.First, (x0, y0) = (ξ, η) = z0 holds by the definition of y0 and z0, respectively. Now fixk ∈ 0, . . . , N − 1. As the kg,j must satisfy the condition (3.1b) (with f replaced byg), we have (also using the induction hypothesis zk = (xk, yk))

∀j∈1,...,s

kg,j(zk, hk) = g

(zk + hk

s∑

l=1

ajlkg,l(zk, hk)

)

ind.hyp.=

(1, f

(xk + hk

s∑

l=1

ajl, yk + hk

s∑

l=1

ajlkg,l(zk, hk)

)).

Thus, the kg,j(zk, hk) satisfy (3.25b). As we also assume the node condition (3.4), Lem.3.13 implies

∀j∈1,...,s

kg,l(zk, hk) = f

(xk + cjhk, yk + hk

s∑

l=1

ajlkg,l(zk, hk)

). (3.27)

As the kf,j(xk, yk, hk) also satisfy (3.27) (which is merely condition (3.1b)) and, sincethe kf,j(xk, yk, hk) are the unique solution to (3.27) in the considered domain (and alsousing that the RK method is assumed to be in standard form), we obtain

∀j∈1,...,s

kg,j(zk, hk) =(1, kf,j(xk, yk, hk)

). (3.28)

Using (3.28), we are now in a position to complete the induction on k: We compute

zk+1ind.hyp.= (xk, yk) + hk

s∑

j=1

bjkg,j(zk, hk)

(3.28)= (xk, yk) + hk

s∑

j=1

bj(1, kf,j(xk, yk, hk)

)

=

(xk + hk

s∑

j=1

bj, yk + hk

s∑

j=1

bj kf,j(xk, yk, hk)

)

(3.3)=

(xk+1, yk + hk

s∑

j=1

bjkf,j(xk, yk, hk)

)

= (xk+1, yk+1), (3.29)

as desired.


Remark 3.15. In Prop. 3.14, the consistency condition (3.3) is clearly also necessaryfor (3.26c) to hold: Indeed, otherwise, the computation (3.29) shows for k = 0 that thefirst component of z1 does not equal x1 if (3.3) is not satisfied. In general, the nodecondition (3.4) will also be necessary for (3.26c) to hold. However, for certain f and/orcertain discretizations, (3.27) (and, hence, (3.28) and (3.26c)) can be true, even if thenode condition fails.

3.4 Stability Functions

Example 3.16. Let λ ∈ K, f : R × K −→ K, f(x, y) := λ y, and consider the one-dimensional autonomous linear initial value problem

y′ = f(x, y) = λ y, y(ξ) = η, (ξ, η) ∈ R×K.

Now let s ∈ N and consider the s-stage RK method given by weights b1, . . . , bs ∈ R andRK matrix A = (ajl) ∈ M(s,K) (as f does not depend on x, the following considerationsare independent of the nodes cj of the RK method). In u-form, we have, for the definingfunction,

ψ : Dψ −→ K, ψ(x, y, h) =s∑

j=1

bj λuj(x, y, h),

where the u1(x, y, h), . . . , us(x, y, h) ∈ K satisfy the linear system

∀j∈1,...,s

uj(x, y, h) = y + h

s∑

l=1

ajlλul(x, y, h) = y + hλ

s∑

l=1

ajlul(x, y, h). (3.30)

For fixed (x, y, h) ∈ R×K× R, we set u := u(x, y, h) ∈ Ks, such that (3.30) now reads

∀j∈1,...,s

uj = y + hλ

s∑

l=1

ajlul

or, in matrix form,

u = y1+ hλAu ⇔ (Id−hλA) u = y1,

where 1 := (1, . . . , 1)t ∈ Ks. As Id is invertible and we know the set of invertiblematrices, i.e. the general linear group GLs(K) = det−1(K \ 0), to be an open subsetof M(s,K) (since the determinant is a continuous map (even a polynomial, cf. [Phi19b,Rem. 4.33]) from Ks2 ∼= M(s,K) into K), there exists ǫ ∈ R+ (depending only on λ andA) such that (Id−hλA) is invertible for each h ∈] − ǫ, ǫ[. In consequence, if |h| < ǫ,then

u =1

Id−hλA 1y := (Id−hλA)−1 1y.

Thus, letting b := (b1, . . . , bs)t, we obtain

ψ : R×K× [0, ǫ[−→ K, ψ(x, y, h) = λ btu = λ bt(Id−hλA)−1 1y,


and the resulting recursion is (cf. Rem. 3.4(a)) y0 = η,

yk+1 = yk + hkλ bt(Id−hkλA)−1 1yk = R(hkλ)yk,

whereR : DR −→ K, R(z) := 1 + zbt(Id−z A)−11, DR ⊆ K, (3.31)

defined on a superset DR of z ∈ K : |z| < ǫ |λ|. In the following Def. and Rem. 3.17,we will see that the so-called stability function R is always a rational function.

Definition and Remark 3.17. Let s ∈ N, b ∈ Rs, A ∈ M(s,K).

(a) The function R defined in (3.31) above, is called the stability function of the RKmethod with RK matrix A and weights vector b. Recall from Linear Algebra (cf.[Phi19b, Rem. 4.33]) that the determinant det : M(s,K) −→ K, B 7→ detB, is apolynomial of degree s with real coefficients in the entries bjl of B (the coefficientsare actually ±1). Moreover, recall from Linear Algebra (cf. [Phi19b, Th. 4.29(c)])that, for invertible B ∈ GLs(K),

B−1 = (detB)−1 B,

where, for s > 1, the entries of the adjugate matrix B of B are, up to a sign ±1,the determinants of the (s − 1) × (s − 1) submatrices of B. Thus, each entry ofB−1 is a rational function Rjl of the entries bjl of B, and Rjl = Pjl/ det, wherePjl : K

s2 −→ K is a polynomial of degree s− 1 with real coefficients. Letting

∀d∈N0

Pold(K) :=

(P : K −→ K) : P (z) =

d∑

j=0

λjzj ; λ0, . . . , λd ∈ K

(3.32)

and coming back to (3.31), we see that R is a rational function, where R = P /Qwith polynomials P , Q ∈ Pols(R). As R(0) = 1, one has P (0) = Q(0) 6= 0, and,dividing numerator and denominator of R = P /Q by P (0) as well as by any commonprime factors (cf. [Phi19b, Cor. 7.31, Th. 7.12(c)]), we obtain

R =P

Q, P,Q ∈ Pols(R), P (0) = Q(0) = 1, P,Q mutually prime, (3.33)

and P,Q are uniquely determined by (3.33). In particular, DR = z ∈ K : Q(z) 6=0.

(b) Note that, if the RK method is explicit, i.e. if A is strictly lower triangular, then, in(3.33), Q ≡ 1 and R = P is a polynomial with real coefficients, 0 ≤ degP ≤ s: Thiscan be seen directly from (3.30), using an induction or, alternatively, by observingthat, for each z ∈ K, det(Id−zA) = det Id = 1 in this case.

Remark 3.18. Another useful fact to recall from Linear Algebra is that, given a poly-nomial P : K −→ K, we can substitute numbers with matrices, yielding the followingmatrix mapping (still denoted by P for the simplicity of notation):

P : M(s,K) −→ M(s,K), B 7→ P (B)


(where, as usual, for P ≡ λ ∈ K, P (B) := λ Id). Moreover, given polynomials P,Q :K −→ K, the rational function P/Q also yields a matrix mapping, namely

(P/Q) :B ∈ M(s,K) : det(Q(B)) 6= 0

−→ M(s,K), B 7→ P (B) (Q(B))−1.

This mapping is independent of the representation of the rational function in the sensethat, if P/Q = P /Q, det(Q(B)) 6= 0, and det(Q(B)) 6= 0, then (P/Q)(B) = (P /Q)(B).

Example 3.19. Somewhat surprisingly, everything of Ex. 3.16 still works analogouslyfor higher-dimensional autonomous linear initial value problems, if one makes use of thematrix mappings of Rem. 3.18 above. Let n ∈ N, B ∈ M(n,K). Let f : R×Kn −→ Kn,f(x, y) := B y, and consider the linear initial value problem

y′ = f(x, y) = B y, y(ξ) = η, (ξ, η) ∈ R×Kn.

As B and f do not depend on x, the above linear initial value problem has so-calledconstant coefficients (cf. [Phi16c, Sec. 4.6.2]). As in Ex. 3.16, let s ∈ N and consider thes-stage RK method given by weights b1, . . . , bs ∈ R and RK matrix A = (ajl) ∈ M(s,K)(as f does not depend on x, the following considerations are independent of the nodescj of the RK method). In u-form, we have, for the defining function,

ψ : Dψ −→ Kn, ψ(x, y, h) =s∑

j=1

bj Buj(x, y, h),

where the u1(x, y, h), . . . , us(x, y, h) ∈ Kn satisfy the linear system

∀j∈1,...,s

uj(x, y, h) = y + hs∑

l=1

ajlBul(x, y, h) = y + hBs∑

l=1

ajlul(x, y, h). (3.34)

For fixed (x, y, h) ∈ R×Kn×R, if we define the larger vectors u := (ut1, . . . , uts)

t ∈ Kns,~y := (yt, . . . , yt)t ∈ Kns, and the block matrix

hA := h

a11B . . . a1sB...

...as1B . . . assB

=

a11hB . . . a1shB

......

as1hB . . . asshB

∈ M(ns,K),

then, using blockwise matrix multiplication (cf. [Phi19a, Sec. 7.5]), (3.34) can be writtenin matrix form as

u = ~y + hAu ⇔ (Id−hA) u = ~y.

As in Ex. 3.16, there exists ǫ ∈ R+ (now depending only on B and A) such that (Id−hA)is invertible for each h ∈]− ǫ, ǫ[. Thus, if |h| < ǫ, then

u =(Id−hA

)−1

~y =(Id−hA

)−1

Idn...

Idn

y,


where the j-th row block of this equation yields uj ∈ Kn. Moreover, due to blockwise

matrix multiplication, one obtains(Id−hA

)−1from (Id−zA)−1 ∈ M(s,K) by replacing

each z ∈ K by hB ∈ M(n,K) and each 1 in Id by Idn. Thus, we obtain ψ : R×Kn ×[0, ǫ[−→ Kn,

ψ(x, y, h) =s∑

j=1

bj Buj(x, y, h) = B(b1 Idn, . . . , bs Idn

) (Id−hA

)−1

Idn...

Idn

y

and the resulting recursion is y0 = η,

yk+1 = yk + hk ψ(xk, yk, hk) = R(hkB)(yk),

where R is the stability function as defined in (3.31), but now interpreted in the sense ofRem. 3.18, i.e. as a function, mapping a subset DR of M(s,K) into M(s,K). Using therepresentation R = P/Q of (3.33), we obtain DR =

M ∈ M(s,K) : det(Q(M)) 6= 0

,

where Q(0) = Id implies DR to contain some open neighborhood of 0. According to thechoice of ǫ > 0 above, we also have hkB ∈ DR for each hk ∈]− ǫ, ǫ[.

Example 3.20. We compute the stability function R of Def. and Rem. 3.17 for a fewconcrete RK methods:

(a) For the explict Euler method, we have s = 1, b = (1), A = 0 (cf. Ex. 3.5(a)). Thus,

R : K −→ K, R(z) = 1 + z.

(b) For the implicit Euler method, we have s = 1, b = (1), A = Id = (1) (cf. Ex. 3.5(c)).Thus,

R : K \ 1 −→ K, R(z) = 1 +z

1− z=

1

1− z.

(c) For the classical explicit RK method, we have s = 4, bt = (16

13

13

16),

A =

0 0 0 012

0 0 00 1

20 0

0 0 1 0

(cf. Ex. 3.5(b)). Thus, for each z ∈ K,

Id−zA =

1 0 0 0

− z2

1 0 0

0 − z2

1 0

0 0 −z 1

, (Id−zA)−1 =

1 0 0 0z2

1 0 0z2

4z2

1 0z3

4z2

2z 1


and R : K −→ K,

R(z) = 1 + z

(1

6

1

3

1

3

1

6

)(Id−zA)−1

1111

= 1 +z

6+z2

6+z

3+z3

12+z2

6+z

3+z4

24+z3

12+z2

6+z

6

= 1 + z +z2

2+z3

6+z4

24.

(d) As an example of a 2-stage implicit RK method, consider the so-called implicittrapezoidal method, which has Butcher tableau

0 0 01 1

212

12

12

The name comes from the fact that the recursion can be written as

yk+1 = yk + hk

2∑

j=1

bjf(xk + cjhk, uj

)= yk +

hk2

(f(xk, u1) + f(xk + hk, u2)

),

where

u1 = yk + hk

2∑

l=1

a1lf(xk + clhk, ul

)= yk,

u2 = yk + hk

2∑

l=1

a2lf(xk + clhk, ul

)

= yk +hk2

(f(xk, u1) + f(xk + hk, u2)

)= yk+1,

i.e.

yk+1 = yk +hk2

(f(xk, yk) + f(xk+1, yk+1)

)

(if f is R-valued, then the term added to yk is the area of the trapezoid withvertices (xk, 0), (xk+1, 0), (xk, f(xk, yk)), (xk+1, f(xk+1, yk+1))). To compute the sta-bility function, we note that, for each z ∈ K \ 2,

Id−zA =

(1 0− z

21− z

2

), (Id−zA)−1 =

(1 0z

2−z2

2−z

)

and R : K \ 2 −→ K,

R(z) = 1 + z

(1

2

1

2

)(Id−zA)−1

(11

)= 1 +

z

2+

1

2· z2

2− z+

z

2− z

=4− 2z + 2z − z2 + z2 + 2z

2(2− z)=

2 + z

2− z.


3.5 Higher Orders of Consistency and Convergence

The following Prop. 3.22 shows that, to obtain consistency of order p, one needs atleast p stages for an explicit RK method, and at least ⌈p/2⌉ stages for an implicit RKmethod. The proof makes use of the following lemma.

Lemma 3.21. Let p ∈ N0, let P,Q : K −→ K be polynomials with real coefficients,neither P nor Q the zero polynomial, and consider the rational function R := P/Q. Ifthere exist ǫ > 0 and C ∈ R+

0 such that

∀h∈]0,ǫ[\Q−10

|R(h)− eh| ≤ C hp+1, (3.35)

thenp ≤ degP + degQ

(in other words, if the approximation of the exponential function by the rational functionR is “consistent of order p”, then p ≤ degP + degQ).

Proof. Let k := degP , j := degQ. Seeking a contradiction, assume (3.35) holds withp > k + j. Set D :=]0, ǫ[\Q−10. Then

∀h∈D

∣∣∣∣P (h)−Q(h) eh

hp+1

∣∣∣∣ ≤ |Q(h)|C. (3.36)

We show, by induction on k ∈ N0, that (3.36) must be false, which is the desiredcontradiction. More precisely, what we show by induction on k is

limx→0

∣∣∣∣P (x)−Q(x) ex

xp+1

∣∣∣∣ = ∞ :

For k = 0, P is constant. Thus, applying l’Hopital’s rule [Phi16a, Th. 9.25] yields

limx→0

∣∣∣∣P (x)−Q(x) ex

xp+1

∣∣∣∣ = limx→0

∣∣∣∣∣

(Q′(x) +Q(x)

)ex

(p+ 1) xp

∣∣∣∣∣ = 1 · limx→0

∣∣∣∣∣

(Q′(x) +Q(x)

)

(p+ 1) xp

∣∣∣∣∣p>degQ= ∞.

Now consider k > 0. This time, we apply l’Hopital’s rule together with the inductionhypothesis to obtain

limx→0

∣∣∣∣P (x)−Q(x) ex

xp+1

∣∣∣∣ = limx→0

∣∣∣∣∣P ′(x)−

(Q′(x) +Q(x)

)ex

(p+ 1) xp

∣∣∣∣∣ind.hyp.= ∞,

where the induction hypothesis applies, as degP ′ = k − 1, deg(Q′ + Q) ≤ j, andp > k + j implies p − 1 > k − 1 + j. Thus, the induction and the proof of the lemmaare complete.


Proposition 3.22. Let s ∈ N and consider an s-stage RK method according to Def.3.1. Consider the identity f : Kn −→ Kn, and the corresponding initial value problem

y′ = y, (3.37a)

y(0) = η ∈ Kn \ 0, (3.37b)

which, clearly, has the maximal solution

φ : R −→ Kn, φ(x) = exη.

Assume the RK method to be consistent of order p ∈ N with respect to φ : [0, b] −→ Kn,b > 0, in the sense of Def. 2.3(b).

(a) If the method is implicit, then p ≤ 2s.

(b) If the method is explicit, then p ≤ s.

Proof. According to Ex. 3.19 with B := Id, we obtain for the local truncation error

λ(0, h) = φ(0) + hψ(0, φ(0), h)− φ(h) = R(h Id)(η)− ηeh

=(R(h)− eh

)η =

(P (h)

Q(h)− eh

)η =

P (h)−Q(h)eh

Q(h)η,

where R is the (rational) stability function of (3.31) and P,Q are the polynomials ofdegree at most s from (3.33). Now Def. 2.3(b) implies

∃ǫ,C∈R+

∀h∈]0,ǫ[

∥∥λ(0, h)∥∥ ≤ C hp+1.

Thus, for η 6= 0, Lem. 3.21 implies p ≤ degP + degQ ≤ s + s = 2s, proving (a). Ifthe method is explicit, then we know Q ≡ 1, i.e. degQ = 0, from Def. and Rem. 3.17.Thus, in this case, for η 6= 0, Lem. 3.21 implies p ≤ degP +degQ ≤ s, proving (b).

Remark 3.23. (a) As defined in Def. 2.3(b), the order of consistency of a methodwill, in general, depend on the approximated function φ : [a, b] −→ Kn. So it mighthappen that an s-stage RK method is, by accident, consistent of order p > 2s(p > s in the explicit case) for some particular solution φ, solution to some ODEwith some particular right-hand side f . However, as a consequence of Prop. 3.22,it can never be consistent of order p > 2s (p > s in the explicit case) for nontrivialsolutions in the case f = Id.

(b) For implicit RK methods, the bound p ≤ 2s of Prop. 3.22(a) is optimal: There existimplicit RK methods, where p = 2s (see Ex. 6.13 below). For explicit RK methods,the bound p ≤ s of Prop. 3.22(b) is, for p > 4 not optimal: In general, finding theoptimal bounds in this case is difficult, cf. final paragraphs of [DB08, Sec. 4.2.3].

—


To obtain higher-order RK methods, one has to determine the parameters of the RKmethod such that, when using Taylor expansions of the exact solution φ and of thedefining function ψ to calculate the local truncation error, sufficiently many low-orderterms cancel out (cf. proof of Th. 2.12). The difficulty lies in the fact that these Taylorexpansions quickly become rather complicated. To help with the accounting of therelevant terms, Butcher developed a graph-theoretic method, which we present in therest of this section. The key objects that are employed are so-called (unlabeled, finite)rooted trees. Here, we will directly develop the theory of such trees as far as necessary,without requiring any prior knowledge of graph theory. In regular graph theory, a(labeled) tree is a connected graph without any cycles (a graph without cycles, notnecessarily connected, is called a forest). One can then show that each of our unlabeledtrees represents an entire class of such labeled trees, but this is of no consequence to ourconsiderations.

The central observation will then be that both the evaluation of derivatives and theconstruction of certain vectors in the Taylor expansion of the defining function can berelated to the mentioned unlabeled, finite, rooted trees. We start with the definitionof a multiset, followed by a recursive definition of the aforementioned trees and somerelated notions.

Definition and Remark 3.24. Let M be a set.

(a) Each function M : M −→ N0 is called an M-multiset or an unordered tuple withentries from M . The name multiset comes from the fact that each multiset inM can be interpreted as a set with elements from M , possibly containing cer-tain elements multiple times. For example, consider the following multiset M in1, 2, 3: M(1) := 2, M(2) := 0, M(3) := 5. It can be written in the formM = [1, 1, 3, 3, 3, 3, 3], where, here and in the following, we use brackets [ ] insteadof braces to distinguish multisets from sets. Noting that we can also write thesame multiset as M = [3, 3, 1, 3, 1, 3, 3] explains the alternative name unorderedtuple for a multiset. Also note that, for each set M , we can identify the empty set∅ with the constant M -multiset M ≡ 0 (i.e. [] = ∅) and the set M itself with theconstant M -multiset M ≡ 1.

(b) If M is an M -multiset, then we define its order or cardinality |M| by setting

|M| :=∑

m∈MM(m) ∈ N0 ∪ ∞.

We call M an unordered n-tuple if, and only if, |M| = n ∈ N; we call M finite if,and only if, |M| <∞.

(c) According to (a), the set ofM -multisets is the same as F(M,N0), the set of functionsfrom M into N0. We now define the combinatorial function

δ :M ∈ F(M,N0) : |M| <∞

−→ N,

δ(M) := #

(f : 1, . . . , |M| −→M

): ∀m∈M

#f−1(m) = M(m)

.


Thus, if M is an unordered n-tuple, then δ(M) is defined to be the number ofdifferent ordered n-tuples with precisely the same entries as M. Note that δ(∅) = 1,since |∅| = 0 and the empty function f = ∅ is the only function f : ∅ −→M . Alsonote δ(M) = 1 if there exists m ∈ M with M(m) = |M | (i.e. if M = [m, . . . ,m]).If N ⊆M is a finite subset of M , k := #N ∈ N0, and M = χN is the characteristicfunction of N (i.e. |M| = k and all entries of M are distinct), then δ(M) = k!.Some further examples are

δ[1, 2, 2] = 3, δ[1, 1, 2, 2, 2] =

(5

2

)=

5!

2! · 3! = 10, δ[1, 1, 2, 3] = 2 ·(4

2

)= 12.

Definition and Remark 3.25. In the following, we define finite unlabled rooted trees.However, as these are the only kinds of trees we will consider in this class, we will merelycall them trees.

(a) We define, recursively, for each d ∈ N, the set Td of trees of depth d, where we alsoset T≤d := T1 ∪ · · · ∪ Td. We define T1 := ∅, i.e. the empty set ∅ is the only tree ofdepth 1. Now let d ∈ N, and assume T1, . . . , Td have already been defined. Define

Td+1 :=

(M : T≤d −→ N0) : ∃

T∈TdM(T ) ≥ 1, |M| <∞

,

i.e. the set of trees of depth d + 1 consists precisely of all finite T≤d-multisets,containing at least one tree of depth d. Moreover, we define the set of all treesT :=

⋃d∈N Td. Even though it is not necessary for the rigorous logical arguments

that we conduct in the following, it can be useful to represent and visualize trees asgraphs in the following way: In regular graph theory, a graph is a pair G = (V,E)consisting of a set of vertices V and a set of edges E ⊆

(V2

), where

(V2

)denotes the

set of all subsets of V that have precisely two elements. We represent the tree ∅ bya single vertex (the root) and no edge:

∅ =

Writing “=” in the previous formula and below in similar situations is somewhat ofan abuse of notation, as it does not actually constitute a set-theoretic equality. Itis rather meant in the sense that we use the drawn graph as a new notation for theunlabeled tree on the left that it represents (rather than for the graph consistingof vertices and edges that it also represents). If M : T≤d −→ N0 is a tree of depthd + 1, then we obtain the graph representing M by adding a new root vertex andconnecting this new root with each of the previous roots via a new edge, wherethe root is always at the bottom of the following drawings. Thus, for example, weobtain the representations

T1 := [∅, ∅, ∅] = , T2 := [[∅]] = , T3 := [[∅, ∅], ∅, [[∅]]] = .


(b) Recursively, we define a function # : T −→ N, that assigns each tree T its order#T : Define #∅ := 1 and, for d ∈ N,

∀M∈Td+1,

M: T≤d−→N0

#M := 1 +∑

T∈T≤d

M(T ) ·#T,

where we note that the definition of M ∈ Td+1 guarantees that only finitely manysummands of the above sum are nonzero. In other words, if M = [T1, . . . , TN ],N ∈ N, then #M = 1 + #T1 + · · · + #TN . The order of a tree T can, thus, beinterpreted as its number of vertices in its graph representation described in (a).For example, for the trees T1, T2, T3 from (a), we obain

#T1 = 1 + 3#∅ = 4,

#T2 = 1 +#[∅] = 1 + 1 + #∅ = 3,

#T3 = 1 +#[∅, ∅] + #∅+#T2 = 1 + 3 + 1 + 3 = 8.

Similar to the notation introduced in (a), we define, for each k ∈ N,

T #k := T ∈ T : #T = k, T #≤k := T ∈ T : #T ≤ k;

and, for each d, k ∈ N,

T #kd := T #k ∩ Td, T #k

≤d := T #k ∩ T≤d, etc.

(c) Recursively, we define a factorial function ! : T −→ N, T 7→ T !, as well as a weightfunction α : T −→ Q+: Define ∅! := α(∅) := 1 and, for d ∈ N,

∀M∈Td+1,

M: T≤d−→N0

M! := #M ·∏

T∈T≤d

(T !)M(T ), α(M) :=δ(M)

(|M|)! ·∏

T∈T≤d

(α(T )

)M(T ),

where |M| and δ(M) are as defined in Def. and Rem. 3.24(b),(c), respectively,and where the definition of M ∈ Td+1 guarantees that only finitely many factorsof the above products are 6= 1. In other words, if M = [T1, . . . , TN ], N ∈ N,

then M! = #M · T1! · · ·TN ! and α(M) = δ(M)N !

· α(T1) · · ·α(TN). Note that thetree factorial function can be interpreted as an extension of the usual factorialfunction on N, if one identifies each of the following trees τd with the numberd ∈ N: Recursively, define, for each d ∈ N, τ1 := ∅, τd+1 := [τd]. A simple inductionthen shows that, for each d ∈ N, τd has both depth d and order d, and τd! = d!holds as well. Moreover, for the trees T1, T2, T3 from (a), we obain

T1! = 4 · 13 = 4, T2! = 3! = 6, T3! = 8 · [∅, ∅]! · 1 · 3! = 8 · 3 · 6 = 144,

α(T1) =1

3!· α(∅)3 = 1

6, α(T2) =

1

1· 11· 1 = 1, α(T3) =

3!

3!· α([∅, ∅]) · 1 =

1

2!=

1

2.

Proposition 3.26. (a) For each T ∈ Td, d ∈ N, one has

d ≤ #T.

For the tree τd, defined in Def. and Rem. 3.25(c), one has τd ∈ Td, #τd = d.


(b) One has T =⋃k∈N T #k. For each d, k ∈ N, one has

Td =⋃

k∈NT #kd , T #k =

⋃

d∈1,...,kT #kd .

(c) #T1 = #T #1 = 1. For each k ∈ N, T #k is finite. For each d ∈ N with d ≥ 2, Td isinfinite and countable.

(d) T is infinite and countable.

Proof. (a): That d ≤ #T for each T ∈ Td, d ∈ N, follows by induction on d ∈ N fromthe definition of # : T −→ N in Def. and Rem. 3.25(b), noting that, for each M ∈ Td+1,there must be at least one T ∈ Td with M(T ) ≥ 1. As mentioned in Def. and Rem.3.25(c), τd ∈ Td, #τd = d also follows via a simple induction on d ∈ N.

(b): T =⋃k∈N T #k follows as it is clear from the definition of # : T −→ N in Def. and

Rem. 3.25(b) that the order map is, indeed, N-valued. Then, for each d ∈ N,

Td = Td ∩ T = Td ∩⋃

k∈NT #k =

⋃

k∈NT #kd .

Analogous, for each k ∈ N,

T #k = T #k ∩ T = T #k ∩⋃

d∈NTd

(a)=

⋃

d∈1,...,kT #kd .

(c): While #T1 = #T #1 = 1 is immediate, we show by induction on d ∈ N that eachT #kd is finite: T #1

1 = T1 and T #k1 = ∅ for k ≥ 2. For d ∈ N, from the definition of

# : T −→ N in Def. and Rem. 3.25(b) and the definition of Td+1 in Def. and Rem.3.25(a), we obtain

T #kd+1 ⊆ F := F

(T #≤(k−1)≤d , 1, . . . , k − 1

)

and F is finite due to the induction hypothesis. Thus, T #kd+1 is finite as well. In conse-

quence, according to (b), for each k ∈ N, T #k is a finite union of finite sets and, hence,finite. Clearly, T2 is already infinite and, thus, each Td, d ≥ 2, is infinite. On the otherhand, according to (b), each Td is a countable union of finite sets and, hence, countable.

(d): T is infinite as, by (c), each Td, d ≥ 2, is infinite. On the other hand, by (b) and(c), T is a countable union of finite sets and, hence, countable.

We will now proceed to explain the relation between derivatives of f ∈ Cp(Ω,Kn),Ω ⊆ Kn (p, n ∈ N), and (finite unlabeled rooted) trees: First, we need to recall that wecan interpret higher total derivatives Dαf of f as multilinear maps on (Kn)α (see, e.g.,[Phi16b, Sec. 4.6]). We restate the definition of Dαf , where, to simplify notation, wewill also write f ′ for Df , f (α) for Dαf :


Definition 3.27. Let n,m, p, α ∈ N. Let Ω ⊆ Kn be open, f : Ω −→ Km, f ∈Cp(Ω,Km). For α ≤ p, define the total derivative of order α of f as follows: For eachy ∈ Ω,

f (α)(y) := Dαf(y) : (Kn)α −→ Km,

Dαf(y)(h1, . . . , hα)l =n∑

j1,...,jα=1

∂j1 · · · ∂jαfl(y)h1j1 · · ·hαjα , l ∈ 1, . . . ,m (3.38)

(note that the coordinate functions of f (α) = Dαf are precisely the partials of order αof f). As usual, we also let f (0) := D0f := f .

—

Next, we observe that, for m = n, we can apply total derivatives to (the results of)other total derivates: For example, we can form the following expressions (assuming thehighest-order derivative to exist):

D1 := f ′′(f, f), D2 := f ′′(f (4)(f, f ′(f), f, f

), f), D3 := f ′′′(f ′(f ′(f)), f, f ′′(f, f)

),

where we omitted the argument y ∈ Ω in the notation. As we assume all partials off to be continuous, the order of differentiation does not matter and the f (α) actuallyconstitute symmetric multilinear maps, i.e. their results do not depend on the order ofthe arguments. For example, we can rewrite D2 and D3 from above as follows:

D2 := f ′′(f, f (4)

(f ′(f), f, f, f

)), D3 := f ′′′(f ′′(f, f), f ′(f ′(f)), f

).

The key observation is now that the structure of such nested derivatives is preciselythe same as that of the above-defined trees: To obtain the corresponding tree, onemerely has to replace each f (0) with ∅, replace parentheses () with brackets [], and omitderivatives f (α), α ≥ 1. For D1, D2, D3 from above, we, thus, obtain

D1 = f ′′(f, f) = [∅, ∅] = ,

D2 = f ′′(f, f (4)

(f ′(f), f, f, f

))=[[∅, [∅], ∅, ∅

], ∅]= ,

D3 = f ′′′(f ′′(f, f), f ′(f ′(f)), f)=[[∅, ∅], [[∅]], ∅

]= .

The above considerations lead us to the following definition of derivatives with respectto trees. Bearing in mind d ≤ #T for each T ∈ Td according to Prop. 3.26(a), we definef (T ) for each f being p times continuously differentiable and for each T ∈ T #≤(p+1),recursively over d ∈ 1, . . . , p+ 1:Definition 3.28. Let p ∈ N0, n ∈ N, Ω ⊆ Kn open, f ∈ Cp(Ω,Kn). Define

f (∅) := f.


For d ∈ 1, . . . , p, we recursively define

∀T=[T1,...,TN ]∈T #≤(p+1)

d+1 ,

N≤p

∀y∈Ω

f (T )(y) := f (N)(y)(f (T1)(y), . . . , f (TN )(y)

)

(note that f (T )(y) is then well-defined due to the symmetry of f (N)(y)).

Example 3.29. In this example, we construct all trees up to order 4. For each tree T ,we provide its order #T , its factorial T !, its weight α(T ), and the derivative f (T ), wheref ∈ C3(Ω,Kn), Ω ⊆ Kn open, n ∈ N.

(a) T11 := ∅ = is the only tree of order one. We have

T11 = ∅ = , #T11 = 1, T11! = 1, α(T11) = 1, f (T11) = f.

(b) If T ∈ T #2, then T = [S] with #S = 1. Thus S = T11 and T21 := [∅] is the onlytree of order 2. We have

T21 = [∅] = , #T21 = 2, T21! = 2 · 1 = 2, α(T21) =1

1!= 1, f (T21) = f ′(f).

(c) We show there are precisely 2 trees of order 3: If T ∈ T #3, then T = [S] with#S = 2 or T = [S1, S2] with #S1 = #S2 = 1. Thus, we have

T #3 = T31, T32, T31 := [T21] = , T32 := [∅, ∅] = ,

T31 = [[∅]] = , #T31 = 3, T31! = 3 · T21! = 6, α(T31) =1

1!· α(T21) = 1, f (T31) = f ′(f ′(f)),

T32 = [∅, ∅] = , #T32 = 3, T32! = 3 · 1 · 1 = 3, α(T32) =1

2!· 1 · 1 =

1

2, f (T32) = f ′′(f, f).

(d) We show there are precisely 4 trees of order 4: If T ∈ T #4, then T = [S] with#S = 3 or T = [S1, S2] with #S1 + #S2 = 3 (i.e. T = [∅, T21]) or T = [S1, S2, S3]with #S1 = #S2 = #S3 = 1. Thus, we have

T #4 = T41, T42, T43, T44, T41 := [T31] = , T42 := [T32] = , T43 := [∅, T21] = , T44 := [∅, ∅, ∅] = ,

T41 = [[[∅]]] = , #T41 = 4, T41! = 4 · T31! = 24, α(T41) =1

1!· α(T31) = 1, f (T41) = f ′

(

f ′(f ′(f)))

,

T42 = [[∅, ∅]] = , #T42 = 4, T42! = 4 · T32! = 12, α(T42) =1

1!· α(T32) =

1

2, f (T42) = f ′

(

f ′′(f, f))

,

T43 = [∅, [∅]] = , #T43 = 4, T43! = 4 · 1 · 2 = 8, α(T43) =2

2!· 1 · 1 = 1, f (T43) = f ′′

(

f, f ′(f))

,

T44 = [∅, ∅, ∅] = , #T44 = 4, T44! = 4 · 13 = 4, α(T44) =1

3!· 13 =

1

6, f (T44) = f ′′′(f, f, f).


—

Proceeding with the strategy outlined at the beginning of the section, we now want toobtain Taylor expansions of the exact solution φ to y′ = f(y), y(ξ) = η, and of thedefining function ψ of an RK method. We will obtain concise and structured forms ofthese expansions in Prop. 3.31 and Prop. 3.34 below, making use of the above-definedtrees. In preparation, we provide a suitable version of Taylor’s theorem:

Theorem 3.30 (Taylor). Let m,n ∈ N. Let Ω ⊆ Kn be open and f ∈ Cp+1(Ω,Km) forsome p ∈ N0. Let y ∈ Ω and h ∈ Kn be such that the line segment Sy,y+h between yand y+h is a subset of Ω. Then the following formula, also known as Taylor’s formula,holds:

f(y + h) =

p∑

k=0

f (k)(y)(

k times︷︸︸︷h, . . . , h)

k!+Rp(y, h),

where Rp(y, h) ∈ Km with ‖Rp(y, h)‖ = O(‖h‖p+1) for ‖h‖ → 0, i.e.

lim suph→0

‖Rp(y, h)‖‖h‖p+1

=: C(y) ∈ R+0 .

Moreover, the function h 7→ Rp(y, h) is continuous in a neighborhood Ny of 0.

Proof. According to [Phi16b, Th. 4.44], Taylor’s formula holds in the stated form forΩ ⊆ Rn and K-valued f with the remainder term in integral form

Rp(y, h) =

∫ 1

0

(1− t)p

p!f (p+1)(y + th)(h, . . . , h) dt .

It then also holds for Km-valued f , since we can apply the K-valued version to eachcoordinate function fl of f , l = 1, . . . ,m. It then also extends to Ω ⊆ Cn, as we caninterpret Cn as R2n (here, as always in this class, we only consider R-differentiability,even if K = C). It remains to verify Rp(y, h) ∈ Km with ‖Rp(y, h)‖ = O(‖h‖p+1) for‖h‖ → 0: Without loss of generality, we may interpret ‖ · ‖ as the max norm (due tonorm equivalence). Moreover, there exists a compact neighborhood K of y with K ⊆ Ω.Each absolute value of the finitely many continuous partials of order p+1 of f is boundedon K, say by C(y) ∈ R+

0 . Then we can estimate, for y + h ∈ K,

∀l∈1,...,m

|Rp(y, h)l| ≤

∣∣∣∣∣∣

∫ 1

0

(1− t)p

p!

n∑

j1,...,jp+1=1

∂j1 · · · ∂jp+1fl(y + th)hj1 · · ·hjp+1 dt

∣∣∣∣∣∣

≤ np+1 C(y) ‖h‖p+1max

p!.

Taking the max over l ∈ 1, . . . ,m in the above estimate, proves the claimed ‖Rp(y, h)‖= O(‖h‖p+1) for ‖h‖ → 0. Solving Taylor’s formula for Rp(y, h), the continuity ofh 7→ Rp(y, h) follows from the continuity of f and the continuity of the multilinearfunctions f (k)(y).


Proposition 3.31. Let n ∈ N, p ∈ N0, Ω ⊆ Kn open, f ∈ Cp(Ω,Kn). If I ⊆ R is anopen interval and φ : I −→ Kn is a solution to y′ = f(y), then one has φ ∈ Cp+1(I,Kn)and, for each x, h ∈ R such that x, x+ h ∈ I, φ admits the Taylor expansion

φ(x+ h) = φ(x) +∑

T∈T #≤p

h#T

T !α(T ) f (T )

(φ(x)

)+Rp(x, h), (3.39)

where ‖Rp(x, h)‖ = O(|h|p+1) for |h| → 0 and the map h 7→ Rp(x, h) is continuous in aneighborhood of 0.

Proof. We have φ ∈ Cp+1(I,Kn) according to Prop. B.1 of the Appendix. We nowprove (3.39) via induction on p: For p = 0, the sum in (3.39) is empty and the formulaholds due to Th. 3.30 (applied for p = 0 and with φ instead of f). Now fix p ∈ N0,f ∈ Cp+1(Ω,Kn), and assume (3.39) to hold by induction hypothesis. We apply theinduction hypothesis with h replaced by t ∈ [0, h] and set

y := φ(x), y(t) :=∑

T∈T #≤p

t#T

T !α(T ) f (T )(y) +Rp(x, t)

to obtain

f(φ(x+ t)

) ind.hyp.= f

(y + y(t)

)

Th. 3.30=

p∑

k=0

f (k)(y)(y(t), . . . , y(t)

)

k!+Rp,f

(y, y(t)

)

=

p∑

k=0

1

k!f (k)(y)

∑

T1∈T #≤p

t#T1

T1!α(T1) f

(T1)(y),

. . . ,∑

Tk∈T #≤p

t#Tk

Tk!α(Tk) f

(Tk)(y)

+Pp,f (x, t) +Rp,f

(y, y(t)

),

where, for the Rp(x, t) occurring in y(t), we know ‖Rp(x, t)‖ = O(|t|p+1) for |t| → 0.Moreover, due to (3.38), Pp,f (x, t) has the form of a finite sum of terms v

∏pl=1Xl,

where v ∈ Kn and each Xl ∈ K is 1 or a component of∑

T∈T #≤pt#T

T !α(T ) f (T )(y) or a

component of Rp(x, t), where, in each summand, at least one Xl = (Rp(x, t))j. Thus,for each summand, we have ‖v ∏p

l=1Xl‖ = O(|t|p+1) for |t| → 0, implying ‖Pp,f (x, t)‖ =O(|t|p+1) for |t| → 0 as well. We also know ‖Rp,f (y, v)‖ = O(‖v‖p+1) for ‖v‖ → 0,implying

lim supt→0

∥∥Rp,f

(y, y(t)

)∥∥|t|p+1

= lim supt→0

‖y(t)‖p+1∥∥Rp,f

(y, y(t)

)∥∥|t|p+1 ‖y(t)‖p+1

∈ R+0 .


Thus, we have shown

f(φ(x+ t)

)=

p∑

k=0

1

k!f (k)(y)

∑

T1∈T #≤p

t#T1

T1!α(T1) f

(T1)(y),

. . . ,∑

Tk∈T #≤p

t#Tk

Tk!α(Tk) f

(Tk)(y)

+Rp,f,1(x, t),

where ‖Rp,f,1(x, t)‖ = O(|t|p+1) for |t| → 0. Then t 7→ Rp,f,1(x, t) is continuous in [0, h],due to the continuity of f , φ, and the multilinear maps f (k)(y). Using the multilinearityof the f (k)(y) once again yields

f(φ(x+ t)

)

=

p∑

k=0

1

k!

∑

#T1+···+#Tk≤p

t#T1+···+#Tk · α(T1) · · ·α(Tk)T1! · · ·Tk!

f (k)(y)(f (T1)(y), . . . , f (Tk)(y)

)

+Rp,f,2(x, t),

where ‖Rp,f,2(x, t)‖ = O(|t|p+1) for |t| → 0 and t 7→ Rp,f,2(x, t) is continuous in [0, h].Now, instead of summing over all ordered k-tuples, we can use the symmetry of f (k)(y)to merely sum over all unordered k-tuples, using that each unordered k-tuple T =[T1, . . . , Tk] corresponds to δ(T ) ordered k-tuples. Thus, also making use of the recursivedefinitions of T , #T , T !, α(T ), and f (T ), we obtain

f(φ(x+ t)

)=

p∑

k=0

∑

T∈T #≤(p+1),T=[T1,...,Tk]

#T · t#T−1

T !

δ(T )α(T1) · · ·α(Tk)k!︸︷︷︸

=α(T )

f (T )(y) +Rp,f,2(x, t)

=∑

T∈T #≤(p+1)

#T · t#T−1

T !α(T ) f (T )(y) +Rp,f,2(x, t).

As φ is a solution to the ODE, we infer from the above equality

φ(x+ h)− φ(x) =

∫ h

0

φ′(x+ t) dt =

∫ h

0

f(φ(x+ t)

)dt

=∑

T∈T #≤(p+1)

h#T

T !α(T ) f (T )

(φ(x)

)+Rp+1(x, h),

where

Rp+1(x, h) :=

∫ h

0

Rp,f,2(x, t) dt

(the integral exists, as t 7→ Rp,f,2(x, t) is continuous). Thus,

lim suph→0

‖Rp+1(x, h)‖|h|p+2

=: C(x) ∈ R+0 ,


showing ‖Rp+1(x, h)‖ = O(|h|p+2) for |h| → 0 and we have verified (3.39) to hold forp+ 1. The continuity of h 7→ Rp+1(x, h) follows, again, by solving (3.39) for Rp+1(x, h).In consequence, the induction and the proof of the proposition are complete.

The striking and immensely useful observation due to Butcher is the fact that one cangive the Taylor expansion of the defining function of an RK method an analogous struc-ture to that of the above Taylor expansion of the solution φ. It turns out that certaincoefficients in the Taylor expansion of the defining function can be computed recursively,using the above trees T , making use of auxiliary vectors vA(T ) ∈ Ks, depending on theRK matrix A ∈ M(s,K). Thus, before we can provide the Taylor expansion in Prop.3.34 below, we still need to provide the recursive definition of the vA(T ):

Definition 3.32. Let s ∈ N, A ∈ M(s,K). Recursively, we define a function vA :T −→ Ks: Define

vA(∅) := (1, . . . , 1)t ∈ Ks,

and, for d ∈ N,

∀M∈Td+1,

M: T≤d−→N0

∀j∈1,...,s

vA(M)j :=∏

T∈T≤d

(AvA(T )

)M(T )

j,

where the definition of M ∈ Td+1 guarantees that only finitely many factors of theabove products are 6= 1. In other words, if M = [T1, . . . , TN ], N ∈ N, then vA(M)j =(AvA(T1))j · · · (AvA(TN))j for each j ∈ 1, . . . , s.Example 3.33. Let s ∈ N, A ∈ M(s,K). We compute vA(T ) for all the trees T of Ex.3.29, i.e. for all trees of order at most 4. We define

∀j∈1,...,s

cj :=s∑

l=1

ajl (3.40)

(while, here, (3.40) is merely the definition of the cj, it is the same as the node condition(3.4), i.e. it is also satisfied if A and the cj are parameters of an RK method that satisfiesthe node condition). The following trees Tµν are the same as in Ex. 3.29. We obtain,for each j ∈ 1, . . . , s,

T11 = ∅ = , vA(T11)j = 1, T21 = [∅] = , vA(T21)j = (AvA(T11))j =s

∑

l=1

ajl · 1 = cj ,

T31 = [T21] = , vA(T31)j = (AvA(T21))j =s

∑

l=1

ajl cl,

T32 = [∅, ∅] = , vA(T32)j =(

(AvA(T11))j

)2= c2j ,

T41 = [T31] = , vA(T41)j = (AvA(T31))j =s

∑

k,l=1

ajk akl cl,

T42 = [T32] = , vA(T42)j = (AvA(T31))j =s

∑

l=1

ajl c2l ,


T43 = [T11, T21] = , vA(T43)j = (AvA(T11))j (AvA(T21))j = cj

s∑

l=1

ajl cl,

T44 = [T11, T11, T11] = , vA(T44)j =(

(AvA(T11))j

)3= c3j .

Proposition 3.34. Let n ∈ N, p ∈ N0, Ω ⊆ Kn open, f ∈ Cp(Ω,Kn). Let s ∈ N

and consider an s-stage RK method according to Def. 3.1 with weights vector bt =(b1, . . . , bs) ∈ Rs and RK matrix A ∈ M(s,K). Assume

∀y∈Ω

∀j∈1,...,s

limh↓0

kj(y, h) = f(y),

(sufficient conditions are that the RK method is explicit or that f satisfies the conditionsof Th. 3.8 and is in standard form2; also note that the kj do not depend on x, as f doesnot depend on x). Then, for each y ∈ Ω and each sufficiently small h ∈ R+, the kjadmit Taylor expansions

kj(y, h) =∑

T∈T #≤p

h#T−1 α(T ) vA(T )j f(T )(y) +Rp,j(y, h), (3.41a)

where ‖Rp,j(y, h)‖ = O(hp) for h→ 0. Thus, the defining function ψ of the RK methodadmits the Taylor expansion

ψ(y, h) =∑

T∈T #≤p

h#T−1 α(T ) btvA(T ) f(T )(y) +Rp(y, h), (3.41b)

where ‖Rp(y, h)‖ = O(hp) for h→ 0.

Proof. Since

ψ(y, h) =s∑

j=1

bjkj(y, h),

(3.41b), clearly, follows from (3.41a). Thus, it suffices to show (3.41a). The proofof (3.41a) is conducted similar to the proof of Prop. 3.31, using induction on p: Forp = 0, the sum in (3.41a) is empty and the statement holds due to the assumptionlimh↓0 kj(y, h) = f(y). Now fix p ∈ N0, f ∈ Cp+1(Ω,Kn), and assume (3.41a) to hold byinduction hypothesis. We set

Rp,j,1(y, h) := hARp,j(y, h), y(h) :=∑

T∈T #≤p

h#T α(T ) (AvA(T ))j f(T )(y) +Rp,j,1(y, h)

2Due to Prop. A.4 of the Appendix, the conditions of Th. 3.8 are automatically satisfied for p ≥ 1.


to obtain

kj(y, h)(3.1b)= f

(y + h

s∑

l=1

ajlkl(y, h)

)ind.hyp.= f

(y + y(h)

)

Th. 3.30=

p∑

k=0

f (k)(y)(y(h), . . . , y(h)

)

k!+Rp,f

(y, y(h)

)

=

p∑

k=0

1

k!f (k)(y)

∑

T1∈T #≤p

h#T1 α(T1) (AvA(T1))j f(T1)(y),

. . . ,∑

Tk∈T #≤p

h#Tk α(Tk) (AvA(Tk))j f(Tk)(y)

+Pp,f,j(y, h) +Rp,f

(y, y(h)

),

where ‖Rp,j(y, h)‖ = O(hp) for h → 0. Moreover, due to (3.38), Pp,f,j(y, h) has theform of a finite sum of terms v

∏pl=1Xl, where v ∈ Kn and each Xl ∈ K is 1 or

a component of∑

T∈T #≤p h#T α(T ) (AvA(T ))j f(T )(y) or a component of Rp,j,1(y, h),

where in each summand, at least one Xl = (Rp,j,1(y, h))α. Thus, for each summand, wehave ‖v ∏p

l=1Xl‖ = O(hp+1) for h → 0, implying ‖Pp,f,j(y, h)‖ = O(hp+1) for h → 0 aswell. We also know ‖Rp,f (y, v)‖ = O(‖v‖p+1) for ‖v‖ → 0, implying

lim suph↓0

∥∥Rp,f

(y, y(h)

)∥∥hp+1

= lim suph↓0

‖y(h)‖p+1∥∥Rp,f

(y, y(h)

)∥∥hp+1 ‖y(h)‖p+1

∈ R+0 .

Thus, we have shown

kj(y, h) =

p∑

k=0

1

k!f (k)(y)

∑

T1∈T #≤p

h#T1 α(T1) (AvA(T1))j f(T1)(y),

. . . ,∑

Tk∈T #≤p

h#Tk α(Tk) (AvA(Tk))j f(Tk)(y)

+Rp,f,j(y, h),

where ‖Rp,f,j(y, h)‖ = O(hp+1) for h → 0. Using the multilinearity of the f (k)(y) onceagain yields

kj(y, h)

=

p∑

k=0

1

k!

∑

#T1+···+#Tk≤ph#T1+···+#Tk

(k∏

l=1

α(Tl)(AvA(Tl))j

)f (k)(y)

(f (T1)(y), . . . , f (Tk)(y)

)

+Rp+1,j(y, h),

where ‖Rp+1,j(y, h)‖ = O(hp+1) for h → 0. Now, instead of summing over all orderedk-tuples, we can use the symmetry of f (k)(y) to merely sum over all unordered k-tuples,


using that each unordered k-tuple T = [T1, . . . , Tk] corresponds to δ(T ) ordered k-tuples.Thus, also making use of the recursive definitions of T , #T , vA(T ), α(T ), and f

(T ), weobtain

kj(y, h) =

p∑

k=0

∑

T∈T #≤(p+1),T=[T1,...,Tk]

h#T−1 vA(T )jδ(T )α(T1) · · ·α(Tk)

k!︸︷︷︸=α(T )

f (T )(y) +Rp+1,j(y, h)

=∑

T∈T #≤(p+1)

h#T−1 α(T ) vA(T )j f(T )(y) +Rp+1,j(y, h).

Hence, we have verified (3.41a) to hold for p+1, completing the induction and the proofof the proposition.

Definition 3.35. Let s ∈ N, bt = (b1, . . . , bs) ∈ Rs, A ∈ M(s,K). We say that thes-stage RK method with weights vector b and RK matrix A satisfies the consistencycondition of order p ∈ N if, and only if,

∀T∈T #≤p

btvA(T ) =1

T !. (3.42)

Remark 3.36. We note that (3.42) is a generalization of the consistency condition(3.3): Since ∅ is the only tree of order 1, vA(∅) = (1, . . . , 1)t, and ∅! = 1, the consistencycondition of order 1 reads

∑sj=1 bj = 1, which is precisely (3.3).

Theorem 3.37 (Butcher). Let n, s ∈ N, bt = (b1, . . . , bs) ∈ Rs, A ∈ M(s,K). Consideran s-stage RK method with weight vector b and RK matrix A.

(a) Autonomous Case: Let Ω ⊆ Kn be open, let f ∈ Cp(Ω,Kn), p ∈ N, and assume theRK method to be in standard form. Moreover, let (ξ, η) ∈ R× Ω and choose b > ξand φ : [ξ, b] −→ Kn such that φ is the unique solution to y′ = f(y), y(ξ) = η, on[ξ, b]. If the RK method satisfies the consistency condition of order p, i.e. (3.42),then it is consistent of order p.

(b) Nonautonomous Case: Let G ⊆ R × Kn be open, let f ∈ Cp(G,Kn), p ∈ N, andassume the RK method to be in standard form. Let (ξ, η) ∈ G and choose b > ξ andφ : [ξ, b] −→ Kn such that φ is the unique solution to y′ = f(x, y), y(ξ) = η, on[ξ, b]. If the RK method satisfies the consistency condition of order p, i.e. (3.42),plus the node condition (3.4) (with node vector c = (c1, . . . , cs)

t ∈ Rs), then it isconsistent of order p.

(c) If the RK method is consistent of order p ∈ N with respect to each solution φ :[0, b] −→ Rn, b > 0, to an autonomous initial value problem y′ = f(y), y(0) = 0,for each f ∈ C∞(Rn,Rn), n ∈ N, then it must satisfy the consistency condition oforder p, i.e. (3.42).

Proof. (a): Note that, according to Th. 3.8(c),

∀y∈Ω

∀j∈1,...,s

limh↓0

kj(y, h) = f(y),


and the local truncation error λ of (2.4) is well-defined. The main work of the proofhas already been carried out in the proofs of Prop. 3.31 and Prop. 3.34, respectively. Itremains to use these propositions in (2.4): Let x ∈ [ξ, b[ and set y := φ(x). Accordingto (2.4), for each sufficiently small h ∈]0, b− x], we obtain the local truncation error

λ(x, h) = y + hψ(y, h)− φ(x+ h)

= y + h∑

T∈T #≤p

h#T−1 α(T ) btvA(T ) f(T )(y) + hRp,ψ(y, h)

−y −∑

T∈T #≤p

h#T

T !α(T ) f (T )(y) +Rp,φ(x, h)

(3.42)= hRp,ψ(y, h)−Rp,φ(x, h), (3.43)

where ‖Rp,ψ(y, h)‖ = O(hp) for h→ 0 and ‖Rp,φ(x, h)‖ = O(hp+1) for h→ 0. Thus,

lim suph↓0

∥∥λ(x, h)∥∥

hp+1∈ R+

0 ,

showing the method to be consistent of order p.

(b): We know from Th. 3.12 that y′ = f(x, y), y(ξ) = η, is equivalent to the autonomousinitial value problem y′ = g(y), y(ξ) = (ξ, η), where

g : G −→ Kn+1, g(x, y1, . . . , yn) :=(1, f(x, y1, . . . , yn)

).

The RK method defining functions for the nonautonomous and the autonomous problemare, respectively,

ψf : Dψf −→ Kn, ψf (x, y, h) =s∑

j=1

bjkf,j(x, y, h),

ψg : Dψg −→ Kn+1, ψg((x, y), h) =s∑

j=1

bjkg,j((x, y), h),

where the kf,j(x, y, h) satisfy

∀j∈1,...,s

kf,j(x, y, h) = f

(x+ cjh, y + h

s∑

l=1

ajlkf,l(x, y, h)

),

and the kg,j((x, y), h) satisfy

∀j∈1,...,s

kg,j((x, y), h) = g

((x, y) + h

s∑

l=1

ajlkg,l((x, y), h)

).

As we assume f ∈ Cp(G,Kn) with p ≥ 1, f is locally Lipschitz with respect to y byProp. A.4 of the Appendix, i.e. f satisfies the conditions of Th. 3.8. Hence, as we alsoassume the RK method to be in standard form, according to Th. 3.8(c),

∀(x,y)∈G

∀j∈1,...,s

limh↓0

kfj(x, y, h) = f(x, y),


and the local truncation error λf with respect to ψf and φ is well-defined. Moreover,we know from Lem. 3.13 (as we also assume the node condition), for each sufficientlysmall h > 0,

∀j∈1,...,s

kg,j((x, y), h) =(1, kf,j(x, y, h)

)

⇒ ∀j∈1,...,s

limh↓0

kg,j((x, y), h) =(1, f(x, y)

)= g(x, y).

We know from Th. 3.12 that, if φ : [ξ, b] −→ Kn is the solution to y′ = f(x, y), y(ξ) = η,then ψ : [ξ, b] −→ Kn+1, ψ(x) = (x, φ(x)), is the solution to y′ = g(y), y(ξ) = (ξ, η),yielding, for the respective local truncation errors at x ∈ [ξ, b[ and sufficiently smallh ∈]0, b− x]:

λf (x, h) = y + hψf (x, y, h)− φ(x+ h)

and

λg(x, h) = ψ(x) + hψg(ψ(x), h)− ψ(x+ h)

= (x, φ(x)) + h(1, ψf (x, φ(x), h)

)− (x+ h, φ(x+ h)) =

(0, λf (x, h)

). (3.44)

According to (a), the RK method is consistent of order p for the autonomous problemand we obtain

lim suph↓0

∥∥λg(x, h)∥∥

hp+1∈ R+

0

(3.44)⇒ lim suph↓0

∥∥λf (x, h)∥∥

hp+1∈ R+

0 ,

showing the method to be consistent of order p for the nonautonomous problem as well.

(c) is a consequence of the following result:

∀T∈T

∃fT∈C∞(R#T ,R#T )

∀S∈T

(f(S)T (0)

)1= δST :=

1 for S = T ,

0 for S 6= T :(3.45)

We will construct the fT inductively (the construction will actually show we can obtaineach component of each fT to be a scalar multiple of a monomial): We start the construc-

tion by setting f∅ : R −→ R, f∅(y) := 1. Then f(∅)∅ (0) = f∅(0) = 1 and, for S ∈ T \ ∅,

f(S)∅ (0) = 0, since f

(S)∅ involves a derivative of order ≥ 1 of f∅, and all such derivatives

vanish identically. Now let d ∈ N, assume fT ∈ C∞(R#T ,R#T ) with the property of(3.45) has already been constructed for each T ∈ T≤d, and let T = [T1, . . . , TN ] ∈ Td+1,N ∈ N. We need to define

fT : R#T −→ R#T , y 7→ fT (y).

As the following definition does actually depend on the order of T1, . . . , TN , we fix anenumeration α : T −→ N and assume T1, . . . , TN to be ordered according to α (i.e.α(T1) ≤ α(T2) ≤ · · · ≤ α(TN)). We now partition

y = (y1, . . . , y#T ) = (y1, y1, . . . , yN), where y1 ∈ R, ∀

j∈1,...,Nyj ∈ R#Tj ,


and define

fT (y) :=

((y1)

N

N !, fT1(y

1), . . . , fTN (yN)

)∈ R#T .

Then, clearly, each component of fT is a scalar multiple of a monomial, since we knoweach component of each fTj(y

j) to be a scalar multiple of a monomial by induction.Since T 6= ∅, we have (

f(∅)T (0)

)1= (fT (0))1 = 0.

Now let S = [S1, . . . , SM ] ∈ T , M ∈ N. We compute(f(S)T (0)

)1

Def. 3.28=

(f(M)T (0)

(f(S1)T (0), . . . , f

(SM )T (0)

))1

(3.38)=

#T∑

j1,...,jM=1

(∂j1 · · · ∂jMfT (0)

)1

(f(S1)T (0)

)j1· · ·(f(SM )T (0)

)jM

=

#T∑

j1,...,jM=1

∂j1 · · · ∂jM((y1)

N

N !

)(0)(f(S1)T (0)

)j1· · ·(f(SM )T (0)

)jM

=

(f(S1)T1

(0))1· · ·(f(SN )TN

(0))1

for M = N,

0 for M 6= N

ind.hyp.=

δS1T1 · · · δSNTN for M = N,

0 for M 6= N

= δST ,

completing the proof of (3.45). Now let p ∈ N. If S ∈ T #≤p, then, due to (3.45), forf := fS and (ξ, η) = (0, 0), the first component of (3.43) becomes

(λ(0, h))1 = h#S(α(S) btvA(S)−

α(S)

S!

)+ h (Rp,ψ(0, h))1 − (Rp,φ(0, h))1,

where ‖Rp,ψ(0, h)‖ = O(hp) for h→ 0 and ‖Rp,φ(0, h)‖ = O(hp+1) for h→ 0. Hence, if

lim suph↓0

∥∥λ(0, h)∥∥

hp+1∈ R+

0 , (3.46)

then (as #S ≤ p)

btvA(S) =1

S!,

proving (3.42).

Example 3.38. Let s ∈ N, bt = (b1, . . . , bs) ∈ Rs, A ∈ M(s,K) and consider an s-stageRK method with weights vector b and RK matrix A. We use the results of Ex. 3.29 andEx. 3.33 above, to formulate the consistency conditions (3.42) of up to order 4, where,as in Ex. 3.33, we define

∀j∈1,...,s

cj :=s∑

l=1

ajl.


We already know from Rem. 3.36 that

s∑

j=1

bj = 1 (3.47a)

constitutes the consistency condition of order 1. In the following, the notation for thetrees is the same as in Ex. 3.29 and Ex. 3.33 above. Since T21 is the only tree of order2,

btvA(T21) =s∑

j=1

bjcj =1

2(3.47b)

is the only additional consistency condition of order 2. As the trees of order 3 areprecisely T31 and T32, there are precisely two additional equations for the consistencycondition of order 3, namely

btvA(T31) =s∑

j,l=1

bj ajl cl =1

6, (3.47c)

btvA(T32) =s∑

j=1

bjc2j =

1

3. (3.47d)

The four trees of order 4 yield precisely the following four additional equations for theconsistency condition of order 4:

btvA(T41) =s∑

j,k,l=1

bj ajk akl cl =1

24, (3.47e)

btvA(T42) =s∑

j,l=1

bj ajl c2l =

1

12, (3.47f)

btvA(T43) =s∑

j,l=1

bj cj ajl cl =1

8, (3.47g)

btvA(T44) =s∑

j=1

bjc3j =

1

4. (3.47h)

Example 3.39. (a) Both the explicit Euler method of Ex. 3.5(a) and the implicit Eulermethod of Ex. 3.5(c) are 1-stage RK methods with b = (1), showing they satisfythe consistency condition (3.47a) of order 1. The implicit trapezoidal method ofEx. 3.20(d) has Butcher tableau

0 0 01 1

212

12

12

Thus, (3.47a) is clearly satisfied. Moreover,

s∑

j=1

bjcj =1

2· 0 + 1

2· 1 =

1

2,


showing (3.47b) to be satisfied as well. On the other hand

s∑

j,l=1

bj ajl cl = b2 a21 c1 + b2 a22 c2 =1

2· 12· 0 + 1

2· 12· 1 =

1

46= 1

6,

i.e. (3.47c) fails.

(b) As stated before in Ex. 3.5(b), the classical explicit RK method has Butcher tableau

012

12

12

0 12

1 0 0 116

13

13

16

We verify that it satisfies all 8 equations of (3.47), i.e. it satisfies the consistencycondition of order 4:

s∑

j=1

bj =1

6+

1

3+

1

3+

1

6= 1,

s∑

j=1

bjcj = 2 · 13· 12+

1

6· 1 =

1

2,

s∑

j,l=1

bj ajl cl =1

3· 12· 0 + 1

3· 12· 12+

1

6· 1 · 1

2=

1

6,

s∑

j=1

bjc2j = 2 · 1

3· 14+

1

6· 12 = 1

3,

s∑

j,k,l=1

bj ajk akl cl = b3a32a21c1 + b4a43a32c2 =1

3· 12· 12· 0 + 1

6· 1 · 1

2· 12=

1

24,

s∑

j,l=1

bj ajl c2l =

1

3· 12· 02 + 1

3· 12· 14+

1

6· 1 · 1

4=

1

12,

s∑

j,l=1

bj cj ajl cl =1

3· 12· 12· 0 + 1

3· 12· 12· 12+

1

6· 1 · 1 · 1

2=

1

8,

s∑

j=1

bjc3j = 2 · 1

3· 18+

1

6· 13 = 1

4.

In combination with Th. 3.37(b), this, finally, provides the proof of Th. 2.15(a).

(c) Consider the following implicit 2-stage RK method with Butcher tableau

12−

√36

14

14−

√36

12+

√36

14+

√36

14

12

12


It is a so-called Gauss method. It is an example of a 2-stage method, satisfying theconsistency condition of order 4: As in (b), we check the validity of all 8 equationsof (3.47):

s∑

j=1

bj =1

2+

1

2= 1,

s∑

j=1

bjcj =1

2

(1

2−

√3

6

)+

1

2

(1

2+

√3

6

)= 2 · 1

4=

1

2,

s∑

j,l=1

bj ajl cl =1

2

(1

4+

1

4+

√3

6

)(1

2−

√3

6

)+

1

2

(1

4+

1

4−

√3

6

)(1

2+

√3

6

)

= 2 · 12·(1

2−

√3

6

)(1

2+

√3

6

)=

1

4− 3

36=

1

6,

s∑

j=1

bjc2j =

1

2

(1

2−

√3

6

)2

+1

2

(1

2+

√3

6

)2

= 2 · 12· 14+ 2 · 1

2· 3

36=

1

3,

s∑

j,k,l=1

bj ajk akl cl =1

2

s∑

j=1

aj1

(1

4

(1

2−

√3

6

)+

(1

4−

√3

6

)(1

2+

√3

6

))

+1

2

s∑

j=1

aj2

(1

4

(1

2+

√3

6

)+

(1

4+

√3

6

)(1

2−

√3

6

))

=1

2

(1

2+

√3

6

)(1

4

(1

2−

√3

6

)+

(1

4−

√3

6

)(1

2+

√3

6

))

+1

2

(1

2−

√3

6

)(1

4

(1

2+

√3

6

)+

(1

4+

√3

6

)(1

2−

√3

6

))

=1

4· 16+

1

4· 14+

1

4· 3

36− 3

36=

6

48− 1

12=

1

24,

s∑

j,l=1

bj ajl c2l =

1

2

(1

2+

√3

6

)(1

2−

√3

6

)2

+1

2

(1

2−

√3

6

)(1

2+

√3

6

)2

=1

2

(1

4− 3

36

)=

1

12,

s∑

j,l=1

bj cj ajl cl =1

2· 14

(1

2−

√3

6

)2

+1

2· 14

(1

2+

√3

6

)2

+1

2

(1

2+

√3

6

)(1

2−

√3

6

)(1

4+

1

4

)=

1

12· 12+

1

12=

1

8,

s∑

j=1

bjc3j =

1

2

(1

2−

√3

6

)3

+1

2

(1

2+

√3

6

)3

4 ACCURACY IMPROVEMENTS 60

= 2 · 12· 18+ 2 · 1

2· 12· 3 · 3

36=

1

4.

Thus, as the method, clearly, also satisfies the node condition, it is consistent oforder 4 for each f that satisfies the hypotheses of Th. 3.37(a) or Th. 3.37(b).

4 Accuracy Improvements Based on Asymptotic Er-

ror Expansions

4.1 The Idea of Extrapolation

Extrapolation is a method that can be used to improve the accuracy of single-stepmethods. As before, the goal is to approximate the solution to the initial value problemy′ = f(x, y), y(ξ) = η, which is assumed to have a unique exact solution φ definedon some interval [ξ, b], b > ξ. We know that, given a defining function ψ and a parti-tion ∆ = (x0, . . . , xN) of [ξ, b], we obtain a sequence (y0, y1, . . . ) of approximations to(φ(x0), φ(x1), . . . ).

Assume we have a defining function ψ that provides an explicit single-step method suchthat the unique global solution

((ξ, η) = (x0, y0), . . . , (xN , yN)

)∈([ξ, b]×Kn

)N+1,

satisfying

y0 = η,

∀k∈0,...,N−1

yk+1 = yk + hk ψ(xk, yk, hk), hk := xk+1 − xk,

exists for each ∆ ∈ Π([ξ, b]) with hmax(∆) < ǫ for some ǫ > 0. We now fix x ∈]ξ, b] and are interested in approximating φ(x). We consider partitions ∆h of [ξ, x] ofdecreasing (sufficiently small) equidistant stepsizes h < ǫ. If (hj)j∈N0 is a sequence ofadmissible stepsizes such that limj→∞ hj = 0, then, for a reasonable method (one oforder of convergence p ∈ N, say), we expect limj→∞ yhj(x) = φ(x), where yhj(x) is theapproximation at x given by the partition ∆hj of [ξ, x] that has equidistant stepsizes hj(only discrete values for hj > 0 are admissible, namely those satisfying x−ξ

hj∈ N).

The idea of extrapolation is to improve the approximation at x, by computing yhj(x)for a number of different stepsizes h0 > h1 > · · · > hM , interpolating the points(h0, yh0(x)), . . . , (hM , yhM (x)) via polynomial interpolation to obtain a polynomial P ,h 7→ P (h), using P (0) as the improved approximation of φ(x). The term extrapolationcomes from the fact that 0 /∈ [hM , h0], i.e. one extrapolates P to some value outside theinterval of interpolation points.

One can only expect the described idea to work well if yh(x) (as a function of h) behaveslike a polynomial, at least asymptotically for h → 0. This is the case, if the erroryh(x)− φ(x) has an asymptotic expansion in the sense of Def. 4.2 below.


4.2 Asymptotic Error Expansions

Notation 4.1. Let a, x ∈ R, a < x. Define the set of admissible stepsizes by

Hx :=

h ∈ R+ :

x− a

h∈ N

.

For each h ∈ Hx, let N(h) := x−ah

∈ N, i.e. N(h) is the number of steps in the partition∆h ∈ Π([a, x]) consisting of the N(h) + 1 equidistant points

xhk = a+ kh, k ∈ 0, . . . , N(h),

that means

∆h = (xh0 , xh1 , . . . , x

hN(h)) = (a, a+ h, . . . ,

=x︷︸︸︷a+ hN(h)).

Definition 4.2. In the situation of Def. 2.3, i.e. given [a, b] ⊆ R, a < b, and φ :[a, b] −→ Kn, n ∈ N, η := φ(a), assume the explicit single-step method given by thedefining function ψ : Dψ −→ Kn, Dψ ⊆ R×Kn ×R, to be such that, for each partition∆ = (x0, . . . , xN) of [a, b] with hmax(∆) < h(ψ), h(ψ) ∈ R+, the (unique) global solution

((x0, y0), . . . , (xN , yN)

)∈([a, b]×Kn

)N+1

with y0 = η exists. Let x ∈]a, b]. For each h ∈ Hx∩]0, h(ψ)[, denote

yh0 := η,

∀k∈0,1,...,N(h)−1

yhk+1 := yhk + hψ(xhk, yhk , h).

Also defineU :=

(x, h) ∈]a, b]×]0, h(ψ)[: h ∈ Hx

and letu : U −→ Kn, u(x, h) := yhN(h).

We say that the method given by ψ admits an asymptotic expansion of the (global) erroru(x, h)− φ(x) if, and only if, there exist p, r ∈ N such that

∃C∈R

+

0

∀j∈0,...,r−1

∃cp+j∈C

r+1−j([a,b],Kn),cp+j(a)=0

∀x∈]a,b]

lim suph→0

∥∥∥u(x, h)− φ(x)−∑r−1j=0 cp+j(x)h

p+j∥∥∥

hp+r≤ C,

(4.1a)

where the convergence in (4.1a) is often stated with a Landau symbol in the form

u(x, h)− φ(x) = cp(x)hp + cp+1(x)h

p+1 + · · ·+ cp+r−1(x)hp+r−1 +O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]). (4.1b)

—


As an example of an application of the existence of an asymptotic expansion of theglobal error, we provide the following Prop. 4.3, which will be useful in Sec. 4.4 onstepsize control below (cf. the proof of Lem. 4.20).

Proposition 4.3. In the situation of Def. 2.3, i.e. given [a, b] ⊆ R, a < b, and φ :[a, b] −→ Kn, n ∈ N, η := φ(a), assume the explicit single-step method given by thedefining function ψ : Dψ −→ Kn, Dψ ⊆ R × Kn × R, admits an asymptotic expansionof the global error as in (4.1), i.e.

∃p,r∈N

∀j∈0,...,r−1

∃cp+j∈Cr+1−j([a,b],Kn),

cp+j(a)=0

u(x, h)− φ(x) =r−1∑

j=0

cp+j(x)hp+j +O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]).

Then

∀l∈N

∀j∈1,...,r−1

∃bp+j∈Kn

u(a+ lh, h)−φ(a+ lh) =r−1∑

j=1

bp+j hp+j+O(hp+r) (for h→ 0)

(4.2)(note that, in contrast to (4.1), the expansion (4.2) is local in a positive neighborhood ofa).

Proof. Fix l ∈ N. For each j ∈ 0, . . . , r − 1, we have the Taylor expansion (recallingcp+j(a) = 0)

cp+j(a+ lh) =

r−j−1∑

k=1

c(k)p+j(a)

(lh)k

k!+O(hr−j) (for h→ 0),

which, plugged into (4.1), yields

u(a+ lh, h)− φ(a+ lh) =r−1∑

j=0

cp+j(a+ lh)hp+j +O(hp+r)

=r−1∑

j=0

(r−j−1∑

k=1

c(k)p+j(a)

(lh)k

k!+O(hr−j)

)hp+j +O(hp+r)

(∗)=

r−1∑

s=1

(s∑

k=1

c(k)p+s−k(a)

(lh)k

k!

)hp+s−k +O(hp+r)

=r−1∑

s=1

(s∑

k=1

c(k)p+s−k(a)

lk

k!

)hp+s +O(hp+r) (for h→ 0),

where, for the equality at (∗) note that s = j + k ⇔ j = s − k. Thus, we have found

(4.2) to hold with bp+j :=∑j

k=1 c(k)p+s−k(a)

lk

k!.


We will show below (cf. Th. 4.9 and Th. 4.10(c)) that explicit single-step methodsadmit asymptotic expansions of the global error, provided that the defining function ψis sufficiently regular and that φ is the solution to an ODE y′ = f(x, y) with f sufficientlyregular as well. However, the proof is somewhat involved and needs some preparation.In Lem. 4.5 below, we will start with a first asymptotic expansion result regarding thelocal truncation error λ(x, h).

Definition 4.4. Let a, b ∈ R, a < b. We call D ⊆ [a, b] × R+0 admissible ǫ-domain,

ǫ > 0, if, and only if,

D =(x, h) ∈ R2 : x ∈ [a, b], 0 ≤ h ≤ minǫ, b− x

.

Then, for η : D −→ Kn, n ∈ N, we write η ∈ Cp(D,Kn), p ∈ N0, if, and only if,η ∈ Cp(D,Kn) and η as well as all of its partials of order at most p extend continuouslyto D.

Lemma 4.5. Let p ∈ N, s ∈ N∪∞. In the situation of Def. 2.3, assume φ : [a, b] −→Kn to be (p+s+1) times continuously differentiable on [a, b], and assume ψ : Dψ −→ Kn

to be a defining function of an explicit single-step method such that the function

η : Dη −→ Kn, η(x, h) := ψ(x, φ(x), h),

is defined and (p+ s+1) times continuously differentiable on some admissible ǫ-domainDη, ǫ > 0 (i.e. η ∈ Cp+s+1(Dη,K

n) in the sense of Def. 4.4). If, moreover, ψ isconsistent of order p, then there exists d ∈ Cs

([a, b],Kn

)such that the local truncation

error satisfiesλ(x, h) = d(x)hp+1 +O(hp+2)

(for h→ 0, uniformly in x ∈ [a, b[).(4.3)

(cf. (4.1)).

Proof. In consequence of the hypothesis, λ is defined on Dη \ (b, 0):

λ : Dη \ (b, 0) −→ Kn, λ(x, h) = φ(x) + hψ(x, φ(x), h)− φ(x+ h).

For fixed x ∈ [a, b[, we apply Taylor’s theorem to the function h 7→ λ(x, h) to obtain

∀(x,h)∈Dη\(b,0)

λ(x, h) =

p+1∑

j=0

∂jhλ(x, 0)

j!hj +

∫ h

0

(h− t)p+1

(p+ 1)!∂p+2h λ(x, t) dt .

As we also assume the method to be consistent of order p, by possibly making Dη

smaller, we may assume, without loss of generality,

∃Cλ≥0

∀(x,h)∈Dη\(b,0)

‖λ(x, h)‖ ≤ Cλ hp+1.

Thus,∀

(x,h)∈Dη\(b,0)∀

j∈0,...,p∂jhλ(x, 0) = 0,


implying

∀(x,h)∈Dη\(b,0)

λ(x, h) =∂p+1h λ(x, 0)

(p+ 1)!hp+1 +

∫ h

0

(h− t)p+1

(p+ 1)!∂p+2h λ(x, t) dt .

In consequence, if we let

d : [a, b] −→ Kn, d(x) :=∂p+1h λ(x, 0)

(p+ 1)!=∂phψ(x, φ(x), 0)

p!− φ(p+1)(x)

(p+ 1)!

(where we have used the general Leibniz product rule to obtain the first summand),then, since the continuous (extension of the) function ∂p+2

h λ is bounded on the compactset Dη,

∃C≥0

∀x∈[a,b[

lim suph→0

‖λ(x, h)− d(x)hp+1‖hp+2

≤ C.

As the hypotheses on φ and ψ imply d ∈ Cs([a, b],Kn

), the proof is complete.

Given a solution φ to y′ = f(x, y) with f (and ψ) sufficiently regular, the strategy forproving the existence of an asymptotic error expansion will now be to construct thefunctions cp+j of (4.1) inductively for j = 0, . . . , r− 1 as solutions to an inhomogeneouslinear initial value problem

c′ = Dyf(x, φ(x)) c− dj(x), c(a) = 0, (4.4)

where Dy denotes the derivative with respect to the y-components, i.e. Dyf(x, φ(x)) :=(∂ykfl(x, φ(x))

)∈ M(n,K), and dj : [a, b] −→ Kn arises from a representation of the

form (4.3) corresponding to a defining function ψj, the functions ψ =: ψ0, ψ1, . . . , ψr be-ing inductively constructed simultaneously with the cp+j, according to the constructiongiven in the following Not. 4.6:

Notation 4.6. Let [a, b] ⊆ R, a < b, let n, q ∈ N, let ψ : Dψ −→ Kn be a (defining)function, Dψ ⊆ [a, b] × Kn × R+

0 , and c : [a, b] −→ Kn another function. Then a newdefining function is given by

Ψ(q, c) : DΨ(q,c) −→ Kn, Ψ(q, c)(x, y, h) := ψ(x, y−hq c(x), h

)+(c(x+h)−c(x)

)hq−1,(4.5)

where

DΨ(q,c) =(x, y, h) ∈ [a, b]×Kn × R+

0 : (x, y − hqc(x), h) ∈ Dψ, x+ h ≤ b. (4.6)

—

The following Lem. 4.7 provides a number of basic properties of the construction fromNot. 4.6, which we will have to make use of in our subsequent considerations:

Lemma 4.7. As in Not. 4.6, let [a, b] ⊆ R, a < b, let n, q ∈ N, and consider adefining function ψ : Dψ −→ Kn, Dψ ⊆ [a, b] × Kn × R+

0 , as well as another functionc : [a, b] −→ Kn. Let ψ∗ := Ψ(q, c) be defined as in Not. 4.6.


(a) If (x, y, 0) ∈ Dψ, then (x, y, 0) ∈ Dψ∗ and

ψ∗(x, y, 0) = ψ(x, y, 0).

(b) If ψ is globally L-Lipschitz with respect to y, L ∈ R+0 , then so is ψ∗.

(c) If c is bounded by M ∈ R+0 (i.e. sup‖c(x)‖ : x ∈ [a, b] ≤ M , which holds, e.g.,

if c is continuous) and ψ satisfies (2.15) of Cor. 2.8(b) with φ : [a, b] −→ Kn andr > 0, then ψ∗ satisfies (2.15) with φ and r∗ > 0, i.e.

Dψ∗ ⊇(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ r∗

× [0, r∗],

where

r∗ := min

r

2, q

√r

2M

.

(d) Assume c(a) = 0. Let η ∈ Kn and, as in Def. 4.2, assume, for each partition∆ = (x0, . . . , xN) of [a, b] with hmax(∆) < h(ψ), h(ψ) ∈ R+, the (unique) globalsolution (

(x0, y0), . . . , (xN , yN))∈([a, b]×Kn

)N+1

with y0 = η exists. Using notation from Not. 4.1 and Def. 4.2, for each x ∈]a, b]and each h ∈ Hx∩]0, h(ψ)[, denote

yh0 := zh0 := η,

∀k∈0,1,...,N(h)−1

yhk+1 := yhk + hψ(xhk, yhk , h), zhk+1 := zhk + hψ∗(xhk , z

hk , h),

as well as

u(ψ), u(ψ∗) : U −→ Kn, u(ψ)(x, h) := yhN(h), u(ψ∗)(x, h) := zhN(h).

Then, for h < h(ψ), zhk , k ∈ 0, 1, . . . , N(h), are well-defined, u(ψ∗) is well-defined,and

∀(x,h)∈U

u(ψ∗)(x, h) = u(ψ)(x, h) + c(x)hq. (4.7)

Proof. (a) is immediate from (4.5).

(b): If ψ is globally L-Lipschitz with respect to y and (x, y, h), (x, y, h) ∈ Dψ∗ , then

∥∥ψ∗(x, y, h)− ψ∗(x, y, h)∥∥ (4.5)

=∥∥ψ(x, y − hqc(x), h

)− ψ

(x, y − hqc(x), h

)∥∥ ≤ L ‖y − y‖,

showing ψ∗ to be L-Lipschitz with respect to y as well.

(c): If (x, y, h) ∈ [a, b]×Kn×R+0 such that ‖y−φ(x)‖ ≤ r∗ and h ≤ r∗, then h < r and

∥∥y − hq c(x)− φ(x)∥∥ ≤ ‖hq c(x)‖+ ‖y − φ(x)‖ ≤ rM

2M+r

2= r,

showing (x, y − hq c(x), h) ∈ Dψ and (x, y, h) ∈ Dψ∗ .


(d): Clearly, it suffices to show that, for each (x, h) ∈]a, b]×(Hx∩]0, h(ψ)[

), we have

∀k∈0,...,N(h)−1

(xhk, zhk , h) ∈ Dψ∗ , ∀

k∈0,...,N(h)zhk = yhk + c(xhk)h

q,

which we show via induction on k: As we have yh0 = zh0 = η, c(a) = 0 yields the base casek = 0. For the induction step, let k ∈ 0, . . . , N(h) − 1 and assume (xhk, z

hk , h) ∈ Dψ∗

as well as zhk = yhk + c(a+ kh)hq via induction hypothesis. We then compute

zhk+1 = zhk + hψ∗(xhk, zhk , h)

ind.hyp.= yhk + c(xhk)h

q + hψ∗(xhk, yhk + c(xhk)hq, h)

(4.5)= yhk + c(xhk)h

q + hψ(xhk , y

hk , h)+ h

(c(xhk+1)− c(xhk)

)hq−1

= yhk+1 + c(xhk+1)hq.

If k + 1 ≤ N(h)− 1, then we also note

(xhk+1, z

hk+1 − hqc(xhk+1), h

)= (xhk+1, y

hk+1, h) ∈ Dψ, xhk+1 + h ≤ b,

showing (xhk+1, zhk+1, h) ∈ Dψ∗ and completing the induction.

The following Lem. 4.8 shows that, under suitable technical hypotheses, choosing c ac-cording to (4.4) and applying the construction of Not. 4.6 will allow to increase theorder of consistency from q for the defining function ψ to q + 1 for ψ∗ := Ψ(q, c). It isunderlined that this is meant as a technical construction for the purpose of proving theexistence of an asymptotic expansion of the global error rather than as a constructionthat would be useful in practical applications to improve the accuracy of an approxima-tion method: The new method will depend on the unknown function φ that one wantsto approximate!

Lemma 4.8. Let q ∈ N. In the situation of Def. 2.3, assume φ : [a, b] −→ Kn tobe a solution to y′ = f(x, y) with f ∈ C2(G,Kn), G ⊆ R × Kn open, and assumeψ ∈ C2(Dψ,K

n) to be the defining function of an explicit single-step method, where

Dψ =(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ r

× [0, r], r ∈ R+ (4.8)

(cf. (2.15)). If, moreover, ψ is consistent of order q and satisfies (4.3) with p replacedby q, i.e.

λ(x, h) = d(x)hq+1 +O(hq+2)

(for h→ 0, uniformly in x ∈ [a, b[) (4.9)

with some d ∈ C1([a, b],Kn

), then ψ∗ := Ψ(q, c), defined according Not. 4.6 with c ∈

C2([a, b],Kn

)being a solution to the inhomogeneous linear ODE

c′ = Dyf(x, φ(x)) c− d(x) (4.10)

of the initial value problem (4.4), is consistent of order q + 1.


Proof. Note that, if c is the solution to (4.10), then, indeed, c ∈ C2([a, b],Kn

): As f

is C2, Dyf is C1, φ is C3 by Prop. B.1, showing the right-hand side of the ODE for cto be C1, i.e. c is C2, again by Prop. B.1. Also note c to be defined on all of [a, b] by[Phi16c, Th. 4.8].

Due to (4.8) and Lem. 4.7(c), we may assume

Dψ∗ =(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ r∗

× [0, r∗]

with r∗ > 0 given by Lem. 4.7(c). Moreover, the continuous functions

(x, h) 7→ ψ(x, φ(x), h), (x, h) 7→ ψ∗(x, φ(x), h), (x, h) 7→ Dyψ(x, φ(x), h),

(x, h) 7→ D2yψ(x, φ(x), h), (x, h) 7→ ∂hDyψ(x, φ(x), h)

are all defined and bounded on the (compact) admissible r∗-domain (cf. Def. 4.4)

D∗ :=(x, h) ∈ R2 : x ∈ [a, b], 0 ≤ h ≤ minr∗, b− x

.

For the local truncation error corresponding to ψ∗, we obtain, for each (x, h) ∈ D∗,

λ∗(x, h) = φ(x) + hψ∗(x, φ(x), h)− φ(x+ h)

= φ(x)− φ(x+ h) + hψ(x, φ(x)− hq c(x), h

)+(c(x+ h)− c(x)

)hq

= λ(x, h)− hψ(x, φ(x), h) + hψ(x, φ(x)− hq c(x), h

)+(c(x+ h)− c(x)

)hq

= λ(x, h) +(c(x+ h)− c(x)

)hq + hR(x, h), (4.11)

with the remainder term

R(x, h) := ψ(x, φ(x)− hq c(x), h

)− ψ(x, φ(x), h).

For each x ∈ [a, b[ and h ∈ [0, b− x], we have the Taylor expansion

c(x+ h)− c(x) = hc′(x) +

∫ 1

0

(1− t)h2c′′(x+ th) dt = hc′(x) +O(h2)

(for h→ 0, uniformly in x ∈ [a, b[), (4.12)

as c′′ is bounded on [a, b]. Next, we use a Taylor expansion to rewrite R(x, h): We useTaylor’s theorem on ψ, where we treat ψ as a function of its second argument, fixingthe first and last variables as parameters, to obtain

∀(x,h)∈D∗\(b,0)

R(x, h) = ψ(x, φ(x)− hq c(x), h

)− ψ(x, φ(x), h)

= −Dyψ(x, φ(x), h)hq c(x) +O(h2q)


as c is bounded on [a, b] and (x, h) 7→ D2yψ(x, φ(x), h) is bounded on D∗.


To still improve the expression for R(x, h) in (4.13), we use another Taylor expansion,this time for the function h 7→ Dyψ(x, φ(x), h), to obtain

∀(x,h)∈D∗\(b,0)

Dyψ(x, φ(x), h) = Dyψ(x, φ(x), 0) +O(h)(2.7)= Dyf(x, φ(x)) +O(h)


as (x, h) 7→ ∂hDyψ(x, φ(x), h) is bounded on D∗.

Using (4.9), (4.12), (4.13), (4.14) in (4.11), we obtain, for each (x, h) ∈ D∗ \ (b, 0),

λ∗(x, h) = λ(x, h) +(c(x+ h)− c(x)

)hq + hR(x, h)

= d(x)hq+1 + c′(x)hq+1 − hDyψ(x, φ(x), h)hq c(x) +O(hq+2)

=(d(x) + c′(x)−Dyf(x, φ(x)) c(x)

)hq+1 +O(hq+2)

(4.10)= O(hq+2) (for h→ 0, uniformly in x ∈ [a, b[), (4.15)

showing ψ∗ to be consistent of order q + 1.

We now have all preparations in place to prove the existence of an asymptotic expansionof the global error. As in [DB08, Sec. 4.3.2], for simplicity, we will assume f and ψ to beC∞. While one can prove the following theorem under lower regularity assumptions, theproof becomes even more technical and intricate, while the ideas and strategy remainthe same.

Theorem 4.9 (Asymptotic Expansion of the (Global) Error). In the situation of Def.2.3, assume φ : [a, b] −→ Kn to be a solution to y′ = f(x, y) with f : G −→ Kn,G ⊆ R×Kn open, and assume ψ : Dψ −→ Kn to be the defining function of an explicitsingle-step method, where

Dψ =(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ ǫ

× [0, ǫ], ǫ ∈ R+ (4.16)

(cf. (2.15)). Furthermore assume:

(i) The method given by ψ is consistent of order p for some p ∈ N.

(ii) f ∈ C∞(G,Kn) and ψ ∈ C∞(Dψ,Kn).

Then, for each r ∈ N, the method given by ψ admits an asymptotic expansion of the(global) error u(x, h)− φ(x) of the form (4.1).

Proof. Let η := φ(a). According to Cor. 2.8(b),(a), there exists 0 < h(ψ) ≤ ǫ such that,for each partition ∆ = (x0, . . . , xN) of [a, b] with hmax(∆) < h(ψ), the (unique) globalsolution (

(x0, y0), . . . , (xN , yN))∈([a, b]×Kn

)N+1

with y0 = η exists. Fixing r ∈ N, we need to construct functions

cp+j ∈ Cr+1−j([a, b],Kn) for j ∈ 0, . . . , r − 1


such that (4.1) holds. We carry out the construction inductively over j, while, simul-taneously, also constructing defining functions ψ0, . . . , ψr of suitable explicit single-stepmethods. Corresponding to these methods, using notation from Not. 4.1 and Def. 4.2,for each x ∈]a, b], each h ∈ Hx∩]0, h(ψ)[, and each j ∈ 0, . . . , r, denote

yj,h0 := η,

∀k∈0,1,...,N(h)−1

yj,hk+1 := yj,hk + hψj(xhk , y

j,hk , h),

as well asu(ψj) : U −→ Kn, u(ψj)(x, h) := yj,hN(h).

More precisely, we construct the cp+j and the ψj such that the following holds for eachj ∈ 0, . . . , r − 1:

(1) cp+j ∈ C∞([a, b],Kn) with cp+j(a) = 0.

(2) There exists ǫj+1 ∈ R+ such that (4.16) holds with ψ replaced by ψj+1 and ǫ replacedby ǫj+1.

(3) ψj+1 ∈ C∞(Dψj+1,Kn).

(4) ψj+1 is globally Lj+1-Lipschitz with respect to y, Lj+1 ∈ R+0 .

(5) For each h < h(ψ), yj+1,hk , k ∈ 0, 1, . . . , N(h), (as defined above) are well-defined,

u(ψj+1) (as defined above) is well-defined, and

∀(x,h)∈U

u(ψj+1)(x, h) = u(ψj)(x, h) + cp+j(x)hp+j. (4.17)

(6) The method given by ψj+1 is consistent of order p+ j + 1.

We initialize our inductive construction by setting ψ0 := ψ, ǫ0 := ǫ, observing that thehypotheses of the theorem are such that ψ0 satisfies conditions (2) – (6) (except (4.17))with j + 1 replaced by 0 (note that (4) follows from (3), as Dψj+1

is compact by (2)).For the inductive step, let j ∈ 0, . . . , r − 1. Assume cp+l have been constructed forl < j and ψl have been constructed for l ≤ j such that conditions (1) – (6) hold for eachl < j. We need to construct cp+j and ψj+1 such that (1) – (6) hold. In preparation, weapply Lem. 4.5 to φ and ψj with p replaced by p + j and s := ∞: According to (ii), φis C∞; according to (6), ψj is consistent of order p+ j; as a consequence of (2) and (3),the function

ηj : Dηj −→ Kn, ηj(x, h) := ψj(x, φ(x), h),

is defined and C∞ on the admissible ǫj-domain Dηj . Thus, Lem. 4.5 applies, providinga function dj ∈ C∞([a, b],Kn

)such that

λj(x, h) = dj(x)hp+j+1 +O(hp+j+2)



where λj denotes the local truncation error corresponding to ψj. We now define cp+j tobe the solution to the initial value problem (4.4), i.e. to

c′ = Dyf(x, φ(x)) c− dj(x), c(a) = 0,

and let ψj+1 := Ψ(p+ j, cp+j), with Ψ according to Not. 4.6, i.e.

ψj+1 : Dψj+1−→ Kn,

ψj+1(x, y, h) := ψj(x, y − hp+j cp+j(x), h

)+(cp+j(x+ h)− cp+j(x)

)hp+j−1.

As the right-hand side of the above ODE is C∞, we obtain (1) by Prop. B.1 (notingcp+j to be defined on all of [a, b] by [Phi16c, Th. 4.8]). The validity of (2) is due toLem. 4.7(c). According to the above definition of ψj+1, as both ψj and cp+j are C∞,so is ψj+1, proving (3). But then, as mentioned before, (4) also holds, as ψj+1 is C∞

on the compact set Dψj+1. The validity of (5) is due to Lem. 4.7(d). To obtain (6), we

employ Lem. 4.8, where (4.9) is provided by (4.18), such that we, indeed, obtain ψj+1

to be consistent of order p+ j + 1, completing our inductive construction.

Finally, putting everything together, we apply (4.17) to obtain

∀(x,h)∈U

u(ψr)(x, h) = u(ψ0)(x, h) +r−1∑

j=0

cp+j(x)hp+j. (4.19)

On the other hand, as ψr is consistent of order p+ r, we know from Cor. 2.8 that

u(ψr)(x, h)− φ(x) = O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]). (4.20)

Combining (4.19) and (4.20) yields

u(ψ0)(x, h)− φ(x) = u(ψr)(x, h)− φ(x)−r−1∑

j=0

cp+j(x)hp+j

= −r−1∑

j=0

cp+j(x)hp+j +O(hp+r) (for h→ 0, uniformly in x ∈]a, b]),

which is (4.1) (where it is, clearly, not a problem that our functions here are the negativesof the functions in (4.1)).

Theorem 4.10. In the situation of Def. 2.3, assume φ : [a, b] −→ Kn and f : G −→ Kn

on G ⊆ R×Kn open to be both continuous with

C :=(x, φ(x)) : x ∈ [a, b]

⊆ G,

and assume ψ : Dψ −→ Kn to be the defining function of an explicit s-stage RK methodwith weights b1, . . . , bs ∈ R, nodes c1, . . . , cs ∈ R, and RK matrix A := (ajl) ∈ M(s,K).Then the following holds:


(a) There exists ǫ ∈ R+ such that (2.15) is satisfied, i.e.

Dψ ⊇ K :=(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ ǫ

× [0, ǫ].

(b) If ψ is locally Lipschitz with respect to y (f ∈ C1(G,Kn) is sufficient by Rem.3.6(b),(c)) and, with respect to φ on [a, b], the method has order of consistencyp ∈ N, then, with respect to φ on [a, b], the method has order of convergence p aswell.

(c) Under the hypotheses of (b) with the additional assumption that φ is the solution tothe initial value problem y′ = f(x, y), φ(a) = η, (a, η) ∈ G, and f ∈ C∞(G,Kn), weobtain, for each r ∈ N, that the method given by ψ admits an asymptotic expansionof the (global) error u(x, h)− φ(x) of the form (4.1).

Proof. (a): We conduct the proof analogous to the proof of Th. 2.15(b) for the classicalexplicit RK method: First note that, according to (3.1b) and due to the method beingexplicit, we have

∀j∈1,...,s

kj(x, y, h) = f

(x+ cjh, y + h

j−1∑

l=1

ajlkl(x, y, h)

).

As in the proof of Lem. 2.11(b), we observe that the continuity of φ implies the com-pactness of

C :=(x, φ(x)) : x ∈ [a, b]

⊆ G,

and, using the norm | · |+ ‖ · ‖ on R×Kn, Lem. 2.11(a) yields ǫ0 > 0 such that Cǫ0 ⊆ G.As f is continuous, ‖f‖ is bounded by some M ∈ R+ on the compact set Cǫ0 . Viainduction on j ∈ 1, . . . , s, we now show that there exist 0 < ǫs ≤ · · · ≤ ǫ1 ≤ ǫ0 suchthat, for each j ∈ 1, . . . , s,

Cǫj =(x, y) ∈ R×Kn : dist((x, y), C) ≤ ǫj

⊆ Cǫ0 ⊆ G,

and kj is defined on Kj := Cǫj × [0, ǫj ]: Letting j ∈ 1, . . . , s and ǫj :=ǫj−1

1+|cj |+M‖A‖∞ ,

kj is defined on the set Kj: Indeed, if (x, y, h) ∈ Kj, then there exists (t, z) ∈ C suchthat |x− t| + ‖y − z‖ ≤ ǫj. Recalling that ‖A‖∞ denotes the row sum norm of A, thisimplies

|x+ cjh− t|+∥∥∥∥∥y + h

j−1∑

l=1

ajlkl(x, y, h)− z

∥∥∥∥∥ ≤ |x− t|+ |cj|h+ ‖y − z‖+ hM ‖A‖∞

≤ ǫj + ǫj|cj|+ ǫjM ‖A‖∞ = ǫj−1,

showing(x+ cjh, y+ h

∑j−1l=1 ajlkl(x, y, h)

)∈ Cǫ0 ⊆ G. In particular, kj is also bounded

by M on Kj, and the induction is complete. In consequence (cf. the proof of Lem.2.11(b)),

K :=(x, y) ∈ [a, b]×Kn : ‖y − φ(x)‖ ≤ ǫs

× [0, ǫs] ⊆ Ks ⊆ Dψ,


proving the validity of (2.15).

(b) is now immediate from (a) in combination with Cor. 2.8(b),(a) (note ψ to be globallyLipschitz with respect to y on the compact set K).

(c) is also immediate from (a) and (b) in combination with Th. 4.9.

4.3 Extrapolation Methods

We now come back to the idea of extrapolation as described in Sec. 4.1 above.

Remark 4.11. In the setting of Def. 4.2 and using Not. 4.1, we consider a finite sequence(i.e. a vector)

~h := (h0, . . . , hM) ∈(Hx∩ ]0, h(ψ)[

)M+1

, h(ψ) > h0 > · · · > hM > 0, M ∈ N,

of admissible stepsizes. We claim that, for each p ∈ N, there exists a unique (Kn-valued)polynomial

P := P (ψ, x,~h, p), P : R −→ Kn,

P (h) = P (ψ, x,~h, p)(h) = a0 +M−1∑

j=0

ap+jhp+j, a0, ap, . . . , ap+M−1 ∈ Kn,

satisfying the interpolation conditions

∀j∈0,...,M

P (hj) = u(x, hj), (4.21)

where u is the function from Def. 4.2: First note that is suffices to show this for R-valuedpolynomials, since we can then apply the R-valued result to each component to obtainthe Kn-valued polynomial: Let α ∈ 1, . . . , n and let Op stand for either Re (real part)or Im (imaginary part). Then, for the component Q := OpPα of P , (4.21) yields theM + 1 conditions Q(hj) = Op uα(x, hj). For p = 1, we then obtain Q from regularpolynomial interpolation (cf. [Phi20, Th. 3.4]). For p > 1, note that Q also satisfies thep− 1 conditions

∀j∈1,...,p−1

Q(j)(0) = 0,

yielding p+M conditions for the polynomial Q of degree at most p+M − 1. However,regular Hermite interpolation (cf. [Phi20, Th. 3.10]) does not(!) apply, as the value forQ(0) is not prescribed. Instead, a unique polynomial satisfying these conditions will beprovided by the following Prop. 4.13.

Notation 4.12. Given M, p ∈ N, define the set of polynomials

PolpM(R) :=

(P : R −→ R) : P (h) = a0 +

M−1∑

j=0

ap+jhp+j; a0, ap, . . . , ap+M−1 ∈ R

.


Proposition 4.13. Let M, p ∈ N. Given (h0, u0), . . . , (hM , uM) ∈ R+ × R such thathk 6= hl for k 6= l, there exists a unique polynomial P ∈ PolpM(R) such that

∀j∈0,...,M

P (hj) = uj. (4.22)

Proof. Each polynomial P ∈ PolpM(R) corresponds to the unique pair

(P (0), h−p(P − P (0))

)∈ R× PolM−1(R)

via the bijective linear map

Φ : PolpM(R) −→ R× PolM−1(R), Φ

(a0 +

M−1∑

j=0

ap+jhp+j

):=

(a0,

M−1∑

j=0

ap+jhj

).

We define the linear map

γ : PolM−1(R) −→ RM+1, γ(Q) :=(hp0Q(h0), . . . , h

pMQ(hM)

).

If γ(Q) = 0, then Q has M + 1 distinct zeros, implying Q ≡ 0, as deg γ ≤ M − 1.Thus, ker γ = 0 and γ is injective. In consequence, letting 1 := (1, . . . , 1) ∈ RM+1

and u := (u0, . . . , uM) ∈ RM+1, (4.22) is equivalent to

u = a0 1+ γ(Q), (a0, Q) = Φ(P ), (4.23)

and we need to showRM+1 = span1 ⊕ Im γ, (4.24)

which, due to (4.23) yields both existence and uniqueness of P (cf. [Phi19b, Prop.5.2(ii)]) As we also know M = dimPolM−1(R) = dimker γ + dim Im γ (cf. [Phi19a, Th.6.8(a)]), we have dim Im γ =M , i.e., for (4.24), it remains to show 1 /∈ Im γ. Seeking acontradiction, assume 1 ∈ Im γ, implying

∃Q∈PolM−1(R)

∀j∈0,...,M

hpj Q(hj) = 1. (4.25)

Recalling the hj to be positive, (4.25) yields that Q of degree at mostM−1 interpolatesthe function g : R+ −→ R, g(h) := h−p at the M points h0, . . . , hM−1. Thus, we canemploy the corresponding error formula or regular polynomial interpolation (cf. [Phi20,(3.31)]) to obtain

∀h∈R+

∃ξ(h)∈R+

g(h) = Q(h) +g(M)

(ξ(h)

)

M !(h− h0) · · · (h− hM−1).

However, up to a nonzero factor, g(M) is h 7→ h−(p+M) > 0, showing g(hM) 6= Q(hM).This contradiction to (4.25) shows 1 /∈ Im γ and completes the proof.


Theorem 4.14. In the setting of Def. 4.2 and using Not. 4.1, assume the explicit single-step method given by the defining function ψ : Dψ −→ Kn, Dψ ⊆ R × Kn × R, admitsan asymptotic expansion of the global error as in (4.1), i.e.

∃p,r∈N

∀j∈0,...,r−1

∃cp+j∈Cr+1−j([a,b],Kn),

cp+j(a)=0

u(x, h)− φ(x) =r−1∑

j=0

cp+j(x)hp+j +O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]).If we take some

~N := (N0, . . . , NM) ∈ NM+1, N0 < · · · < NM , 1 ≤M ≤ r, (4.26a)

which, for each h ∈ Hx∩ ]0, h(ψ)[, yields some

~h( ~N) := (h0, . . . , hM) ∈(Hx∩ ]0, h(ψ)[

)M+1

, hj := h/Nj, (4.26b)

as in Rem. 4.11, then there exist coefficients bj ∈ R, j ∈ p +M, . . . , p + r − 1, notdepending on ψ, x, and h (but, in general, depending on ~N and p) such that the errorbetween P (0) and φ(x) (where P : R −→ Kn is the polynomial defined in Rem. 4.11)can be written in the form

P (ψ, x,~h( ~N), p)(0)− φ(x) =

p+r−1∑

j=p+M

bj cj(x)hj +O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]).(4.27)

Proof. It suffices to show (4.27) for each component α ∈ 1, . . . , n, as long as theresulting bj do not depend on α. Thus, let α ∈ 1, . . . , n. Then the interpolationcondition (4.21) can be written as a linear system in matrix-vector form as

1 1/Np0 1/Np+1

0 . . . 1/Np+M−10

1 1/Np1 1/Np+1

1 . . . 1/Np+M−11

......

......

1 1/NpM 1/Np+1

M . . . 1/Np+M−1M

︸︷︷︸=:AM∈M(M+1,R)

a0αapα h

p

ap+1,α hp+1

...ap+M−1,α h

p+M−1

=

uα(x, h0)uα(x, h1)

...uα(x, hM)

.

(4.28a)As we know this linear system to have a unique solution according to Prop. 4.13 andRem. 4.11, the matrix AM must be invertible. On the other hand, we have the asymp-totic expansion (4.1b) and evaluating component α of the expression in (4.1b) at the hjyields, in matrix-vector form,

AM

φα(x)cpα(x)h

p

cp+1,α(x)hp+1

...cp+M−1,α(x)h

p+M−1

=

uα(x, h0)uα(x, h1)

...uα(x, hM)

− rα(x, h), (4.28b)


where

rα(x, h) =

p+r−1∑

j=p+M

1/N j0

1/N j1

...

1/N jM

cjα(x)h

j +O(hp+r)

(for h→ 0, uniformly in x ∈]a, b]).Subtracting (4.28b) from (4.28a) yields

AM

a0α − φα(x)(apα − cpα(x))h

p

(ap+1,α − cp+1,α(x))hp+1

...(ap+M−1,α − cp+M−1,α(x))h

p+M−1

= rα(x, h).

We now multiply the above equation by A−1M and claim that the first row of the resulting

equation completes the proof of (4.27): Indeed, we obtain

Pα(ψ, x,~h( ~N), p)(0)− φα(x) = a0α − φα(x) =M∑

l=0

p+r−1∑

j=p+M

a−1M,1,l

cjα(x)hj

N jl

+O(hp+r)

(for h→ 0, uniformly in x ∈]a, b])

and, setting

∀j∈p+M,...,p+r−1

bj :=M∑

l=0

a−1M,1,l

N jl

,

we have (4.27) (noting the bj to be independent of α, ψ, x, and h).

Remark 4.15. Note that, according to (4.27), the approximation given by the extrap-

olation P (ψ, x,~h( ~N), p)(0) now has order of convergence p+M in the sense that

∃C>0

∀x∈]a,b]

limh→0

∥∥∥P (ψ, x,~h( ~N), p)(0)− φ(x)∥∥∥

hp+M

= limh→0

∥∥∥∑p+r−1

j=p+M bj cj(x)hj∥∥∥

hp+M= |bp+M |

∥∥cp+M(x)∥∥ ≤ C.

Remark 4.16. Among others, the following rules have been applied in the literatureto obtain decreasing sequences (hj)j∈N0 of stepsizes for extrapolation (cf. Rem. 4.11 and(4.26)). All sequences start with a given base stepsize h0 := h > 0 and then applydifferent rules:

Harmonic Sequence:

∀j∈N

hj :=h0j + 1

.


Romberg Sequence:

∀j∈N

hj :=hj−1

2.

Bulirsch Sequence:

h1 :=h02, h2 :=

h03, h3 :=

h04, ∀

j≥4hj :=

hj−2

2.

—

In general, one can obtain the a0, ap, . . . , ap+M−1 from the linear systems (4.28a). How-ever, as we are only interested in the value P (0), if possible, it is more efficient tocompute P (0) directly, without computing the polynomial’s coefficients. For p = 1, onecan achieve this using a so-called Neville tableau, which was presented in the context ofHermite interpolation in [Phi20, Rem. 3.14]. Here, we specialize [Phi20, Rem. 3.14] tothe situation of distinct x0, . . . , xM :

Remark 4.17. Let M ∈ N and consider points

(x0, y0), (x1, y1), . . . , (xM , yM) ∈ R2,

where x0, . . . , xM are all distinct. Let P ∈ PolM(R) denote the unique polynomial ofdegree ≤M , satisfying the conditions

∀j∈0,...,M

P (xj) = yj.

Fix x ∈ R. According to [Phi20, Rem. 3.14], if we define, recursively,

Pα0 := yα for each α ∈ 0, . . . ,M,

Pαβ :=(x− xα−β)Pα,β−1 − (x− xα)Pα−1,β−1

xα − xα−βfor each 1 ≤ β ≤ α ≤M,

then PMM = P (x) and PMM can be computed, in O(M2) steps, via the following Nevilletableau:

y0 = P00

ցy1 = P10 → P11

ց ցy2 = P20 → P21 → P22

......

.... . .

yM−1 = PM−1,0 → PM−1,1 → . . . . . . PM−1,M−1

ց ց ցyM = PM0 → PM1 → . . . . . . PM,M−1 → PMM

As one can apply the above recursion componentwise, it still holds if the yj and P areKn-valued, n ∈ N.


Example 4.18. Suppose, we want to improve a single-step method given by some ψ atx = b by interpolation with p = 1 and M = 2, using the Romberg sequence accordingto Rem. 4.16 for some sufficiently small and admissible h > 0. Employing a Nevilletableau as in Rem. 4.17, we have to find P (0) = P22, where the polynomial P of degreeat most 2 satisfies conditions (4.21), which, in the current situation, are

∀j∈0,1,2

P (hj) = u(b, hj) = yhjN(hj)

, N(hj) = (b− a)/hj.

We obtain P (0) from a Neville tableau as follows:

u(b, h) = P00

ցu(b, h/2) = P10 → P11 =

−hP10+h2P00

h2−h = 2u(b, h/2)− u(b, h)

ց ցu(b, h/4) = P20 → P21 =

−h2P20+

h4P10

h4−h

2

= 2u(b, h/4)− u(b, h/2) →

P22 =−hP21 +

h4P11

h4− h

=4

3

(2u(b, h/4)− u(b, h/2)

)− 1

3

(2u(b, h/2)− u(b, h)

)

=1

3

(8u(b, h/4)− 6u(b, h/2) + u(b, h)

).

If Th. 4.14 applies with r =M = 2, then (4.27) yields

P (0)− φ(b) = O(h3) for h→ 0.

Example 4.19. Suppose, we want to improve a single-step method given by some ψat x ∈]a, b] by interpolation with p ≥ 1 and M = 1, using some sufficiently small andadmissible h > 0, letting h0 := h, h1 := h0/N , N ∈ N, N ≥ 2. We have to find P (0),where the polynomial P of degree at most p satisfies conditions (4.21), which, in thecurrent situation, are

P (h) = u(x, h), P (h/N) = u(x, h/N).

For each α ∈ 1, . . . , n, we can obtain a0α and apα from (4.28a), which, in the currentsituation, reads

a0α + apα hp = uα(x, h)

a0α +apα h

p

Np= uα(x, h/N)

with the solution

Pα(0) = a0α = uα(x, h/N) +uα(x, h/N)− uα(x, h)

Np − 1, apα =

uα(x, h)− uα(x, h/N)

hp(1− 1Np )

.

If Th. 4.14 applies with r = 2, then (4.27) yields

P (0)− φ(x) = bp+1 cp+1(x)hp+1 +O(hp+2) for h→ 0.


4.4 Stepsize Control

As before, let us consider an explicit single-step method given via a defining function ψ,for the initial value problem y′ = f(x, y), y(ξ) = η. Also, as before, assume the methodto have global solutions for sufficiently small stepsizes 0 < hk < h(ψ). Then, given apartition ∆ = (x0, . . . , xN) of [ξ, b] with hmax(∆) < h(ψ), ψ provides approximationsy0, . . . , yN according to the recursion (2.2).

We are now interested in using adaptive stepsizes hk. More precisely, we want to varythe stepsizes hk, aiming at keeping the error in a prescribed range (reducing hk if theerror is too large, enlarging hk (e.g. to make the computation more efficient), if the erroris smaller than required. In general, this is a difficult problem and often involves theuse of heuristics. Many different variants exist in the literature. Here, we follow [Pla06,Sec. 7.7]. The idea is to modify (2.2) such that, instead of using hk in the kth step, weuse hk

2twice. Then we use an extrapolation as in the previous section to estimate the

local error, and adjust hk if necessary. We will now provide the details to carry out thisidea. The described modification of (2.2) leads to the recursion

y0 := η, x0 := ξ

∀k∈0,...,N−1

wk := yk +hk2ψ

(xk, yk,

hk2

),

∀k∈0,...,N−1

yk+1 := wk +hk2ψ

(xk +

hk2, wk,

hk2

), xk+1 = xk + hk.

As before, we assume the initial value problem under consideration to have a uniqueexact solution φ, which we assume to be defined at the current point xk ≥ ξ. Given anapproximation yk ≈ φ(xk), the desired new stepsize hk should be such that

‖yk+1 − γ(xk + hk)‖ ≈ ǫ, (4.29)

where ‖·‖ denotes some arbitrary norm on Kn, ǫ > 0 is given, and γ is the exact solutionto the initial value problem

y′ = f(x, y), y(xk) = yk,

which is assumed to be unique and defined at xk+hk. As γ is not known, one will haveto use a numerical approximation of γ to test the validity of (4.29). Thus, obtaining

the new stepsize hk involves iteration, starting with some initial guess h(0)k (where one

would use h(0)k := hk−1 for k ≥ 1 and h

(0)0 can be set according to Rem. 4.21(a) below):

(a) Compute

w(l)k := yk +

h(l)k

2ψ

(xk, yk,

h(l)k

2

),

y(l)k+1 := w

(l)k +

h(l)k

2ψ

(xk +

h(l)k

2, w

(l)k ,

h(l)k

2

).


(b) Compute an estimation of the error

δ(l)k ≈ ‖y(l)k+1 − γ(xk + h

(l)k )‖

(this can be done by an extrapolation according to (4.31) below). Stop the iteration

and return hk := h(l)k if

c1ǫ ≤ δ(l)k ≤ c2ǫ,

where 0 < c1 < 1 < c2 are prescribed constants.

(c) Set h(l+1)k < h

(l)k if δ

(l)k > c2ǫ and h

(l+1)k > h

(l)k if δ

(l)k < c1ǫ, using some suitable rule

(a possible rule will be given in (4.39) below). Proceed to (a) for the next iterationstep.

—

We now describe how to obtain δ(l)k for (b): First we approximate zk ≈ γ(xk+h

(l)k ), using

an extrapolation as in Ex. 4.19 with h0 := h(l)k and h1 := h0

2. In the present situation,

the formula for P (0) of Ex. 4.19 becomes

zk = P (0) = y(l)k+1 +

y(l)k+1 − v

(l)k

2p − 1, v

(l)k := yk + h

(l)k ψ(xk, yk, h

(l)k ), (4.30)

and, thus,

δ(l)k = ‖y(l)k+1 − zk‖ =

∥∥∥y(l)k+1 − v(l)k

∥∥∥2p − 1

. (4.31)

The rule for obtaining h(l+1)k in (c) above can be based on the following result:

Lemma 4.20. If Prop. 4.3 applies with p ∈ N and r = 2 to ψ and φ replaced by thesolution γ : [xk, b] −→ Kn to the initial value problem y′ = f(x, y), y(xk) = yk, atx := xk + h ≤ b, then, writing

w(h) := yk +h

2ψ

(xk, yk,

h

2

),

u(h) := w(h) +h

2ψ

(xk +

h

2, w(h),

h

2

),

and noting that, in (4.31), δ(l)k = δ

(l)k (h

(l)k ) can be seen as a function of h

(l)k ,

‖u(h)− γ(xk + h)‖ =

(h

h(l)k

)p+1

δ(l)k (h

(l)k ) +O

((h

(l)k )p+2

)

(for h, h(l)k → 0 with 0 < h ≤ h

(l)k ).

(4.32)

Proof. We apply Prop. 4.3 with l := j := 1 to obtain bp+1 ∈ Kn such that

u(h)−γ(xk+h) = u(xk+h, h)−γ(xk+h) = bp+1 hp+1+O(hp+2) (for h→ 0). (4.33)


Applying Th. 4.14 to (4.30) yields

zk − γ(xk + h) = β cp+1(xk + h)hp+1 +O(hp+2) (for h→ 0) (4.34)

with β ∈ R and cp+1 ∈ C2([xk, b],Kn), cp+1(xk) = 0. A Taylor expansion of cp+1 then

providescp+1(xk + h) = cp+1(xk) +O(h) = O(h) (for h→ 0),

which, in (4.34), yields

zk − γ(xk + h) = O(hp+2) (for h→ 0). (4.35)

Using (4.35) in (4.33) results in

u(h)− zk = bp+1 hp+1 +O(hp+2) (for h→ 0). (4.36)

Writing h := h(l)k in (4.36) together with

δ(l)k (h

(l)k )

(4.31)= ‖y(l)k+1 − zk‖ = ‖u(h(l)k )− zk‖

yields ∣∣∣∣∣‖bp+1‖ −δ(l)k (h

(l)k )

(h(l)k )p+1

∣∣∣∣∣ ≤∥∥∥∥∥bp+1 −

u(h(l)k )− zk

(h(l)k )p+1

∥∥∥∥∥(4.36)= O(h

(l)k )

and

‖bp+1‖ =δ(l)k (h

(l)k )

(h(l)k )p+1

+O(h(l)k ) (for h

(l)k → 0). (4.37)

Employing (4.37) in (4.33) leads to

‖u(h)− γ(xk + h)‖ =

(h

h(l)k

)p+1

δ(l)k +O(h

(l)k )hp+1 +O(hp+2)

=

(h

h(l)k

)p+1

δ(l)k +O((h

(l)k )p+2)

(for h, h(l)k → 0 with 0 < h ≤ h

(l)k ),

thereby proving (4.32).

If Lem. 4.20 applies and

δ(l)k ≈ ǫ≫ (h

(l)k )p+2 i.e. h

(l)k ≪ ǫ

1p+2 , (4.38)

then it makes sense to use (4.32) to obtain, by neglecting the remainder term,

h(l+1)k :=

(ǫ

δ(l)k

) 1p+1

h(l)k (4.39)

for the new test stepsize in (c) above.

5 STIFF EQUATIONS 81

Remark 4.21. (a) In view of (4.38), one might choose for the initial test stepsize

h(0)0 := αǫ

1p+2 with 0 < α < 1, e.g. α = 1

2or α = 1

10.

(b) In general, there is no guarantee that the above iteration for finding the new stepsizehk terminates. It terminates if, and only if, the condition in (b) is satisfied for somel ∈ N0. In practice, one would set some bound L ∈ N for l and stop the iteration forl = L even if the condition in (b) fails to hold. One might then stop the method,as the stepsize control has failed, or one might revert to some default value for hk,depending on some criterion for the severity of the failure.

5 Stiff Equations

In Rem. 1.9(b),(c), we discussed the relation between explicit and implicit methods. Wementioned that methods given in explicit form are typically computationally easier andare usually preferred, unless the ODE to be solved defies solution by explicit methods(which means that an accurate solution by explicit methods (usually meaning by explicitRunge-Kutta methods) needs stepsizes that are unacceptably small). Such ODE, as alsoalready mentioned in Rem. 1.9(c), are usually called stiff ODE. A mathematically precisedefinition of the notion of stiff ODE seems difficult, and the literature does not seem tohave reached a consensus on this. Here, we will follow [Pla06, Sec. 8.9].

For simplicity, in this section, we will restrict ourselves to initial value problems, wherethe right-hand side is Rn-valued and defined on the entire space, i.e.

y′ = f(x, y), f : R× Rn −→ Rn, (5.1a)

y(ξ) = η, η ∈ Rn. (5.1b)

Now, typically, (5.1) is stiff if there exists an equilibrium function ψe : I −→ Rn suchthat every solution φ : I −→ Rn to (5.1), ξ ∈ I, appoaches ψe rapidly in the sense thatφ(x) ≈ ψe(x) for each x > ξ + ǫ (x ∈ I) with a small ǫ > 0.

The following Def. 5.1 is an attempt at putting the notion of a stiff ODE into a somewhatmore precise form. It is basically reproduced from [Pla06, Sec. 8.9.1].

Definition 5.1. Let 〈·, ·〉 : Rn×Rn −→ R denote a fixed scalar product on Rn, n ∈ N,with induced norm ‖ · ‖ : Rn −→ R+

0 . Let ξ ∈ R and let I ⊆ R be a (nontrivial) intervalwith ξ = min I.

(a) Then f : R × Rn −→ Rn (and the initial value problem (5.1)) satisfies an upperLipschitz condition with respect to y and 〈·, ·〉 on I if, and only if, there exists acontinuous function M : I −→ R such that

∀(x,y),(x,y)∈I×Rn

⟨f(x, y)− f(x, y), y − y

⟩≤M(x) ‖y − y‖2. (5.2)

Moreover, the initial value problem is called dissipative on I if, and only if, (5.2)holds with M ≤ 0 (in the sense that M(x) ≤ 0 for each x ∈ I).


(b) The initial value problem (5.1) is stiff provided that it satisfies the following twoconditions (i) and (ii):

(i) The problem is dissipative or, at least, (5.2) holds with an M that does notsurpass a moderate positive size, say M ≤ 1.

(ii) The expression on the left-hand side of (5.2) divided by ‖y− y‖2 can becomestrongly negative, i.e.

∀x∈I

m(x) := inf

⟨f(x, y)− f(x, y), y − y

⟩

‖y − y‖2 : y, y ∈ Rn, y 6= y

≪ 0.

Remark 5.2. Note that the Cauchy-Schwarz inequality implies

∀(x,y),(x,y)∈R×Rn,

y 6=y

∣∣⟨f(x, y)− f(x, y), y − y⟩∣∣

‖y − y‖2 ≤∥∥f(x, y)− f(x, y)

∥∥‖y − y‖ .

In consequence, the condition of Def. 5.1(b)(ii) means that f can be globally L-Lipschitzwith respect to y only with a very large Lipschitz constant L ≥ |m(x)|. In particular, forthe explicit Euler method and the explicit classical Runge-Kutta method, the constantKprovided by (2.14) becomes very large (cf. Th. 2.12 and Th. 2.15), such that reasonableapproximations from these methods can only be expected for exceedingly small stepsizesh.

Example 5.3. Let λ ∈ R. We consider (5.1) with n = 1, ξ = 0, and f = fλ, where

fλ : R× R −→ R, fλ(x, y) := λy − (1 + λ)e−x,

i.e. the initial value problem

y′ = λy − (1 + λ)e−x, (5.3a)

y(0) = η, η ∈ R. (5.3b)

To compare with Def. 5.1 and Rem. 5.2, we compute (using the Euclidean scalar product,which is just multiplication in one dimension)

∀(x,y),(x,y)∈R2

⟨fλ(x, y)− fλ(x, y), y − y

⟩= λ(y − y)2 = λ|y − y|2,

obtaining M(x) = m(x) = λ. Thus Def. 5.1(b) is satisfied for λ ≪ 0. We considerλ = −10 and λ = −1000, and compute approximations using the explicit Euler method(2.1) as well as the implicit Euler method of Ex. 3.5(c), comparing the results with theexact solution

φλ : R −→ R, φλ(x) = e−x + (η − 1)eλx

(φλ(0) = η is immediate and

φ′λ(x) = −e−x + λ(η − 1)eλx = λe−x + λ(η − 1)eλx − (1 + λ)e−x = λφλ(x)− (1 + λ)e−x,


h yh(1)− φ−10(1) yh(1)− φ−10(1)explicit Euler implicit Euler

2−4 = 0.0625 −1.247 · 10−3 1.308 · 10−3

2−6 ≈ 0.0156 −3.174 · 10−4 3.212 · 10−4

2−8 ≈ 0.0039 −7.971 · 10−5 7.994 · 10−5

2−10 ≈ 0.0010 −1.995 · 10−5 1.996 · 10−5

2−12 ≈ 0.0002 −4.989 · 10−6 4.990 · 10−6

Table 1: Numerical rusults to (5.3) with λ = −10, computed for several stepsizes h bythe explicit Euler method and by the implicit Euler method. The problem is not stiffand both methods perform equally well.

shows φλ is a solution to (5.3a)).

For the numerical comparison, we fix η := 1. Note

η = 1 ⇒ ∀λ∈R

∀x∈R

φλ(x) = e−x.

Given h > 0, the recursion for the explicit Euler method is

y0 = 1,

∀k∈N0

yk+1 = yk + h(λyk − (1 + λ) e−xk

).

For the implicit Euler method, the equation for yk+1 is

yk+1 = yk + h(λyk+1 − (1 + λ) e−xk+1

),

h yh(1)− φ−1000(1) yh(1)− φ−1000(1)explicit Euler implicit Euler

2−4 = 0.0625 1.283 · 1024 1.175 · 10−5

2−6 ≈ 0.0156 2.865 · 1069 2.892 · 10−6

2−8 ≈ 0.0039 8.014 · 10112 7.202 · 10−7

2−10 ≈ 0.0010 −1.797 · 10−7 1.799 · 10−7

2−12 ≈ 0.0002 −4.495 · 10−8 4.496 · 10−8

Table 2: Numerical rusults to (5.3) with λ = −1000, computed for several stepsizes hby the explicit Euler method and by the implicit Euler method. The problem is stiffand the explicit method is unstable for the largest three values of h.


i.e. the recursion is, for hλ 6= 1,

y0 = 1,

∀k∈N0

yk+1 =yk − h (1 + λ) e−xk+1

1− hλ.

We now apply both methods with increasingly small equidistant stepsizes

h = 2−4, 2−6, 2−8, 2−10, 2−12,

recomputing the results of [Pla06, Table 8.3], reproduced in Tables 1 and 2 for thereader’s convenience, where we write

yh(1) := y1/h ≈ φλ(1) = e−1.

For λ = −10, the problem (5.3) is actually not stiff and the results in Table 1 show thatboth the explicit and the implicit Euler method produce reasonable results even for thelargest stepsize h = 2−4.

For λ = −1000, (5.3) is stiff, and the results in Table 2 show that, while the implicitEuler method performs reasonably for each value of h, the error of the explicit Eulermethod appears to tend to infinity for the first three values of h, but then becomesreasonably small for h = 2−10 and h = 2−12 (one says the explicit Euler method isunstable for the larger values of h.

Example 5.4. Let us consider the 2-dimensional example, where n = 2, ξ = 0, and

f : R× R2 −→ R2, f(x, y) := Ay =

(−100 y1 + y2

− 110y2

), A :=

(−100 10 − 1

10

),

i.e. the initial value problem

y′1 = −100 y1 + y2,

y′2 = − 1

10y2, (5.4a)

y(0) = η, η ∈ R2. (5.4b)

To compare with Def. 5.1 and Rem. 5.2, we compute (using the Euclidean scalar product)

∀(x,y),(x,y)∈R×R2

⟨f(x, y)− f(x, y), y − y

⟩=⟨A(y − y), y − y

⟩

= −100 (y1 − y1)2 + (y2 − y2)(y1 − y1)−

1

10(y2 − y2)

2

≤ −(10 |y1 − y1| −

1√10

|y2 − y2|)2

≤ 0 ≤ ‖y − y‖22.

As we also have, for (y1, y2) = (1, 0) and (y1, y2) = (0, 0), that

⟨f(x,y)−f(x,y), y−y

⟩‖y−y‖22

= −100,

but, on the other hand3,

⟨f(x, y)− f(x, y), y − y

⟩≥ −100 (y1 − y1)

2 − 1

2

((y1 − y1)

2 + (y2 − y2)2)− (y2 − y2)

2

10

≥ −201

2‖y − y‖22,

3Thanks to Julien Ricaud for pointing out this estimate from below.


we obtain

−201

2≤ m(x) = inf

⟨f(x, y)− f(x, y), y − y

⟩

‖y − y‖22: y, y ∈ R2, y 6= y

≤ −100.

Thus, M(x) = 0 and m(x) ≤ −100, i.e. the problem (5.4) has a chance of being stiffaccording to Def. 5.1(b).

As in the previous example, the goal is to assess the performance of both the explicitEuler method (2.1) and the implicit Euler method of Ex. 3.5(c), comparing their respec-tive results with the exact solution. As (5.4a) is a linear ODE with constant coefficients,the exact solution can be obtained using corresponding results from the theory of ODE.To this end, one observes A to be diagonalizable,

D := W−1AW =

(−100 00 − 1

10

), W =

(1 10 999

10

), W−1 =

(1 − 10

999

0 10999

):

Indeed,

W−1AW = W−1

(−100 10 − 1

10

)(1 10 999

10

)=

(1 − 10

999

0 10999

)(−100 −100 + 99910

0 −999100

)

=

(−100 00 − 1

10

).

As a consequence of [Phi16c, Th. 4.44(b)],

Ψ : R −→ M(2,R), Ψ(x) :=

(e−100x 0

0 e−x/10

),

constitutes a fundamental matrix solution to y′ = Dy, and, by [Phi16c, Th. 4.47],

Φ : R −→ M(2,R), Φ(x) := WΨ(x),

constitutes a fundamental matrix solution to (5.4a). Thus, by variation of constants[Phi16c, (4.29)] with x0 = 0, y0 = η and b(t) ≡ 0, we obtain the following solution tothe initial value problem (5.4):

φ : R −→ R2, φ(x) = Φ(x)Φ−1(0) η = WΨ(x)Ψ−1(0)W−1 η = WΨ(x)W−1 η.

Introducing the abbreviations

v =

(v1v2

):= W−1 η, w22 :=

999

10,

one can rewrite φ as

φ : R −→ R2, φ(x) =

(1 10 w22

)(v1 e

−100x

v2 e−x/10

)= e−100x

(v10

)+ e−x/10

(v2

w22v2

).


Given h > 0, the recursion for the explicit Euler method is

y0 = η,

∀k∈N0

yk+1 = yk + hAyk = (Id+hA)yk,(5.5)

implying∀

k∈N0

yk = (Id+hA)k y0 = W (Id+hD)kW−1 η.

Using

∀k∈N0

(Id+hD)k =

((1− 100h)k 0

0(1− 1

10h)k)

and the above definitions of v and w22, we obtain

∀k∈N0

yk = (1− 100h)k(v10

)+

(1− 1

10h

)k (v2

w22v2

). (5.6)

For h > 150, we have |1− 100h| > 1 and (5.6) implies

v1 6= 0 ⇒ limk→∞

‖yk‖2 = ∞,

whereas limx→∞ ‖φ(x)‖2 = 0, showing the instability of the explicit Euler methodfor h > 1

50. When using (5.5) instead of (5.6), numericallly, due to roundoff errors,

limk→∞ ‖yk‖2 = ∞ will even occur for v1 = 0.

For the implicit Euler method, the equation for yk+1 is

yk+1 = yk + hAyk+1,

i.e. the recursion is

y0 = η,

∀k∈N0

yk+1 = (Id−hA)−1yk,

implying

∀k∈N0

yk =((Id−hA)−1

)ky0 = W

((Id−hD)−1

)kW−1 η.

Using

∀k∈N0

((Id−hD)−1

)k=

(

11+100h

)k0

0(

11+ 1

10h

)k

and the above definitions of v and w22, we obtain

∀k∈N0

yk =

(1

1 + 100h

)k (v10

)+

(1

1 + 110h

)k (v2

w22v2

),

such thatlimk→∞

‖yk‖2 = 0,

independently of the size of h > 0. While this does not prove the convergence of theimplicit Euler method, a convergence result is provided by the following Th. 5.7.


Lemma 5.5. Consider the implicit Euler method of Ex. 3.5(c) for the initial valueproblem (5.1), i.e. y′ = f(x, y), y(ξ) = η, f : R×Rn −→ Rn, η ∈ Rn. Let b ∈ R, b > ξ,and assume φ : [ξ, b] −→ Rn to be the unique solution to y′ = f(x, y), y(ξ) = η, on[ξ, b]. We define

∀x∈[ξ,b[

∀h∈]0,b−x]

λ(x, h) := φ(x) + h f(x+ h, φ(x+ h)

)− φ(x+ h),

which constitutes a variant of the local truncation error λ(x, h) of Def. 2.3(b)4. If f ∈C1(R× Rn,Rn), then

∃C≥0

∀x∈[ξ,b[

∀h∈]0,b−x]

‖λ(x, h)‖ ≤ C h2.

Proof. If f is C1, then the solution φ is C2 according to Prop. B.1 in the Appendix.Thus, we can apply Taylor’s theorem with the remainder term in integral form to obtain,for each x ∈ [ξ, b[ and each h ∈]0, b− x],

φ(x) = φ(x+ h− h) = φ(x+ h)− φ′(x+ h)h+

∫ x

x+h

(x− t)φ′′(t) dt .

Applying this in the definition of λ yields, for each x ∈ [ξ, b[ and each h ∈]0, b− x],

λ(x, h) = φ(x) + h

=φ′(x+h)︷︸︸︷f(x+ h, φ(x+ h)

)−φ(x+ h) = −

∫ x+h

x

(x− t)φ′′(t) dt ,

and, thus,‖λ(x, h)‖ ≤ C h2,

where

C =1

2max‖φ′′(t)‖ : t ∈ [ξ, b],

completing the proof of the lemma.

Lemma 5.6. Let 〈·, ·〉 : Rn × Rn −→ R denote a fixed scalar product on Rn, n ∈ N,with induced norm ‖ · ‖. Let ξ, b ∈ R with ξ < b and assume f : [ξ, b] × Rn −→ Rn tosatisfy an upper Lipschitz condition with respect to y on [ξ, b] as defined in Def. 5.1(a),however with a constant function M(x) ≡M ∈ R.

(a) If M ≤ 0, then

∀x∈[ξ,b]

∀y,y∈Rn

∀h∈R+

0

‖y − y‖ ≤∥∥∥y − y − h

(f(x, y)− f(x, y)

)∥∥∥.

4In general, λ and λ will not be identical, due to the fact that Def. 2.3(b) assumes the method tobe written in explicit form. In particular, the result of the present lemma is not the same as the resultobtained from Ex. 3.39(a) and Th. 3.37(b), which says that the method is consistent of order 1, whenrewritten with a ψ in explicit (standard) form.


(b) If M > 0, then

∀x∈[ξ,b]

∀y,y∈Rn

∀h,H∈R+

0 ,0≤h≤H<1/M

‖y − y‖ ≤ Ch

∥∥∥y − y − h(f(x, y)− f(x, y)

)∥∥∥,

where

Ch := 1 +hM

1−HM=

1 +M(h−H)

1−HM.

Proof. Let x ∈ [ξ, b], let y, y ∈ Rn, and let h ∈ R+0 . Then, according to Def. 5.1(a),

h⟨f(x, y)− f(x, y), y − y

⟩≤ hM ‖y − y‖2,

implying

(1− hM) ‖y − y‖2 ≤ 〈y − y, y − y〉 − h⟨f(x, y)− f(x, y), y − y

⟩

=⟨y − y − h

(f(x, y)− f(x, y)

), y − y

⟩

≤∥∥∥y − y − h

(f(x, y)− f(x, y)

)∥∥∥ ‖y − y‖,

where the Cauchy-Schwarz inequality was used for the last estimate. If M ≤ 0, then‖y − y‖ ≤ (1 − hM) ‖y − y‖, proving (a). Now let M > 0 and h ≤ H < 1/M . Then1 − hM > 0 and we may divide the above estimate by this term without changing theinequality. Since

1

1− hM= 1 +

hM

1− hM≤ 1 +

hM

1−HM= Ch,

this proves (b).

Theorem 5.7. Let 〈·, ·〉 : Rn × Rn −→ R denote a fixed scalar product on Rn, n ∈ N,with induced norm ‖ · ‖. Let ξ, b ∈ R with ξ < b and assume the initial value problem(5.1) satisfies an upper Lipschitz condition with respect to y on [ξ, b] as defined in Def.5.1(a), however with a constant function M(x) ≡ M ∈ R. Moreover, assume (5.1) hasa unique solution φ defined on [ξ, b] and consider the implicit Euler method of Ex. 3.5(c)for a partition ∆ := (x0, . . . , xN) of [ξ, b], i.e. assume y0, . . . , yN ∈ Rn satisfy

y0 = η, ∀k∈0,...,N−1

yk+1 = yk + hk f(xk+1, yk+1), hk := xk+1 − xk.

If λ(x, h) is defined as in Lem. 5.5 and

∃C≥0

∀x∈[ξ,b[

∀h∈]0,b−x]

‖λ(x, h)‖ ≤ C h2

(according to Lem. 5.5, f ∈ C1(R × Rn,Rn) is sufficient), then the global truncationerror can be estimated

max‖yk − φ(xk)‖ : k ∈ 0, . . . , N

≤ K hmax(∆), where (5.7a)

K :=

C(b− ξ) for M ≤ 0,CM

(eM(b−ξ)/(1−hmax(∆)M) − 1

)for M > 0 and 0 < hmax(∆) < 1/M,

(5.7b)


and where hmax(∆) is the mesh size of ∆ as defined in (1.9b). Note that C (and, hence,K) does not depend on either f or the partition. Moreover, for M ≤ 0, K tends to beof moderate size and it grows linearly with the length of the interval [ξ, b].

Proof. Introducing the abbreviations

∀k∈0,...,N

φk := φ(xk), ek := yk − φk,

and∀

k∈0,...,N−1λk := λ(xk, hk) = φk + hk f(xk+1, φk+1)− φk+1,

we obtain, for each k ∈ 0, . . . , N − 1,ek + λk = yk − φk + φk + hk f(xk+1, φk+1)− φk+1

= ek+1 − yk+1 + φk+1 + yk + hk f(xk+1, φk+1)− φk+1

= ek+1 − hk(f(xk+1, yk+1)− f(xk+1, φk+1)

). (5.8)

If M ≤ 0, then we apply (5.8) together with Lem. 5.6(a), yielding

‖ek+1‖Lem. 5.6(a)

≤∥∥∥ek+1 − hk

(f(xk+1, yk+1)− f(xk+1, φk+1)

)∥∥∥(5.8)= ‖ek + λk‖ ≤ ‖ek‖+ ‖λk‖ ≤ ‖ek‖+ Ch2k.

Using this in an induction on k ∈ 0, . . . , N yields

∀k∈0,...,N

‖ek‖ ≤ C (xk − ξ)hmax(∆) :

Indeed, e0 = y0 − φ0 = η − η = 0, and, for k ∈ 0, . . . , N − 1,

‖ek+1‖ ≤ ‖ek‖+Ch2kind.hyp.

≤ C (xk − ξ)hmax(∆) +C hk hmax(∆) = C (xk+1 − ξ)hmax(∆),

completing the induction and, in particular, proving (5.7) for M ≤ 0. Analogously, forM > 0, we apply (5.8) together with Lem. 5.6(b), yielding, for hmax(∆) < 1/M ,

‖ek+1‖ ≤ Chk(‖ek‖+ ‖λk‖

)≤ Chk ‖ek‖+

1

1− hmax(∆)M‖λk‖

≤ Chk ‖ek‖+C hmax(∆)

1− hmax(∆)Mhk,

where

Chk = 1 +hkM

1− hmax(∆)M=

1 +M(hk − hmax(∆))

1− hmax(∆)M.

We can now apply Lem. 2.5 with

ak := ‖ek‖, a0 = ‖e0‖ = 0, L :=M

1− hmax(∆)M, β :=

C hmax(∆)

1− hmax(∆)M,

implying

∀k∈0,...,N

‖ek‖ ≤ eL(xk−ξ) − 1

Lβ =

(eM(xk−ξ)/(1−hmax(∆)M) − 1

) C hmax(∆)

M,

in particular, proving (5.7) for M > 0.

6 COLLOCATION METHODS 90

6 Collocation Methods

An important idea for constructing useful implicit methods (actually, implicit RK meth-ods, as we will see below) is so-called collocation (explicit RK methods can also result,see, e.g., Ex. 6.7(a), but this is not the main importance of collocation methods). Giventhe initial-value problem

y′ = f(x, y), y(ξ) = η, f : G −→ Kn, G ⊆ R×Kn, (ξ, η) ∈ G,

the idea is to approximate the exact solution φ : [ξ, ξ + h] −→ Kn via a Kn-valuedpolynomial of degree at most s, that satisfies the ODE at least at the s collocationpoints ξ + cjh, where 0 ≤ c1 < · · · < cs ≤ 1.

Definition 6.1. Let s, n ∈ N, G ⊆ R×Kn, f : G −→ Kn, let c := (c1, . . . , cs)t ∈ [0, 1]s

with 0 ≤ c1 < · · · < cs ≤ 1.

(a) Let (x, y, h) ∈ R×Kn×R+. We call a Kn-valued polynomial P of degree at most s(i.e. each component of P is a K-valued polynomial of degree at most s) collocationpolynomial, determined by c, f , x, y, and h, if, and only if, P satisfies the followingconditions (i) – (ii):

(i) P (x) = y.

(ii) P ′(x + cjh) = f(x + cjh, P (x + cjh)

)for each j ∈ 1, . . . , s (which, in par-

ticular, is supposed to mean, for each j ∈ 1, . . . , s, that the used argumentof f is in G).

(b) We call an explicit single-step method according to Def. 1.7(b) (withm = 1, α0 = 1)a collocation method, determined by c, if, and only if, the defining function has theform

ψ : Dψ −→ Kn, ψ(x, y, h) =−y + P (x, y, h)(x+ h)

h(recall h > 0),

where, for each (x, y, h) ∈ Dψ, P (x, y, h) is a collocation polynomial determined byc, f , x, y, and h according to (a).

—

With each collocation method as defined above, we will now associate an RK method.Then, in Th. 6.5 below, we will show that the resulting defining functions are identical.

Definition 6.2. Let s ∈ N and c := (c1, . . . , cs)t ∈ [0, 1]s with 0 ≤ c1 < · · · < cs ≤ 1.

Let L1, . . . , Ls : R −→ R be the Lagrange basis polynomials (cf. [Phi20, Th. 3.4]5) of

5In [Phi20, Th. 3.4], the Lagrange basis polynomials are actually defined as algebraic polynomials,i.e. as elements of R[X], whereas, here, we are using the corresponding polynomial functions in Pol(R)(recall that, as R is an infinite field, the spaces R[X] and Pol(R) are, actually, isomorphic (cf. [Phi19b,Th. 7.12(c)])).


degree at most s− 1, satisfying

∀j,l∈1,...,s

Lj(cl) = δjl =

1 for j = l,

0 for j 6= l.

Define

∀j∈1,...,s

bj :=

∫ 1

0

Lj(t) dt , ∀j,l∈1,...,s

ajl :=

∫ cj

0

Ll(t) dt .

Then the RK method with weights b1, . . . , bs, nodes c1, . . . , cs, and RK matrix (ajl) iscalled the RK method defined by c and collocation.

Lemma 6.3. Let s, n ∈ N and let c := (c1, . . . , cs)t ∈ [0, 1]s with 0 ≤ c1 < · · · < cs ≤ 1.

Let (x, y, h) ∈ R×Kn × R+ and let P : R −→ Kn be a Kn-valued polynomial of degreeat most s with P (x) = y. Defining

∀j∈1,...,s

kj := kj(x, y, h) := P ′(x+ cjh),

we have

∀t∈R

P ′(x+ th) =s∑

j=1

kjLj(t), (6.1a)

∀t∈R

P (x+ th) = y + hs∑

j=1

kj

∫ t

0

Lj(θ) dθ , (6.1b)

where the Lj are defined as in Def. 6.2 above.

Proof. We have that (6.1a) holds, as each component of both sides constitutes a K-valued polynomial on R of degree at most s−1, where both sides agree at the s distinctvalues t = c1, . . . , cs (cf. [Phi20, Th. 3.4]). For (6.1b), we compute, for each t ∈ R,

P (x+ th) = y + h

∫ t

0

P ′(x+ θh) dθ(6.1a)= y + h

s∑

l=1

kj

∫ t

0

Lj(θ) dθ ,

thereby establishing the case.

Lemma 6.4. Let s, n ∈ N, G ⊆ R × Kn, f : G −→ Kn, let c := (c1, . . . , cs)t ∈ [0, 1]s

with 0 ≤ c1 < · · · < cs ≤ 1. Let (x, y, h) ∈ R×Kn × R+. Define

K(x, y, h) :=k := (k1, . . . , ks) ∈ (Kn)s : k satisfies (3.1b)

,

P(x, y, h) :=(P : R −→ Kn) : P satisfies Def. 6.1(a)(i),(ii)

.

Then the map Φ : P(x, y, h) −→ K(x, y, h), given by the definition in Lem. 6.3,does, indeed, map into K(x, y, h), and it is bijective with the inverse given by Φ−1 :K(x, y, h) −→ P(x, y, h), k 7→ P , where

P : R −→ Kn, P (t) := y + hs∑

j=1

kj

∫ t−xh

0

Lj(θ) dθ . (6.2)


Proof. If P ∈ P(x, y, h), then, from (6.1b) with t = cj, we obtain

∀j∈1,...,s

P (x+ cjh) = y + h

s∑

l=1

ajlkl (6.3)

and, thus, for each j ∈ 1, . . . , s,

kj = P ′(x+ cjh)Def. 6.1(a)(ii)

= f(x+ cjh, P (x+ cjh)

)

= f

(x+ cjh, y + h

s∑

l=1

ajlkl

),

showing k = Φ(P ) to satisfy (3.1b), i.e. k = Φ(P ) ∈ K(x, y, h).

If k ∈ K(x, y, h) and P := Φ−1(k) is given by (6.2), then, since each Lj is a polynomial ofdegree at most s−1, clearly, P is a (Kn-valued) polynomial of degree at most s. If t = xin the definition of P , then all the integrals vanish, showing P (x) = y, in accordancewith Def. 6.1(a)(i). Moreover, using (3.1b) and the chain rule, one obtains

P ′ : R −→ Kn, P ′(t) =s∑

j=1

kjLj

(t− x

h

), (6.4a)

implying

∀j∈1,...,s

P ′(x+ cjh) =s∑

l=1

klLl(cj) = kj. (6.4b)

In consequence, Lem. 6.3 applies, showing (6.3) to hold, once again. Thus, as k satisfies(3.1b), we obtain

P ′(x+ cjh) = kj = f

(x+ cjh, y + h

s∑

l=1

ajlkl

)= f

(x+ cjh, P (x+ cjh)

),

which is Def. 6.1(a)(ii), showing P = Φ−1(k) ∈ P(x, y, h).

Φ Φ−1 = Id: Let k ∈ K(x, y, h). Then P := Φ−1(k) is given by (6.2) and

Φ(P ) = (k1, . . . , ks), where ∀j∈1,...,s

kj := P ′(x+ cjh).

On the other hand, from the above argument, we know (6.4) to hold as well, showing

∀j∈1,...,s

kj = P ′(x+ cjh) = kj,

proving Φ Φ−1 = Id.

Φ−1 Φ = Id: Let P ∈ P(x, y, h). Then k := Φ(P ) is given by

∀j∈1,...,s

kj := P ′(x+ cjh)


and P := Φ−1(k) is given by

P : R −→ Kn, P (t) := y + hs∑

j=1

kj

∫ t−xh

0

Lj(θ) dθ .

As

∀j∈1,...,s

P ′(x+ cjh) = kj(6.4b)= P ′(x+ cjh),

both P ′ and P ′ are polynomials of degree at most s − 1 that agree at the s distinctpoints t = x+ cjh, implying P ′ = P ′. Since, also P (x) = y = P (x), we conclude P = P ,proving Φ−1 Φ = Id and completing the proof of the lemma.

Theorem 6.5. Let s, n ∈ N, G ⊆ R×Kn, f : G −→ Kn, let c := (c1, . . . , cs)t ∈ [0, 1]s

with 0 ≤ c1 < · · · < cs ≤ 1.

(a) If ψ : Dψ −→ Kn, ψ(x, y, h) = −y+P (x,y,h)(x+h)h

, is the defining function of a col-location method, determined by c, as in Def. 6.1(b), then, defining b ∈ Rs and(ajl) ∈ M(s,R) as in Def. 6.2 plus the kj(x, y, h) as in Lem. 6.3, we have that ψsatisfies (3.1), i.e.

∀(x,y,h)∈Dψ

ψ(x, y, h) =s∑

j=1

bjkj(x, y, h)

and

∀(x,y,h)∈Dψ

∀j∈1,...,s

kj(x, y, h) = f

(x+ cjh, y + h

s∑

l=1

ajlkl(x, y, h)

),

i.e. ψ is the defining function of the RK method defined by c and collocation.

(b) If ψ : Dψ −→ Kn is the defining function of the RK method defined by c andcollocation, then it satisfies (3.1) with certain kj := kj(x, y, h) for each (x, y, h) ∈Dψ. With the notation of Lem. 6.4, we have k := (k1, . . . , ks) ∈ K(x, y, h) and, byLem. 6.4, P := Φ−1(k) ∈ P(x, y, h) is a collocation polynomial determined by c,f , x, y, and h, according to Def. 6.1(a). Moreover, ψ is the defining function of acollocation method, satisfying Def. 6.1(b), and

∀k∈1,...,s

s∑

j=1

bjck−1j =

1

k, (6.5a)

∀k,l∈1,...,s

s∑

j=1

aljck−1j =

cklk, (6.5b)

i.e., in particular, the RK method defined by c and collocation satisfies the consis-tency condition (3.3) as well as the node condition (3.4).


Proof. (a): Let (x, y, h) ∈ Dψ. Then we obtain, from (6.1b) with t = 1,

P (x, y, h)(x+ h) = y + hs∑

j=1

kj(x, y, h)

∫ 1

0

Lj(θ) dθ = y + hs∑

j=1

bjkj(x, y, h), (6.6)

showing ψ to satisfy (3.1a). This proves (a), since we know from Lem. 6.4 that ψ satisfies(3.1b) as well.

(b): As in the proof of (a), Lem. 6.3 applies, yielding, once again, the validity of (6.6).Thus

ψ(x, y, h) =s∑

j=1

bjkj(x, y, h) =−y + P (x, y, h)(x+ h)

h,

as claimed. To prove (6.5), we first note that

∀t∈R

∀k∈1,...,s

tk−1 =s∑

j=1

ck−1j Lj(t), (6.7)

which holds, as both sides of (6.7) constitute polynomials of degree at most s− 1 thatagree at the s distinct points t = c1, . . . , cs. Thus,

s∑

j=1

bjck−1j =

s∑

j=1

∫ 1

0

ck−1j Lj(t) dt

(6.7)=

∫ 1

0

tk−1 dt =1

k,

proving (6.5a), also yielding the consistency condition by setting k := 1. Similarly,

s∑

j=1

aljck−1j =

s∑

j=1

∫ cl

0

ck−1j Lj(t) dt

(6.7)=

∫ cl

0

tk−1 dt =cklk,

proving (6.5b), also yielding the node condition by setting k := 1.

Corollary 6.6. Let s, n ∈ N, G ⊆ R×Kn, f : G −→ Kn, let c := (c1, . . . , cs)t ∈ [0, 1]s

with 0 ≤ c1 < · · · < cs ≤ 1. Moreover, let G be open and assume f to be continuous andlocally Lipschitz with respect to y.

(a) For each (x, y) ∈ G, there exist γ(x, y) ∈]0,∞] and r(x, y) ∈ R+ such that, ifh ∈]0, γ(x, y)[, there exists a unique collocation polynomial P , determined by c, f ,x, y, and h, with the additional property

∀j∈1,...,s

∥∥P ′(x+ cjh)− f(x, y)∥∥ < r(x, y).

(b) If, additionally, G = R×Kn and f is globally L-Lipschitz with respect to y (L ∈ R+0 ),

and ‖A‖∞ denotes the operator norm of A = (ajl) (ajl defined as in Def. 6.2) withrespect to ‖ · ‖∞ on Ks, then, for each h ∈]0, γ[ with

γ :=1

L ‖A‖∞(1/0 := ∞),

there exists a unique collocation polynomial P , determined by c, f , x, y, and h (i.e.uniqueness is guaranteed without the additional condition of (a)).


Proof. Let (x, y) ∈ G. Under the hypotheses of the corollary, according to Th. 3.8 andLem. 3.2, there exist γ(x, y) ∈]0,∞] and r(x, y) ∈ R+ such that, for each h ∈]0, γ(x, y)[,the system (3.1b) has a unique solution

k := (k1, . . . , ks) ∈ Br(x,y)(yf , . . . , yf ) ⊆ (Kn)s, where yf := f(x, y),

and this solution is even unique in the entire space (Kn)s under the additional hypothesesof (b). Now, by Lem. 6.4 each solution k to (3.1b) uniquely corresponds to a collocationpolynomial P = Φ−1(k), determined by c, f , x, y, and h, where the correspondence isgiven by the bijective map Φ of Lem. 6.4. Since, for each j ∈ 1, . . . , s, kj = P ′(x+cjh),this proves the corollary.

Example 6.7. (a) Consider s = 1 with the collocation point c1 := 0. From the consis-tency and the node condition, we directly obtain b1 = 1 and a11 = 0, showing theresulting collocation method to be the explicit Euler method.

(b) Consider s = 1 with the collocation point c1 := 1. From the consistency and thenode condition, we directly obtain b1 = a11 = 1, showing the resulting collocationmethod to be the implicit Euler method.

(c) Consider s = 2 with the collocation points c1 := 0, c2 := 1. Then L1(t) = 1 − t,L2(t) = t,

b1 =

∫ 1

0

(1− t) dt =

[t− t2

2

]1

0

=1

2, b2 =

∫ 1

0

t dt =

[t2

2

]1

0

=1

2,

a11 =

∫ 0

0

L1(t) dt = a12 =

∫ 0

0

L2(t) dt = 0,

a21 =

∫ 1

0

L1(t) dt = b1 =1

2, a22 =

∫ 1

0

L2(t) dt = b2 =1

2,

showing this collocation method to be the implicit trapezoidal method of Ex.3.20(d).

(d) The classical RK method is not a collocation method, since it has c2 = c3 =12.

(e) Let s ∈ N and let 0 < c1 < · · · < cs < 1 be the zeros of the sth orthogonalpolynomial with respect to the L2[0, 1]-scalar product (these orthogonal polynomialsare obtained from t0, t1, t2, . . . via the usual orthonormalization procedure, see, e.g.,[Phi20, Sec. 4.6.2]). Then the collocation method defined by these collocation pointsis called the Gauss method corresponding to s (the name comes from the fact thatthe cj also define a so-called Gaussian quadrature rule, cf. [Phi20, Def. 4.47]). Letus check that, for s = 2, we obtain the Gauss method of Ex. 3.39(c): The first 3


orthogonal polynomials are (using [Phi20, (4.73)]) p0 ≡ 1,

p1(t) = t− 〈t, 1〉‖p0‖22

= t−∫ 1

0t dt

∫ 1

0dt

= t− 1

2,

p2(t) = t2 − 〈t2, 1〉‖p0‖22

p0(x)−〈t2, t− 1

2〉

‖p1‖22p1(t)

= t2 −∫ 1

0

t2 dt −∫ 1

0(t3 − 1

2t2) dt

∫ 1

0(t− 1

2)2 dt

(t− 1

2

)

= t2 − 1

3− 12 · 1

12

(t− 1

2

)= t2 − t+

1

6.

The zeros of p2 are

c1 =1

2−√

1

4− 1

6=

1

2−√

1

12=

1

2−

√3

6, c2 =

1

2+

√3

6.

Then

L1(t) = −√3 t+

√3

2+

1

2, L2(t) =

√3 t−

√3

2+

1

2,

b1 =

∫ 1

0

L1(t) =1

2, b2 =

∫ 1

0

L2(t) =1

2,

a11 =

∫ c1

0

L1(t) dt = −√3

2c21 +

(√3

2+

1

2

)c1

= −√3

2

(1

3−

√3

6

)+

√3

4− 1

4+

1

4−

√3

12= −

√3

6+

1

4+

√3

4−

√3

12=

1

4,

a22 =

∫ c2

0

L2(t) dt =

√3

2c22 +

(−√3

2+

1

2

)c2

=

√3

2

(1

3+

√3

6

)−

√3

4− 1

4+

1

4+

√3

12=

1

4,

a12 = c1 − a11 =1

4−

√3

6, a21 = c2 − a22 =

1

4+

√3

6,

showing that, for s = 2, we obtain the Gauss method of Ex. 3.39(c).

—

The goal in the remainder of this section is to obtain results that allow to determinethe order of consistency of collocation methods by relating them to certain quadraturerules.

Definition and Remark 6.8. Let s ∈ N and c := (c1, . . . , cs)t ∈ [0, 1]s with 0 ≤ c1 <

· · · < cs ≤ 1. We assume b1, . . . , bs ∈ R to be defined as in Def. 6.2. We call the


functional

Qc : F([0, 1],R) −→ R, Qc(f) :=s∑

j=1

bjf(cj),

the quadrature rule defined by c and collocation (F([0, 1],R) denotes the set of functionsmapping from [0, 1] into R). We remark that, for f integrable, Qc(f) can, indeed, be

considered as an approximation of∫ 1

0f(t) dt .

Proposition 6.9. Let s ∈ N and c := (c1, . . . , cs)t ∈ [0, 1]s with 0 ≤ c1 < · · · < cs ≤ 1.

Moreover, let x, h, b1, . . . , bs ∈ R, h > 0, and consider the quadrature rule

Q : F([x, x+ h],R) −→ R, Q(f) := hs∑

j=1

bjf(x+ cjh), (6.8)

which, for integrable f : [x, x+ h] −→ R, has the error term

R(f, h) :=

∫ x+h

x

f(t) dt −Q(f). (6.9)

If Q is exact for each polynomial of degree at most m, where m ∈ N0, m ≥ s− 1, i.e.

∀q∈Polm(R)

R(q, h) = 0,

then m ≤ 2s− 1 and

∃C∈R+

∀f∈Cm+1[x,x+h]

|R(f, h)| ≤ C hm+2 ‖f (m+1)‖∞. (6.10)

Proof. We already know from [Phi20, Rem. 4.9(b)] that m ≤ 2s − 1 if R(q, h) = 0 foreach q ∈ Polm(R). It remains to prove (6.10). To this end, let f ∈ Cm+1[x, x + h]and choose cs+1, . . . , cm+1 ∈ [0, 1] such that c1, . . . , cm+1 are all distinct. Moreover, letq ∈ Polm(R) be the unique interpolating polynomial, satisfying

∀j∈1,...,m+1

q(x+ cjh) = f(x+ cjh). (6.11)

Using R(q, h) = 0, we obtain

R(f, h) =

∫ x+h

x

f(t) dt −Q(f) +R(q, h) =

∫ x+h

x

(f(t)− q(t)

)dt +Q(q)−Q(f)

= h

∫ 1

0

(f(x+ θh)− q(x+ θh)

)dθ + h

s∑

j=1

bj(q(x+ cjh)− f(x+ cjh)

)

(6.11)= h

∫ 1

0

(f(x+ θh)− q(x+ θh)

)dθ .


We now use the error formula [Phi20, (3.31)], which, for each θ ∈ [0, 1], provides theexistence of ξ(θ) ∈ [x, x+ h] such that

f(x+ θh)− q(x+ θh) =f (m+1)

(ξ(θ)

)

(m+ 1)!

m+1∏

j=1

(x+ θh− x− cjh)

=f (m+1)

(ξ(θ)

)

(m+ 1)!hm+1

m+1∏

j=1

(θ − cj),

implying

|R(f, h)| ≤ h

∫ 1

0

∣∣f(x+ θh)− q(x+ θh)∣∣ dθ ≤ hm+2 ‖f (m+1)‖∞

(m+ 1)!

∫ 1

0

m+1∏

j=1

|θ − cj| dθ ,

thereby proving (6.10).

Remark 6.10. We can view the quadrature rule of (6.8) as an explicit single-stepmethod with defining function

ψ : Dψ −→ R, ψ(x, y, h) :=s∑

j=1

bjf(x+ cjh),

for the ODE y′ = f(x), say, with f : [a, b] −→ R, a < b, f sufficiently regular. As thesolution for the initial value problem y′ = f(x), y(a) = 0, is

φ : [a, b] −→ R, φ(x) :=

∫ x

a

f(t) dt ,

the corresponding local truncation error is, for x ∈ [a, b[, h ∈]0, b− x],

λ(x, h) = φ(x)+hψ(x, φ(x), h)−φ(x+h) = h

s∑

j=1

bjf(x+cjh)−∫ x+h

x

f(t) dt = −R(f, h),

Prop. 6.9 states that, if Q of (6.8) is exact for each polynomial of degree at most m,then ψ is consistent of order m+ 1 with respect to φ.

Proposition 6.11. Let s, n ∈ N, let G ⊆ R × Kn be open, let f ∈ C2s(G,Kn), letc := (c1, . . . , cs)

t ∈ [0, 1]s with 0 ≤ c1 < · · · < cs ≤ 1. Moreover, let (ξ, η) ∈ G, b ∈ R

with b > ξ such that φ : [ξ, b] −→ Kn is the unique solution on [ξ, b] to the initial valueproblem

y′ = f(x, y), y(ξ) = η.

Then

∃h∗∈]0,b−ξ]

∃C∈R+

∀h∈]0,h∗]

∀k∈0,...,s

max‖(φ− P )(k)(x)‖ : x ∈ [ξ, ξ + h]

≤ C hs+1−k,

(6.12)


where P = P (ξ, η, h) is the collocation polynomial determined by c, f, ξ, η, and h accord-ing to Def. 6.1(a), for which we assume, in addition,

∀γ∈R

limh↓0

P (ξ, η, h)(ξ + γ h) = η (6.13)

(in the situation of Def. 6.1(b), limh↓0 kj(ξ, η, h) = κj ∈ Kn is sufficient for (6.13)6).

Proof. Fix k ∈ 0, . . . , s. According to Cor. 6.6(a), there exists h∗0 ∈ R+ such thatP (ξ, η, h) exists for each h ∈]0, h∗0[. Below, we will also use ξ + 2h ≤ b. To this end,we already choose h∗0 ≤ (b − ξ)/2. In the following, let h ∈]0, h∗0[. Since P (ξ, η, h)is given via interpolation of its derivative, it will be useful to write φ′ in terms of aninterpolating polynomial as well: More precisely, we interpolate φ′ at the distinct pointsξ+c1h, . . . , ξ+csh via a polynomial of degree at most s−1. Using the Lagrange formula[Phi20, (3.10)] together with φ being a solution to y′ = f(x, y) and the error formula[Phi20, (3.29)], we obtain, for each t ∈ [0, 1],

φ′(ξ + th) =s∑

j=1

φ′(ξ + cjh)s∏

l=1l 6=j

ξ + th− ξ − clh

ξ + cjh− ξ − clh+ α(t, h)

s∏

j=1

(ξ + th− ξ − cjh)

=s∑

j=1

f(ξ + cjh, φ(ξ + cjh)

)Lj(t) + hs α(t, h)ω(t), (6.14)

where the Lj are the Lagrange basis polynomials of Def. 6.2,

α(t, h) := [φ′|ξ + th, ξ + c1h, . . . , ξ + csh]

is the corresponding divided difference and

ω(t) :=s∏

j=1

(t− cj)

is the corresponding Newton basis polynomial. According to the Hermite-Genocchiformula [Phi20, (3.28)], we have

α(t, h) =

∫

Σsφ(s+1)

(ξ + th+

s∑

j=1

rj(ξ + cjh− ξ − th)

)dr

=

∫

Σsφ(s+1)

(ξ + th+

s∑

j=1

rj(cjh− th)

)dr ,

where Σs is the simplex defined in [Phi20, Not. 3.21]. Using φ ∈ C2s+1([ξ, b],Kn) (asf ∈ C2s(G,Kn)) and differentiation of parameter-dependent integrals (e.g. according to[Phi17, Cor. 2.24]), we differentiate α, with respect to t, k times, obtaining

(∂t)kα(t, h) = hk

∫

Σs

(1−

s∑

j=1

rj

)k

φ(s+1+k)

(ξ + th+

s∑

j=1

rj(cjh− th)

)dr .

6From (6.1b) (with t = γ), we know P (ξ, η, h)(ξ+γ h) = η+h∑sj=1 kj(ξ, η, h)

∫ γ0Lj(θ) dθ , implying

limh↓0 P (ξ, η, h)(ξ + γ h) = η + 0 ·∑s

j=1 κj∫ γ0Lj(θ) dθ = η.


Since 0 ≤ ∑sj=1 rj ≤ 1 by the definition of Σs and

∫Σs

1 dr = 1s!by [Phi20, (3.26)], we

can estimate

max‖(∂t)kα(t, h)‖ : t ∈ [0, 1]

≤ hk

s!max

‖φ(s+1+k)(x)‖ : x ∈ [ξ, ξ + 2h]

= hk Cα(k, h), Cα(k, h) :=1

s!max

‖φ(s+1+k)(x)‖ : x ∈ [ξ, ξ + 2h]

. (6.15)

Next, we integrate (6.14) to obtain, for each t ∈ [0, 1],

φ(ξ + th) = η + h

∫ t

0

φ′(ξ + θ h) dθ

= η + h

s∑

j=1

f(ξ + cjh, φ(ξ + cjh)

) ∫ t

0

Lj(θ) dθ + hs+1

∫ t

0

α(θ, h)ω(θ) dθ ,

which is then used together with (6.1b) to yield, for each t ∈ [0, 1],

φ(ξ + th)− P (ξ, η, h)(ξ + th)(6.1b)= φ(ξ + th)− η − h

s∑

j=1

kj(ξ, η, h)

∫ t

0

Lj(θ) dθ

(3.1b)= φ(ξ + th)− η − h

s∑

j=1

f

(ξ + cjh, η + h

s∑

l=1

ajlkl(ξ, η, h)

)∫ t

0

Lj(θ) dθ

(6.3)= φ(ξ + th)− η − h

s∑

j=1

f(ξ + cjh, P (ξ, η, h)(ξ + cjh)

)∫ t

0

Lj(θ) dθ

= h

s∑

j=1

δfj(ξ, h)

∫ t

0

Lj(θ) dθ + hs+1

∫ t

0

α(θ, h)ω(θ) dθ , (6.16)

where

∀j∈1,...,s

δfj(ξ, h) := f(ξ + cjh, φ(ξ + cjh)

)− f

(ξ + cjh, P (ξ, η, h)(ξ + cjh)

).

As many times before, we make use of the fact that f being continuously differentiableimplies f to be locally Lipschitz. Thus, there exist ǫ, L ∈ R+ such that f is L-Lipschitzon Bǫ(ξ, η). Using φ(ξ) = η and (6.13),

∃h∗1∈]0,h∗0]

∀h∈]0,h∗1[

∀j∈1,...,s

(ξ+cjh, φ(ξ+cjh)

),(ξ+cjh, P (ξ, η, h)(ξ+cjh)

)∈ Bǫ(ξ, η),

implying‖δfj(ξ, h)‖ ≤ L

∥∥φ(ξ + cjh)− P (ξ, η, h)(ξ + cjh)∥∥. (6.17)


To be employed in further estimates, we now define the quantities

e(h) := max‖φ(x)− P (ξ, η, h)(x)‖ : x ∈ [ξ, ξ + h]

,

C0 := max

s∑

j=1

∣∣∣∣∫ t

0

Lj(θ) dθ

∣∣∣∣ : t ∈ [0, 1]

,

∀l∈1,...,s

Cl := max

s∑

j=1

|L(l−1)j (t)| : t ∈ [0, 1]

,

∀l∈1,...,s

Cω,l := ‖ω(l)[0,1] ‖∞.

Also let

h∗ := min

h∗1,

1

2LC0

.

Applying the norm to (6.16), using the triangle inequality on the right-hand side, andthen taking the max over t ∈ [0, 1], we obtain, for each h ∈]0, h∗[,

e(h) ≤ h ‖δfj(ξ, h)‖C0 + hs+1 max‖α(t, h)ω(t)‖ : t ∈ [0, 1]

(6.17),(6.15)

≤ h e(h)LC0 + hs+1Cα(0, h∗)Cω,0

ande(h) (1− hLC0) ≤ hs+1Cα(0, h∗)Cω,0,

which, since

1− hLC0 ≥ 1− h∗LC0 ≥ 1− LC0

2LC0

=1

2⇒ 1

1− hLC0

≤ 2,

impliese(h) ≤ hs+1 2Cα(0, h∗)Cω,0. (6.18)

For k = 0, (6.18) completes the proof of (6.12). For k ≥ 1, we differentiate (6.16) ktimes with respect to t, obtaining

hk(φ(k)(ξ + th)− P (k)(ξ + th)

)

= hs∑

j=1

δfj(ξ, h)L(k−1)j (t) + hs+1

k−1∑

l=0

(k − 1

l

)α(k−1−l)(t, h)ω(l)(t),

where the general Leibniz product rule was used to obtain the last summand. Dividingby hk, applying the norm, taking the max over t ∈ [0, 1], and applying our previous


estimates (6.17), (6.18), and (6.15) yields

max‖(φ− P )(k)(x)‖ : x ∈ [ξ, ξ + h]

≤ h1−k Le(h)Cl + hs+1−kk−1∑

l=0

(k − 1

l

)hk−1−l Cα(k − 1− l, h∗)Cω,l

≤ hs+2−k L 2Cα(0, h∗)Cω,0Cl + hs+1−kk−1∑

l=0

(k − 1

l

)hk−1−l∗ Cα(k − 1− l, h∗)Cω,l

≤ hs+1−k

(h∗ L 2Cα(0, h∗)Cω,0Cl +

k−1∑

l=0

(k − 1

l

)hk−1−l∗ Cα(k − 1− l, h∗)Cω,l

),

completing the proof of (6.12).

Theorem 6.12. Let s ∈ N, c := (c1, . . . , cs)t ∈ [0, 1]s with 0 ≤ c1 < · · · < cs ≤ 1, and

consider the RK method defined by c and collocation. If the quadrature rule Q defined byc and collocation according to Def. and Rem. 6.8 is exact for each polynomial of degreeat most p ∈ N0, then the RK method satisfies the consistency condition of order p + 1(as defined in Def. 3.35). In particular, if the RK method is in standard form, then itis consistent of order p + 1 with respect to each solution to each ODE, satisfying thehypotheses of Th. 3.37(a),(b).

Proof. According to the proof of Th. 3.37(c), it suffices to show the method’s localtruncation error λ with respect to each solution φ : [0, b] −→ Rn, b > 0, to y′ = f(y),y(0) = 0, for each f ∈ C∞(Rn,Rn), n ∈ N, satisfies (3.46), i.e.

lim suph↓0

∥∥λ(0, h)∥∥

hp+2∈ R+

0 . (6.19)

Thus, consider such a solution φ to such an initial value problem with f ∈ C∞(Rn,Rn).In the following, let h > 0 be sufficiently small such that the RK method has uniquelocal solutions (cf. Th. 3.8(a)). Moreover, let the RK method be in standard form withdefining function

ψ : Dψ −→ Rn, ψ(x, y, h) =−y + P (x, y, h)(x+ h)

h,

P (x, y, h) denoting the collocation polynomial determined by c, f , x, y, and h accordingto Def. 6.1(a). Then, moreover, by Th. 3.8(c),

limh↓0

kj(0, 0, h) = f(0),

showing (6.13) to be satisfied. Thus, by Prop. 6.11, we may take h > 0 to be sufficientlysmall such that the estimate of (6.12) holds as well. For each such sufficiently smallh > 0, we need to estimate

λ(0, h) = φ(0) + hψ(0, φ(0), h)− φ(h) = P (0, 0, h)(h)− φ(h).


We now keep h fixed and set P := P (0, 0, h). Define

ǫ : R −→ Rn, ǫ(x) := P ′(x)− f(P (x))

andM := 1 + max‖P (x)‖ : x ∈ [−h, 2h].

Moreover, define

g : ]− h, 2h[×BM(0)×]− 1, 2[−→ Rn, g(x, y, µ) := f(y) + µ ǫ(x),

and consider the following initial value problem with parameter µ:

y′ = g(x, y, µ) = f(y) + µ ǫ(x), y(0) = 0. (6.20)

Since f is bounded on the compact set BM(0), f is bounded on BM(0), implying, foreach µ ∈] − h, 2h[, that the solution φµ : ] − h, 2h[−→ Rn to (6.20) is defined on theentire interval ]− h, 2h[ by [Phi16c, Th. 4.7]. We now set

Y : ]− h, 2h[×]− 1, 2[−→ Rn, Y (x, µ) := φµ(x).

Clearly, φ0 = φ. For µ = 1, (6.20) becomes

y′ = f(y) + ǫ(x) = f(y) + P ′(x)− f(P (x)), y(0) = 0,

showing φ1 = P (note P (0) = 0 by Def. 6.1(a)(i)). Thus, as Y is continuously differen-tiable with respect to µ by Cor. B.5,

∀x∈]−h,2h[

P (x)− φ(x) = Y (x, 1)− Y (x, 0) =

∫ 1

0

∂µY (x, µ) dµ . (6.21)

Next, we differentiate∂xY (x, µ) = f(Y (x, µ)) + µ ǫ(x)

with respect to µ to obtain

∂x∂µY (x, µ) = Df(Y (x, µ)) ∂µY (x, µ) + ǫ(x). (6.22)

(as we chose f to be C∞, Cor. B.5 actually implies Y to be C∞ as well). Moreover,we also know from (6.20) that Y (0, µ) = 0 for each µ ∈] − 1, 2[, showing ∂µY (x, µ) tosatisfy the inhomogeneous linear initial value problem

y′ = Df(Y (x, µ)) y + ǫ(x), y(0) = 0. (6.23)

Now let Φ : ] − h, 2h[×] − 1, 2[−→ M(n,R) be the principal matrix solution to theparametrized linear ODE of (6.23) (i.e. the fundamental system with Φ(0, ·) ≡ Id).Then, by variation of constants (cf. [Phi16c, Th. 4.15]) and using ∂µY (0, µ) = 0, weobtain

∀(x,µ)∈]−h,2h[×]−1,2[

∂µY (x, µ) = Φ(x, µ)Φ−1(0, µ) 0 + Φ(x, µ)

∫ x

0

Φ−1(t, µ) ǫ(t) dt

= Φ(x, µ)

∫ x

0

Φ−1(t, µ) ǫ(t) dt . (6.24)


Considering (6.21) at x = h and replacing ∂µY (h, µ) by means of (6.24) yields

P (h)− φ(h) =

∫ 1

0

∂µY (h, µ) dµ =

∫ 1

0

∫ h

0

Φ(h, µ)Φ−1(t, µ) ǫ(t) dt dµ

=

∫ h

0

ǫ(t)

∫ 1

0

Φ(h, µ)Φ−1(t, µ) dµ dt =

∫ h

0

ǫ(t)γ(t) dt , (6.25)

where we used Fubini’s theorem and where

γ : ]− h, 2h[−→ M(n,R), γ(t) :=

∫ 1

0

Φ(h, µ)Φ−1(t, µ) dµ .

We now use the quadrature rule Q and Prop. 6.9 to estimate the integral on the right-hand side of (6.25), obtaining with R as defined in Prop. 6.9:

P (h)− φ(h) =

∫ h

0

ǫ(t)γ(t) dt = Q(ǫγ) +R(ǫγ, h)

= hs∑

j=1

bjǫ(cjh) γ(cjh) +R(ǫγ, h) = R(ǫγ, h), (6.26)

since∀

j∈1,...,sǫ(cjh) = P ′(cjh)− f(P (cjh)) = 0

by Def. 6.1(a)(ii). Since f is C∞, so is ǫ, implying Y to be C∞ by Cor. B.5 (as wealready noted above), which also yields Φ and, thus, γ to be C∞. Thus, Prop. 6.9 yieldsthe estimate

‖λ(0, h)‖ = ‖P (h)− φ(h)‖ = ‖R(ǫγ, h)‖ ≤ C hp+2 max‖(ǫγ)(p+1)(t)‖ : t ∈ [0, h]

with C ∈ R+. The proof is complete, if we can show

M(h) := max‖(ǫγ)(p+1)(t)‖ : t ∈ [0, h]

to remain bounded for h→ 0. While the γ and its derivatives, clearly, remain boundedfor h→ 0, the derivatives of ǫ remain bounded for h→ 0 by (6.12) of Prop. 6.11.

Example 6.13. Let s ∈ N. Then the Gauss method corresponding to s, as defined inEx. 6.7(e), is an example of an implicit s-stage RK method, satisfying the consistencycondition of order 2s (which is the maximum by Prop. 3.22(a)): Comparing the definitionof Qc in Def. and Rem. 6.8 with [Phi20, Def. 4.47], shows Qc to be a Gaussian quadraturerule, which, by [Phi20, Th. 4.48(a)], is exact for each polynomial of degree at most 2s−1.Then Th. 6.12 yields that the Gauss method satisfies the consistency condition of order2s (for the Gauss method with s = 2, we checked this via an explicit calculation in Ex.3.39(c)).

A LIPSCHITZ CONTINUITY 105

A Lipschitz Continuity

Definition A.1. Let m,n ∈ N, G ⊆ R×Km, and f : G −→ Kn.

(a) The function f is called (globally) Lipschitz continuous or just (globally) Lipschitzwith respect to y if, and only if,

∃L≥0

∀(x,y),(x,y)∈G

∥∥f(x, y)− f(x, y)∥∥ ≤ L‖y − y‖. (A.1)

(b) The function f is called locally Lipschitz continuous or just locally Lipschitz withrespect to y if, and only if, for each (x0, y0) ∈ G, there exists a (relative) open setU ⊆ G such that (x0, y0) ∈ U (i.e. U is a (relative) open neighborhood of (x0, y0))and f is Lipschitz continuous with respect to y on U , i.e. if, and only if,

∀(x0,y0)∈G

∃(x0, y0) ∈ U ⊆ G open

∃L≥0

∀(x,y),(x,y)∈U

∥∥f(x, y)− f(x, y)∥∥ ≤ L‖y − y‖.

(A.2)

The number L occurring in (a),(b) is called Lipschitz constant. The norms on Km andKn in (a),(b) are arbitrary. If one changes the norms, then one will, in general, changeL, but not the property of f being (locally) Lipschitz.

Caveat A.2. It is emphasized that f : G −→ Kn, (x, y) 7→ f(x, y), being Lipschitzwith respect to y does not imply f to be continuous: Indeed, if I ⊆ R, ∅ 6= A ⊆ Km,and g : I −→ Kn is an arbitrary discontinuous function, then f : I × A −→ Kn,f(x, y) := g(x) is not continuous, but satisfies (A.1) with L = 0.

—

While the local neighborhoods U , where a function locally Lipschitz (with respect to y)is actually Lipschitz continuous (with respect to y) can be very small, we will now showthat a continuous function is locally Lipschitz (with respect to y) on G if, and only if,it is Lipschitz continuous (with respect to y) on every compact set K ⊆ G.

Proposition A.3. Let m,n ∈ N, G ⊆ R × Km, and f : G −→ Kn be continuous.Then f is locally Lipschitz with respect to y if, and only if, f is (globally) Lipschitz withrespect to y on every compact subset K of G.

Proof. First, assume f is not locally Lipschitz with respect to y. Then there exists(x0, y0) ∈ G such that

∀N∈N

∃(xN ,yN,1),(xN ,yN,2)∈G∩B1/N (x0,y0)

∥∥f(xN , yN,1)− f(xN , yN,2)∥∥ > N‖yN,1 − yN,2‖. (A.3)

The setK := (x0, y0) ∪

(xN , yN,j) : N ∈ N, j ∈ 1, 2

A LIPSCHITZ CONTINUITY 106

is clearly a compact subset of G (e.g. by the Heine-Borel property of compact sets, sinceevery open set containing (x0, y0) must contain all, but finitely many, of the elements ofK). Due to (A.3), f is not (globally) Lipschitz with respect to y on the compact set K(so, actually, continuity of f was not used for this direction).

Conversely, assume f to be locally Lipschitz with respect to y, and consider a compactsubsetK of G. Then, for each (x, y) ∈ K, there is some (relatively) open U(x,y) ⊆ G with(x, y) ∈ U(x,y) and such that f is Lipschitz with respect to y in U(x,y). By the Heine-Borelproperty of compact sets, there are finitely many U1 := U(x1,y1), . . . , UN := U(xN ,yN ),N ∈ N, such that

K ⊆N⋃

j=1

Uj. (A.4)

For each j = 1, . . . , N , let Lj denote the Lipschitz constant for f on Uj and set L′ :=maxL1, . . . , LN. As f is assumed continuous and K is compact, we have

M := max‖f(x, y)‖ : (x, y) ∈ K <∞. (A.5)

Using the compactness of K once again, there exists a Lebesgue number δ > 0 for theopen cover (Uj)j∈1,...,N of K, i.e. δ > 0 such that

∀(x,y),(x,y)∈K

(‖y − y‖ < δ ⇒ ∃

j∈1,...,N(x, y), (x, y) ⊆ Uj

). (A.6)

Define L := maxL′, 2M/δ. Then, for every (x, y), (x, y) ∈ K:

‖y − y‖ < δ ⇒ ‖f(x, y)− f(x, y)‖ ≤ Lj‖y − y‖ ≤ L‖y − y‖, (A.7a)

‖y − y‖ ≥ δ ⇒ ‖f(x, y)− f(x, y)‖ ≤ 2M =2Mδ

δ≤ L‖y − y‖, (A.7b)

completing the proof that f is Lipschitz with respect to y on K.

The following Prop. A.4 provides a useful sufficient condition for f : G −→ Kn, G ⊆R×Km open, to be locally Lipschitz with respect to y:

Proposition A.4. Let m,n ∈ N, let G ⊆ R × Km be open, and f : G −→ Kn. Asufficient condition for f to be locally Lipschitz with respect to y is f being continuously(real) differentiable with respect to y, i.e., f is locally Lipschitz with respect to y providedthat all partials ∂ykfl; k, l = 1, . . . , n (∂yk,1fl, ∂yk,2fl for K = C) exist and are continuous.

Proof. We consider the case K = R; the case K = C is included by using the identifi-cations Cm ∼= R2m and Cn ∼= R2n. Given (x0, y0) ∈ G, we have to show f is Lipschitzwith respect to y on some open set U ⊆ G with (x0, y0) ∈ U . Since G is open,

∃b>0

B :=(x, y) ∈ R× Rm : |x− x0| ≤ b and ‖y − y0‖1 ≤ b

⊆ G,

where ‖ · ‖1 denotes the 1-norm on Rm. Since the ∂ykfl, (k, l) ∈ 1, . . . ,m×1, . . . , n,are all continuous on the compact set B,

M := max|∂ykfl(x, y)| : (x, y) ∈ B, (k, l) ∈ 1, . . . ,m × 1, . . . , n

<∞. (A.8)

B ORDINARY DIFFERENTIAL EQUATIONS (ODE) 107

Applying the mean value theorem to the n components of the function

fx :y ∈ Rm : (x, y) ∈ B

−→ Rn, fx(y) := f(x, y),

we obtain η1, . . . , ηn ∈ Rm such that

fl(x, y)− fl(x, y) =m∑

k=1

∂ykfl(x, ηl)(yk − yk), (A.9)

and, thus,

∀(x,y),(x,y)∈B

∥∥f(x, y)− f(x, y)∥∥1=

n∑

l=1

|fl(x, y)− fl(x, y)|

(A.8),(A.9)

≤n∑

l=1

m∑

k=1

M |yk − yk| =n∑

l=1

M‖y − y‖1 = nM‖y − y‖1,(A.10)

i.e. f is Lipschitz with respect to y on B (where(x, y) ∈ R× Rm : |x− x0| < b and ‖y − y0‖1 < b

⊆ B

is an open neighborhood of (x0, y0)), showing f is locally Lipschitz with respect to y.

B Ordinary Differential Equations (ODE)

B.1 Regularity of Solutions

Proposition B.1. Let G ⊆ R × Kn be open, n ∈ N, and f : G −→ Kn. Let I ⊆ R bean open interval and let φ : I −→ Kn be a solution to the ODE

y′ = f(x, y). (B.1)

If f has continuous partials up to order k ∈ N0, then φ ∈ Ck+1(I,Kn).

Proof. As we are only considering real differentiability, we may assume K = R withoutloss of generality (the case K = C is included via the identification Kn ∼= R2n). By thechain rule, if g : G −→ R is differentiable and

α : I −→ R, α(x) := g(x, φ(x)), (B.2)

then α is differentiable with

α′ : I −→ R, α′(x) = ∂xg(x, φ(x)) +n∑

ν=1

∂yνg(x, φ(x))φ′ν(x)

= ∂xg(x, φ(x)) +n∑

ν=1

∂yνg(x, φ(x)) fν(x, φ(x)). (B.3)

Thus, an induction shows that, for each 1 ≤ j ≤ k + 1 and each x ∈ I, φ(j)(x) is apolynomial in partial derivatives of order at most j − 1 of f , all evaluated at (x, φ(x)).In particular, if all partials up to order k of f are continuous, then so is φ(k).

B ORDINARY DIFFERENTIAL EQUATIONS (ODE) 108

B.2 Differentiability with Respect to Initial Conditions andParameters

Definition B.2. Let G ⊆ R×Kn be open, n ∈ N, and let f : G −→ Kn be continuousand locally Lipschitz with respect to y. Given (ξ, η) ∈ G, let

φ(ξ,η) : I(ξ,η) −→ Kn

denote the unique maximal solution to the initial value problem

y′ = f(x, y), y(ξ) = η.

We then callY : Df −→ Kn, Y (x, ξ, η) := φ(ξ,η)(x), (B.4)

defined onDf := (x, ξ, η) ∈ R×G : x ∈ I(ξ,η), (B.5)

the global or general solution to y′ = f(x, y). Note that the domainDf of Y is determinedentirely by f , which is notationally emphasized by its lower index f .

Theorem B.3. Let G ⊆ R × Kn be open, n ∈ N, and let f ∈ Ck(G,K), k ∈ N.Moreover, let Y : Df −→ Kn be the global solution to y′ = f(x, y) as defined in Def.B.2. Then Df is open and Y ∈ Ck(Df ,K

n).

Proof. We know Df to be open from [Phi16c, Th. 3.35]. In [Tes12, Th. 2.10], it is shownthat, for each (x, ξ, η) ∈ Df , Y is Ck in an open neighborhood of (x, ξ, η), which sufficesto conclude Y ∈ Ck(Df ,K

n) (for the case k = 1, also see [Mar04, Th. 7.9]).

We will now consider situations where the right-hand side f depends on some (vectorof) parameters µ in addition to depending on x and y:

Definition B.4. Let G ⊆ R × Kn × Kl be open, n, l ∈ N, and let f : G −→ Kn becontinuous and locally Lipschitz with respect to y. Given (ξ, η, µ) ∈ G, let

φ(ξ,η,µ) : I(ξ,η,µ) −→ Kn

denote the unique maximal solution to the initial value problem

y′ = f(x, y, µ

), (B.6a)

y(ξ) = η. (B.6b)

We then callY : Df −→ Kn, Y (x, ξ, η, µ) := φ(ξ,η,µ)(x), (B.7)

defined onDf := (x, ξ, η, µ) ∈ R×G : x ∈ I(ξ,η,µ), (B.8)

the global or general solution to (B.6a).

REFERENCES 109

Corollary B.5. Let G ⊆ R×Kn ×Kl be open, n, l ∈ N, and let f ∈ Ck(G,K), k ∈ N.Moreover, let Y : Df −→ Kn be the global solution to y′ = f(x, y, µ) as defined in Def.B.4. Then Df is open and Y ∈ Ck(Df ,K

n).

Proof. To apply Th. B.3 to the present situation, define the auxiliary function

F : G −→ Kn+l, Fj(x, y) :=

fj(x, y) for j = 1, . . . , n,

0 for j = n+ 1, . . . , n+ l.(B.9)

Then, since f ∈ Ck(G,Kn), we have F ∈ Ck(G,Kn+l) and, thus, we can apply Th. B.3to

y′ = F (x, y), (B.10a)

y(ξ) = (η, µ), (B.10b)

where (ξ, η, µ) ∈ G. According to Th. B.3, the global solution Y : DF −→ Kn+l of(B.10a) is Ck on the open set DF . Moreover, by the definition of F in (B.9), we have

∀(x,ξ,η,µ)∈DF

Y (x, ξ, η, µ) =

(Y (x, ξ, η, µ)

µ

),

where Y is as defined in (B.7). In particular, Df = DF and Y being Ck implies Y tobe Ck as well.

References

[DB08] Peter Deuflhard and Folkmar Bornemann. Numerische Mathematik2, 3rd ed. Walter de Gruyter, Berlin, Germany, 2008 (German).

[Mar04] Nelson G. Markley. Principles of Differential Equations. Pure and AppliedMathematics, Wiley-Interscience, Hoboken, NJ, USA, 2004.

[Phi16a] P. Philip. Analysis I: Calculus of One Real Variable. Lecture Notes, Ludwig-Maximilians-Universitat, Germany, 2015/2016, available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_Analysis1.pdf.

[Phi16b] P. Philip. Analysis II: Topology and Differential Calculus of Several Variables.Lecture Notes, Ludwig-Maximilians-Universitat, Germany, 2016, available inPDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_Analysis2.pdf.

[Phi16c] P. Philip. Ordinary Differential Equations. Lecture Notes, Ludwig-Maximilians-Universitat, Germany, 2016, available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_ODE.pdf.

REFERENCES 110

[Phi17] P. Philip. Analysis III: Measure and Integration Theory of Several Variables.Lecture Notes, Ludwig-Maximilians-Universitat, Germany, 2016/2017, avail-able in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_Analysis3.pdf.

[Phi19a] P. Philip. Linear Algebra I. Lecture Notes, Ludwig-Maximilians-Universitat,Germany, 2018/2019, available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_LinearAlgebra1.pdf.

[Phi19b] P. Philip. Linear Algebra II. Lecture Notes, Ludwig-Maximilians-Universitat,Germany, 2019, available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_LinearAlgebra2.pdf.

[Phi20] P. Philip. Numerical Mathematics I. Lecture Notes, Ludwig-Maxi-milians-Universitat, Germany, 2019/2020, available in PDF format athttp://www.math.lmu.de/~philip/publications/lectureNotes/philipPeter_Numeri

calMathematics1.pdf.

[Pla06] Robert Plato. Numerische Mathematik kompakt, 3rd ed. Vieweg Verlag,Wiesbaden, Germany, 2006 (German).

[Tes12] Gerald Teschl. Ordinary Differential Equations and Dynamical Systems.Graduate Studies in Mathematics, Vol. 140, American Mathematical Society,Providence, Rhode Island, 2012.

numerical mathematics ii: numerical solution of ordinary...

Documents