trutnau/lecturetim12015.pdf · bibliography 1. bauer, h., probability theory, de gruyter, 1996. 2....

Topics in Mathematics 1

Gerald Trutnau

Seoul National Univesity

Non-Corrected version

May 31, 2015

This text is a summary of the lecture

Topics in Mathematics 1

held at Seoul National University, Spring Term 2015

Please email all misprints and mistakes to

[email protected]

Bibliography

1. Bauer, H., Probability theory, de Gruyter, 1996.

2. Bauer, H., Measure and integration theory, de Gruyter, 1996, ISBN: 3-11-016719-0.

3. Billingsley, P., Probability and Measure, third edition. Wiley, 1995. ISBN: 0-471-00710-2.

4. Billingsley, P., Convergence of probability measures, Wiley, 1999.

5. Chung, K. L., A Course in Probability Theory, Third Edition, Academic Press;

6. Dudley, R.M., Real analysis and probability, Cambridge University Press, 2002.

7. Feller, W., An introduction to probability theory and its applications, Vol. 1 & 2,Wiley, 1950.

8. Halmos, P.R., Measure Theory, Springer, 1974.

9. Jacod, J.; Protter, P., Probability essentials, second edition. Universitext. Springer,ISBN: 3-540-43871-8

10. Klenke, A., Probability theory. A comprehensive course., Universitext. Springer,ISBN: 978-1-84800-047-6.

11. Shiryaev, A.N., Probability, Springer, 1996.

2

1 Basic Notions

1 Probability spaces

Probability theory is the mathematical theory of randomness. The basic notion is thatof a random experiment, which is an event whose outcome is not predictable and canonly be determined after performing them and then observing the outcome.

Probability theory tries to quantify the possible outcomes by attaching a probabilityto every event. This is of importance for example for an insurance company when askingthe question what is a fair price of an insurance against events like fire or death thatare events that can happen but need not happen.

The set of all possible outcomes of a random experiment is denoted by Ω. The setΩ may be finite, infinitely countable or even uncountable.

Example 1.1. Examples of random experiments and corresponding Ω:

(i) Coin tossing The possible outcomes of tossing a coin are either “head” or “tail”.Denoting one outcome by “0” and the other one by “1”, the set of all possibleoutcomes is given by Ω = 0, 1.

(ii) Tossing a coin n times In this case any sequence of zeros and ones (alias headsor tails) of length n are considered as one possible outcome; hence

Ω =(x1, x2, . . . , xn)

∣∣ xi ∈ 0, 1

=: 0, 1n

is the space of all possible outcomes.

(iii) Tossing a coin infinitely many times In this case

Ω =(xi)i∈N

∣∣ xi ∈ 0, 1

=: 0, 1N.

In this case Ω is uncountable in contrast to the previous examples. We can identifyΩ with the set [0, 1] ⊂ R using the binary expansion

x =∞∑

i=1

xi2−i .

(iv) A random number between 0 and 1 Ω = [0, 1].

(v) Continuous stochastic processes, e.g. Brownian motion on R Any continuousreal-valued function defined on [0, 1] ⊂ R is a possible outcome. In this caseΩ = C

([0, 1]

).

Example:

3

0

ω(t)

R

t

EventsReasonable subsets A ⊂ Ω for which it makes sense to calculate the probability are

called events (a precise definition will be given in 1.3 below). If we observe ω ∈ A in arandom experiment, we say that A has occured.

• elementary events A = ω for some ω ∈ Ω

• the impossible event A = ∅ and the certain event A = Ω

• “A does not occur” Ac = Ω \A

Combination of events

A1 ∪A2,⋃

i

Ai “at least one of the events Ai occur”

A1 ∩A2,⋂

i

Ai “all of the events Ai occur”

lim supn→∞

An :=⋂

n

⋃

m>n

Am “infinitely many of the Am occur”

lim infn→∞

An :=⋃

n

⋂

m>n

Am “all but finitely many of the Am occur”

Example 1.2. (i) Coin tossing “1 occurs”: A = 1

(ii) Tossing a coin n times “tossing k ones”:

A =

(x1, . . . , xn) ∈ 0, 1n∣∣∣

n∑

i=1

xi = k

.

4

(iii) Tossing a coin infinitely many times “relative frequency of 1 equals p”:

A =

(xi)i∈N ∈ 0, 1N∣∣∣∣limn→∞

1

n

n∑

i=1

xi = p

.

(iv) random number 0 and 1. “number ∈ [a, b]”: A = [a, b] ⊂ Ω = [0, 1].

(v) Continuous stochastic processes “exceeding level c”:

A =ω ∈ C

([0, 1]

) ∣∣ max06t61

ω(t) > c.

c

0

ω(t)

R

t

Let Ω be countable. A probability distribution function p on Ω is a function

p : Ω → [0, 1] with∑

ω∈Ω

p(ω) = 1 .

Given any subset A ⊂ Ω, its probability P (A) can then be defined by simply adding up

P (A) =∑

ω∈A

p(ω) .

In the uncountable case, however, there is no reasonable way of adding up an uncount-able set of numbers. There is no way to build a reasonable theory by starting withprobability functions specifying the probability of individual outcomes. The best wayout is to specify directly the probability of events. In the uncountable case it is notpossible in general to consider the power set P(Ω), i.e. the collection of all subsets ofΩ (including the empty set ∅ and the whole set Ω) but only a certain subclass A. Onthe other hand A should satisfy some minimal requirements specified in the followingdefinition:

Definition 1.3. A ⊂ P(Ω) is called a σ-algebra if

5

(i) Ω ∈ A.

(ii) A ∈ A implies Ac ∈ A.

(iii) Ai ∈ A, i ∈ N, implies⋃

i∈N

Ai ∈ A.

Remark 1.4. (i) Let A be a σ-algebra. Then:

• ∅ = Ωc ∈ A.

• Ai ∈ A, i ∈ N, implies

⋂

i∈N

Ai =(⋃

i∈N

Aci

)c

∈ A.

• A1, . . . , An ∈ A implies

n⋃

i=1

Ai ∈ A and

n⋂

i=1

Ai ∈ A

(consider Am = ∅ for all m > n, resp. Am = Ω for all m > n).

• Ai ∈ A, i ∈ N, implies

⋂

n

⋃

m>n

Am ∈ A and⋃

n

⋂

m>n

Am ∈ A.

(ii) the power set P(Ω) is a σ-algebra.

(iii) Let I be an index set (not necessarily countable) and for any i ∈ I, let Ai be aσ-algebra. Then

⋂

i∈I Ai is again a σ-algebra.

(iv) Typical construction of a σ-algebra Let A0 6= ∅ be a class of events. Then

σ(A0) :=⋂

B σ-algebrain Ω ,A0⊂B

B

is the smallest σ-algebra containing A0. σ(A0) is called the σ-algebra generatedby A0.

Example 1.5. Let Ω be a topological space, and Ao be the collection of open subsetsof Ω. Then B(Ω) := σ(Ao) is called the Borel-σ-algebra of Ω, or σ-algebra of Borel-subsets.

Example of Borel-subsets: closed sets, countable unions of closed sets, etc...Note: not every subset of a topological space is a Borel-subset, e.g. B(R) 6= P(R).

Definition 1.6. Let Ω 6= ∅ and A ⊂ P(Ω) a σ-algebra. A mapping P : A → [0, 1] iscalled a probability measure (on (Ω,A)) if:

• P (Ω) = 1

6

•

P( ⋃

i∈N

Ai

)

=∞∑

i=1

P (Ai) (“σ-additivity”)

for all pairwise disjoint Ai ∈ A, i ∈ N.

In this case (Ω,A, P ) is called a probability space and A ∈ A an event. The pair (Ω,A)of a set Ω together with a σ-algebra is called a measurable space.

Example 1.7. (i) Coin tossing Let A := P(Ω) =∅, 0, 1, 0, 1

.

Tossing a fair coin means “head” and “tail” have equal probability 0.5, hence:

P (0) := P (1) := 1

2, P (∅) := 0, P (0, 1

︸︷︷︸

=Ω

) := 1.

(iii) Tossing a coin infinitely many times Ω = 0, 1N.Let A := σ(A0) where

A0 :=B ⊂ Ω

∣∣ ∃ n ∈ N and B0 ∈ P

(0, 1n

),

such that B = B0 × 0, 1 × 0, 1 × . . ..

An event B is contained in A0 if it depends on finitely many tosses.Fix x1, . . . , xn ∈ 0, 1 and define

P(

(x1, x2, . . . ) ∈ 0, 1N∣∣ x1 = x1, . . . , xn = xn

︸︷︷︸

∈A0

):= 2−n.

P can be extended to a probability measure on A = σ(A0). For this probabilitymeasure we have that

P

(

(x1, x2, . . . ) ∈ 0, 1N∣∣∣∣lim

n→∞1

n

n∑

i=1

xi =1

2

)

= 1.

(Proof: Later!)

(v) Continuous stochastic processes Ω = C([0, 1]), P = Wiener measure (“Brown-ian motion”) For fixed t0 ∈ (0,∞) and α, β ∈ R we have that

P(

ω∣∣ ω(t0) ∈ [α, β]

):=

1√2πt0

∫ β

α

e−x2

2t0 dx

(Gaussian or normal distribution).What is B(Ω) and how to construct the Wiener measure ? Answer to this question:see Introduction to Stochastic Differential Equations 2013 !What is now the probability P

(ω∣∣ max06t61

ω(t) > c)

? Answer to this question:

Maybe later !

7

α

β

0

ω(t)

R

t0 t

Remark 1.8. Let (Ω,A, P ) be a probability space, and let A1, . . . , An ∈ A be pairwisedisjoint. Then

P( ⋃

i6n

Ai

)

=n∑

i=1

P (Ai)

(simply let Am = ∅ for all m > n). In particular:

A,B ∈ A, A ⊂ B ⇒ P (B) = P (A) + P (B \A)⇒ P (B \A) = P (B)− P (A),

and P (Ac) = P (Ω \A) = P (Ω)− P (A) = 1− P (A).P is subadditive, that is, for A,B ∈ A

P (A ∪B) = P(A ∪ [B \ (A ∩B)]

)

= P (A) + P (B)− P (A ∩B)

6 P (A) + P (B), (1.1)

and by induction one obtains Sylvester’s formula:Let I be a finite index set, Ai, i ∈ I, be a collection of subsets in A (not necessarilydisjoint). Then

P(⋃

i∈I

Ai

)

=∑

J⊂I,J 6=∅

(−1)|J|−1 · P(⋂

j∈J

Aj

)

(1.2)

I=1,...,n=

n∑

k=1

(−1)k−1 ·∑

16i1<···<ik6n

P (Ai1 ∩ · · · ∩Aik).

Proposition 1.9. Let A be a σ-algebra, and P : A → R+ := [0,∞) be a mappingwith P (Ω) = 1. Then the following are equivalent:

8

(i) P is a probability measure.

(ii) P is additive, that is, A,B ∈ A, A ∩B = ∅ implies P (A ∪B) = P (A) + P (B),

and

continuous from below, that is, Ai ∈ A, i ∈ N, with Ai ⊂ Ai+1 for all i ∈ N,implies

P(⋃

i∈N

Ai

)

= limi→∞

P (Ai).

(iii) P is additive and continuous from above, that is Ai ∈ A, i ∈ N, with Ai ⊃ Ai+1

for all i ∈ N, implies

P(⋂

i∈N

Ai

)

= limi→∞

P (Ai).

Corollary 1.10 (σ-subadditivity). Let Ai, i ∈ N, be a sequence of subsets in A (notnecessarily pairwise disjoint). Then:

P( ∞⋃

i=1

Ai

)

6

∞∑

i=1

P (Ai).

Proof.

P( ∞⋃

i=1

Ai

)1.9= lim

n→∞P( n⋃

i=1

Ai

) (1.1)

6 limn→∞

n∑

i=1

P (Ai) =

∞∑

i=1

P (Ai).

Lemma 1.11 (Borel-Cantelli). Let Ai ∈ A, i ∈ N. Then

∞∑

i=1

P (Ai) < ∞ ⇒ P(⋂

n∈N

⋃

m>n

Am

︸︷︷︸

=:lim supn→∞

An

)

= 0.

Proof. Since

⋃

m>n

Am

n→∞ց

⋂

n∈N

⋃

m>n

Am

the continuity from above of P implies that

P(lim supn→∞

An

) 1.9= lim

n→∞P( ⋃

m>n

Am

) 1.10

6 limn→∞

∞∑

m=n

P (Am) = 0,

since∑∞

m=1 P (Am) < ∞.

9

Example 1.12. (i) Uniform distribution on [0, 1] Let Ω = [0, 1] and A be theBorel-σ-algebra on Ω (= σ

([a, b]

∣∣ 0 6 a 6 b 6 1

)). Let P be the restriction

of the Lebesgue measure on the Borel subset of R to [0, 1]. Then (Ω,A, P ) is aprobability space. The probability measure P is called the uniform distribution on[0, 1], since P ([a, b]) = b− a for any 0 ≤ a ≤ b ≤ 1 (translation invariance).

(ii) Dirac-measure Let Ω 6= ∅ and ω0 ∈ Ω. Let A be an arbitrary σ-algebra on Ω(e.g. A = P(Ω)). Then

P (A) := 1A(ω0) :=

1 if ω0 ∈ A

0 if ω0 /∈ A.

defines a probability measure on A. P is called the Dirac-measure in ω0, denotedby P = δω0

or P = εω0.

(iii) Convex combinations of probability measures Let Ω 6= ∅ and A be a σ-algebraof subsets of Ω. Let I be a countable index set. Let Pi, i ∈ I, be a family ofprobability measures on (Ω,A), and αi ∈ [0, 1], i ∈ I, be such that

∑

i∈I αi = 1.Then P :=

∑

i∈I αi · Pi is again a probability measure on (Ω,A).

This holds in particular for

P :=∑

i∈I

αi · δωi

if ωi ∈ Ω, i ∈ I.

2 Discrete models

Throughout the whole section

• Ω 6= ∅ countable (i.e finite or denumerable)

• A = P(Ω) and

• ω ∈ Ω (more precisely ω ⊂ Ω) an elementary event.

Proposition 2.1. (i) Let p : Ω → [0, 1] be a function with 0 6 p(ω) 6 1 for allω ∈ Ω and

∑

ω∈Ω p(ω) = 1 (p is called a probability distribution function). Then

P (A) :=∑

ω∈A

p(ω) ∀A ⊂ Ω

defines a probability measure on (Ω,A).

(ii) Every probability measure P on (Ω,A) is of this form, with p(ω) := P (ω) forall ω ∈ Ω.

Proof. (i)

P =∑

ω∈Ω

p(ω) · δω .

10

(ii) Exercise.

Example 2.2 (Laplace probability space). Fundamental example in the discrete casethat forms the basis of many other discrete models.

Let Ω be a nonempty finite set (that is 0 < |Ω| < ∞). Define

p(ω) =1

|Ω| ∀ω ∈ Ω .

Then

P (A) =|A||Ω| =

number of elements in A

number of elements in Ω.

Hence measure theoretic problems reduce to combinatorial problems in the discretecase.P is called uniform distribution on Ω, because every elementary event ω ∈ Ω has the

same probability 1|Ω| .

Example 2.3. (i) random permutations Let M := 1, . . . , n and Ω :=all permutations of M . Then |Ω| = n! Let P be the uniform distribution on Ω.

Problem: What is the probability P (“at least one fixed point”)?Consider the event Ai := ω |ω(i) = i (fixed point at position i). ThenSylvester’s formula (cf. (1.2)) implies that

P (“at least one fixed point”) = P( n⋃

i=1

Ai

)

(1.2)=

n∑

k=1

(−1)k+1 ·∑

16i1<···<ik6n

P (Ai1 ∩ · · · ∩Aik)︸︷︷︸

=(n−k)!

n! (k−th pos. fixed)

=

n∑

k=1

(−1)k+1 ·(n

k

)(n− k)!

n!= −

n∑

k=1

(−1)k

k!.

Consequently,

P (“no fixed point”) = 1 +

n∑

k=1

(−1)k

k!=

n∑

k=0

(−1)k

k!

n→∞−−−−→ e−1

and thus for all k ∈ 0, . . . , n:

P (“exactly k fixed points”)

=1

n!︸︷︷︸

all poss.outcomes

·(n

k

)

︸︷︷︸

all ωwith k positions

fixed

· (n− k)!n−k∑

j=0

(−1)j

j!︸︷︷︸

n − k positionswithout fixed points

=1

k!

n−k∑

j=0

(−1)j

j!.

11

Asymptotics as n → ∞:

P (“exactly k fixed points”) =1

k!

n−k∑

j=0

(−1)j

j!

n→∞−−−−→ 1

k!· e−1

(the number of fixed points is asymptotically Poisson distributed with parameterλ = 1).

The Poisson distribution with parameter λ > 0 on N ∪ 0 is given by

πλ := e−λ

∞∑

j=0

λj

j!· δj .

(ii) n experiments with state space S, |S| < ∞

Ω :=ω = (x1, . . . , xn)

∣∣ xi ∈ S

, |Ω| = |S|n.

Let P be the uniform distribution on Ω.

Fix a subset S0 ⊂ S, such that xi ∈ S0 is called a “success”, hence p := |S0||S| is

the probability of success.

What is the probability of the event Ak = “(exactly) k successes”, k = 0, . . . , n?

P (Ak) =|Ak||Ω| =

(n

k

)

· |S0|k · |S \ S0|n−k

|S|n =

(n

k

)

· pk(1− p)n−k

(Binomial distribution with parameters n, p).

The Binomial distribution with parameters n, p on 0, ..., n is given by

B(n, p) := βpn :=

n∑

k=0

(n

k

)

· pk(1− p)n−k · δk,

where p = P (“Success in the i-th experiment”), i = 1, . . . , n (see above).Moreover, if we replace p by pn := λ

nthen for k = 0, 1, 2, ... the asymptotics of

the binomial distribution as n → ∞ is given by

(n

k

)

·(λ

n

)k(

1− λ

n

)n−k

=λk

k!· n · (n− 1) · · · (n− k + 1)

nk

︸︷︷︸n→∞−−−−→ 1

·(

1− λ

n

)n−k

︸︷︷︸n→∞−−−−→ e−λ

n→∞−−−−→ λk

k!· e−λ (k = 0, 1, 2, . . .)

(for n big and p small, the Poisson distribution with parameter λ = p ·n is a goodapproximation for B(n, p)).

12

(iii) Urn model (for example: opinion polls, samples, poker, lottery...) Weconsider an urn containing N balls, K red and N−K black. Suppose that n 6 Nballs are sampled without replacement. What is the probability that exactly k ballsin the sample are red?

typical application: suppose that a small lake contains an (unknown) number Nof fish. To estimate N one can do the following: K fish will be marked by redand after that n (n ≤ N) fish are “sampled” from the lake. If k is the number ofmarked fish in the sample, N := K · n

kis an estimation of the unknown number

N . In this case the probability below with N replaced by N is also an estimation.

Model:Let Ω be all subsets of 1, . . . , N having cardinality n, hence

Ω :=ω ∈ P

(1, . . . , N

) ∣∣ |ω| = n

, |Ω| =

(N

n

)

and let P be the uniform distribution on Ω. Consider the event Ak := “exactly kred”. Then

|Ak| =(K

k

)(N −K

n− k

)

so that

P (Ak) =

(Kk

)(N−Kn−k

)

(Nn

) (k = 0, . . . , n) hypergeometric distr.

Asymptotics for N → ∞, K → ∞ with p := KN

and n fixed:

P (Ak) −→(n

k

)

· pk(1− p)n−k (k = 0, . . . , n)

(good approximation for N big and nN

small).

3 Transformations of probability spaces

Throughout this section let (Ω,A) and (Ω, A) be measurable spaces.

Definition 3.1. A mapping T : Ω → Ω is called A/A-measurable (or simply measur-able), if T−1(A) ∈ A for all A ∈ A.

Notation:

T ∈ A := T−1(A) =ω ∈ Ω

∣∣ T (ω) ∈ A

.

Remark 3.2. (i) Clearly, if A := P(Ω) then every mapping T : Ω → Ω is measurable.

13

(ii) Sufficient criterion for measurability Suppose that A := σ(A0) for some collectionof subsets A0 ⊂ P(Ω). Then T is A/A-measurable, if T−1(A) ∈ A for all A ∈ A0.

(iii) Let Ω, Ω be topological spaces, and A, A be the associated Borel σ-algebras.Then:

T : Ω → Ω is continuous ⇒ T is A/A-measurable.

(iv) Let (Ωi,Ai), i = 1, 2, 3, be measurable spaces, and Ti : Ai → Ai+1, i = 1, 2,measurable mappings. Then:

T2 T1 is A1/A3-measurable.

Proof. (ii) A ∈ P(Ω) |T−1(A) ∈ A is a σ-algebra containing A0. Consequently,

σ(A0) ⊂A ∈ P(Ω)

∣∣ T−1(A) ∈ A

.

(iii) Easy consequence of (ii).

(iv) Exercise.

Definition 3.3. Let T : Ω → Ω be a mapping and let A be a σ-Algebra of subsets ofΩ. The system

σ(T ) :=T−1(A)

∣∣ A ∈ A

is a σ-algebra of subsets of Ω; σ(T ) is called the σ-algebra generated by T . Moreprecisely: σ(T ) is the smallest σ-algebra A, such that T is A/A-measurable.

Proposition 3.4. Let T : Ω → Ω be A/A-measurable and P be a probability measureon (Ω,A). Then

P (A) := T (P )(A) := P(T−1(A)

)=: P [T ∈ A], A ∈ A,

defines a probability measure on (Ω, A). P is called the induced measure on (Ω, A) orthe distribution of T under P .

Notation: P = P T−1 or P = T (P ).

Proof. Clearly, P (A) > 0 for all A ∈ A, P (∅) = 0 and P (Ω) = 1. For pairwise disjointAi ∈ A, i ∈ N, T−1(Ai) are pairwise disjoint too, hence

P( ⋃

i∈N

Ai

)

= P

(

T−1( ⋃

i∈N

Ai

)

︸︷︷︸

⋃

i∈N

T−1(Ai)

) P isσ-additive

=

∞∑

i=1

P(T−1(Ai)

)=

∞∑

i=1

P (Ai).

Remark 3.5. Let T (Ω) be countable, so that T (Ω) = ωi | i ∈ I, with I finite orI = N. Then

P =∑

i∈I

P [T = ωi] · δωi.

14

Proof. For any A ∈ A can be written as

T ∈ A

=⋃

i∈I,ωi∈A

T = ωi

so that

P (A) = P [T ∈ A] =∑

i∈I

P [T = ωi] · 1A(ωi)︸︷︷︸

=δωi(A)

=(∑

i∈I

P [T = ωi] · δωi

)

(A).

Example 3.6. Infinitely many coin tosses: existence of a probability measure.Let Ω := [0, 1] and A be the Borel σ-algebra on [0, 1]. Let P be the restriction of theLebesgue measure on [0, 1]. Let

Ω :=ω = (xn)n∈N

∣∣ xi ∈ 0, 1 ∀ i ∈ N

= 0, 1N.

Define Xi : Ω → 0, 1 by

Xi

((xn)n∈N

):= xi, i ∈ N,

and let

A := σ(

Xi = 1∣∣ i ∈ N

).

Note that A = σ(A0), where A0 is the algebra of cylindrical subsets of Example 1.7(iii). The binary expansion of some ω ∈ [0, 1] defines a mapping

T : Ω → Ωω 7→ T (ω) = (T1ω, T2ω, . . . ),

with

1

R

Ω12

1

T1(ω)1

R

Ω14

12

34

1

T2(ω)

(and similar for T3, T4, . . . ). Note that Ti = XiT for all i ∈ N. T is A/A-measurable,since

T−1(Xi = 1

)= Ti = 1 = finite union of intervals ∈ A .

Define P := P T−1. For fixed (x1, . . . , xn) ∈ 0, 1n we now obtain

P [X1 = x1, . . . , Xn = xn] := P( n⋂

i=1

Xi = xi)

= P( n⋂

i=1

T−1(Xi = xi))

= P( n⋂

i=1

Ti = xi)

= P [interval of length 2−n] = 2−n .

15

Hence, for any fixed n, P coincides with the probability measure for n coin tosses (= uniform distribution on binary sequences of length n). We have thus shown theexistence of a probability measure P on (Ω, A) and solved part of the problem of 1.7.

Uniqueness of P later!

4 Random variables

Let (Ω,A) be a measurable space and

R := R ∪ −∞,+∞, B(R) =B ⊂ R

∣∣ B ∩ R ∈ B(R)

.

Definition 4.1. A random variable (r.v.) on (Ω,A) is a A/B(R)-measurable mapX : Ω → R.

Remark: We will mainly consider real-valued r.v. X : Ω → R. In this case, it is ofcourse enough to check A/B(R)-measurability.

Remark 4.2. (i) X : Ω → R is a random variable if for all c ∈ R, X 6 c ∈ A. IfX : Ω → R is only real-valued, it is enough to check X 6 c ∈ A for all c ∈ R.It even suffices to only consider c ∈ Q ∪ ±∞ (instead of c ∈ R), resp. c ∈ Q(instead of c ∈ R), or instead of Q any other set S ⊂ R which is dense.

(ii) If A = P(Ω), then every function from Ω to R is a random variable on (Ω,A).

(iii) Let X be a random variable on (Ω,A) with values in R (resp. R) and h : R → R(resp. h : R → R) be a measurable mapping. Then h(X) is a random variabletoo.

Examples: |X|, X2, |X|p, eX , . . .

(iv) The class of random variables on (Ω,A) is closed under countable operations.

In particular, if X1, X2, . . . are random variables, then

•∑

i αi ·Xi (provided the sum makes sense, ∞−∞ =?)

• supi Xi, infi Xi

• lim supi→∞ Xi, lim infi→∞ Xi

are random variables too.

Proof. (i) Obvious, since σ((−∞, c] : c ∈ R) = B(R), resp. σ((−∞, c] : c ∈R) = B(R). This also holds, if replace c ∈ R by c ∈ S ∪ ±∞, resp. c ∈ R byc ∈ S where S ⊂ R is any dense subset of R (exercise).

(ii) and (iii) Obvious.

(iv) for example:

• supremum

(supi∈N

Xi) 6 c=⋂

i∈N

Xi 6 c︸︷︷︸

∈A

∈ A.

16

• sum (assume X = +∞ ∩ Y = −∞ = ∅ = X = −∞ ∩ Y = +∞with the usual definitions ∞ + ∞ = ∞, −∞ − ∞ = −∞, ∞ + α = ∞,α ∈ R, etc). Then for c ∈ R

X + Y < c

=

⋃

r,s∈Qr+s<c

X < r ∩ Y < s︸︷︷︸

∈A

∈ A,

and for c = ∞ or c = −∞: X + Y = c = X = c ∪ Y = c byassumption.

Important examples

Example 4.3. (i) Indicator functions of an event A ∈ A:

ω 7→ 1A(ω) :=

1 if ω ∈ A

0 if ω /∈ Aalternative notation IA

is a random variable, because

1A 6 c =

∅ if c < 0

Ac if 0 6 c < 1

Ω if c > 1.

(ii) simple random variables

X =

n∑

i=1

ci · 1Ai, ci ∈ R, Ai ∈ A,

Note: any finite-valued random variable is simple, because X(Ω) = c1, . . . , cnimplies

X =∑

i

ci1Aiwith Ai := X = ci = X−1

(ci

)∈ A.

Proposition 4.4 (Structure of random variables on (Ω,A)). Let X be a random variableon (Ω,A). Then:

(i) X = X+ −X−, with

X+ := max(X, 0), X− := −min(X, 0) (random variables!).

(ii) Let X > 0. Then there exists a sequence of simple random variables Xn, n ∈ N,with Xn 6 Xn+1 and X = limn→∞ Xn (in short: Xn ր X).

Proof. (of (ii))

Xn :=

n2n−1∑

i=0

i

2n1 i

2n ≤X< i+12n + n1X≥n

17

Let (Ω,A, P ) be a probability space.

Definition 4.5. Let X be a random variable on (Ω,A) with

min

(∫

X+ dP,

∫

X− dP

)

< ∞. (1.3)

Then

E[X] :=

∫

X dP

(

=

∫

Ω

X dP

)

is called the expectation of X (w.r.t. P ).

Definition/Construction

of the integral w.r.t. PLet X be a random variable.

1. If X = 1A, A ∈ A, define∫

X dP := P (A) .

2. If X =∑n

i=1 ci · 1Ai, ci ∈ R, Ai ∈ A, define

∫

X dP :=

n∑

i=1

ci · P (Ai)

(independent of the particular representation of X!)

3. X ≥ 0, then there exist Xn simple, Xn ≥ 0 (see 4.4) with Xn ր X. Define

∫

X dP := limn→∞

∫

Xn dP(∈ [0,∞]

).

(independent of the particular choice for Xn!)

4. for general X, decompose X = X+ −X− and define

(E[X] =

)∫

X dP :=

∫

X+ dP −∫

X− dP.

(well-defined, if (1.3) satisfied by using the usual definitions for c ∈ R:−∞+ c := c−∞ := −∞, c+∞ := ∞+ c := ∞.)

Definition 4.6. (i) The set of all P -integrable random variables is defined by

L1 := L

1(Ω,A, P ) :=X r.v.

∣∣ E[|X|]< ∞

.

18

(ii) A property E of points ω ∈ Ω holds P -almost surely (P -a.s.), if there existsa measurable zero set N , i.e. a set N ∈ A with P [N ] = 0, such that everyω ∈ Ω \N has property E.

If

N := X r.v. |X = 0 P -a.s.

then the quotient space

L1 :=L1/

N(X ∼ Y :⇔ X = Y P -a.s.)

is a Banach space w.r.t. the norm E[|X|].

Remark 4.7. Special case: X random variable, X ≥ 0, X(Ω) countable. Then (withthe usual definitions (+∞) ·0 := (−∞) ·0 := 0, (+∞) ·α := +∞ if α > 0, (+∞) ·α :=−∞ if α < 0, etc) we have

E[X] = E[ ∑

x∈X(Ω)

x · 1X=x]

“3.” and“2.” above

=∑

x∈X(Ω)

x · P [X = x]. (1.4)

Similarly for X not necessarily finite, but E[X] well-defined:

E[X] =∑

x∈X(Ω),x>0

x · P [X = x]−∑

x∈X(Ω),x<0

(−x) · P [X = x].

If, in addition, Ω is countable, and X > 0, then

X =∑

ω∈Ω

X(ω) · 1ω , and

E[X] =∑

ω∈Ω

X(ω) · E[1ω] =∑

ω∈Ω

X(ω) · P (ω)︸︷︷︸

=:p(ω)

=∑

ω∈Ω

p(ω) ·X(ω).

Example 4.8. Infinitely many coin tosses with a fair coin Let Ω = 0, 1N. A andP as in 3.6

(i) Expectation of the ith coin toss Xi

((xn)n∈N

):= xi

E[Xi](1.4)= 1 · P [Xi = 1] + 0 · P [Xi = 0] =

1

2.

(ii) Expectation of number of “successes”

Sn := X1 + · · ·+Xn = number of “successes” (= Ones) in n tosses

Then

P [Sn = k] =∑

(x1,...,xn)∈0,1n

withx1+···+xn=k

P [X1 = x1, . . . , Xn = xn] =

(n

k

)

· 2−n , k = 0, 1, . . . , n

19

Hence

E[Sn](1.4)=

n∑

k=0

k · P [Sn = k] =

n∑

k=1

k ·(n

k

)

· 2−n =n

2.

Easier: Once we have noticed that E[ · ] is linear (see next proposition):

E[Sn] =

n∑

i=1

E[Xi](i)=

n

2.

(iii) Waiting time until first success Let

T (ω) := minn ∈ N |Xn(ω) = 1= waiting time until first success

(min ∅ := +∞, T measurable !, why ?). Then

P [T = k] = P [X1 = · · · = Xk=1 = 0, Xk = 1] = 2−k,

and P (T = ∞) = 0 so that

E[T ](1.4)=

∞∑

k=1

k · P [T = k] =

∞∑

k=1

k · 2−k =1

2

∞∑

k=1

k ·(1

2

)k−1

= 2.

(Recall: ddq

(q

1−q

)= d

dq

∑∞k=1 q

k =∑∞

k=1 kqk−1 )

Remark 4.9. X = Y P -a.s., i.e. P [X = Y ] = 1, implies E[X] = E[Y ].

Proposition 4.10. Let X,Y be r.v. satisfying (1.3). Then

(i) 0 6 X 6 Y P -a.s. =⇒ 0 6 E[X] 6 E[Y ].

(ii) α, β ∈ R, Y ∈ L1 =⇒ E[αX + βY ] = αE[X] + β E[Y ].

In particular, (i) implies: X 6 Y P -a.s. =⇒ E[X] 6 E[Y ] (only (1.3) is necessary !).

Proof. See textbooks on measure theory.

In addition X 7→ E[X] is continuous w.r.t. monotone increasing sequences, i.e. thefollowing proposition holds:

Proposition 4.11 (monotone integration, B. Levi). Let Xn random variables with0 6 X1 6 X2 6 . . . . Then:

limn→∞

E[Xn] = E[limn→∞

Xn

].


20

Corollary 4.12. Let Xn > 0, n ∈ N, be random variables. Then

E[ ∞∑

n=1

Xn

]

=

∞∑

n=1

E[Xn].

Lemma 4.13 (Fatou’s lemma). Let Xn ≥ 0, n ∈ N, be a sequence random variablessatisfying (1.3) and let Y ∈ L1. Then

(i) Xn > Y P -a.s. for all n ∈ N =⇒ E[lim infn→∞ Xn

]6 lim infn→∞ E[Xn].

(ii) Xn 6 Y P -a.s. for all n ∈ N =⇒ E[lim supn→∞ Xn

]> lim supn→∞ E[Xn].

(Remark: (i) and (ii) are mostly applied with Y = 0.)

Proof. (i) By Remark 4.9 and Corollary 5.6 below we may assume that |Y (ω|) < ∞for all ω ∈ Ω

E[lim infn→∞

Xn

] 4.10(ii)= E

[

limn→∞

infk>n

(Xk − Y

)

︸︷︷︸

satisfies (1.3)

]

+ E[Y]

Levi= lim

n→∞E[infk>n

(Xk − Y

)]+ E

[Y]

4.10

6 limn→∞

infk>n

E[Xk − Y ] + E[Y]

4.10(ii)= lim inf

n→∞E[Xn].

(ii) is similar to (i).

Proposition 4.14 (Lebesgue’s dominated convergence theorem, DCT). Let Xn, n ∈ Nbe random variables and Y ∈ L1 with |Xn| 6 Y P -a.s. Suppose that the pointwiselimit limn→∞ Xn exists P -a.s. Then

E[lim

n→∞Xn

]= lim

n→∞E[Xn].

Proof. Since −Y 6 Xn 6 Y P -a.s. for any n and lim infn→∞ Xn = lim supn→∞ Xn =limn→∞ Xn P -a.s, it follows

E[lim

n→∞Xn

] 4.9= E

[lim infn→∞

Xn

] Fatou6 lim inf

n→∞E[Xn] 6 lim sup

n→∞E[Xn]

Fatou6 E

[lim supn→∞

Xn

] 4.9= E

[lim

n→∞Xn

].

Example 4.15. Tossing a fair coin Consider the following simple game: A fair coinis thrown and the player can invest an arbitrary amount of KRW on either “head” or“tail”. If the right side shows up, the player gets twice his investment back, otherwisenothing.

Suppose now a player plays the following bold strategy: he doubles his investmentuntil his first success. Assuming the initial investment was 1000 KRW, the investmentin the nth round is given by

In = 2n−1 · 1T>n−1,

21

where T = waiting time until the first "‘1"’. Then

E[In] = 2n−1 · P [T > n− 1]︸︷︷︸

=( 12 )

n−1

= 1.

whereas on the other hand limn→∞ In = 0 P -a.s. (more precisely: for all ω 6=(0, 0, 0, . . . )).

5 Inequalities


Proposition 5.1 (Jensens’ inequality). Let h be a convex function defined on someinterval I ⊆ R, X in L1 with X(Ω) ⊂ I. Then E(X) ∈ I and

h(E[X]

)6 E

[h(X)

].

Proof. W.l.o.g. we may asssume that X not P -a.s. equal to a constant function.Then x0 := E[X] ∈ I. Since h is convex, there exists an affine linear function ℓ withℓ(x0) = h(x0) and ℓ 6 h (“support tangent line”).

ℓh

Consequently,

h(E[X]

)= ℓ(E[X]

) linearity= E

[ℓ(X)

] monotonicity

6 E[h(X)

].

Example 5.2.

E[X]2 6 E[X2].

More generally, for 0 < p 6 q:

E[|X|p

] 1p

︸︷︷︸

=:‖X‖p

6 E[|X|q

] 1q

︸︷︷︸

=:‖X‖q

.

Proof. h(x) := |x| qp is convex. Since(|X| ∧ n

)p ∈ L1 for n ∈ N, we obtain that

(

E[(|X| ∧ n

)p]) q

p

6 E[(|X| ∧ n

)q]

,

which implies the assertion by B. Levi taking the limit n → ∞.

22

Definition 5.3. For 1 6 p < ∞ let

Lp :=

X∣∣ X r.v. and E

[|X|p

]< ∞

.

Lp is called the set of p-integrable random variables.

Remark 5.4. (i) If 1 6 p 6 q then Lq ⊂ Lp.

(ii) Let N = X |X r.v. and X = 0 P -a.s., and p > 1. Then N ⊂ Lp is a linearsubspace and the quotient space

Lp :=Lp/

N

is a Banach space w.r.t. ‖ · ‖p (i.e. a complete normed vector space)

Proposition 5.5. Let X be a random variable, h : R → [0,∞] be increasing. Then

h(c)P [X > c] 6 E[h(X)

]∀ c > 0.

Proof.

h(c)P [X > c] 6 h(c)P[h(X) > h(c)

]= E

[h(c) 1h(X)>h(c)

]

6 E[h(X)

].

Corollary 5.6. (i) Markov inequality Choose h(x) = x1[0,∞](x) and replace X by|X| in 5.5. Then

P[|X| > c

]6

1

cE[|X|]

∀ c > 0 .

In particular,E[|X|]= 0 ⇒ |X| = 0 P -a.s.

E[|X|]< ∞ ⇒ |X| < ∞ P -a.s.

(ii) Chebychev’s inequality Choose h(x) = x21[0,∞](x) and and replace X by |X −E[X]| in 5.5. Then X ∈ L2 implies

P[∣∣X − E[X]

∣∣ > c

]

61

c2E[(X − E[X]

)2] (

=var(X)

c2

)

.

6 Variance and Covariance

Let (Ω,A, P ) be a probability space.E[X] = “average value” of X(ω), ω ∈ Ω (prediction)

Remark 6.1. Let P be the uniform distribution on Ω = ω1, . . . , ωn, then

E[X] =1

n

n∑

i=1

X(ωi) = arithmetic mean of X(ω1), . . . , X(ωn) .

23

Definition 6.2. Let X ∈ L1. Then

var(X) := σ2(X) := E[(X − E[X]

)2] (

∈ [0,∞])

is called the variance of X (mean square prediction error)The variance is a measure for fluctuations of X around E[X]. It indicates the risk

that one takes when a prognosis is based on the expectation.σ(X) :=

√

var(X) is called standard deviation.

Remark 6.3. (i)

var(X) = E[(X − E[X]

)2]

= E[X2]−(E[X]

)2.

(ii) var(X) = 0 ⇔ P[X = E[X]

]= 1

i.e. X behaves deterministically

(iii) var(X) < ∞ ⇔ X ∈ L2.

Definition 6.4. Let X, Y ∈ L2. Then

cov(X,Y ) := E[(X − E[X]

)(Y − E[Y ]

)]

= E[XY ]− E[X] · E[Y ] (1.5)

is called the covariance of X and Y . If additionally X, Y are not deterministic, then

(X,Y ) :=cov(X,Y )

σ(X) · σ(Y )

is called the correlation of X and Y .

Remark 6.5 (properties of the covariance). (i) Let X ∈ L1. Then

var(aX + b) = a2 · var(X).

(ii) Let X,Y ∈ L2. Then

var(X + Y ) = var(X) + var(Y ) + 2 cov(X,Y ).

Definition 6.6. Two random variables X,Y ∈ L2 are called uncorrelated, if

cov(X,Y ) = 0(⇔ var(X + Y ) = var(X) + var(Y )

). (1.6)

Proposition 6.7 (Cauchy-Schwarz). Let X and Y ∈ L2. Then

X · Y ∈ L1 and

∣∣cov(X,Y )

∣∣ 6 σ(X) · σ(Y )

In particular, (X,Y ) ∈ [−1, 1].

24

Proof. Let X,Y ∈ L2. Then X + Y ∈ L2, hence

2 ·XY = (X + Y )2 −X2 − Y 2 ∈ L1.

cov(·, ·) : L2 × L2 → R, (X,Y ) 7→ cov(X,Y ) is bilinear, symmetric and positive.Hence by the Cauchy-Schwarz inequality

| cov(X,Y )| ≤ cov(X,X)12 cov(Y, Y )

12 = σ(X) · σ(Y ).

Example 6.8. Tossing a coin with probability p ∈ [0, 1] for success

Ω =ω = (x1, x2, . . . )

∣∣ xi ∈ 0, 1

= 0, 1N,

Xi : Ω → 0, 1 mit Xi

((xn)n∈N

)= xi, i ∈ N,

A = σ(

Xi = 1∣∣ i ∈ N

).

Then there exists a unique probability measure P = Pp on (Ω,A) with

Pp[Xi1 = x1, . . . , Xin = xn] = p

n∑

i=1

xi · (1− p)n−

n∑

i=1

xi

for any 1 ≤ i1 < ... < in.

(existence for p = 12 in example 3.6, existence for general p 6= 1

2 later, uniqueness later).Then P [Xi = 1] = p and P [Xi = 1, Xj = 1] = p2 for all i 6= j. Consequently,

E[Xi] = p and var(Xi) = E[X2i ]− E[Xi]

2 = p− p2 = p(1− p)

and for i 6= jcov(Xi, Xj) = E[XiXj ]− p2 = 0

so that X1, X2, . . . are pairwise uncorrelated (in fact even independent, see below)Let Sn := X1 + · · ·+Xn be the number of successes. Then

E[Sn] = np and var(Sn) = np(1− p).

If X :=∑∞

n=1 2−nXn then

E[X] = E[ ∞∑

n=1

2−nXn

]Levi=

∞∑

n=1

E[2−nXn] =

∞∑

n=1

2−np = p

and using Levi and the fact that X1, X2, . . . are pairwise uncorrelated, we conclude that

var(X) =∞∑

n=1

2−2n · p(1− p) =1

3· p(1− p).

Finally, let T be the waiting time until the first "‘success"’. Then

Pp[T = n] = Pp

[X1 = · · · = Xn−1 = 0, Xn = 1

]

= (1− p)n−1p (geometric distribution)

25

then

E[T ] = E[ ∞∑

n=1

n · 1T=n]

=

∞∑

n=1

n · Pp[T = n]

=∞∑

n=1

n · (1− p)n−1

︸︷︷︸

"derivative of thegeometr. series"

p =1

p,

and analogously

var(T ) = · · · = 1− p

p2.

7 The (strong and the weak) law of large numbers

Let

• (Ω,A, P ) be a probability space

• X1, X2, . . . ∈ L2 r.v. with

– Xi uncorrelated, i.e. cov(Xi, Xj) = 0 for i 6= j

– bounded variances, i.e. supi∈N var(Xi)︸︷︷︸

=σ2(Xi)

=:σ2i

< ∞.

LetSn := X1 + · · ·+Xn

so that Sn(ω)n

is the arithmetic mean of the first n observations X1(ω), . . . , Xn(ω)("empirical mean")

Our aim in this section is to show that randomness in the empirical mean vanishesfor increasing n, i.e.

Sn(ω)

n

n large∼ E[Sn]

n,

resp.

Sn(ω)

n

n large∼ m if E[Xi] = m.

Remark 7.1. W.l.o.g. we may assume that E[Xi] = 0 for all i, because Xi :=Xi − E[Xi] ("centered"), satisfies:

• Xi ∈ L

26

• cov(Xi, Xj) = cov(Xi, Xj) = 0 for i 6= j

• var(Xi) = var(Xj).

• Sn

n− E[Sn]

n= Sn

n− E[Sn]

n(Sn :=

∑ni=1 Xi "centered sum")

Proposition 7.2.

limn→∞

E

[(Sn

n− E[Sn]

n

)2]

= 0 .

(resp. limn→∞

E

[(Sn

n−m

)2]

= 0 if E[Xi] = m)

Proof.

E

[(Sn

n− E[Sn]

n

)2]

= var(Sn

n

)

=1

n2· var(Sn)

Bienaymé=

1

n2

n∑

i=1

σ2i 6

1

n· const.

n→∞−−−−→ 0.

Remark 7.3. Mere functional analytic fact: in the Hilbert space L2 = L2/

∼ (with the

scalar product (X,Y ) = E[X · Y ] and norm ‖ · ‖ = ( · , · ) 12 ), the arithmetic mean of

bounded, orthogonal vectors converges to zero:

∥∥∥Sn

n

∥∥∥

2

=1

n2· ‖Sn‖2 =

1

n2

n∑

i=1

‖Xi‖2 n→∞−−−−→ 0.

Chebychev’s inequality immediately implies the following

Proposition 7.4 ("Weak law of large numbers"). Let X1, X2, . . . ∈ L2(Ω,A, P ) un-correlated r.v. with bounded variances and E[Xi] = m ∀ i. Then for all ε > 0:

limn→∞

P

[∣∣∣Sn

n−m

∣∣∣ > ε

]

= 0 .

"convergence in probability of Sn

ntowards m"

Proof. Chebychev’s inequality implies

P

[∣∣∣Sn

n−m

∣∣∣ > ε

]

61

ε2· var

(Sn

n

)

→ 0 if n → ∞.

Example 7.5. Bernoulli experiments with parameter p ∈ [0, 1] Let Xi(ω) = xi andPp[Xi = 1] = p, hence Ep[Xi] = p and var(Xi) = p(1− p) (6 1

4 )

27

Then

Pp

[∣∣∣

Sn

n︸︷︷︸

rel. freq.of "1"

−p∣∣∣ > ε

]

n→∞−−−−→ 0; (1.7)

(J. Bernoulli: Ars Conjectandi)Interpretation of the success probability p as relative frequency.

Problem: to infer p from (1.7) one uses a probabilistic statement w.r.t. a probabilitymeasure Pp that is defined with the help of p.

Example 7.6. Application: uniform approximation of f ∈ C([0, 1]

)with Bernstein

polynomialsLet

Bn(p) :=

n∑

k=0

f(k

n

)

·(n

k

)

pk(1− p)n−k =

n∑

k=0

f(k

n

)

· Pp[Sn = k]

= Ep

[

f

(Sn

n

)]

.

ε > 0: f uniformly continuous ⇒ ∃δ = δ(ε) > 0 such that

supx,y∈[0,1],|x−y|6δ

∣∣f(x)− f(y)

∣∣ 6 ε.

Consequently,

∣∣Bn(p)− f(p)

∣∣ =

∣∣∣∣∣Ep

[

f(Sn

n

)

− f(p)

]∣∣∣∣∣6 Ep

[∣∣∣f(Sn

n

)

− f(p)∣∣∣

]

= Ep

[∣∣∣f(Sn

n

)

− f(p)∣∣∣ · 1|Sn

n−p|6δ

]

+ Ep

[∣∣∣f(Sn

n

)

− f(p)∣∣∣ · 1|Sn

n−p|>δ

]

6 ε · Pp

[∣∣∣Sn

n− p∣∣∣ 6 δ

]

+ 2‖f‖∞ Pp

[∣∣∣Sn

n− p∣∣∣ > δ

]

︸︷︷︸

≤ 1δ2n

p(1−p)≤ 14δ2n

.

Consequently,

lim supn→∞

‖Bn − f‖∞ 6 ε ∀ ε > 0, hence limn→∞

‖Bn − f‖∞ = 0.

From convergence in probability to P -a.s.-convergence:

Lemma 7.7. Let Z1, Z2, . . . be r.v. on (Ω,A, P ). Then

P (lim supn→∞

|Zn| ≥ ε) = 0 ∀ε > 0 ⇐⇒ P ( limn→∞

Zn = 0) = 1.

28

In particular,

∞∑

n=1

P[|Zn| > ε

]< ∞ ∀ε > 0 ( “fast convergence in probability towards 0”),

implies

P(

ω∣∣ limn→∞

Zn(ω) = 0)

= 1 ( “almost sure convergence towards 0”).

Proof. "⇒": Suppose ∃N , measurable with P (N) = 0 and Zn(ω)n→∞−−−−→ 0 ∀ω ∈ Ω\N .

Let ε > 0 arbitrary, An := |Zn| > ε. If ω /∈ N , then ω ∈ Acn for all but finitely many

n, thus P (ω |ω ∈ An for infinitely many n) = P (lim supn→∞ An) = 0.

"⇐": If∑∞

n=1 P[|Zn| > ε

]< ∞ ∀ε > 0 then P

(

lim supn→∞|Zn| > ε

)

= 0 by

the lemma of Borel-Cantelli (Lemma 1.11). Define Nk := lim supn→∞∣∣Zn

∣∣ ≥ 1

k

and N :=⋃

k≥1 Nk. Then N ∈ A and P (N) = 0. Moreover, for all ω /∈ N :

limn→∞∣∣Zn(ω)

∣∣ = 0.

Proposition 7.8 ("Strong law of large numbers"). Let X1, X2, . . . ∈ L2(Ω,A, P ) beuncorrelated with supi∈N σ2(Xi) = c < ∞. Then:

limn→∞

Sn

n− E[Sn]

n= 0 P -a.s.

(resp., if E[Xi] = m: limn→∞Sn

n= m P -a.s.).

Proof. Again w.l.o.g. we may assume that E[Xi] = 0.

1. Step Fast convergence in probability towards 0 along the subsequence nk = k2

For all ε > 0

P

[∣∣∣Sk2

k2

∣∣∣ > ε

]Chebychev

61

ε2k4var(Sk2) ≤ c

ε2k2.

Consequently, Lemma 7.7 implies that

limk→∞

Sk2(ω)

k2= 0 ∀ω /∈ N1 with P [N1] = 0 .

2. Step Let Dk := maxk26l<(k+1)2 |Sl − Sk2 |. We will show in the following fast

convergence in probability of Dk

k2 towards 0:

29

For all ε > 0:

P

[Dk

k2> ε

]

= P

(k2+2k+1⋃

l=k2+1

|Sl − Sk2 | > εk2

)

6

k2+2k+1∑

l=k2+1

P(

|Sl − Sk2 | > εk2)

︸︷︷︸

Chebychev6 1

ε2k4 (l−k2)︸︷︷︸62k+1

·c

6(2k + 1)(2k + 1) · c

ε2k4

≤ 9c

ε2k2

Lemma 7.7 now implies that

limk→∞

Dk(ω)

k2= 0 ∀ω /∈ N2 with P [N2] = 0.

3. Step For n ∈ N and k = k(n) ∈ N with k2 6 n < (k + 1)2 we obtain that

∣∣Sn(ω)

∣∣

n6

∣∣Sk2(ω)

∣∣+Dk(ω)

k2n→∞−−−−→ 0 ∀ω /∈ N1 ∪N2.

Example 7.9. Bernoulli experiments with p ∈ [0, 1]

1

n

n∑

i=1

Xi −−→ p Pp-a.s. (E. Borel 1909, improves Bernoulli’s result)

Consider the experiment of tossing a fair coin (p = 12 ), Yi := 2Xi − 1

Sn = Y1 + · · ·+ Yn

= position of a particle undergoing a "random walk" on Z

Increasing refinement (+rescaling and linear interpolation) of the random walk yieldsthe Brownian motion:

The strong law of large numbers implies that Sn(ω)n

→ 0 P -a.s.In particular, fluctuations are growing slower than linear.

30

A precise description of the order of fluctuations is provided by the law of the iteratedlogarithm:

lim supn→∞

Sn(ω)√2n log log n

= +1 P -a.s.

lim infn→∞

Sn(ω)√2n log log n

= −1 P -a.s.

8 Convergence and uniform integrability

Definition 8.1. Let X,X1, X2, . . . be r.v. on (Ω,A, P ).

(i) Lp-convergence (p > 1)

limn→∞

E[|Xn −X|p

]= 0

(alternative notation limn→∞‖Xn −X‖p = 0)

(ii) Convergence in probability

∀ ε > 0 : limn→∞

P[|Xn −X| > ε

]= 0.

(iii) P -a.s. convergence

P[lim

n→∞Xn = X

]= 1.

Proposition 8.2 (Comparison of the three types of convergence).

(i) +3 (ii)

along somesubsequence

~

(iii)

>F

if supn∈N

|Xn| ∈ Lp

(resp. |Xn|p unif. int.)

X`9999999999999

9999999999999

31

Proof. (i)⇒(ii): Chebychev’s inequality implies:

P[|Xn −X| > ε

]6

E[|Xn −X|p

]

εp.

(iii)⇒(ii):

limn→∞

Xn = X=

∞⋂

k=1

∞⋃

m=1

⋂

n>m

|Xn −X| 6 1

k

︸︷︷︸

=:Ak

.

Then P[limn→∞ Xn = X

]= 1 implies P (Ak) = 1 for all k ∈ N. Continuity of

P from above (cf. Proposition 1.9) implies that

1 = P [Ak]1.9= lim

m→∞P

(⋂

n>m

|Xn −X| 6 1

k

)

6 lim infm→∞

P

[

|Xm −X| 6 1

k

]

≤ lim supm→∞

P

[

|Xm −X| 6 1

k

]

6 1

Consequently,

limm→∞

P

[

|Xm −X| > 1

k

]

= 0.

(iii)⇒(i): Y := supn∈N|Xn| ∈ Lp, limn→∞ Xn = X P -a.s. implies |X| 6 YIn particular, |Xn −X|p 6 2pY p ∈ L1.limn→∞|Xn − X|p = 0 P -a.s. now implies with Lebesgue’s dominated conver-gence

limn→∞

E[|Xn −X|p

]= 0.

(ii)⇒(iii): For each k ∈ N there exists nk ∈ N with

P

[∣∣Xnk

−X∣∣ >

1

k

]

<

(1

2

)k

.

(We may and do choose nk such that nk is increasing in k.) Let ε > 0.

∑

k≥1

P

[∣∣Xnk

−X∣∣ > ε

]

=∑

k : 1k≥ε

P

[∣∣Xnk

−X∣∣ > ε

]

︸︷︷︸

≤const.

+∑

k : 1k<ε

P

[∣∣Xnk

−X∣∣ > ε

]

︸︷︷︸

≤P

[∣∣Xnk

−X

∣∣> 1

k

]

≤ const.+ 1 < ∞.

Lemma 7.7 now implies

limk→∞

Xnk= X P -a.s.

32

Remark 8.3. The diagram can be complemented as follows:

• (ii)⇒(i) holds, if supn∈N|Xn| ∈ Lp (resp. |Xn|p uniformly integrable)(see Propo-sition 8.4 and Remark 8.8 below)

• in general (i);(iii) and (iii);(i) (hence (ii);(i) too). For examples see Exercises.

The next Proposition is the definitive version of Lebesgue’s theorem on dominatedconvergence.

Proposition 8.4. Let Xn ∈ L1 and X be r.v. Then the following statements areequivalent:

(i) limn→∞

Xn = X in L1.

(ii) limn→∞

Xn = X in probability and (Xn)n∈N uniformly integrable.

Corollary 8.5. limn→∞ Xn = X P -a.s. and (Xn)n∈N uniformly integrable implies

limn→∞

E[|Xn −X|] = 0 hence limn→∞

E[Xn] = E[X].

Definition 8.6. Let I be an index set. A family (Xi)i∈I ⊂ L1 of r.v. is called uniformlyintegrable if

limc→∞

supi∈I

∫

|Xi|>c|Xi| dP

︸︷︷︸

=E[1|Xi|>c·|Xi|]→0 for any fixed i as c→∞

= 0.

Lemma 8.7 (ε-δ criterion). Let (Xi)i∈I ⊂ L1. Then the following statements areequivalent:

(i) (Xi)i∈I is uniformly integrable.

(ii) supi∈I E[|Xi|

]< ∞ and ∀ ε > 0 ∃δ > 0 such that

P (A) < δ ⇒∫

A

|Xi| dP < ε ∀ i ∈ I.

Proof. (i)⇒(ii): ∃c such that supi∈I

∫

|Xi|>c|Xi| dP 6 1 Consequently,

supi∈I

∫

|Xi| dP = supi∈I

∫

|Xi|<c|Xi| dP +

∫

|Xi|>c|Xi| dP

6 c+ 1 < ∞.

Let ε > 0. Then there exists c > 0 such that

supi∈I

∫

|Xi|>c|Xi| dP <

ε

2.

33

For δ := ε2c and A ∈ A mit P [A] < δ we now conclude

∫

A

|Xi| dP =

∫

A∩|Xi|<c|Xi| dP +

∫

A∩|Xi|>c|Xi| dP

6 c

∫

A

dP +

∫

|Xi|>c|Xi| dP < c · P [A] +

ε

2< ε.

(ii)⇒(i): Let ε > 0 and δ be as in (ii). Using Markov’s inequality, we get for any i ∈ I

P[|Xi| > c

]6

1

c· E[|Xi|

], < δ if c >

supi∈I E[|Xi|

]+ 1

δ,

∫

|Xi|>c|Xi| dP < ε ∀ i ∈ I.

Remark 8.8. (i) Existence of dominating integrable r.v. implies uniform integrability:|Xi| ≤ Y ∈ L1 ∀i ∈ I

⇒∫

|Xi|>c|Xi| dP 6

∫

Y>cY dP = E

[1Y>c · Y

] DCT cր∞−−−−−−−−→ 0,

since 1Y>c · Y c→∞−−−→ 0 P -a.s. (Markov’s inequality)In particular, I finite ⇒ (Xi)i∈I ⊂ L1 uniformly integrable.

(ii) Let (Xi)i∈I , (Yi)i∈I be uniformly integrable, α, β ∈ R

⇒ (αXi + βYi)i∈I uniformly integrable

(see Exercises)

Proof of Proposition 8.4. (i)⇒(ii): see Exercises. (Hint: Use Lemma 8.7).

(ii)⇒(i): a) X ∈ L1, because there exists a subsequence (nk) such that limk→∞ Xnk=

X P -a.s., so that

E[|X|]= E

[lim infk→∞

|Xnk|] Fatou

6 lim infk→∞

E[|Xnk

|]6 sup

n∈N

E[|Xn|

]< ∞.

b) W.l.o.g. X = 0 (because (Xn)n∈N uniformly integrable implies (Xn −X)n∈N uniformly integrable too by Remark 8.8)

Let ε > 0. Then there exists δ > 0 such that for all A ∈ A with P (A) < δit follows that

∫

A|Xn| dP < ε

2 .

Since Xn → 0 in probability, there exists n0 ∈ N, such that P[|Xn| > ε

2

]<

δ ∀n ≥ n0. Hence, for n ≥ n0

E[|Xn|

]=

∫

|Xn|< ε2|Xn| dP

︸︷︷︸

< ε2

+

∫

|Xn|> ε2|Xn| dP

︸︷︷︸

< ε2

< ε,

and thus limn→∞ E[|Xn|] = 0.

34

Corollary 8.9. limn→∞ Xn = X in probability and(|Xn|p

)

n∈Nuniformly integrable,

p > 0

⇒ limn→∞

Xn = X in Lp.

Proof. limn→∞|Xn −X|p → 0 in probability and since

|Xn −X|p 6 2p ·(|Xn|p + |X|p

)

︸︷︷︸

unif. integrable

,

(|Xn −X|p)n∈N is uniformly integrable too. Proposition 8.4 implies

limn→∞

E[|Xn −X|p

]= 0.

Proposition 8.10. Let g : [0,∞) → [0,∞) be measurable with limx→∞g(x)x

= ∞.Then

supi∈I

E[g(|Xi|

)]< ∞ ⇒ (Xi)i∈I uniformly integrable

Proof. Let ε > 0. Choose c > 0, such that g(x)x

>1ε

(supi∈I E

[g(|Xi|

)]+ 1)

forx > c. Then for all i ∈ I

∫

|Xi|>c|Xi| dP =

∫

|Xi|>cg(|Xi|

)· |Xi|g(|Xi|

) dP

6ε

supj∈I

E[g(|Xj |

)]+ 1

·∫

|Xi|>cg(|Xi|

)dP 6 ε.

Example 8.11. (i) p > 1, supi E[|Xi|p] < ∞ ⇒ (Xi)i∈I uniformly integrable

(ii) ("finite entropy condition")

supi∈I

E[

|Xi| · log+(|Xi|

)]

< ∞ ⇒ (Xi)i∈I uniformly integrable (1.8)

Example 8.12. Application to the strong law of large numbers Let X1, X2, . . . ber.v. in L1(Ω,A, P ) with E[Xi] = m for all i ∈ N. Suppose that

Sn

n=

1

n

n∑

i=1

Xin→∞−−−−→ m P -a.s. (1.9)

Question: Which additional condition implies L1-convergence in (1.9) ?(We have seen: if supi∈N E[X2

i ] < ∞ and (Xi)i∈N uncorrelated, then (1.9) holds(cf. Proposition 7.8). In this case even L2-convergence holds in (1.9) by Proposition7.2 (“L2-case”). Later, we will see that (1.9) always holds if Xi ∈ L1 are pairwiseindependent, identically distributed.)

35

Answer to the question:

supi∈N

E[

|Xi| · log+(|Xi|

)]

< ∞ implies limn→∞

Sn

n= m in L

1.

Proof: g(x) := x · log+(x) ist monotone increasing and convex. Consequently,

E

[

g

( |Sn|n

)]monotonicity

6 E

[

g

(1

n

n∑

i=1

|Xi|)]

convexity

6 E

[1

n

n∑

i=1

g(|Xi|

)]

6 sup16i6n

E

[

g(|Xi|

)]

∀n,

and so

supn∈N

E

[

g

( |Sn|n

)]

6 supi∈N

E

[

g(|Xi|

)]

< ∞.

Consequently,(Sn

n

)

n∈Nis uniformly integrable and thus

limn→∞

Sn

n= m in L

1.

One complementary remark concerning Lebesgue’s dominated convergence theorem.

Proposition 8.13. Let Xn ≥ 0, limn→∞

Xn = X P -f.s. (or in probability). Then

limn→∞

Xn = X in L1

⇔ limn→∞

E[Xn] = E[X] and E[X] < ∞.

Proof. "⇒": Obvious.

"⇐":

X +Xn = X ∨Xn︸︷︷︸

:=supX,Xn

+ X ∧Xn︸︷︷︸

:=infX,Xn

Then

limn→∞

E[X ∧Xn]Lebesgue

= E[X]

and thus

limn→∞

E[X ∨Xn] = E[X] .

Now |Xn −X| = (X ∨Xn)− (X ∧Xn) implies

limn→∞

E[|Xn −X|

]= E[X]− E[X] = 0.

36

Lp-completeness

Proposition 8.14 (Riesz-Fischer). Let 1 6 p < ∞ and Xn ∈ Lp with

limn,m→∞

∫

|Xn −Xm|p dP = 0.

Then there exists a r.v. X ∈ Lp such that

(i) limk→∞

Xnk= X P -a.s. along some subsequence,

(ii) limn→∞

Xn = X in Lp.


9 Distribution of random variables

Let (Ω,A, P ) be a probability space, and X : Ω → R be a r.v.Let µ be the distribution of X (under P ), i.e., µ(A) = P [X ∈ A] for all A ∈ B(R).Assume that P [X ∈ R] = 1 (in particular, X P -a.s. finite, and µ is a probability

measure on (R,B(R)).

Definition 9.1. The function F : R → [0, 1], defined by

F (b) := P [X 6 b] = µ((−∞, b]

), b ∈ R, (1.10)

is called the distribution function of X resp. µ.

Proposition 9.2. (i) F is monotone increasing: a 6 b ⇒ F (a) 6 F (b)

right continuous: F (a) = limbցa

F (b)

normalized: limaց−∞

F (a) = 0, limbր+∞

F (b) = 1.

(ii) To any such function there exists a unique probability measure µ on (R,B(R))with (1.10).

Proof. (i) Monotonicity is obvious.

Right continuity: if b ց a then (−∞, b] ց (−∞, a], hence by continuity of µfrom above (cf. Proposition 1.9):

F (a) = µ((−∞, a]

) 1.9= lim

bցaµ((−∞, b]

)= lim

bցaF (b).

Similarly, (−∞, a] ց ∅ if a ց −∞ (resp. (−∞, b] ր R if b ր ∞), and thus

limaց−∞

F (a) = limaց−∞

µ((−∞, a]

)= 0

(resp. limbր∞

F (b) = limbր∞

µ((−∞, b]

)= 1).

37

(ii) Existence: Let λ be the Lebesgue measure on (0, 1). Define the "inverse function"G of F : R → [0, 1] by

G : (0, 1) → R

G(y) := infx ∈ R

∣∣ F (x) > y

.

Since y < F (x) ⇒ G(y) 6 x, we have(0, F (x)

)⊂ G 6 x.

By definition G(y) 6 x ⇒ ∃ xn ց x with F (xn) > y, hence by right-continuityF (x) > y, so that

G 6 x ⊂(0, F (x)

].

Combining both inclusions we obtain that(0, F (x)

)⊂ G 6 x ⊂

(0, F (x)

].

so that G is measurable.

Let µ := G(λ) = λ G−1 (probability measure on (R,B(R))). Then

µ((−∞, x]

)= λ(G 6 x) = λ

((0, F (x)

])

= F (x) ∀x ∈ R .

Uniqueness: later.

Remark 9.3. (i) Let µ be an arbitrary probability measure on (R,B(R)). Then thereexists a probability space (Ω,A, P ) and a random variable X : Ω → R such thatµ = P X−1. Indeed, we just have to choose (Ω,A, P ) = ((0, 1),B((0, 1)), λ)and X = G (λ,G as in the proof of Proposistion 9.2(ii)).

(ii) (In the situation of the proof of Proposistion 9.2(ii)). Let Y be a r.v. with uniformdistribution on (0, 1), then X = G(Y ) has distribution µ. In particular: simulatingthe uniform distribution on (0, 1) gives by transformation with G a simulation ofµ.

(iii) Some authors define the distribution function F by F (x) := µ((−∞, x)

). In this

case F is left continuous, not right continuous.

Remark 9.4. (i) Let F be a distribution function and let x ∈ R: Then

F (x)− F (x−) = limnր∞

µ

((

x− 1

n, x])

= µ(x)

is called the step height of F in x. In particular:

F continuous ⇔ ∀ x ∈ R : µ(x) = 0 “µ is continuous” .

(ii) Let F be monotone increasing and bounded, then F has at most countable manypoints of discontinuity.

38

Definition 9.5. (i) F (resp. µ) is called discrete, if there exists a countable set S ⊂ Rwith µ(S) = 1. In this case, µ is uniquenely determined by the weights µ(x),x ∈ S, and F is a step function of the following type:

F (x) =∑

y∈S,y6x

µ(y).

(ii) F (resp. µ) is called absolutely continuous, if there exists a measurbale functionf > 0 called the "density", such that

F (x) =

∫ x

−∞f(t) dt, (1.11)

(resp., for all A ∈ B(R):

µ(A) =

∫

A

f(t) dt =

∫ ∞

−∞1A · f dt). (1.12)

In particular

∫ +∞

−∞f(t) dt = 1.

Remark 9.6. (i) Every measurable function f > 0 with∫ +∞−∞ f(t) dt = 1 defines a

probability measure on (R,B(R)) by A 7→∫

Af(t) dt.

(ii) In the previous definition "(1.11)⇒(1.12)", because A 7→∫

Af(t) dt defines a

probability measure on (R,B(R)) with distribution function F . Uniqueness in 9.2implies the assertion.

Example 9.7. (i) Uniform distribution on [a, b]. Let f := 1b−a

· 1[a,b]. Theassociated distribution function is given by

F (x) :=

0 if x 6 a1

b−a· (x− a) if x ∈ [a, b]

1 if x > b.

(continuous analogue to the discrete uniform distribution on a finite set)

(ii) Exponential distribution with parameter α > 0.

f(x) :=

αe−αx if x > 0

0 if x < 0,

F (x) :=

1− e−αx if x > 0

0 if x < 0.

39

f

α

(continuous analogue of the geometric distribution)

∫ k+1

k

f(x) dx = F (k + 1)− F (k) = e−αk(1− e−α

︸︷︷︸

=:p

) = (1− p)kp = P (X = k).

(iii) Normal distribution N(m,σ2), m ∈ R, σ2 > 0

fm,σ2(x) =1√2πσ2

· e−(x−m)2

2σ2 .

The associated distribution function is given by

Fm,σ2(x) =1√2πσ2

·∫ x

−∞e−

(y−m)2

2σ2 dy

z= y−mσ=

1√2π

·∫ x−m

σ

−∞e−

z2

2 dz = F0,1

(x−m

σ

)

.

Φ := F0,1 is called the distribution function of the standard normal distributionN(0, 1).

ϕ

1√2π

The expectation E[X] (or more general E[h(X)]) can be calculated with the help ofthe distribution µ of X:

Proposition 9.8. Let h > 0 be measurable, then

E[h(X)

]=

∫ +∞

−∞h(x) µ(dx)

=

∫ +∞

−∞h(x) · f(x) dx if µ absolutely continuous with density f

∑

x∈S

h(x) · µ(x) if µ discrete, µ(S) = 1 and S countable.

Proof. See exercises (Assignment no.2, Exercise 1).

40

Example 9.9. Let X be N(m,σ2)-distributed. Then

E[X] =

∫

x · fm,σ2(x) dx = m+

∫

(x−m) · fm,σ2(x) dx

︸︷︷︸

=0

= m.

The pth central moment of X is given by

E[|X −m|p

]=

∫

|x−m|p · fm,σ2(x) dx,

=

∫

|x|p · f0,σ2(x) dx.

= 2

∫ ∞

0

xp · 1√2πσ2

· e− x2

2σ2 dx,

=︸︷︷︸

y= x2

2σ2

1√π· 2 p

2 · σp

∫ ∞

0

yp+12 −1 · e−y dy.

︸︷︷︸

=Γ( p+12 )

In particular:

p = 1 : E[|X −m|

]= σ ·

√

2

π

p = 2 : E[|X −m|2

]= σ2

p = 3 : E[|X −m|3

]= 2

32 · σ3

√π

p = 4 : E[|X −m|4

]= 3σ4.

10 Weak convergence of probability measures

Let S be a topological space and S be the Borel σ-algebra on S.Let µ, µn, n ∈ N, be probability measures on (S, S).What is a reasonable notion of convergence of the sequence µn towards µ? The

notion of "pointwise convergence" in the sense that µn(A)n→∞−−−−→ µ(A) for all A ∈ S

is too strong for many applications.

Definition 10.1. Let µ and µn, n ∈ N, be probability measures on (S, S). The sequence(µn) converges to µ weakly if for all f ∈ Cb(S) (= the space of bounded continuousfunctions on S) it follows that

∫

f dµnn→∞−−−−→

∫

f dµ.

Example 10.2. (i) xnn→∞−−−−→ x in S implies δxn

n→∞−−−−→ δx weakly.

41

(ii) Let S := R1 and µn := N(0, 1

n

). Then µn → δ0 weakly, since for all f ∈ Cb(R)

∫

f dµn =

∫

f(x) · 1√

2π 1n

· e−x2

2· 1n dx

x= y√n

=

∫

f( y√

n

)

· 1√2π

· e− y2

2 dy

Lebesguen→∞−−−−→ f(0) =

∫

f dδ0.

Proposition 10.3 (Portemanteau-Theorem). Let S be a metric space with metric d.Then the following statements are equivalent:

(i) µn → µ weakly

(ii)∫f dµn

n→∞−−−−→∫f dµ for all f bounded and uniformly continuous (w.r.t. d)

(iii) lim supn→∞ µn(F ) 6 µ(F ) for all F ⊂ S closed

(iv) lim infn→∞ µn(G) > µ(G) for all G ⊂ S open

(v) limn→∞ µn(A) = µ(A) for all µ-continuity sets A, i.e. ∀A ∈ S with µ(∂A) = 0.

(vi)∫f dµn

n→∞−−−−→∫f dµ for all f bounded, measurable and µ-a.s. continuous.

Proof. (i)⇒(ii), (iii)⇔(iv), (vi)⇒(i) are obvious.

(ii)⇒(iii): Let F ⊂ S be closed. The sets

Gm :=

x ∈ S

∣∣∣∣d(x, F ) <

1

m

, m ∈ N are open,

and Gm ց ∩Gm = F , hence µ(Gm) ց µ(F ). In particular: ∀ε > 0 there existssome m ∈ N with

µ(Gm) < µ(F ) + ε.

Define

ϕ(x) :=

1 if x 6 0

1− x if x ∈ [0, 1]

0 if x > 1.

ϕ

10

42

and let f := ϕ(m · d( · , F )

).

f is Lipschitz, hence uniformly continuous, 0 6 f 6 1, f = 0 on Gcm and f = 1

on F , and thus

lim supn→∞

µn(F ) 6 lim supn→∞

∫

f dµn(ii)=

∫

f dµ

6 µ(Gm) < µ(F ) + ε.

(Idea: ϕ(m · d( · , F )

)ց 1F as m ր ∞, i.e. the indicator of F is approximated

pointwise from above through uniformly continuous functions)

(iii)⇒(v): For a subset A ⊂ S we denote the closure by A, the interior by A, and theboundary by ∂A. Let A be such that µ(A \ A) = µ(∂A) = 0. Then

µ(A) = µ(A)(iv)

6 lim infn→∞

µn(A) 6 lim supn→∞

µn(A)(iii)

6 µ(A) = µ(A).

(v)⇒(vi): Let f be as in (vi). The distribution function F (x) = µ(f 6 x) has atmost countably many jumps. Thus D :=

x ∈ R

∣∣ µ(f = x) 6= 0

is at most

countable, and so R \D ⊂ R is dense. By denseness and since f is bounded: forany ε > 0 we find c0 < · · · < cm ∈ R \D with

|ck+1 − ck| ≤ ε ∀k = 0, ...,m− 1 and c0 6 f 6 cm.

Let Ak := f ∈ [ck, ck+1), k = 0, ...,m− 2, Am−1 := f ∈ [cm−1, cm]. ThenAk is a µ-continuity set, because

µ(Ak) = µ(x |x = limxn, (xn)n∈N ⊂ Ak)= µ(x |x = limxn, (xn)n∈N ⊂ Ak, f is continuous in x)6 µ(f ∈ [ck, ck+1]) = µ(Ak) since ck, ck+1 ∈ R \D

hence µ(Ak) = µ(Ak) and similarly µ(Ak) = µ(Ak). Let g :=∑m−1

k=0 ck · IAk.

Then∥∥∥f − g

∥∥∥∞

6 ε and

∣∣∣∣

∫

f dµ−∫

f dµn

∣∣∣∣

6

∫

|f − g| dµ︸︷︷︸

<ε

+

∣∣∣∣

∫

g dµ−∫

g dµn

∣∣∣∣+

∫

|g − f | dµn

︸︷︷︸

<ε

6 2ε+m−1∑

k=0

|ck| ·∣∣∣µ(Ak

)− µn

(Ak

)∣∣∣

(v)n→∞−−−−→ 2ε.

43

Corollary 10.4. Let X, Xn, n ∈ N, be measurable mappings from (Ω,A, P ) to (S, S)with distributions µ, µn, n ∈ N. Then:

Xnn→∞−−−−→ X in probability ⇒ µn

n→∞−−−−→ µ weakly

Here, limn→∞ Xn = X in probability, if limn→∞ P (d(X,Xn) > δ) = 0 for all δ > 0.

Proof. Let f ∈ Cb(S) be uniformly continuous and ε > 0. Then there exists a δ =δ(ε) > 0 such that:

x, y ∈ S with d(x, y) 6 δ implies |f(x)− f(y)| < ε

Hence

∣∣∣∣

∫

f dµ−∫

f dµn

∣∣∣∣=∣∣∣E[f(X)

]− E

[f(Xn)

]∣∣∣

6

∫

d(X,Xn)6δ

∣∣f(X)− f(Xn)

∣∣ dP +

∫

d(X,Xn)>δ

∣∣f(X)− f(Xn)

∣∣ dP

6 ε+ 2‖f‖∞ · P[d(Xn, X) > δ

]

︸︷︷︸n→∞−−−−→0

.

Corollary 10.5. Let S = R1 and let µ, µn, n ∈ N, be probability measures on (R,B(R))with distributions functions F , Fn. Then the following statements are equivalent:

(i) µnn→∞−−−−→ µ vaguely, i.e. limn→∞

∫f dµn =

∫f dµ for all f ∈ C0(R

1) (= thespace of continuous functions with compact support)

(ii) µnn→∞−−−−→ µ weakly

(iii) Fn(x)n→∞−−−−→ F (x) for all x where F is continuous.

(iv) µn

((a, b]

) n→∞−−−−→ µ((a, b]

)for all (a, b] with µ(a) = µ(b) = 0.

Proof. (i)⇒(ii): Exercise.

(ii)⇒(iii): Let x be such that F is continuous in x. Then µ(x) = 0, which impliesby the Portmanteau theorem:

Fn(x) = µn

((−∞, x]

) n→∞−−−−→ µ((−∞, x]

)= F (x).

(iii)⇒(iv): Let (a, b] be such that µ(a) = µ(b) = 0 then F is continuous in a andb and thus

µ((a, b]

)= F (b)− F (a)

(iii)= lim

n→∞Fn(b)− lim

n→∞Fn(a)

= limn→∞

µn

((a, b]

).

44

(iv)⇒(i): Let D :=x ∈ R

∣∣ µ(x) = 0

. Then R \D is countable, hence D ⊂ R

dense. Let f ∈ C0(R), then f is uniformly continuous, hence for ε > 0 we findc0 < · · · < cm ∈ D such that

∥∥∥f −

m∑

k=1

f(ck−1) · I(ck−1,ck]

︸︷︷︸

=:g

∥∥∥∞

6 sup0≤k≤m−1

supx∈[ck−1,ck]

∣∣f(x)− f(ck−1)

∣∣ ≤ ε.

Then∣∣∣∣

∫

f dµ−∫

f dµn

∣∣∣∣

6

∫

|f − g| dµ︸︷︷︸

≤ε

+

∣∣∣∣

∫

g dµ−∫

g dµn

∣∣∣∣+

∫

|f − g| dµn

︸︷︷︸

≤ε

6 2ε+

m∑

k=1

|f(ck−1)| ·∣∣∣µ((ck−1, ck]

)− µn

((ck−1, ck]

)∣∣∣

(iv)n→∞−−−−→ 2ε.

11 Dynkin-systems and Uniqueness of probabilitymeasures

Let Ω 6= ∅.

Definition 11.1. A collection of subsets D ⊂ P(Ω) is called a Dynkin-system, if:

(i) Ω ∈ D.

(ii) A ∈ D ⇒ Ac ∈ D.

(iii) Ai ∈ D, i ∈ N, pairwise disjoint, then

⋃

i∈N

Ai ∈ D.

Example 11.2. (i) Every σ-Algebra A ⊂ P(Ω) is a Dynkin-system

(ii) Let P1, P2 be probability measures on (Ω,A). Then

D :=A ∈ A

∣∣ P1(A) = P2(A)

is a Dynkin-system

Remark 11.3. (i) Let D be a Dynkin-system. Then

A,B ∈ D , A ⊂ B ⇒ B \A = (Bc ∪A)c ∈ D

45

(ii) Every Dynkin-system which is closed under finite intersections (short notation:∩-stable), is a σ-algebra, because:

(a) A,B ∈ D ⇒ A ∪B = A ∪(B \ (A ∩B)

︸︷︷︸

∈D︸︷︷︸

(i)∈D

)∈ D.

(b) Ai ∈ D, i ∈ N ⇒⋃

i∈N

Ai =⋃

i∈N

[

Ai ∩( i−1⋃

n=1

An

︸︷︷︸

(a)∈D

)c

︸︷︷︸

∈ D by ass.,pairwise disjoint

]

∈ D.

Proposition 11.4. Let B ⊂ P(Ω) be a ∩-stable collection of subsets. Then

σ(B) = D(B) ,

where

D(B) :=⋂

D Dynkin-systemB⊂D

D

is called the Dynkin-system generated by B.


Proposition 11.5 (Uniqueness of probability measures). Let P1, P2 be probability mea-sures on (Ω,A), and B ⊂ A be a ∩-stable collection of subsets. Then:

P1(A) = P2(A) for all A ∈ B ⇒ P1 = P2 on σ(B).

Proof. The collection of subsets

D :=A ∈ A

∣∣ P1(A) = P2(A)

is a Dynkin-system containing B. Consequently,

σ(B)11.4= D(B) ⊂ D.

Example 11.6. (i) For p ∈ (0, 1) the probability measure Pp on (Ω := 0, 1N,A)is uniquely determined by

Pp[X1 = x1, . . . , Xn = xn] = pk(1− p)n−k, with k :=

n∑

i=1

xi

for all x1, . . . , xn ∈ 0, 1, n ∈ N, because the collection of cylindrical sets

X1 = x1, . . . , Xn = xn, n ∈ N0, x1, . . . , xn ∈ 0, 1is ∩-stable, generating A (cf. Example 1.7).

(Existence of Pp for p = 12 see Example 3.6. Existence for p ∈ (0, 1) \ 1

2 later.)

46

(ii) A probability measure on (R,B(R)) is uniquely determined through its distributionfunction F (:= µ

((−∞, · ]

)), because

µ((a, b]

)= F (b)− F (a),

and the collection of intervals (a, b], a, b ∈ R, is ∩-stable, generating B(R).

47

2 Independence

1 Independent events


Definition 1.1. A collection of events Ai ∈ A, i ∈ I, are said to be independent (w.r.t.P ), if for any finite subset J ⊂ I

P(⋂

j∈J

Aj

)

=∏

j∈J

P (Aj).

A family of collections of subsets Bi ⊂ A, i ∈ I, is said to be independent, if for allfinite subsets J ⊂ I and for all subsets Aj ∈ Bj , j ∈ J

P(⋂

j∈J

Aj

)

=∏

j∈J

P (Aj).

Proposition 1.2. Let Bi, i ∈ I, be independent collections of subsets that are closedunder intersections. Then:

(i) σ(Bi), i ∈ I, are independent.

(ii) Let Jk, k ∈ K, be a partition of the index set I. Then the σ-algebras

σ( ⋃

i∈Jk

Bi

)

, k ∈ K,

are independent.

Proof. (i) Let J ⊂ I, J finite, be of the form J = j1, . . . , jn. Let Aj1 ∈σ(Bj1), . . . , Ajn ∈ σ(Bjn).

We have to show that

P (Aj1 ∩ · · · ∩Ajn) = P (Aj1) · · ·P (Ajn). (2.1)

To this end suppose first that Aj2 ∈ Bj2 , . . . , Ajn ∈ Bjn , and define

Dj1 :=A ∈ σ(Bj1)

∣∣ P (A ∩Aj2 ∩ · · · ∩Ajn)

= P (A) · P (Aj2) · · ·P (Ajn).

Then Dj1 is a Dynkin system (!) containing Bj1 . Proposition 1.11.4 now implies

σ(Bj1) = D(Bj1) ⊂ Dj1 ,

hence σ(Bj1) = Dj1 . Iterating the above argument for Dj2 , Dj3 , implies (2.1).

49

(ii) For k ∈ K define

Ck :=⋂

j∈J

Aj

∣∣∣ J ⊂ Jk, J finite, Aj ∈ Bj

.

Then Ck is closed under intersections and the collection of subsets Ck, k ∈K, are independent, because: given k1, . . . , kn ∈ K and finite subsets J1 ⊂Jk1

, . . . , Jn ⊂ Jkn, then

P

(( ⋂

i∈J1

Ai

︸︷︷︸

∈Ck1

)

∩ · · · ∩( ⋂

i∈Jn

Ai

︸︷︷︸

∈Ckn

)) Bi,i∈Iind.=

n∏

j=1

P( ⋂

i∈Jj

Ai

)

.

(i) now implies that σ(Ck), k ∈ K is independent. But Ck ⊂ σ(⋃

i∈JkBi

)

since

Ck only contains finite intersections of events in σ(⋃

i∈JkBi

)

and Ck ⊃ ⋃i∈JkBi.

Hence

σ(Ck) = σ( ⋃

i∈Jk

Bi

)

, k ∈ K,

which implies the assertion.

Example 1.3. Let Ai ∈ A, i ∈ I, be independent. Then σ(Ai) = ∅, Ai, Aci ,Ω,

i ∈ I, is an independent family of σ-algebras, but the collection of events Ai, Aci | i ∈

I, is in general not independent.

Remark 1.4. Pairwise independence does not imply independence in general.Example: Consider two tosses with a fair coin, i.e.

Ω :=(i, k)

∣∣ i, k ∈ 0, 1

, P := uniform distribution.

Consider the events

A := "1. toss 1" =(1, 0), (1, 1)

B := "2. toss 1" =(0, 1), (1, 1)

C := "1. and 2. toss equal" =(0, 0), (1, 1)

.

Then P (A) = P (B) = P (C) = 12 and A,B,C are pairwise independent

P (A ∩B) = P (B ∩ C) = P (C ∩A) =1

4.

But on the other hand

P (A ∩B ∩ C) =1

46= P (A) · P (B) · P (C).

50

Example 1.5. Independent 0-1-experiments with success probability p ∈ [0, 1].Let Ω := 0, 1N, Xi(ω) := xi and ω := (xi)i∈N. Let Pp be a probability measure onA := σ

(Xi = 1, i = 1, 2, . . .

), with

(i) Pp[Xi = 1] = p (hence Pp[Xi = 0] = Pp

(Xi = 1c

)= 1− p).

(ii) Xi = 1, i ∈ N, are independent w.r.t. Pp.

Existence of such a probability measure later! Then for any x1, . . . , xn ∈ 0, 1:

Pp[Xi1 = x1, . . . , Xin = xn]

(ii) and1.3=

n∏

j=1

Pp[Xij = xj ](i)= pk(1− p)n−k,

where k :=∑n

i=1 xi gilt. Hence Pp is uniquely determined by (i) and (ii).

Proposition 1.6 (Kolmogorov’s Zero-One Law). Let Bn, n ∈ N, be independent σ-algebras, and

B∞ :=

∞⋂

n=1

σ( ∞⋃

m=n

Bm

)

be the tail σ-algebra (resp. σ-algebra of terminal events). Then

P (A) ∈ 0, 1 ∀A ∈ B∞

i.e., P is deterministic on B∞.

Illustration: Independent 0-1-experimentsLet Bi = σ

(Xi = 1

). Then

B∞ =⋂

n∈N

σ( ⋃

m>n

Bm

)

is the σ-algebra containing the events of the remote future, e.g.

lim supi→∞

Xi = 1 = “infinitely many ‘1” ’

ω ∈ 0, 1N∣∣∣∣limn→∞

1

n

n∑

i=1

Xi(ω)

︸︷︷︸

=:Sn(ω)

n

exists

Proof of the Zero-One Law. Proposition 1.2 implies that for all n

B1,B2, . . . ,Bn−1, σ( ∞⋃

m=n

Bm

)

are independent. Since B∞ ⊂ σ(⋃

m>n Bm

)

, this implies that for all n

B1,B2, . . . ,Bn−1,B∞

51

are independent. By definition this implies that

B∞ ,Bn , n ∈ N are independent

and now Proposition 1.2(ii) implies that

σ(⋃

n∈N

Bn

)

and B∞

are idependent. Since B∞ ⊂ σ(⋃

n>1 Bn

)

we finally obtain that B∞ and B∞ are

independent. The conclusion now follows from the next lemma.

Lemma 1.7. Let B ⊂ A be a σ-algebra such that B is independent from B. Then

P (A) ∈ 0, 1 ∀A ∈ B .

Proof. For all A ∈ B

P (A) = P (A ∩A) = P (A) · P (A) = P (A)2.

Hence P (A) = 0 or P (A) = 1.

For any sequence An, n ∈ N, of independent events in A, Kolmogorov’s Zero-OneLaw implies in particular for

A∞ :=⋂

n∈N

⋃

m>n

Am

(=: lim sup

n→∞An

)

that P (A∞) ∈ 0, 1.Proof: The σ-algebras Bn := σAn = ∅,Ω, A,Ac, n ∈ N, are independent byProposition 1.2 and A∞ ∈ B∞.

Lemma 1.8 (Borel-Cantelli). (i) Let Ai ∈ A, i ∈ N. Then

∞∑

i=1

P (Ai) < ∞ ⇒ P(lim supi→∞

Ai

)= 0.

(ii) Assume that Ai ∈ A, i ∈ N, are independent. Then

∞∑

i=1

P (Ai) = ∞ ⇒ P(lim supi→∞

Ai

)= 1.

Proof. (i) See Lemma 1.1.11.

(ii) It suffices to show that

P( ∞⋃

m=n

Am

)

= 1 resp. P( ∞⋂

m=n

Acm

)

= 0 ∀n .

52

The last equality follows from the fact that

P( ∞⋂

m=n

Acm

)

= limk→∞

P(n+k⋂

m=n

Acm

)

︸︷︷︸

=∏

n+km=n P (Ac

m) ind.

= limk→∞

n+k∏

m=n

(1− P (Am)) ≤ limk→∞

exp

(

−n+k∑

m=n

P (Am)

)

= 0

where we used the inequality 1− α 6 e−α for all α ∈ R.

Example 1.9. Independent 0-1-experiments with success probability p ∈ (0, 1).Let (x1, . . . , xN ) ∈ 0, 1N ("binary text of length N").

Pp["text occurs"] = ?

To calculate this probability we partition the infinite sequence ω = (yn) ∈ 0, 1N intoblocks of length N

(y1, y2, . . .︸︷︷︸

1. blocklength = N

. . . . . .︸︷︷︸

2. blocklength = N

. . .) ∈ Ω := 0, 1N.

and consider the events Ai = "text occurs in the ith block". Clearly, Ai, i ∈ N, areindependent events (!) by Proposition 1.2(ii) with equal probability

Pp(Ai) = pK(1− p)N−K =: α > 0.

where K :=∑N

i=1 xi is the total sum of ones in the text. In particular,∑∞

i=1 Pp(Ai) =∑∞i=1 α = ∞, and now Borel-Cantelli implies Pp(A∞) = 1, where

A∞ = lim supi→∞

Ai := "text occurs infinitely many times" .

Moreover: since the indicator functions 1A1, 1A2

, . . . are uncorrelated (since A1, A2, . . .are independent) with uniformly bounded variances, the strong law of large numbersimplies that

1

n

n∑

i=1

1A1

Pp-a.s.−−−−→ E[1Ai] = α ,

i.e. the relative frequency of the given text in the infinite sequence is strictly positive.

2 Independent random variables


53

Definition 2.1. A family Xi, i ∈ I, of r.v. on (Ω,A, P ) is said to be independent, ifthe σ-algebras

σ(Xi) := X−1i

(B(R)

) (

=Xi ∈ A

∣∣ A ∈ B(R)

)

, i ∈ I,

are independent, i.e. for all finite subsets J ⊂ I and any Borel subsets Aj ∈ B(R)

P(⋂

j∈J

Xj ∈ Aj)

=∏

j∈J

P [Xj ∈ Aj ].

Remark 2.2. Let Xi, i ∈ I, be independent and hi : R → R, i ∈ I, B(R)/B(R)-measurable. Then Yi := hi(Xi), i ∈ I, are again independent, because σ (Yi) ⊂ σ (Xi)for all i ∈ I.

Proposition 2.3. Let X1, . . . , Xn be independent r.v., ≥ 0. Then

E[X1 · · ·Xn] = E[X1] · · ·E[Xn].

Proof. W.l.o.g. n = 2. (Proof of the general case by induction, using the fact thatX1 · . . . · Xn−1 and Xn are independent , since X1 · . . . · Xn−1 is measurable w.r.tσ(σ(X1)∪ · · · ∪σ(Xn−1)

)and σ

(σ(X1)∪ · · · ∪σ(Xn−1)

)and σ(Xn) are independent

by Proposition 1.2.)

It therefore suffices to consider two independent r.v. X,Y , ≥ 0, and we have to showthat

E[XY ] = E[X] · E[Y ]. (2.2)

W.l.o.g. X,Y simple

(for general X and Y there exist increasing sequences of simple r.v. Xn (resp. Yn),which are σ(X)-measurable (resp. σ(Y )-measurable), converging pointwise to X (resp.Y ).

Then E[XnYn] = E[Xn] · E[Yn] for all n implies (2.2) using monotone integration.)

But for X, Y simple, hence

X =

m∑

i=1

αi1Aiand Y =

n∑

j=1

βj1Bj,

with αi, βj > 0 and Ai ∈ σ(X) resp. Bj ∈ σ(Y ) it follows that

E[XY ] =∑

i,j

αiβj · P (Ai ∩Bj) =∑

i,j

αiβj · P (Ai) · P (Bj) = E[X] · E[Y ].

Corollary 2.4. X,Y independent, X,Y ∈ L1

⇒ XY ∈ L1 and E[XY ] = E[X] · E[Y ] .

54

Proof. Let ε1, ε2 ∈ +,−. Then Xε1 and Y ε2 are independent by Remark 2.2 andnonnegative. Proposition 2.3 implies

E[Xε1 · Y ε2 ] = E[Xε1 ] · E[Y ε2 ].

In particular Xε1 · Y ε2 in L1, because E[Xε1 ] · E[Y ε2 ] < ∞. Hence

X · Y = X+ · Y + +X− · Y − − (X+ · Y − +X− · Y +) ∈ L1,

and E[XY ] = E[X] · E[Y ].

Remark 2.5. (i) In general the converse to the above corollary does not hold: Forexample let X be N(0, 1)-distributed and Y = X2. Then X and Y are notindependent, but

E[XY ] = E[X3] = E[X] · E[Y ] = 0 .

(ii)X,Y ∈ L

2 independent ⇒ X,Y uncorelated

because

cov(X,Y ) = E[XY ]− E[X] · E[Y ] = 0 .

Corollary 2.6 (to the strong law of large numbers ). Let X1, X2, · · · ∈ L2 be indepen-dent with supi∈N var(Xi) < ∞. Then

limn→∞

1

n

n∑

i=1

(Xi − E[Xi]

)= 0 P-a.s.

If E[Xi] ≡ m then limn→∞

1

n

n∑

i=1

Xi = m P -a.s.

3 Kolmogorov’s law of large numbers

Proposition 3.1 (Kolmogorov, 1930). Let X1, X2, · · · ∈ L1 be independent, identicallydistributed (i.i.d.), m = E[Xi]. Then

1

n

n∑

i=1

Xi

︸︷︷︸

empiricalmean

n→∞−−−−→ m P -a.s.

Proposition 3.1 follows from the following more general result:

Proposition 3.2 (Etemadi, 1981). Let X1, X2, · · · ∈ L1 be pairwise independent,identically distributed, m = E[Xi]. Then

1

n

n∑

i=1

Xin→∞−−−−→ m P -a.s.

55

Proof. W.l.o.g. Xi > 0(otherwise consider X+

1 , X+2 , . . . (pairwise independent, identically distributed)

and X−1 , X−

2 , . . . (pairwise independent, identically distributed))

1. Replace Xi by Xi := 1Xi<iXi.

Clearly,

Xi = hi(Xi) with hi(x) :=

x if x < i

0 if x > i

Then X1, X2, . . . are pairwise independent by Remark 2.2. For the proof it is nowsufficient to show that for Sn :=

∑ni=1 Xi we have that

Sn

n

n→∞−−−−→ m P -a.s.

Indeed,

∞∑

n=1

P [Xn 6= Xn] =

∞∑

n=1

P [Xn > n] =

∞∑

n=1

P [X1 > n]

=∞∑

n=1

∞∑

k=n

P[X1 ∈ [k, k + 1)

]=

∞∑

k=1

k · P[X1 ∈ [k, k + 1)

]

=

∞∑

k=1

E[k · 1X1∈[k,k+1)︸︷︷︸

6X1·1X1∈[k,k+1)

]6 E[X1] < ∞

implies by the Borel-Cantelli lemma

P [Xn 6= Xn infinitely often] = 0 .

2. Reduce the proof to convergence along the subsequence kn = ⌊αn⌋ (= largestnatural number ≤ αn), α > 1.

We will show in Step 3. that

Skn− E[Skn

]

kn

n→∞−−−−→ 0 P-a.s. (2.3)

This will imply the assertion of the Proposition, because

E[Xi] = E[1Xi<i ·Xi

]= E

[1X1<i ·X1

] i→∞ր E[X1](= m)

hence

1

kn· E[Skn

] =1

kn

kn∑

i=1

E[Xi]n→∞−−−−→ m,

56

and thus by (2.3)

1

kn· Skn

(ω)n→∞−−−−→ m ∀ω ∈ N c

α where Nα ∈ A with P (Nα) = 0.

It follows for l ∈ N ∩ [kn, kn+1), n big, and ω ∈ N cα

knkn+1︸︷︷︸n→∞−−−−→ 1

α

· Skn

kn(ω)

︸︷︷︸n→∞−−−−→m

6Sl

l(ω) 6

Skn+1

kn+1(ω)

︸︷︷︸n→∞−−−−→m

· kn+1

kn︸︷︷︸n→∞−−−−→α

.

Hence for all ω /∈ Nα

1

α·m 6 lim inf

l→∞Sl(ω)

l6 lim sup

l→∞

Sl(ω)

l6 α ·m.

Finally choose a sequence αn ց 1. Then for all ω ∈ N :=⋂

n>1 Ncαn

= Ω \⋃

n>1 Nαn

liml→∞

Sl(ω)

l= m.

3. Due to Lemma 1.7.7 it suffices for the proof of (2.3) to show that

∀ε > 0 :

∞∑

n=1

P

[∣∣∣∣

Skn− E[Skn

]

kn

∣∣∣∣> ε

]

< ∞

(fast convergence in probability towards 0)

Pairwise independence of Xi implies Xi pairwise uncorrelated, hence

P

[∣∣∣∣

Skn− E[Skn

]

kn

∣∣∣∣> ε

]

61

k2nε2· var(Skn

) =1

k2nε2

kn∑

i=1

var(Xi)

61

k2nε2

kn∑

i=1

E[(Xi)

2].

It therefore suffices to show that

s :=∞∑

n=1

(1

k2n

kn∑

i=1

E[(Xi)

2])

=∑

(i,n)∈N2,i6kn

1

k2n· E[(Xi)

2]< ∞ .

To this end note that

s =

∞∑

i=1

(∑

n : kn>i

1

k2n

)

· E[(Xi)

2].

57

We will show in the following that there exists a constant c such that

∑

n : kn>i

1

k2n6

c

i2. (2.4)

This will then imply that

s(2.4)

6 c∞∑

i=1

1

i2· E[(Xi)

2]= c

∞∑

i=1

1

i2· E[1X1<i ·X2

1

]

6 c

∞∑

i=1

(1

i2

i∑

l=1

l2 · P[X1 ∈ [l − 1, l)

])

= c∞∑

l=1

(

l2 ·( ∞∑

i=l

1

i2

)

︸︷︷︸

62l−1

·P[X1 ∈ [l − 1, l)

]

)

6 2c

∞∑

l=1

l · P[X1 ∈ [l − 1, l)

]= 2c

∞∑

l=1

E[

l · 1X1∈[l−1,l)︸︷︷︸

6(X1+1)·1X1∈[l−1,l)

]

6 2c ·(E[X1] + 1

)< ∞,

where we used the fact that

∞∑

i=l

1

i26

1

l2+

∞∑

i=l+1

1

(i− 1)i=

1

l2+

∞∑

i=l+1

(1

i− 1− 1

i

)

=1

l2+

1

l6

2

l.

It remains to show (2.4). To this end note that

⌊αn⌋ = kn 6 αn < kn + 1

⇒ kn > αn − 1α>1> αn − αn−1 =

(α− 1

α

)

︸︷︷︸

=:cα

αn.

Let ni be the smallest natural number satisfying kni= ⌊αni⌋ > i, hence αni > i,

then

∑

n : kn>i

1

k2n6 c−2

α

∑

n>ni

1

α2n= c−2

α · 1

1− α−2· α−2ni 6

c−2α

1− α−2· 1

i2.

Corollary 3.3. Let X1, X2, . . . be pairwise independent, identically distributed withXi > 0. Then

limn→∞

1

n

n∑

i=1

Xi = E[X1](∈ [0,∞]

)P-a.s.

58

Proof. W.l.o.g. E[X1] = ∞. Then 1n

∑ni=1

(Xi(ω) ∧N

) n→∞−−−−→ E[X1 ∧N ], P -a.s. forall N , hence

1

n

n∑

i=1

Xi >1

n

n∑

i=1

(Xi ∧N

) n→∞−−−−→ E[X1 ∧N ]N→∞ր E[X1] P -a.s.

Example 3.4. Growth in random media Let Y1, Y2, . . . be i.i.d., Yi > 0, with m :=E[Yi] (existence of such a sequence later!)

Define X0 = 1 and inductively Xn := Xn−1 · Yn

Clearly, Xn = Y1 · · ·Yn and E[Xn] = E[Y1] · · ·E[Yn] = mn, hence

E[Xn] →

+∞ if m > 1 exponential growth (supercritical)

1 if m = 1 critical

0 if m < 1 exponential decay (subcritical)

What will be the long-time behaviour of Xn(ω)?Surprisingly, in the supercritical case m > 1, one may observe that limn→∞ Xn = 0

with positive probability.Explanation: Suppose that log Yi ∈ L1. Then

1

nlogXn =

1

n

n∑

i=1

log Yin→∞−−−−→ E[log Y1] =: α P -a.s.

and

α < 0: ∃ ε > 0 with α + ε < 0, so that Xn(ω) 6 en(α+ε) ∀ n > n0(ω), hence P-a.s.exponential decay

α > 0: ∃ ε > 0 with α − ε > 0, so that Xn(ω) > en(α−ε) ∀ n > n0(ω), hence P-a.s.exponential growth

Note that by Jensen’s inequality

α = E[log Y1] 6 logE[Y1]︸︷︷︸

=m

,

and typically the inequality is strict, i.e. α < logm, so that it might happen that α < 0although m > 1 (!)

Illustration As a particular example let

Yi :=

12 (1 + c) with prob.1212 with prob.12 ,

so that E[Yi] =14 (1 + c) + 1

4 = c+24 (supercritical if c > 2)

On the other hand

E[log Y1] =1

2·[

log

(1

2(1 + c)

)

+ log1

2

]

=1

2· log 1 + c

4

c<3< 0.

Hence Xnn→∞−−−−→ 0 P -a.s. with exponential rate for c < 3, whereas at the same time

for c > 2 E[Xn] = mn ր ∞ with exponential rate.

59

Back to Kolmogorov’s law of large numbers:Let X1, X2, . . . ∈ L1 i.i.d. with m := E[Xi]. Then

1

n

n∑

i=1

Xi(ω)n→∞−−−−→ E[X1] P -a.s.

Define the "random measure"

n(ω,A) :=1

n

n∑

i=1

1A(Xi(ω)

)

= "relative frequency of the event Xi ∈ A"

Then

n(ω, · ) =1

n

n∑

i=1

δXi(ω)

is a probability measure on(R,B(R)

)for fixed ω and it is called the empirical distri-

bution of the first n observations

Proposition 3.5. For P -almost every ω ∈ Ω:

n(ω, · ) n→∞−−−−→ µ := P X−11 weakly.

Proof. Clearly, Kolmogorov’s law of large numbers implies that for any x ∈ R

Fn(ω, x) := n(ω, (−∞, x]

)=

1

n

n∑

i=1

1(−∞,x]

(Xi(ω)

)

→ E[1(−∞,x](X1)] = P [X1 ≤ x] = µ((−∞, x]

)=: F (x)

P -a.s., hence for every ω /∈ N(x) for some P -null set N(x).Then

N :=⋃

r∈Q

N(r).

is a P -null set too, and for all x ∈ R and all s, r ∈ Q with s < x < r and ω /∈ N :

F (s) = limn→∞

Fn(ω, s) 6 lim infn→∞

Fn(ω, x)

6 lim supn→∞

Fn(ω, x) 6 limn→∞

Fn(ω, r) = F (r).

Hence, if F is continuous at x, then for ω /∈ N

limn→∞

Fn(ω, x) = F (x).

Now the assertion follows from Corollary 10.5.

60

4 Joint distribution and convolution

Let Xi ∈ L1 i.i.d. Kolmogorov’s law of large numbers implies that

1

n

n∑

i=1

Xi(ω)

︸︷︷︸

=:Sn

n→∞−−−−→ E[X1] P -a.s.

hence∫

f(x) d

(

P (Sn

n

)−1)

(x) = E

[

f

(Sn

n

)]

(Lebesgue)n→∞−−−−→ f

(E[X1]

)=

∫

f(x) dδE[X1](x) ∀f ∈ Cb(R)

i.e., the distribution of Sn

nconverges weakly to δE[X1]. This is not surprising, because

at least for Xi ∈ L2

var

(Sn

n

)

=1

n2

n∑

i=1

var(Xi)︸︷︷︸

=var(X1)

n→∞−−−−→ 0.

We will see later that if we rescale Sn appropriately, namely 1√nSn, so that var

(1√nSn

)=

var(X1)), the sequence of distributions of 1√nSn is asymptotically distributed as a nor-

mal distribution.One problem in this context is: How to calculate the distribution of Sn? Since Sn is afunction of X1, . . . , Xn, we need to consider their joint distribution in the sense of thefollowing definition:

Definition 4.1. Let X1, . . . , Xn be real-valued r.v. on (Ω,A, P ). Then the distributionµ := P X−1 of the transformation

X : Ω → Rn, ω 7→ X(ω) :=(X1(ω), . . . , Xn(ω)

)

under P is said to be the joint distribution of X1, . . . , Xn.Note that µ is a probability measure on

(Rn,B(Rn)

)with µ(A) = P [X ∈ A] for all

A ∈ B(Rn).

Remark 4.2. (i) µ is well-defined, because X : Ω → Rn is A/B(Rn)-measurable.

Proof:

B(Rn) = σ(

A1 × · · · ×An

∣∣ Ai ∈ B(R)

)

(= σ(

A1 × · · · ×An

∣∣ Ai = (−∞, xi] , xi ∈ R

))

and if Ai ∈ B(R), i = 1, ..., n, then

X−1(A1 × · · · ×An) =n⋂

i=1

Xi ∈ Ai︸︷︷︸

∈A

∈ A

which implies the measurability of the transformation X.

61

(ii) Proposition 1.11.5 implies that µ is uniquely determined by

µ(A1 × · · · ×An) = P( n⋂

i=1

Xi ∈ Ai)

.

Example 4.3. (i) Let X,Y be r.v., uniformly distributed on [0, 1]. Then

• X, Y independent ⇒ joint distribution = uniform distribution on [0, 1]2

• X = Y ⇒ joint distribution = uniform distribution on the diagonal

(ii) Let X,Y be independent, N(m,σ2) distributed. The following Proposition showsthat the joint distribution of X and Y has the density

f(x, y) =1

2πσ2· exp

[

− 1

2σ2·((x−m)2 + (y −m)2

)]

which is a particular example of a 2-dimensional normal distribution.

In the case m = 0 it follows that

R :=√

X2 + Y 2,

Φ := arctanY

X,

are independent and

Φ has a uniform distribution on]−π

2 ,π2

[,

R has a density

rσ2 exp

(− r2

2σ2

)if r > 0

0 if r < 0.

Definition 4.4. (Products of probability spaces) The product of measurable spaces(Ωi,Ai), i = 1, . . . n, is defined as the measurable space (Ω,A) given by

Ω := Ω1 × . . .× Ωn

endowed with the smallest σ-algebra

A := σ(A1 × . . .×An | Ai ∈ Ai, 1 ≤ i ≤ n)

generated by measurable cylindrical sets. A is said to be the product σ-algebra of Ai

(notation:⊗n

i=1 Ai).Let Pi, i = 1, . . . , n, be probability measures on (Ωi,Ai). Then there exists a unique

probability measure P on the product space (Ω,A) satisfying

P (A1 × · · · ×An) = P1(A1) · . . . · Pn(An)

for every measurable cylindrical set. P is called the product measure of Pi (notation:⊗n

i=1 Pi).(Uniqueness of P follows from 1.11.5, existence later!)

62

Proposition 4.5. Let X1, . . . , Xn be r.v. on (Ω,A, P ) with distributions µ1, . . . , µn

and joint distribution µ. Then

X1, . . . , Xn independent ⇔ µ =

n⊗

i=1

µi,

(i.e., µ(A1 × · · · ×An) =∏

i µi(Ai) if Ai ∈ B(R)).In this case:

(i) µ is uniquely determined by µ1, . . . , µn.

(ii)∫

ϕ(x1, . . . , xn) dµ(x1, . . . , xn)

=

∫(

· · ·(∫

ϕ(x1, . . . , xn) µi1(dxi1)

)

· · ·)

µin(dxin).

for all B(Rn)-measurable functions ϕ : Rn → R with ϕ ≥ 0 or ϕ µ-integrable.

(iii) If µi is absolutely continuous with density fi, i = 1, . . . , n, then µ is absolutelycontinuous with density

f(x) :=n∏

i=1

fi(xi).

Proof. The equivalence is obvious.

(i) Obvious from part (ii) of the previous Remark 4.2.

(ii) See text books on measure theory.

(iii) f is nonnegative and measurable on Rn w.r.t. B(Rn), and

∫

Rn

f(x) dx =

n∏

i=1

∫

R

fi(xi) dxi = 1.

Hence,

µ(A) :=

∫

A

f(x) dx, A ∈ B(Rn),

defines a probability measure on (Rn,B(Rn)). For A1, . . . , An ∈ B(R) it followsthat

µ(A1 × · · · ×An) =

n∏

i=1

µi(Ai) =

n∏

i=1

∫

Ai

fi(xi) dxi

(ii)=

∫

1A1×···×An(x) · f(x) dx = µ(A1 × · · · ×An).

Hence µ = µ by 1.11.5.

63

Let X1, . . . , Xn be independent, Sn := X1 + · · ·+Xn

How to calculate the distribution of Sn with the help of the distribution of Xi?In the following denote by Tx : R1 → R1, y 7→ x+ y, the translation by x ∈ R.

Proposition 4.6. Let X1, X2 be independent r.v. with distributions µ1, µ2. Then:

(i) The distribution of X1 +X2 is given by the convolution

µ1 ∗ µ2 :=

∫

µ1(dx1) µ2 T−1x1

, i.e.

µ1 ∗ µ2(A) =

∫ ∫

1A(x1 + x2) µ1(dx1) µ2(dx2)

=

∫

µ1(dx1) µ2(A− x1) ∀A ∈ B(R1) .

(ii) If one of the distributions µ1, µ2 is absolutely continuous, e.g. µ2 with density f2,then µ1 ∗ µ2 is absolutely continuous again with density

f(x) :=

∫

µ1(dx1) f2(x− x1)

(

=

∫

f1(x1) · f2(x− x1) dx1 =: (f1 ∗ f2)(x) if µ1 = f1 dx1.

)

Proof. (i) Let A ∈ B(R), and define A :=(x1, x2) ∈ R2

∣∣ x1 + x2 ∈ A

. Then

P [X1 +X2 ∈ A] = P[(X1, X2) ∈ A

]= (µ1 ⊗ µ2)(A)

=

∫∫

1A(x1, x2) d(µ1 ⊗ µ2)(x1, x2)

=

∫∫

1A(x1 + x2) d(µ1 ⊗ µ2)(x1, x2)

=

∫ (∫

1A−x1(x2) µ2(dx2)

)

µ1(dx1)

=

∫

µ2(A− x1) µ1(dx1) = (µ1 ∗ µ2)(A).

(ii)

(µ1 ∗ µ2)(A) =

∫

µ1(dx1) µ2(A− x1) =

∫

µ1(dx1)

∫

A−x1

f2(x2) dx2

change of variablex−x1=x2

=

∫

µ1(dx1)

∫

A

f2(x− x1) dx

4.5=

∫

A

(∫

µ1(dx1) f2(x− x1)

)

dx.

64

Example 4.7.

(i) Let X1, X2 be independent r.v. with Poisson-distribution πλ1and πλ2

. ThenX1 +X2 has Poisson-distribution πλ1+λ2

, because

(πλ1∗ πλ2

)(n) =

n∑

k=0

πλ1(k) · πλ2

(n− k) = e−(λ1+λ2)n∑

k=0

λk1

k!· λn−k

2

(n− k)!

= e−(λ1+λ2)1

n!

n∑

k=0

(n

k

)

· λk1λ

n−k2 = e−(λ1+λ2) · (λ1 + λ2)

n

n!.

(ii) Let X1, X2 be independent r.v. with normal distributions N(mi, σ2i ), i = 1, 2.

Then X1+X2 has normal distribution N(m1+m2, σ21+σ2

2), because fm1+m2,σ21+σ2

2=

fm1,σ21∗ fm2,σ

22

(Exercise!)

(iii) The Gamma distribution Γα,p is defined through its density γα,p given by

γα,p(x) =

1

Γ(p) · αpxp−1e−αx if x > 0

0 if x 6 0

If X1, X2 are independent with distribution Γα,pi, i = 1, 2, then X1 + X2 has

distribution Γα,p1+p2. (Exercise!)

In the particular case pi = 1: The sum Sn = T1 + . . .+Tn of independent r.v. Ti

with exponential distribution with parameter α has Gamma-distribution Γα,n, i.e.

γα,n(x) =

αn

(n−1)! · e−αxxn−1 if x > 0

0 if x 6 0.

Example 4.8 (The waiting time paradox). Let T1, T2, . . . be independent, exponentiallydistributed waiting times (e.g. time between reception of two phone calls in a call center)with parameter α > 0, so that in particular

E[Ti] =

∫ ∞

0

x · αe−αx dx = · · · = 1

α.

Fix some time t > 0. Let X denote the time-interval from the preceding event to t,and Y denote the time-interval from t to the next event.

T1︷︸︸︷

T2︷︸︸︷ . . . t

︸︷︷︸

X

︸︷︷︸

Y

Question: How long on average is the waiting time from t until the next event, i.e.,how big is E[Y ] ?

Naive guess: E[Y ] = 12α is wrong.

65

Correct Answer: E[Y ] = 1α, and

E[X] =1

α(1− e−αt) ≈ 1

αfor large t .

More precisely:

(i) Y has exponential distribution with parameter α.

(ii) X has exponential distribution with parameter α, "compressed to" [0, t], i.e.:

P [X > s] = e−αs ∀ 0 6 s 6 t,

P [X = t] = e−αt;

In particular,

E[X] =

∫ t

0

s · αe−αs ds+ t · e−αt = · · · = 1

α(1− e−αt)

and [t−X, t+ Y ] has on average length 1α(2− e−αt) (≈ 2 · 1

αfor large t).

(iii) X,Y are independent.

Descriptive: Choosing t “randomly” it is more likely to pick a large interval.

Proof. Let us first determine the joint distribution of X and Y : Fix 0 6 x 6 t andy > 0. Then for Sn := T1 + · · ·+ Tn, S0 := 0:

P [X > x, Y > y]

= P( ⋃

n≥0

t− Sn > x, Sn+1 − t > y)

= P [T1 > y + t] +

∞∑

n=1

P[Sn 6 t− x, Tn+1 > y + t− Sn

]

= e−α(t+y) +

∞∑

n=1

∫∫

1[0,t−x]×[y+t−s,∞)(s, t) · γα,n(s) · αe−αr ds dr

= e−α(t+y) +

∞∑

n=1

∫ t−x

0

γα,n(s) · e−α(y+t−s) ds

= e−α(t+y)

(

1 +

∫ t−x

0

eαs∞∑

n=1

γα,n(s)

︸︷︷︸

=α

ds

)

= e−α(t+y)

(

1 +

∫ t−x

0

αeαs ds

)

= e−α(t+y) · eα(t−x) = e−αy · e−αx.

66

Consequently:

(i) Setting x = 0 we get: Y ist exponentially distributed with parameter α.

(ii) Setting y = 0 we get: X ist exponentially distributed with parameter α, com-pressed to [0, t].

(iii) X,Y are independent.

5 Characteristic functions

Let M1+(R

n) be the set of all probability measures on (Rn,B(Rn)).For given µ ∈ M1

+(Rn) define its characteristic function as the complex-valued func-

tion µ : Rn → C defined by

µ(u) :=

∫

ei〈u,y〉 µ(dy) :=

∫

cos(〈u, y〉)µ(dy) + i

∫

sin(〈u, y〉)µ(dy) .

Proposition 5.1. Let µ ∈ M1+(R

n). Then

(i) µ(0) = 1.

(ii) |µ| 6 1.

(iii) µ(−u) = µ(u).

(iv) µ is uniformly continuous.

(v) µ is positive definite, i.e. for all c1, . . . , cm ∈ C, u1, . . . , um ∈ Rn, m > 1:

m∑

j,k=1

cj ck · µ(uj − uk) > 0.

Proof. Exercise.

Proposition 5.2 (Uniqueness theorem). Let µ1, µ2 ∈ M1+(R

n) with µ1 = µ2. Thenµ1 = µ2.

Proposition 5.3 (Bochner’s theorem). Let ϕ : Rn → C be a continuous, positivedefinite function with ϕ(0) = 1. Then there exists one (and only one) µ ∈ M1

+(Rn)

with µ = ϕ.

Proposition 5.4 (Lévy’s continuity theorem). Let (µm)m∈N be a sequence in M1+(R

n).Then

(i) limm→∞ µm = µ weakly implies limm→∞ µm = µ uniformly on every compactsubset of Rn.

67

(ii) Conversely, if (µm)m∈N converges pointwise to some function ϕ : Rn → C whichis continuous in u = 0, then there exists a unique µ ∈ M1

+(Rn) such that µ = ϕ

and limm→∞ µm = µ weakly.

Proof. For the proofs or references where to find the proofs of the three previous propo-sitions see Klenke Theorem 15.9, 15.23, and 15.29.

Let (Ω,A, P ) be a probability space and X : Ω → Rn be A/B(Rn)-measurable. LetPX (:= P X−1) be the distribution of X. Then

ϕX(u) := PX(u) =

∫

ei〈u,y〉 PX(dy) =

∫

ei〈u,X〉dP = E[

ei〈u,X〉]

is called the characteristic function of X.

Remark 5.5. X1, . . . , Xn are independent if and only if

P(X1,...,Xn)︸︷︷︸

=ϕ(X1,...,Xn)

(u1, . . . , un) = ( ⊗nj=1PXj

)(u1, . . . , un) ( =n∏

j=1

PXj(uj)

︸︷︷︸

=ϕXj(uj)

),

i.e.: P(X1,...,Xn) =

n∏

j=1

PXj Prj , where Prj(u) = uj .

Proposition 5.6. Let X1, . . . , Xn be independent r.v., α ∈ R and S := α∑n

k=1 Xk.Then for all u ∈ R:

ϕS(u) =

n∏

k=1

ϕXk(αu).

Proof.

ϕS(u) =

∫

eiuS dP =

∫ n∏

k=1

eiαuXk dPIndep.=

n∏

k=1

∫

eiαuXk dP =

n∏

k=1

ϕXk(αu).

Proposition 5.7. For all u ∈ Rn:

(1

2π

)n2∫

ei〈u,y〉e−12‖y‖

2

dy = e−12‖u‖

2

.

Proof. See Theorem 15.12 in Klenke.

Example 5.8. (i) δa(u) = eiua.

(ii) Let µ :=∑∞

i=0 αiδai(αi ≥ 0,

∑∞i=0 αi = 1). Then

µ(u) =

∞∑

i=0

αieiuai , u ∈ R.

Special cases:

68

a) Binomial distribution βpn =

∑nk=0

(nk

)pkqn−k δk Then for all u ∈ R:

βpn(u) =

n∑

k=0

(n

k

)

pkqn−k · eiuk = (q + peiu)n.

b) Poisson distribution πα =∑∞

n=0 e−α αn

n! δn. Then for all u ∈ R:

πα(u) = e−α

∞∑

n=0

αn

n!· eiun

︸︷︷︸

=(αeiu)n

n!

= eα(eiu−1).

6 Central limit theorem

Definition 6.1. Let X1, X2, . . . ∈ L2 be independent r.v. with strictly positive vari-ances, Sn := X1 + · · ·+Xn and

S∗n :=

Sn − E[Sn]√

var(Sn)("standardized sum")

(so that in particular E[S∗n] = 0 and var(S∗

n) = 1). The sequence X1, X2, . . . of r.v. issaid to have the central limit property (CLP), if

limn→∞

PS∗n= N(0, 1) weakly,

or equivalently

limn→∞

P [S∗n 6 b] =

1√2π

∫ b

−∞e−

x2

2 dx = Φ(b) ∀ b ∈ R.

Proposition 6.2. (Central limit theorem) Let X1, X2, . . . ∈ L2 be independent r.v.,σ2n := var(Xn) > 0 and

sn :=( n∑

k=1

σ2k

) 12

.

Assume that (Xn)n∈N satisfies Lindeberg’s condition

limn→∞

n∑

k=1

∫

|Xk−E[Xk]|sn

>ε

(Xk − E[Xk]

sn

)2

dP

︸︷︷︸

=:Ln(ε)

= 0 ∀ ε > 0 . (L)

Then (Xn)n∈N has the CLP.

Remark 6.3. (i) (Xn)n∈N i.i.d. ⇒ (Xn)n∈N satisfies (L).

Proof: Let m := E[Xn], σ2 := var(Xn). Then s2n = nσ2, so that

Ln(ε) = σ−2

∫

|X1−m|>ε√nσ

(X1 −m)2 dPLebesguen→∞−−−−→ 0.

69

(ii) The following stronger condition, known as Lyapunov’s condition, is often easierto check in applications:

∃ δ > 0 : limn→∞

n∑

k=1

E[∣∣Xk − E[Xk]

∣∣2+δ]

s2+δn

= 0 . (Lya)

To see that Lyapunov’s condition implies Lindeberg’s condition note that for allε > 0:

∣∣Xk − E[Xk]

∣∣

εsn> 1

⇒∣∣Xk − E[Xk]

∣∣2+δ

(εsn)δ>∣∣Xk − E[Xk]

∣∣2

and therefore

Ln(ε) 61

εδ·

n∑

k=1


∣∣2+δ]

s2+δn

.

(iii) Let (Xn) be bounded and suppose that sn → ∞. Then (Xn) satisfies Lyapunov’scondition for any δ > 0, because

|Xk| 6α

2

⇒∣∣Xk − E[Xk]

∣∣ 6 α

⇒

n∑

k=1


∣∣2+δ]

s2+δn

6

n∑

k=1


∣∣2αδ]

s2nsδn

=

(α

sn

)δ1

s2n·

n∑

k=1


∣∣2]

︸︷︷︸

=s2n

=

(α

sn

)δ

.

Lemma 6.4. Suppose that (Xn) satisfies Lindeberg’s condition. Then

limn→∞

max16k6n

σk

sn= 0 . (2.5)

Proof. For all 1 6 k 6 n

(σk

sn

)2

=

∫ (Xk − E[Xk]

sn

)2

dP 6

∫

|Xk−E[Xk]|sn

>ε

(Xk − E[Xk]

sn

)2

dP + ε2

6 Ln(ε) + ε2.

70

The proof of Proposition 6.2 requires some further preparations.

Lemma 6.5. For all t ∈ R and n ∈ N:∣∣∣∣eit − 1− it

1!− (it)2

2!− · · · − (it)n−1

(n− 1)!

∣∣∣∣6

|t|nn!

.

Proof. Define f(t) := eit, then f (k)(t) = ikeit, and the Taylor series expansion, appliedto real and imaginary part, implies that

∣∣∣∣eit − 1− · · · − (it)n−1

(n− 1)!

∣∣∣∣=∣∣Rn(t)

∣∣

with

∣∣Rn(t)

∣∣ =

∣∣∣∣

1

(n− 1)!

∫ t

0

(t− s)n−1ineis ds

∣∣∣∣6

1

(n− 1)!

∫ |t|

0

sn−1 ds =|t|nn!

.

Proposition 6.6. Let X ∈ L2. Then ϕX(u) =∫eiuX dP is two times continuously

differentiable with

ϕ′X(u) = i

∫

XeiuX dP , ϕ′′X(u) = −

∫

X2eiuX dP .

In particular

ϕ′X(0) = i · E[X], ϕ′′

X(0) = −E[X2], |ϕ′′X | 6 E[X2].

Moreover, for all u ∈ R

ϕX(u) = 1 + iu · E[X] +1

2· θ(u)u2 · E[X2]

with∣∣θ(u)

∣∣ 6 1 and θ(u) ∈ C.

Proof. Clearly,

(eiuX)′ = iX · eiuX , (eiuX)′′ = −X2eiuX , |eiuX | = 1 .

Now, Lebesgue’s dominated convergence theorem implies all assertions up to the lastone. For the proof of the last assertion note that the previous lemma implies in thecase n = 2, t = uX that

|eiuX − 1− iuX| 6 1

2· u2X2.

Hence

∣∣ϕX(u)− 1− iu · E[X]

∣∣ =

∣∣∣∣

∫

eiuX − 1− iuX dP

∣∣∣∣6

1

2· u2 · E[X2].

Now define θ(u) := 0 if u = 0 or E[X2] = 0, and θ(u) := ϕX(u)−1−iu·E[X]12 ·u2·E[X2]

otherwise.

71

From now on assume that X1, X2, · · · ∈ L2 are independent and

E[Xn] = 0, σ2n := var(Xn) > 0, sn =

( n∑

k=1

σ2k

) 12

, n ≥ 1.

Proposition 6.7. Suppose that

(a) limn→∞

max1≤k≤n

σk

sn= 0 and

(b) limn→∞

n∑

k=1

(

ϕXk

(u

sn

)

− 1

)

= −1

2u2 ∀u ∈ R .

Then (Xn) has the CLP.

Proof. It is sufficient to show that

limn→∞

n∏

k=1

ϕXk

(u

sn

)

= e−12u

2

. (2.6)

because for S∗n = 1

sn

∑nk=1 Xk we have that

ϕS∗n(u) =

n∏

k=1

ϕXk

(u

sn

)

,

and ϕS∗n(u)

n→∞−−−−→ e−12u

2

= N(0, 1)(u) pointwise, implies by Lévy’s continuity theoremthat limn→∞ PS∗

n= N(0, 1) weakly.

For the proof of (2.6) we need to show that for all u ∈ R

limn→∞

(n∏

k=1

ϕXk

(u

sn

)

−n∏

k=1

exp

[

ϕXk

(u

sn

)

− 1

]

︸︷︷︸

=exp[∑··· ]→exp[− 1

2u2]

)

= 0 .

To this end fix u ∈ R and note that |ϕXk| 6 1, hence

∣∣∣∣∣exp

[

ϕXk

(u

sn

)

− 1

]∣∣∣∣∣= exp

[

ReϕXk

(u

sn

)

− 1

]

6 1.

Note that for a1, . . . , an, b1, . . . , bn ∈z ∈ C

∣∣ |z| 6 1

∣∣∣

n∏

k=1

ak −n∏

k=1

bk

∣∣∣ =

∣∣(a1 − b1) · a2 · · · an + b1 · (a2 − b2) · a3 · · · an + . . .

+ b1 · · · bn−1 · (an − bn)∣∣

6

n∑

k=1

|ak − bk|.

72

Consequently,

∣∣∣∣∣

n∏

k=1

ϕXk

(u

sn

)

−n∏

k=1

exp

[

ϕXk

(u

sn

)

− 1

]∣∣∣∣∣

6

n∑

k=1

∣∣∣∣∣ϕXk

(u

sn

)

− exp

[

ϕXk

(u

sn

)

− 1

]∣∣∣∣∣=: Dn.

If we define zk := ϕXk( usn)− 1, we can write

Dn =

n∑

k=1

|zk + 1− ezk |.

Fix ε ∈ (0, 12 ]. Then

|z + 1− ez| 6 |z|2 6 ε|z| ∀ |z| 6 ε .

Note that E[Xk] = 0 and E[X2k ] = σ2

k. The previous proposition now implies that forall k

|zk| 61

2

(u

sn

)2

σ2k,

and moreover by (a) we can find n0 ∈ N such that for all n > n0 and 1 6 k 6 n

1

2

(u

sn

)2

σ2k < ε.

Hence for all n > n0

Dn 6 ε

n∑

k=1

|zk| 6 εu2

2

n∑

k=1

σ2k

s2n= ε · u

2

2.

Consequently, limn→∞ Dn = 0.

Proof of Proposition 6.2. W.l.o.g. assume that E[Xn] = 0 for all n ∈ N. We will useProposition 6.7. Since (L) ⇒ (a) by Lemma 6.4 it remains to show (b) of Proposition6.7. We will show that Lindeberg’s condition implies (b), i.e. we show that (L) implies

limn→∞

n∑

k=1

(

ϕXk

(u

sn

)

− 1

)

= −1

2· u2.

Let u ∈ R, n ∈ N, 1 6 k 6 n. Lemma 6.5 implies that

Yk :=

∣∣∣∣∣exp

(

i · u

sn·Xk

)

− 1− i · u

sn·Xk

︸︷︷︸

E[... ]=0

+1

2· u

2

s2n·X2

k

∣∣∣∣∣6

1

6

∣∣∣∣

u

sn·Xk

∣∣∣∣

3

,

73

and

Yk 6

∣∣∣∣∣exp

(

i · u

sn·Xk

)

− 1− i · u

sn·Xk

∣∣∣∣∣+

1

2· u

2

s2n·X2

k 6u2

s2n·X2

k .

Then∣∣∣∣∣

n∑

k=1

(

ϕXk

(u

sn

)

− 1

)

+1

2· u2

∣∣∣∣∣=

∣∣∣∣∣

n∑

k=1

(

ϕXk

(u

sn

)

− 1 +1

2· u

2

s2n· σ2

k

)∣∣∣∣∣

6

n∑

k=1

E[Yk],

and for any ε > 0

E[Yk] =

∫

|Xk|>εsnYk dP +

∫

|Xk|<εsnYk dP

6u2

s2n

∫

|Xk|>εsnX2

k dP +|u|36s3n

∫

|Xk|<εsn|Xk|3 dP.

Note that

1

s3n

∫

|Xk|<εsn|Xk|3 dP 6

ε

s2n

∫

X2k dP = ε · σ

2k

s2n,

so that we obtain

n∑

k=1

E[Yk] 6 u2n∑

k=1

∫

|Xksn

|>ε

(Xk

sn

)2

dP +|u|36

· εn∑

k=1

σ2k

s2n︸︷︷︸

=1

= u2Ln(ε) +|u|36

· ε .

Consequently

limn→∞

n∑

k=1

E[Yk] = 0 ,

and thus

limn→∞

(n∑

k=1

[

ϕXk

(u

sn

)

− 1

]

+1

2· u2

)

= 0.

Example 6.8 (Applications). (i) "Ruin probability"Consider a portfolio of n contracts of a risc insurance (e.g. car insurance, fireinsurance, health insurance, ...). Let Xi > 0 be the claim size (or claim severity)

74

of the ith contract, 1 6 i 6 n. We assume that X1, . . . , Xn ∈ L2 are i.i.d. withm := E[Xi] and σ2 := var(Xi).

Suppose the insurance holder has to pay he following premium

Π := m+ λσ2

= average claim size + safety loading.

After some fixed amount of time:

Income: nΠ

Expenditures: Sn =

n∑

i=1

Xi.

Suppose that K is the initial capital of the insurance company. What is theprobability P (R), where

R := Sn > K + nΠ denotes the ruin ?

We assume here that:

• No interest rate.

• Payments due only at the end of the time period.

Let

S∗n :=

Sn − nm√nσ

.

The central limit theorem implies for large n that S∗n ∼ N(0, 1), so that

P (R) = P

[

S∗n >

K + nΠ− nm√nσ

]

= P

[

S∗n >

K + nλσ2

√nσ

]

CLT≈ 1− Φ

(K + nλσ2

√nσ

︸︷︷︸n→∞−−−−→∞

)

,

where Φ denotes the distribution of the standard normal distribution. Note thatthe ruin probability decreases with an increasing number of contracts.

ExampleAssume that n = 2000, σ = 60, λ = 0.5‰.

(a) K = 0 ⇒ P (R) ≈ 1− Φ(1.342) ≈ 9%.

(b) K = 1500 ⇒ P (R) ≈ 3%.

How large do we have to choose n in order to let the probability of ruin P (R) fallbelow 1‰?Answer: Φ(. . .) > 0.999, hence n > 10 611.

75

(ii) Stirling’s formulaRemark: Stirling proved the following formula

n! ≈√2πnn+ 1

2 e−n (2.7)

in the year 1730 and De Moivre used it in his proof of the CLT for Bernoulliexperiments.

Conversely, in 1977, Weng provided an independent proof of the formula, usingthe CLT (note that we did not use Stirling’s formula in our proof of the CLT).Here is Weng’s proof:

Let X1, X2, . . . be i.i.d. with distribution π1, i.e.,

PXn= e−1

∞∑

k=0

1

k!δk .

Then Sn := X1 + · · ·+Xn has Poisson distribution πn, i.e.,

PSn= e−n

∞∑

k=0

nk

k!δk,

and in particular E[Sn] = var(Sn) = n. Thus

S∗n =

Sn − n√n

,

so that S∗n = tn Sn for tn(x) :=

x−n√n

. Then

∫

f dPS∗n= E

[f(S∗

n)]= E

[(f tn)(Sn)

]=

∫

f tn d PSn︸︷︷︸

=πn

In particular, for

f∞(x) := x− (= (−x) ∨ 0

)

it follows that∫

f∞ dPS∗n=

∫

R

f∞

(x− n√

n

)

︸︷︷︸

=0 if x>n

= n−x√n

if x6n

πn(dx) = e−n

n∑

k=0

nk

k!· n− k√

n︸︷︷︸

=f∞(k)

=e−n

√n

·(

n+n∑

k=1

nk(n− k)

k!

)

=e−n

√n

·(

n+

n∑

k=1

nk+1

k!− nk

(k − 1)!︸︷︷︸

=nn+1

n! −n1

0!

)

=e−n · nn+ 1

2

n!.

76

Moreover,

∫

f∞ dN(0, 1) =1√2π

∫ 0

−∞(−x) · e− x2

2 dx =1√2π

· e− x2

2

∣∣∣∣

0

−∞=

1√2π

.

Hence, Stirling’s formula (2.7) would follow, once we have shown that

∫

f∞ dPS∗n

n→∞−−−−→∫

f∞ dN(0, 1). (2.8)

Note that this is not implied by the weak convergence in the CLT since f∞ iscontinuous but unbounded. Hence, we consider for given m ∈ N

fm := f∞ ∧m(∈ Cb(R)

).

The CLT now implies that

∫

fm dPS∗n

n→∞−−−−→∫

fm dN(0, 1).

Define gm := f∞ − fm (≥ 0). (2.8) then follows from a "3ε-argument", oncewe have shown that

(0 6)

∫

gm dPS∗n6

1

m∀ m,

(0 6)

∫

gm dN(0, 1) 61

m∀m.

The first inequality follows from

∫

gm dPS∗n

=

∫

]−∞,−m[

(|x| −m

)dPS∗

n6

∫

]−∞,−m]

|x| dPS∗n

|x|m

>1

61

m

∫

]−∞,−m]

x2 dPS∗n6

1

m· var(S∗

n)︸︷︷︸

=1

,

the second inequality can be shown similarly.

77

3 Conditional probabilities and

expectations

1 Elementary definitions


Definition 1.1. Let B ∈ A with P (B) > 0. Then

P [A |B] :=P (A ∩B)

P (B), A ∈ A,

is called the conditional probability of A given B. In the case P (B) = 0 we simplydefine P [A |B] := 0. For P (B) > 0 the probability measure

PB := P [ · |B]

on (Ω,A) is is called the conditional distribution given B.

Remark 1.2. (i) P (A) is called the a priori probability of A.

P [A |B] is called the a posteriori probability of A, given the information that Boccurred.

(ii) In the case of Laplace experiments

P [A |B] =|A ∩B||B| = proportion of elements in A that are in B.

(iii) If A and B are disjoint (hence A ∩B = ∅), then P [A |B] = 0.

(iv) If A and B are independent, then

P [A |B] =P (A) · P (B)

P (B)= P (A).

Example 1.3. (i) Suppose that a family has two children. Consider the followingtwo events: B := "at least one boy" and A := "two boys". Then P [A |B] = 1

3 ,because

Ω =(boy, boy), (girl, boy), (boy, girl), (girl, girl)

,

P = uniform distribution,

and thus

P [A |B] =|A ∩B||B| =

1

3.

79

(ii) Let X1, X2 be independent r.v. with Poisson distribution with parameters λ1, λ2.Then

P [X1 = k |X1 +X2 = n] =

0 if k > n

? if 0 6 k 6 n.

According to Example 4.7 X1 +X2 has Poisson distribution with parameter λ :=λ1 + λ2. Consequently,

P [X1 = k |X1 +X2 = n] =P [X1 = k, X2 = n− k]

P [X1 +X2 = n]

=e−λ1 λk

1

k! · e−λ2λn−k2

(n−k)!

e−λ λn

n!

=

(n

k

)

·(λ1

λ

)k(λ2

λ

)n−k

,

i.e., P [X1 ∈ · | X1 +X2 = n] is the binomial distribution with parameters n andp = λ1

λ1+λ2.

(iii) Consider n independent 0-1-experiments X1, . . . , Xn with success probability p ∈(0, 1). Let

Sn := X1 + . . .+Xn

and

Xi : Ω := 0, 1n → 0, 1,(x1, . . . , xn) 7→ xi.

For given (x1, . . . , xn) ∈ 0, 1n and fixed k ∈ 0, . . . , n

P [X1 = x1, . . . , Xn = xn |Sn = k]

=

0 if∑

i xi 6= kpk(1−p)n−k

(nk)pk(1−p)n−k=(nk

)−1otherwise

It follows that the conditional distribution ν := P [(X1, ..., Xn) ∈ · | Sn = k] isthe uniform distribution on

Ωk :=

(x1, . . . , xn)∣∣∣

n∑

i=1

xi = k

,

because |Ωk| =(nk

)and for each (x1, ...xn) ∈ Ωk we have ν((x1, ...xn)) =

(nk

)−1.

Proposition 1.4. (Formula for total probability) Let B1, . . . , Bn be disjoint, Bi ∈ A

∀ 1 ≤ i ≤ n. Then for all A ⊂ ⋃ni=1 Bi, A ∈ A:

P (A) =

n∑

i=1

P [A |Bi] · P (Bi).

80

Proof. Clearly, A = ∪i6n(A ∩Bi). Consequently,

P (A) =

n∑

i=1

P (A ∩Bi) =

n∑

i:P (Bi)>0

P (A ∩Bi)

P (Bi)P (Bi) =

n∑

i=1

P (A |Bi)P (Bi) .

Example 1.5. (Simpson’s paradox)Consider applications of male (M) and female (F ) students at a university in the

United States

Applications acceptedM 2084 1036 PM [A] ≈ 0.49F 1067 349 PF [A] ≈ 0.33

Is this an example for discrimination of female students? A closer look to the biggestfour faculties B1, . . . , B4:

male femaleAppl. acc. PM [A |Bi] Appl. acc. PF [A |Bi]

B1 826 551 0.67 108 89 0.82B2 560 353 0.63 25 17 0.68B3 325 110 0.34 593 219 0.37B4 373 22 0.06 341 24 0.07

2084 1036 1067 349

It follows that for all four faculties the probability of being accepted was higher forfemale students than it was for male students:

PM [A |Bi] < PF [A |Bi].

Nevertheless, the preference turns into its opposite if looking at the total probabilityof admission:

PF (A) =

4∑

i=1

PF [A |Bi] · PF (Bi) <

4∑

i=1

PM [A |Bi] · PM (Bi) = PM (A).

For an explanation consider the distributions of applications:

PM (B1) =|B1 ∩M |

|M | =826

2084≈ 4

10, PF (B1) =

|B1 ∩ F ||F | =

108

1067≈ 1

10,

etc. and observe that male students mainly applied at faculties with a high probabilityof admission, whereas female students mainly applied at faculties with a low probabilityof admission.

Proposition 1.6 (Bayes’ theorem). Let B1, . . . , Bn ∈ A be disjoint with P (Bi) > 0for i = 1, . . . , n. Let A ∈ A, A ⊂ ⋃n

i=1 Bi with P (A) > 0. Then:

P [Bi |A] =P [A |Bi] · P (Bi)

n∑

j=1

P [A |Bj ] · P (Bj)

.

81

Proof.

P [Bi |A] =P (A ∩Bi)

P (A)

1.4=

P [A | Bi] · P (Bi)n∑

j=1

P [A | Bj ] · P (Bj)

.

Example 1.7 (A posteriori probabilities in medical tests). Suppose that one out of 145persons of the same age have the disease D, i.e. the a priori probability of having D isP [D] = 1

145 .Suppose now that a medical test for D is given which detects D in 96 % of all cases,

i.e.

P [positive |D] = 0.96 .

However, the test also is positive in 6% of the cases, where the person does not haveD, i.e.

P [positive |Dc] = 0.06 .

Suppose now that the test is positive. What is the a posteriori probability of actuallyhaving D?

So we are interested in the conditional probability P [D | positive]:

P [D | positive] 1.6=

P [positive |D] · P [D]

P [positive |D] · P [D] + P [positive |Dc] · P [Dc]

=0.96 · 1

145

0.96 · 1145 + 0.06 · 144

145

=1

1 + 696 · 144 =

1

10.

Note: in only one out of ten cases, a person with a positive result actually has D.Another conditional probability of interest in this context is the probability of not

having D, once the test is negative, i.e., P [Dc | negative]:

P [Dc | negative] = P [negative |Dc] · P [Dc]

P [negative |D] · P [D] + P [negative |Dc] · P [Dc]

=0.94 · 144

145

0.04 · 1145 + 0.94 · 144

145

=94 · 144

4 + 94 · 144 ≈ 0.9997.

Note: The two conditional probabilities interchange, if the a priori probability of nothaving D is low (e.g. 1

145 ). If the risc of having D is high and one wants to test whetheror not having D, the a posteriori probability of not having D, given that the test wasnegative, is only 0.1.

Example 1.8 (computing total probabilities with conditional probabilities). Let S be afinite set, Ω := Sn+1, n ∈ N, and P be a probability measure on Ω. Let Xi : Ω → S,i = 0, . . . , n, be the canonical projections Xi(ω) := xi for ω = (x0, . . . , xn).

If we interpret 0, 1, . . . , n as time points, then (Xi)06i6n may be seen as a stochasticprocess and

(X0(ω), . . . , Xn(ω)

)is said to be a sample path (or a trajectory) of the

process.

82

For all ω ∈ Ω we either have P (ω) = 0 or

P (ω) = P [X0 = x0, . . . , Xn = xn]

= P [X0 = x0, . . . , Xn−1 = xn−1]

· P [Xn = xn |X0 = x0, . . . , Xn−1 = xn−1]

...

= P [X0 = x0]

· P [X1 = x1 |X0 = x0]

· P [X2 = x2 |X0 = x0, X1 = x1]

· · ·· P [Xn = xn |X0 = x0, . . . , Xn−1 = xn−1].

Note: P (ω) 6= 0 implies P [X0 = x0, . . . , Xk = xk] 6= 0 for all k ∈ 0, . . . , n.Conclusion: A probability measure P on Ω is uniquely determined by the following:

Initial distribution: µ := P X−10

Transition probabilities: the conditional distributions

P [Xk = xk |X0 = x0, . . . , Xk−1 = xk−1]

for any k ∈ 1, . . . , n and (x0, . . . , xk) ∈ S(k+1).

Existence of P for given initial distribution and given transition probabilities is shownin Section 3.3.

Example 1.9. A stochastic process is called a Markov chain, if P [Xk = xk |X0 =x0, . . . , Xk−1 = xk−1] = P [Xk = xk |Xk−1 = xk−1], i.e., if the transition probabilitiesfor Xk only depend on Xk−1.

If we denote by Xk−1 the “present”, by Xk the “future” and by “X0, . . . , Xk−2” thepast, then we can state the Markov property as: given the “present”, the “future” of theMarkov chain is independent of the “past”.

2 Transition probabilities and Fubini’s theorem

Let (S1, S1) and (S2, S2) be measurable spaces.

Definition 2.1. A mapping

K : S1 × S2 → [0, 1]

(x1, A2) 7→ K(x1, A2)

is said to be a transition probability (from (S1, S1) to (S2, S2)), if

(i) ∀x1 ∈ S1: K(x1, · ) is a probability measure on (S2, S2).

83

(ii) ∀A2 ∈ S2: K( · , A2) is S1-measurable.

Example 2.2. (i) For given probability measure µ on (S2, S2) define

K(x1, · ) := µ ∀x1 ∈ S1 no coupling!

(ii) Let T : S1 → S2 be a S1/S2-measurable mapping, and

K(x1, · ) := δT (x1) ∀x1 ∈ S1 .

(iii) Stochastic matrices Let S1, S2 be countable and Si = P(Si), i = 1, 2. In thiscase, any transition probability from (S1, S1) to (S2, S2) is given by

K(x1, x2) := K(x1, x2

), x1 ∈ S1, x2 ∈ S2,

where K : S1×S2 → [0, 1] is a mapping, such that for all x1 ∈ S1

∑

x2∈S2K(x1, x2) =

1. Consequently, K can be identified with a stochastic matrix, or a transition ma-trix, i.e. a matrix with nonnegative entries and row sums equal to one.

Example 2.3. (i) Transition probabilities of the random walk on Zd

S1 = S2 = S := Zd with S := P(Zd)

K(x, · ) := 1

2d

∑

y∈N(x)

δy , x ∈ Zd ,

with

N(x) :=y ∈ Zd

∣∣ ‖x− y‖ = 1

denotes the set of nearest neighbours of x.

(ii) Ehrenfest model Consider a box containing N balls. The box is divided into twoparts ("left" and "right"). A ball is selected randomly and put into the other half.

"microscopic level" the state space is S := 0, 1N with x = (x1, . . . , xN ) ∈ Sdefined by

xi :=

1 if the ith ball is contained in the "left" half

0 if the ith ball is contained in the "right" half

the transition probability for x = (x1, . . . , xN ) ∈ S is given by

K(x, · ) := 1

N

N∑

i=1

δ(x1,...,xi−1,1−xi,xi+1,...,xN ).

"macroscopic level" the state space is S := 0, . . . , N, where j ∈ S denotesthe number of balls contained in the left half. The transition probabilitiesare given by

K(j, · ) := N − j

N· δj+1 +

j

N· δj−1.

84

(iii) Transition probabilities of the Ornstein-Uhlenbeck process S = S1 = S2 =R, K(x, · ) := N(αx, σ2) with α ∈ R, σ2 > 0.

We now turn to Fubini’s theorem. To this end, let µ1 be a probability measure on(S1, S1) and K( · , · ) be a transition probability from (S1, S1) to (S2, S2).Our aim is to construct a probability measure P (:= µ1 ⊗ K) on the product space(Ω,A), where

Ω := S1 × S2

A := S1 ⊗ S2 := σ(X1, X2)!= σ

(A1 ×A2 |A1 ∈ S1, A2 ∈ S2

),

and

Xi : Ω = S1 × S2 → Si,

(x1, x2) 7→ xi,

i = 1, 2,

satisfying

P (A1 ×A2) =

∫

A1

K(x1, A2) µ1(dx1)

for all A1 ∈ S1 and A2 ∈ S2.

Proposition 2.4 (Fubini). Let µ1 be a probability measure on (S1, S1), K a transitionprobability from (S1, S1) to (S2, S2), and

Ω := S1 × S2, (3.1)

A := σ(A1 ×A2 |Ai ∈ Si

)=: S1 ⊗ S2 . (3.2)

Then there exists a probability measure P (=: µ1 ⊗ K) on (Ω,A), such that for allA-measurable functions f > 0

∫

Ω

f dP =

∫ (∫

f(x1, x2) K(x1, dx2)

)

µ1(dx1), (3.3)

in particular, for all A ∈ A

P (A) =

∫

K(x1, Ax1) µ1(dx1) . (3.4)

Here

Ax1= x2 ∈ S2 | (x1, x2) ∈ A

is called the section of A by x1. Furthermore, for A1 ∈ S1, A2 ∈ S2:

P (A1 ×A2) =

∫

A1

K(x1, A2) µ1(dx1) (3.5)

and P is uniquely determined by (3.5).

85

S1x1

S2

x2

A

A(1)x2 A

(2)x2

A(1)x1

A(2)x1

A(3)x1

Note

Ax1= A(1)

x1∪A(2)

x1∪A(3)

x1and Ax2

= A(1)x2

∪A(2)x2

.

Proof. Uniqueness: Clearly, the collection of cylindrical sets A1 × A2 with Ai ∈ Si isstable under intersections and generates A, so that the uniqueness now follows fromProposition 1.11.5.

Existence: For given x1 ∈ S1 let

ϕx1(x2) := (x1, x2) .

ϕx1: S2 → Ω is measurable, because for A1 ∈ S1, A2 ∈ S2

ϕ−1x1

(A1 ×A2) =

∅ if x1 /∈ A1

A2 if x1 ∈ A1 .

It follows that for any f : Ω → R A-measurable and any x1 ∈ S1, the mapping

fx1:= f ϕx1

: S2 → R , x2 7→ f(x1, x2)

is S2/B(R)-measurable.Suppose now that f > 0 is A-measurable. Then

x1 7→∫

f(x1, x2) K(x1, dx2)

(

=

∫

fx1(x2) K(x1, dx2)

)

(3.6)

86

is well-defined.We will show in the following that this function is S1-measurable. We will prove the

assertion for f = 1A, A ∈ A first. For general f the measurability then follows bymeasure-theoretic induction.

Note that for f = 1A we have that∫

1A(x1, x2)︸︷︷︸

=1Ax1(x2)

K(x1, dx2) = K(x1, Ax1).

Hence, in the following we consider

D :=A ∈ A

∣∣ x1 7→ K(x1, Ax1

) is S1-measurable.

D is a Dynkin system (!) and contains all cylindrical sets A = A1 × A2 with Ai ∈ Si,because

K(x1, (A1 ×A2)x1

)= 1A1

(x1) ·K(x1, A2) .

Since measurable cylindrical sets are stable under intersections, we conclude that D =A.

It follows that for all nonnegative A-measurable functions f : Ω → R, the integral∫ (∫

f(x1, x2) K(x1, dx2)

)

µ(dx1)

is well-defined.For all A ∈ A we can now define

P (A) :=

∫ (∫

1A(x1, x2)︸︷︷︸

=1Ax1(x2)

K(x1, dx2)

)

µ(dx1) =

∫

K(x1, Ax1) µ(dx1).

P is a probability measure on (Ω,A), because

P (Ω) =

∫

K(x1, S2) µ(dx1) =

∫

1 µ(dx1) = 1.

For the proof of the σ-additivity, let A1, A2, . . . be pairwise disjoint subsets in A. Itfollows that for all x1 ∈ S1 the subsets (A1)x1

, (A2)x1, . . . are pairwise disjoint too,

hence

P( ⋃

n∈N

An

)

=

∫

K

(

x1,( ⋃

n∈N

An

)

x1

)

µ(dx1)

=

∫ ∞∑

n=1

K(x1, (An)x1

)µ(dx1)

=

∞∑

n=1

∫

K(x1, (An)x1

)µ(dx1) =

∞∑

n=1

P (An).

In the second equality we used that K(x1, ·) is a probability measure for all x1 and inthe third equality we used monotone integration.

Finally, (3.3) follows from measure-theoretic induction.

87

2.1 Examples and Applications

Remark 2.5. The classical Fubini theorem is a particular case of Proposition 2.4:K(x1, · ) = µ2. In this case, the measure µ1 ⊗K, constructed in Fubini’s theorem, iscalled the product measure of µ1 and µ2 and is denoted by µ1 ⊗ µ2. Moreover, in thiscase

∫

f dP =

∫ (∫

f(x1, x2) µ2(dx2)

)

µ1(dx1).

Remark 2.6 (Marginal distributions). Let Xi : Ω → Si, i = 1, 2, be the naturalprojections Xi

((x1, x2)

):= xi. The distributions of Xi under the measure µ1 ⊗K are

called the marginal distributions and they are given by

(P X−11 )(A1) = P [X1 ∈ A1] = P (A1 × S2)

=

∫

A1

K(x1, S2)︸︷︷︸

=1

µ1(dx1) = µ1(A1)

and

(P X−12 )(A2) = P [X2 ∈ A2] = P (S1 ×A2)

=

∫

K(x1, A2) µ1(dx1) =: (µ1K)(A2).

So, the marginal distributions are

P X−11 = µ1 P X−1

2 = µ1K .

Definition 2.7. Let S1 = S2 = S and S1 = S2 = S. A probability measure µ on (S, S)is said to be an equilibrium distribution for K (or invariant distribution under K) ifµ = µK.

Example 2.8. (i) Ehrenfest model (macroscopic) Let S = 0, 1, . . . , N and

K(y, · ) = y

N· δy−1 +

N − y

N· δy+1.

In this case, the binomial distribution µ with parameter N, 12 is an equilibirum

distribution, because for any x ∈ S

(µK)(x) =∑

y∈S

µ(y) ·K(y, x)

= µ(x+ 1) · x+ 1

N+ µ(x− 1) · N − (x− 1)

N

= 2−N

(N

x+ 1

)

· x+ 1

N+ 2−N

(N

x− 1

)

· N − (x− 1)

N

= 2−N

[(N − 1

x

)

+

(N − 1

x− 1

)]

= 2−N ·(N

x

)

= µ(x).

88

(ii) Ornstein-Uhlenbeck process Let S = R and K(x, · ) = N(αx, σ2) with |α| < 1.Then

µ = N

(

0,σ2

1− α2

)

is an equilibrium distribution. (Exercise.)

We now turn to the converse problem: Given a probability measure P on the productspace (Ω,A). Can we "disintegrate" P , i.e., can we find a probability measure µ1 on(S1, S1) and a transition probability from S1 to S2 such that

P = µ1 ⊗K ?

Answer: In most cases - yes, e.g. if S1 and S2 are Polish spaces (i.e., a topologicalspace having a countable basis, whose topology is induced by some complete metric),using conditional expectations (see below).

Example 2.9. In the particular case, when S1 is countable (and S1 = P(S1)), we candisintegrate P explicitly as follows: Necessarily, µ1 has to be the distribution of theprojection X1 onto the first coordinate. To define the kernel K, let ν be any probabilitymeasure on (S2, S2) and define

K(x1, A2) :=

P [X2 ∈ A2 |X1 = x1] if µ1(x1)︸︷︷︸

=P [X1=x1]

> 0

ν(A2) if µ1(x1) = 0.

Then

P (A1 ×A2) = P [X1 ∈ A1, X2 ∈ A2] =∑

x1∈A1

P [X1 = x1, X2 ∈ A2]

=∑

x1∈A1,µ1(x1)>0

P [X1 = x1] · P [X2 ∈ A2 |X1 = x1]

=∑

x1∈A1

µ1(x1) ·K(x1, A2) =

∫

A1

K(x1, A2) µ1(dx1)

= (µ1 ⊗K)(A1 ×A2),

hence P = µ1 ⊗K.

3 The canonical model for the evolution of astochastic system in discrete time

Consider the following situation: suppose we are given

89

• measurable spaces (Si, Si), i = 0, 1, 2, . . . and we define

Sn := S0 × S1 × · · · × Sn

Sn := S0 ⊗ S1 ⊗ · · · ⊗ Sn = σ

(A0 × · · · ×An |Ai ∈ Si

).

• – an initial distribution µ0 on (S0, S0)

– transition probabilities

Kn

((x0, . . . , xn−1), dxn

)

from (Sn−1, Sn−1) to (Sn, Sn), n = 1, 2, . . ..

Using Fubini’s theorem, we can then define probability measures Pn on Sn, n =0, 1, 2, . . . as follows:

P 0 := µ0 on S0,

Pn := Pn−1 ⊗Kn on Sn = Sn−1 × Sn

Note that Fubini’s theorem (see Proposition 2.4) implies that for any Sn-measurablefunction f : Sn → R+:∫

f dPn

=

∫

Pn−1(d(x0, . . . , xn−1)

)∫

Kn

((x0, . . . , xn−1), dxn

)f(x0, . . . , xn−1, xn)

= · · ·

=

∫

µ0(dx0)

∫

K1(x0, dx1) · · ·∫

Kn

((x0, . . . , xn−1), dxn

)f(x0, . . . , xn).

3.1 The canonical model

Let Ω := S0 × S1 × . . . be the set of all paths (or trajectories) ω = (x0, x1, . . . ) withxi ∈ Si, and

Xn(ω) := xn (projection onto nth-coordinate),

An := σ(X0, . . . , Xn)(⊂ A

),

A := σ(X0, X1, . . . ) = σ( ∞⋃

n=1

An

)

.

Our main goal in this section is to construct a probability measure P on (Ω,A) satisfying∫

f(X0, . . . , Xn) dP =

∫

f dPn ∀n = 0, 1, 2, . . . ,

i.e. the "finite dimensional distributions" P (X0, . . . , Xn)−1 of P , i.e., the joint

distributions of X0, . . . , Xn under P , are given by Pn for any n ≥ 0.

90

Proposition 3.1 (Alexandra Ionescu-Tulcea). There exists a unique probability measureP on (Ω,A) such that for all n > 0 und all Sn-measurable functions f : Sn → R+:

∫

Sn

f dP(X0,...,Xn) =

∫

Ω

f(X0, . . . , Xn) dP =

∫

Sn

f dPn. (3.7)

In other words: there exists a unique P on (Ω,A) such that Pn = P (X0, . . . , Xn)−1.

Proof. Uniqueness: Obvious, because the collection of finite cylindrical subsets

E := n⋂

i=0

Xi ∈ Ai∣∣∣ n > 0, Ai ∈ Si

is closed under intersections and generates A.Existence: Let A ∈ An, hence

A = (X0, . . . , Xn)−1(An) for some An ∈ S

n

In particular 1A = 1An(X0, . . . , Xn). Therefore, in order to have (3.7) we must define

P (A) := Pn(An) . (3.8)

We have to check that P is well-defined. Indeed, since A ∈ An ⊂ An+1, we can alsowrite A = (X0, . . . , Xn+1)

−1(An+1) where An+1 = An × Sn+1 ∈ An+1. But

Pn+1(An+1) = Pn+1(An × Sn+1)

=

∫

An

Kn+1

((x0, . . . , xn), Sn+1

)

︸︷︷︸

=1

dPn = Pn(An) .

It follows that P is well-defined by (3.8) on B =⋃∞

n=0 An. B is an algebra (i.e., acollection of subsets of Ω containing Ω, that is closed under complements and finite(!) unions), and P is finitely additive on B, since P is (σ-) additive on An for everyn. To extend P to a σ-additive probability measure on A = σ(B) with the help ofCaratheodory’s extension theorem, it suffices now to show that P is ∅-continuous, i.e.,the following condition is satisfied:

Bn ∈ B , Bn ց ∅ ⇒ P (Bn)n→∞−−−−→ 0 .

(For Caratheodory’s extension theorem see text books on measure theory, or Theorem1.41 in Klenke.)

W.l.o.g. B0 = Ω and Bn ∈ An (if Bn ∈ Am \ An with m > n just repeat Bn−1

m-times!). Then

Bn = An × Sn+1 × Sn+2 × . . . for some An ∈ Sn

with

An+1 ⊂ An × Sn+1

91

and we have to show that(P (Bn) =

)Pn(An)

n→∞−−−−→ 0

(i.e., infn Pn(An) = 0).

In order to do so, we will assume

infn∈N

Pn(An) > 0 .

and show that this implies

∞⋂

n=0

Bn 6= ∅ .

Note that

Pn(An) =

∫

µ0(dx0) f0,n(x0)

with

f0,n(x0) :=

∫

K1(x0, dx1) · · ·∫

Kn

((x0, . . . , xn−1), dxn

)1An(x0, . . . , xn).

It is easy to see that the sequence (f0,n)n∈N is decreasing, because∫

Kn+1

((x0, . . . , xn), dxn+1

)1An+1(x0, . . . , xn+1)

6

∫

Kn+1

((x0, . . . , xn), dxn+1

)1An×Sn+1

(x0, . . . , xn+1)

= 1An(x0, . . . , xn) ,

hence

f0,n+1(x0) =

∫

K1(x0, dx1) · · ·∫

Kn+1

((x0, . . . , xn), dxn+1

)1An+1(x0, . . . , xn+1)

≤∫

K1(x0, dx1) · · ·∫

Kn

((x0, . . . , xn−1), dxn

)1An(x0, . . . , xn) = f0,n(x0) .

In particular, using Lebesgue’s theorem∫

infn∈N

f0,n dµ0 = infn∈N

∫

f0,n dµ0 = infn∈N

Pn(An) > 0 .

Therefore we can find some x0 ∈ S0 with

infn∈N

f0,n(x0) > 0 .

On the other hand we can write

f0,n(x0) =

∫

K1(x0, dx1) f1,n(x1)

92

with

f1,n(x1) :=

∫

K2

((x0, x1), dx2

)

· · ·∫

Kn

((x0, x1, . . . , xn−1), dxn

)IAn(x0, x1, . . . , xn) .

Using the same argument as above (now with µ1 = K1(x0, ·)) we can find some x1 ∈ S1

with

infn∈N

f1,n(x1) > 0 .

Iterating this procedure, we find for any i = 0, 1, . . . some xi ∈ Si such that for allm > 1

infn∈N

∫

Km

((x0, . . . , xm−1), dxm

)

· · ·∫

Kn

((x0, . . . , xm−1, xm, . . . , xn−1), dxn

)

1An(x0, . . . , xm−1, xm, . . . , xn)

> 0.

In particular, if m = n

0 <

∫

Km

((x0, . . . , xm−1), dxm

)1Am(x0, . . . , xm−1, xm)

6 1Am−1(x0, . . . , xm−1) ,

so that (x0, . . . , xm−1) ∈ Am−1 ∀m ≥ 1. Thus

ω := (x0, x1, . . . ) ∈ Bm−1 = Am−1 × Sm × Sm+1 × · · · ∀m ≥ 1,

i.e.

ω ∈∞⋂

m=0

Bm .

Hence the assertion is proven.

Definition 3.2. Suppose that (Si, Si) = (S, S) for all i = 0, 1, 2, . . .. Then (Xn)n>0

on (Ω,A, P ) (with P as in the previous proposition) is called a stochastic process (indiscrete time) with state space (S, S), initial distribution µ0 and transition probabilities(Kn( · , · )

)

n∈N.

93

3.2 Examples

1) Infinite product measures: Consider the situation of Definition 3.2 and let

Kn

((x0, . . . , xn−1), ·

)= µn, n ∈ N (independent of (x0, . . . , xn−1)!)

Then∞⊗

n=0

µn := P

is called the product measure associated with µ0, µ1, . . . .For all n > 0 and A0, . . . , An ∈ S we have that

P [X0 ∈ A0, . . . , Xn ∈ An]I.-T.= Pn(A0 × · · · ×An)

=

∫

µ0(dx0)

∫

µ1(dx1) · · ·∫

µn(dxn) IA0×···×An(x0, . . . , xn)

= µ0(A0) · µ1(A1) · · ·µn(An).

In particular, PXn= µn for all n, and the canonical projections X0, X1, . . . are inde-

pendent. We thus have the following:

Proposition 3.3. Let (µn)n≥0 be a sequence of probability measures on a measurablespace (S, S). Then there exists a probability space (Ω,A, P ) and a sequence (Xn)n≥0

of independent r.v. with PXn= µn for all n ≥ 0.

We have thus proven in particular the existence of a probability space modellinginfinitely many independent 0− 1-experiments!

2) Markov chains

Kn

((x0, . . . , xn−1), ·

)= Kn(xn−1, · ) time-inhomogeneous,

time-homogeneous, if Kn = K for all n.For given initial distribution µ and transition probabilities (Kn)n≥1 (resp. K) there

exists a unique probability measure P on (Ω,A), which is said to be the canonical modelfor the time evolution of a time-inhomogeneous (resp. time-homogeneous) Markovchain.

Example 3.4. Let S = R, β > 0, x0 ∈ R \ 0, µ0 = δx0and K(x, · ) = N(0, βx2)

for x 6= 0, K(0, · ) = δ0.For which β does the sequence (Xn) converge and what is its limit?For n > 1

E[X2n]

I.-T.=

∫

x2n Pn

(d(x0, . . . , xn)

)

=

∫ ( ∫

x2n K(xn−1, dxn)

︸︷︷︸

=βx2n−1,

K(xn−1,dxn)=N(0,βx2n−1)

)

Pn−1(d(x0, . . . , xn−1))

= β · E[X2n−1] = · · · = βnx2

0 .

94

If β < 1 it follows that

E[ ∞∑

n=1

X2n

]

=

∞∑

n=1

E[X2n] =

∞∑

n=1

βnx20 < ∞,

hence∞∑

n=1X2

n < ∞ P -a.s., and therefore

limn→∞

Xn = 0 P -a.s.

A similar calculation as above for the first absolute moment yields

E[|Xn|

]= · · · =

√

2

π· β · E

[|Xn−1|

]= · · · =

(2

π· β)n

2

· E(|X0|

)

︸︷︷︸

=|x0|

,

because∫

|Xn| K(xn−1, dxn) =

√

2

π· β |xn−1| .

Consequently,

E[ ∞∑

n=1

|Xn|]

=

∞∑

n=1

(2

π· β)n

2

· |x0| ,

so that also for β < π2 :

limn→∞

Xn = 0 P -a.s.

In fact, if we define

β0 := exp

(

− 4√2π

∫ ∞

0

log x · e− x2

2 dx

)

= 2eC ≈ 3.56,

where

C := limn→∞

( n∑

k=1

1

k− log n

)

≈ 0.577

denotes the Euler-Mascheroni constant, it follows that

∀ β < β0 : Xnn→∞−−−−→ 0 P -a.s. with exp. rate

∀ β > β0 : |Xn| n→∞−−−−→ ∞ P -a.s. with exp. rate.

Proof. It is easy to see that for all n: Xn 6= 0 P -a.s. For n ∈ N we can then define

Yn :=

Xn

Xn−1on Xn−1 6= 0

0 on Xn−1 = 0.

95

Then Y1, Y2, . . . are independent r.v. with (identical) distribution N(0, β), because forall measurable functions f : Rn → R+

∫

f(Y1, . . . , Yn) dPI.-T.=

∫

f

(x1

x0, . . . ,

xn

xn−1

)

·(

1

2πβ

)n2

·(

1

x20 · · ·x2

n−1

) 12

· exp(

− x21

2βx20

− · · · − x2n

2βx2n−1

)

dx1 . . . dxn

=

∫

f(y1, . . . , yn) ·(

1

2πβ

)n2

· exp(

−y21 + · · ·+ y2n2β

)

dy1 . . . dyn .

Note that

|Xn| = |x0| · |Y1| · · · |Yn|

and thus

1

n· log|Xn| =

1

n· log|x0|+

1

n

n∑

i=1

log|Yi| .

Note that (log|Yi|)i∈N are independent and identically distributed with

E[log|Yi|

]= 2 · 1√

2πβ

∫ ∞

0

log x · e− x2

2β dx .

Kolmogorov’s law of large numbers now implies that

limn→∞

1

n· log|Xn| =

2√2πβ

∫ ∞

0

log x · e− x2

2β dx P -a.s.

Consequently,

|Xn| n→∞−−−−→ 0 with exp. rate, if

∫

· · · < 0,

|Xn| n→∞−−−−→ ∞ with exp. rate, if

∫

· · · > 0.

Note that

2√2πβ

∫ ∞

0

log x · e− x2

2β dxy= x√

β=

2√2π

∫ ∞

0

log(√

βy) · e− y2

2 dy

=1

2· log β +

2√2π

∫ ∞

0

log y · e− y2

2 dy

< 0 ⇔ β < β0.

It remains to check that

− 4√2π

∫ ∞

0

log x · e− x2

2 dx = log 2 + C

where C is the Euler-Mascheroni constant (Exercise!)

96

Example 3.5. Consider independent 0-1-experiments with success probability p ∈ [0, 1]but suppose that p ist unknown. In the canonical model:

Si := 0, 1, i ∈ N; Ω := 0, 1N,Xi : Ω → 0, 1, i ∈ N, projections,

µi := pε1 + (1− p)ε0, i ∈ N; Pp :=∞⊗

i=1

µi

An and A are defined as above.Since p is unknown, we choose an a priori distribution µ on

([0, 1],B([0, 1])

)(as a

distribution for the unknown parameter p).Claim: K(p, · ) := Pp( · ) is a transition probability from

([0, 1],B([0, 1])

)to (Ω,A).

Proof. We only need to show that for given A ∈ A the mapping p 7→ Pp(A) is mea-surable on [0, 1]. To this end define

D :=A ∈ A

∣∣ p 7→ Pp(A) is B([0, 1])-measurable

Then D is a Dynkin system and contains all finite cylindrical sets

X1 = x1, . . . , Xn = xn, n ∈ N, x1, . . . , xn ∈ 0, 1 ,

because

Pp(X1 = x1, . . . , Xn = xn) = p∑n

i=1 xi(1− p)n−∑n

i=1 xi

is measurable (even continuous) in p!The claim now follows from the fact, that the finite cylindrical sets are closed under

intersections and generate A.

Let P := µ⊗K on Ω := [0, 1]× Ω with B([0, 1])⊗A. Using Remark 2.6 it followsthat P has marginal distributions µ and

P ( · ) :=∫

Pp( · ) µ(dp) (3.9)

on (Ω,A). The integral can be seen as mixture of Pp according to the a priori distri-bution µ.

Note: The Xi are no longer independent under P !We now calculate the initial distribution PX1

and the transition probabilities in theparticular case where µ is the Lebesgue measure (i.e., the uniform distribution on theunknown parameter p):

P X−11 =

∫(pε1 + (1− p)ε0

)( · ) µ(dp)

=

∫

p µ(dp) · ε1 +∫

(1− p) µ(dp) · ε0 =1

2· ε1 +

1

2· ε0.

97

For given n ∈ N and x1, . . . , xn ∈ 0, 1 with k :=∑n

i=1 xi it follows that

P [Xn+1 = 1 |X1 = x1, . . . , Xn = xn]

=P [Xn+1 = 1, Xn = xn, . . . , X1 = x1]

P [Xn = xn, . . . , X1 = x1]

(3.9)=

∫

pk+1(1− p)n−k µ(dp)∫

pk(1− p)n−k µ(dp)

=Γ(k + 2)Γ(n− k + 1)

Γ(n+ 3)

Γ(n+ 2)

Γ(k + 1)Γ(n− k + 1)=

k + 1

n+ 2

=

(

1− n

n+ 2

)

· 12+

n

n+ 2· kn

︸︷︷︸

convex combination

.

Proposition 3.6. Let P be a probability measure on (Ω,A) ("canonical model", seeSection 3.1), and

µn := P X−1n , n ∈ N0.

Then:

Xn, n ∈ N, independent ⇔ P =∞⊗

n=0

µn (i.e. Pn =n⊗

k=0

µk ∀n ≥ 0).

Proof. Let P :=⊗∞

n=0 µn. Then

P = P

if and only if for all n ∈ N0 and all A0 ∈ S0, . . . , An ∈ Sn

P [X0 ∈ A0, . . . , Xn ∈ An] = P [X0 ∈ A0, . . . , Xn ∈ An]

=n∏

i=0

µi(Ai) =n∏

i=0

P [Xi ∈ Ai],

which is the case if and only if Xn, n ∈ N0, are independent.

Definition 3.7. Let Si := S, i ∈ N0, (Ω,A) be the canonical model and P be aprobability measure on (Ω,A). In particular, (Xn)n>0 is a stochastic process in thesense of Definition 3.2. Let J ⊂ N0, |J | < ∞. Then the distribution of (Xj)j∈J underP

µJ := P ((Xi)i∈J)−1

is said to be the finite dimensional distribution (w.r.t. J) on (SJ , SJ ).

Remark 3.8. P is uniquely determined by its finite-dimensional distributions resp. by

µ0,...,n, n ∈ N .

98

4 Stationarity

Let (S, S) be a measurable space, Ω = SN0 and (Ω,A) be the associated canonicalmodel. Let P be a probability measure on (Ω,A).

Definition 4.1. The mapping T : Ω → Ω, defined by

ω = (x0, x1, . . . ) 7→ Tω := (x1, x2, . . . )

is called the shift-operator on Ω.

Remark 4.2. For all n ∈ N0, A0, . . . , An ∈ S

T−1(X0 ∈ A0, . . . , Xn ∈ An

)= X1 ∈ A0, . . . , Xn+1 ∈ An.

In particular: T is A/A-measurable

Definition 4.3. The measure P is said to be stationary (or shift-invariant) if

P T−1 = P .

Proposition 4.4. The measure P is stationary if and only if for all k, n ∈ N0:

µ0,...,n = µk,...,k+n.

Proof.

P T−1 = P

⇔ P T−k = P ∀k ∈ N0

3.8⇔ (P T−k) (X0, . . . , Xn)−1 = P (X0, . . . , Xn)

−1 ∀ k, n ∈ N0

⇔ µk,...,n+k = µ0,...,n.

Remark 4.5. (i) The last proposition implies in the particular case

P =

∞⊗

n=0

µn with µn := P X−1n

that

P stationary ⇔ µn = µ0 ∀ n ≥ 0.

(ii) If P =⊗

µn as in (i), hence X0, X1, X2, . . . independent, Kolmogorov’s zero-onelaw implies that P = 0− 1 on the tail-field

A∗ :=

⋂

n>0

σ(Xn, Xn+1, . . . )

99

Proposition 4.6. Let P =⊗∞

n=0 µn, µn := P X−1n , n ∈ N0. Then P is ergodic, i.e.

P = 0− 1 on I := A ∈ A |T−1(A) = A .

I is called the σ-algebra of shift-invariant sets.

Proof. Using part (ii) of the previous remark, it suffices to show that I ⊂ A∗. But

A ∈ I ⇒ A = T−n(A) ∈ σ(Xn, Xn+1, . . . ) ∀ n ∈ N

⇒ A ∈ A∗.

100

trutnau/lecturetim12015.pdf · bibliography 1. bauer, h., probability theory, de gruyter, 1996. 2....

Documents