0521175577_1107002494probabilityfina

Probability for Finance

Students and instructors alike will benefit from this rigorous, unfussy text, whichkeeps a clear focus on the basic probabilistic concepts required for an understanding offinancial market models, including independence and conditioning. Assuming onlysome calculus and linear algebra, the text applies key results of measure andintegration to probability spaces and random variables, culminating in Central LimitTheory. Consequently it provides essential pre-requisites to graduate-level study ofmodern finance and, more generally, to the study of stochastic processes.

Results are proved carefully and the key concepts are motivated by concreteexamples drawn from financial market models. Students can test their understandingthrough the large number of exercises that are integral to the text.

ekkehard kopp is Emeritus Professor of Mathematics at the University of Hull. Hehas published over 50 research papers and five books, on measure and probability,stochastic analysis and mathematical finance. He has taught in the UK, Canada andSouth Africa, and he serves on the editorial board of the AIMS Library Series.

jan malczak has published over 20 research papers and taught courses in analysis,differential equations, measure and probability, and the theory of stochastic differentialprocesses. He is currently Professor of Mathematics at AGH University of Science andTechnology in Krakow, Poland.

tomasz zastawniak holds the Chair of Mathematical Finance at the University ofYork. He has authored about 50 research publications and four books. He hassupervised four PhD dissertations and around 80 MSc dissertations in mathematicalfinance.

Mastering Mathematical Finance

Mastering Mathematical Finance is a series of short books that cover allcore topics and the most common electives offered in Master’s programmesin mathematical or quantitative finance. The books are closely coordinatedand largely self-contained, and can be used efficiently in combination butalso individually.

The MMF books start financially from scratch and mathematically as-sume only undergraduate calculus, linear algebra and elementary proba-bility theory. The necessary mathematics is developed rigorously, with em-phasis on a natural development of mathematical ideas and financial intu-ition, and the readers quickly see real-life financial applications, both formotivation and as the ultimate end for the theory. All books are written forboth teaching and self-study, with worked examples, exercises and solu-tions.

[DMFM] Discrete Models of Financial Markets,Marek Capinski, Ekkehard Kopp

[PF] Probability for Finance,Ekkehard Kopp, Jan Malczak, Tomasz Zastawniak

[SCF] Stochastic Calculus for Finance,Marek Capinski, Ekkehard Kopp, Janusz Traple

[BSM] The Black–Scholes Model,Marek Capinski, Ekkehard Kopp

[PTRM] Portfolio Theory and Risk Management,Maciej J. Capinski, Ekkehard Kopp

[NMFC] Numerical Methods in Finance with C++,Maciej J. Capinski, Tomasz Zastawniak

[SIR] Stochastic Interest Rates,Daragh McInerney, Tomasz Zastawniak

[CR] Credit Risk,Marek Capinski, Tomasz Zastawniak

[FE] Financial Econometrics,Marek Capinski

[SCAF] Stochastic Control Applied to Finance,Szymon Peszat, Tomasz Zastawniak

Series editors Marek Capinski, AGH University of Science and Technol-ogy, Krakow; Ekkehard Kopp, University of Hull; Tomasz Zastawniak,University of York

Probability for Finance

EKKEHARD KOPPUniversity of Hull, Hull, UK

JAN MALCZAKAGH University of Science and Technology, Krakow, Poland

TOMASZ ZASTAWNIAKUniversity of York, York, UK

University Printing House, Cambridge CB2 8BS, United Kingdom

Published in the United States of America by Cambridge University Press, New York

Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit ofeducation, learning, and research at the highest international levels of excellence.

www.cambridge.orgInformation on this title: www.cambridge.org/9781107002494

c© Ekkehard Kopp, Jan Malczak and Tomasz Zastawniak 2014

This publication is in copyright. Subject to statutory exceptionand to the provisions of relevant collective licensing agreements,no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 2014

Printed in the United Kingdom by Clays, St Ives plc

A catalogue record for this publication is available from the British Library

Library of Congress Cataloguing in Publication data

ISBN 978-1-107-00249-4 HardbackISBN 978-0-521-17557-9 Paperback

Additional resources for this publication at www.cambridge.org/9781107002494

Cambridge University Press has no responsibility for the persistence or accuracy ofURLs for external or third-party internet websites referred to in this publication,

and does not guarantee that any content on such websites is, or will remain,accurate or appropriate.

Contents

Preface page vii

1 Probability spaces 11.1 Discrete examples 11.2 Probability spaces 61.3 Lebesgue measure 111.4 Lebesgue integral 131.5 Lebesgue outer measure 33

2 Probability distributions and random variables 392.1 Probability distributions 392.2 Random variables 462.3 Expectation and variance 562.4 Moments and characteristic functions 62

3 Product measure and independence 663.1 Product measure 673.2 Joint distribution 733.3 Iterated integrals 753.4 Random vectors in Rn 813.5 Independence 833.6 Covariance 963.7 Proofs by means of d-systems 98

4 Conditional expectation 1064.1 Binomial stock prices 1064.2 Conditional expectation: discrete case 1124.3 Conditional expectation: general case 1194.4 The inner product space L2(P) 1304.5 Existence of E(X | G) for integrable X 1374.6 Proofs 142

5 Sequences of random variables 1475.1 Sequences in L2(P) 1475.2 Modes of convergence for random variables 1565.3 Sequences of i.i.d. random variables 1675.4 Convergence in distribution 1705.5 Characteristic functions and inversion formula 174

v

vi Contents

5.6 Limit theorems for weak convergence 1765.7 Central Limit Theorem 180

Index 187

Preface

Mathematical models of financial markets rely in fundamental ways onthe concepts and tools of modern probability theory. This book provides aconcise but rigorous account of the probabilistic ideas and techniques mostcommonly used in such models. The treatment is self-contained, requiringonly calculus and linear algebra as pre-requisites, and complete proofs aregiven – some longer constructions and proofs are deferred to the ends ofchapters to ensure the smooth flow of key ideas.

New concepts are motivated through examples drawn from finance. Theselection and ordering of the material are strongly guided by the applica-tions we have in mind. Many of these applications appear more fully inlater volumes of the ‘Mastering Mathematical Finance’ series, including[SCF], [BSM] and [NMFC]. This volume provides the essential mathe-matical background of the financial models described in detail there.

In adding to the extensive literature on probability theory we have notsought to provide a comprehensive treatment of the mathematical theoryand its manifold applications. We focus instead on the more limited objec-tive of writing a fully rigorous, yet concise and accessible, account of thebasic concepts underlying widely used market models. The book shouldbe read in conjunction with its partner volume [SCF], which describes theproperties of stochastic processes used in these models.

In the first two chapters we introduce probability spaces, distributionsand random variables from scratch. We assume a basic level of mathemat-ical maturity in our description of the principal aspects of measures andintegrals, including the construction of the Lebesgue integral and the im-portant convergence results for integrals. Beginning with discrete examplesfamiliar to readers of [DMFM], we motivate each construction by meansof specific distributions used in financial modelling. Chapter 3 introducesproduct measures and random vectors, and highlights the key concept ofindependence, while Chapter 4 is devoted to a thorough discussion of con-ditioning, moving from the familiar discrete setting via the properties ofinner product spaces and the Radon–Nikodym theorem to the constructionof general conditional expectations for integrable random variables. The fi-nal chapter explores key limit theorems for sequences of random variables,beginning with orthonormal sequences of square-integrable functions, fol-

vii

viii Preface

lowed by a discussion of the relationships between various modes of con-vergence, and concluding with an introduction to weak convergence andthe Central Limit Theorem for independent identically distributed randomvariables of finite mean and variance.

Concrete examples and the large number of exercises form an integralpart of this text. Solutions to the exercises and further material can be foundat www.cambridge.org/9781107002494.

1

Probability spaces

1.1 Discrete examples1.2 Probability spaces1.3 Lebesgue measure1.4 Lebesgue integral1.5 Lebesgue outer measure

In all spheres of life we make decisions based upon incomplete informa-tion. Frameworks for predicting the uncertain outcomes of future eventshave been around for centuries, notably in the age-old pastime of gambling.Much of modern finance draws on this experience. Probabilistic modelshave become an essential feature of financial market practice.

We begin at the beginning: this chapter is an introduction to basic con-cepts in probability, motivated by simple models for the evolution of stockprices. Emphasis is placed on the collection of events whose probabilitywe need to study, together with the probability function defined on theseevents. For this we use the machinery of measure theory, including theconstruction of Lebesgue measure on R. We introduce and study integra-tion with respect to a measure, with emphasis on powerful limit theorems.In particular, we specialise to the case of Lebesgue integral and compare itwith the Riemann integral familiar to students of basic calculus.

1.1 Discrete examples

The crucial feature of financial markets is uncertainty related to the fu-ture prices of various quantities, like stock prices, interest rates, foreignexchange rates, market indices, or commodity prices. Our goal is to builda mathematical model capturing this aspect of reality.

1

2 Probability spaces

Example 1.1Consider how we could model stock prices. The current stock price (thespot price) is usually known, say 10. We may be interested in the price atsome fixed future time. This future price involves some uncertainty. Sup-pose first that in this period of time the stock price jumps a number of times,going either up or down by 0.50 (such a price change is called a tick). Af-ter two such jumps there will be three possible prices: 9, 10, 11. After 20jumps there will be a wider range of possible prices: 0, 1, 2, . . . , 19, 20.

The set of all possible outcomes will be denoted by Ω and called thesample space. The elements of Ω will be denoted by ω. For now we as-sume that Ω is a finite set.

Example 1.2If we are interested in the prices after two jumps, we could take Ω ={9, 10, 11}. If we want to describe the prices after 20 jumps, we would takeΩ = {0, 1, 2, . . . , 19, 20}.

The next step in building a model is to answer to the following question:for a subset A ⊂ Ω, called an event, what is the probability that the outcomelies in A? The number representing the answer will be denoted by P(A),and the convention is to require P(A) ∈ [0, 1] with P(Ω) = 1 and P(∅) = 0.We shall write pω = P({ω}) for any ω ∈ Ω. Given pω for all ω ∈ Ω, thefunction P is then constructed for any A ⊂ Ω by adding the values attachedto the elements of A,

P(A) =∑ω∈A

pω.

This immediately implies an important property of P, called additivity,

P(A ∪ B) = P(A) + P(B) for any disjoint events A, B.

By induction, it readily extends to

P

⎛⎜⎜⎜⎜⎜⎝ m⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ = m∑i=1

P(Ai) for any pairwise disjoint events A1, . . . , Am.

1.1 Discrete examples 3

Example 1.3Consider Ω = {9, 10, 11}. The simplest choice is to assign equal probabili-ties p9 = p10 = p11 =

13 to all single-element subsets of Ω.

Example 1.4In the case of Ω = {0, 1, 2, . . . , 19, 20} we could, once again, try equalprobabilities for all single-element subsets of Ω, namely, p0 = p1 = · · · =p20 =

121 .

The uniform probability on a finite Ω assigns equal probabilities pω =1N for each ω ∈ Ω, where N is the number of elements in Ω.

Example 1.5Uniform probability does not appear to be consistent with the schemein Example 1.1, where the stock prices result from consecutive jumpsby ±0.50 from an initial price 10. In the case of two consecutive jumpsone might argue that the middle price 10 should carry more weight sinceit can be arrived at in two ways (up–down or down–up), while either ofthe other two values can occur in just one way (down–down for 9, up–upfor 11). Hence, price 10 would be twice as likely as 9 or 11.

To reflect these considerations on Ω = {9, 10, 11} we can take p9 =14 ,

p10 =12 , p11 =

14 .

Example 1.6Similarly, for Ω = {0, 1, 2, . . . , 19, 20} we can take pn =

(20n

)1

220 , where(20n

)= 20!

n!(20−n)! is the number of scenarios consisting of n upwards and 20−ndownwards price jumps of 0.50 from the initial price 10, with each scenarioequally likely. This is illustrated in Figure 1.1.

In general, when for an N-element Ω we have pn =(

Nn

)1

2N , we call this

the symmetric binomial probability. Clearly,∑N

n=0 pn = 1.


Figure 1.1 Binomial probability and additive jumps.

The mechanism of price jumps by constant additive ticks is not entirelysatisfactory as a model for stock prices. After sufficiently many jumps,the range of possible prices will include negative values. To have a morerealistic model we need to adjust this mechanism of price jumps.

Example 1.7The first price jump of ±0.50 means that the price changes by ±5%. In sub-sequent steps we shall now allow the prices to go up or down by 5% ratherthan by a constant tick of 0.50. The possible prices will then be Ω = {ωn :n = 0, 1, 2, . . . , 19, 20} after 20 jumps, with ωn = 10×1.05n×0.9520−n. Theprices will remain positive for any number of jumps. We choose the prob-abilities in a similar manner as before, pωn =

(20n

)1

220 . Compare Figure 1.2with Figure 1.1 to observe a subtle but crucial shift in the distribution ofstock prices.

The above examples restrict the possible stock prices to a finite set. In anattempt to extend the model we might want to allow an infinite sequenceof possible prices, that is, a countable set Ω.

Example 1.8Suppose that the number of stock price jumps occurring within a fixed timeperiod is not prescribed, but can be an arbitrary integer N. To be specific,suppose that the probability of N jumps is

qN =λNe−λ

N!

1.1 Discrete examples 5

Figure 1.2 Binomial probability and multiplicative jumps.

with N = 0, 1, 2, . . . for some parameter λ > 0. The probability of large N issmall, but there is no upper bound on N, allowing for some hectic trading.Clearly,

∞∑N=0

qN =

∞∑N=0

λNe−λ

N!= e−λ

∞∑N=0

λN

N!= e−λeλ = 1.

This is called the Poisson probability with parameter λ.Furthermore, conditioned on there being N jumps, the possible final

stock prices will be described by means of the binomial probability andmultiplicative jumps. We assume, like in Example 1.7, that each jump in-cerases/reduces the stock price by 5% with probability 1

2 . The stock priceat time T will become

S (T ) = 10 × 1.05n × 0.95N−n

with probability

pN,n = qN

(Nn

)1

2N,

that is, the probability qN of N = 0, 1, 2, . . . jumps multiplied by the prob-ability

(Nn

)1

2N of n upwards price movements among those N jumps, where0 ≤ n ≤ N. We take Ω to be the set of such pairs of integers N, n. Theformula P(A) =

∑ω∈A pω defining the probability of an event now includes

infinite sets A ⊂ Ω.

This example shows that it is natural to consider a stronger version of


the additivity property:

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ = ∞∑i=1

P(Ai)

for any sequence of pairwise disjoint events A1, A2, . . . ⊂ Ω. This is knownas countable additivity.

Example 1.9Another example where a countable set emerges in a natural way is re-lated to modelling the instant when something unpredictable may happen.Time is measured by the number of discrete steps (of some fixed but un-specified length). At each step there is an upward/downward price jumpwith probabilities p, 1 − p ∈ (0, 1), respectively. The probability that anupward jump occurs for the first time at the nth step can be expressed aspn = (1 − p)n−1 p. It is easy to check that

∑∞n=1 pn = 1, which gives a prob-

ability on Ω = {1, 2, . . .}. This defines the geometric probability.

1.2 Probability spaces

Countable additivity turns out to be the perfect condition for probabilitytheory. The actual construction of a probability measure can present diffi-culties. In particular, it is sometimes impossible to define P for all subsetsof Ω. The domain of P has to be specified, and it is natural to impose somerestrictions on that domain ensuring that countable additivity can be for-mulated.

Definition 1.10A probability space is a triple (Ω,F , P) as follows.

(i) Ω is a non-empty set (called the sample space, or set of scenarios).(ii) F is a family of subsets of Ω (called events) satisfying the following

conditions:• Ω ∈ F ;• if Ai ∈ F for i = 1, 2, . . . , then

⋃∞i=1 Ai ∈ F (we say that F is

closed under countable unions);• if A ∈ F , then Ω \A ∈ F (we say that F is closed under comple-

ments).Such a family of sets F is called a σ-field on Ω.

1.2 Probability spaces 7

(iii) P assigns numbers to events,

P : F → [0, 1],

and we assume that• P(Ω) = 1;• for all sequences of events Ai ∈ F , i = 1, 2, 3, . . . that are pairwise

disjoint (Ai ∩ Aj = ∅ for i � j) we have

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ = ∞∑i=1

P(Ai).

This property is called countable additivity.A function P satisfying these conditions is called a probability mea-sure (or simply a probability).

Exercise 1.1 Let F be a σ-field and A1, A2, . . . ∈ F . Show that⋂ni=1 Ai ∈ F for each n = 1, 2, . . . and that

⋂∞i=1 Ai ∈ F .

Exercise 1.2 Suppose that F is a σ-field containing all open inter-vals in [0, 1] with rational endpoints. Show that F contains all openintervals in [0, 1].

Before proceeding further we note some basic properties of probabilitymeasures.

Theorem 1.11If P is a probability measure, then:

(i) P(⋃n

i=1 Ai) =∑n

i=1 P(Ai) for any pairwise disjoint events Ai ∈ F ,i = 1, 2, . . . , n (finite additivity);

(ii) P(Ω \ A) = 1 − P(A) for any A ∈ F ; in particular, P(∅) = 0;(iii) A ⊂ B implies P(A) ≤ P(B) for any A, B ∈ F (monotonicity);(iv) P(

⋃ni=1 Ai) ≤ ∑n

i=1 P(Ai) for any Ai ∈ F , i = 1, 2, . . . , n (finite sub-additivity);

(v) if An+1 ⊃ An ∈ F for all n ≥ 1, then P(⋃∞

n=1 An) = limm→∞ P(Am);(vi) P(

⋃∞i=1 Ai) ≤ ∑∞

i=1 P(Ai) for any Ai ∈ F , i = 1, 2, . . . (countablesubadditivity);

(vii) if An+1 ⊂ An ∈ F for all n ≥ 1, then P(⋂∞

n=1 An) = limm→∞ P(Am).


Proof (i) Let An+1 = An+2 = · · · = ∅ and apply countable additivity.(ii) Use (i) with n = 2, A1 = A, A2 = Ω \ A.(iii) Since B = A ∪ (B \ A) and we have disjoint components, we can

apply (i), so

P(B) = P(A) + P(B \ A) ≥ P(A).

(iv) For n = 2,

P(A1 ∪ A2) = P(A1 ∪ (A2 \ A1)) = P(A1) + P(A2 \ A1) ≤ P(A1) + P(A2)

and then use induction to complete the proof for arbitrary n, where theinduction step will be the same as the above argument.

(v) Using the above properties, we have (with A0 = ∅)

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=1

An

⎞⎟⎟⎟⎟⎟⎠ = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=0

(An+1 \ An)

⎞⎟⎟⎟⎟⎟⎠ = ∞∑n=0

P(An+1 \ An)

= limm→∞

m∑n=0

P(An+1 \ An) = limm→∞ P

⎛⎜⎜⎜⎜⎜⎝ m⋃n=0

(An+1 \ An)

⎞⎟⎟⎟⎟⎟⎠= lim

m→∞ P(Am+1).

(vi) We put Bn =⋃n

i=1 Ai, so that Bn+1 ⊃ Bn ∈ F for all n ≥ 1, andusing (v) we pass to the limit in the finite subadditivity relation (iv):

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=1

An

⎞⎟⎟⎟⎟⎟⎠ = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=1

Bn

⎞⎟⎟⎟⎟⎟⎠ = limm→∞ P(Bm) = lim

m→∞ P

⎛⎜⎜⎜⎜⎜⎝ m⋃n=1

An

⎞⎟⎟⎟⎟⎟⎠≤ lim

m→∞

m∑n=1

P(An) =∞∑

n=1

P(An).

(vii) Take A =⋂∞

n=1 An, note that P(A) = 1 − P(Ω \ A) by (ii), andapply (v):

P(A) = 1 − P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=1

(Ω \ An)

⎞⎟⎟⎟⎟⎟⎠= 1 − lim

m→∞ P(Ω \ Am) = limm→∞ P(Am).

�

The construction of interesting probability measures requires some labour,as we shall see. However, a few simple examples can be given immediately.

1.2 Probability spaces 9

Example 1.12Take any non-empty set Ω, fix ω ∈ Ω, and define δω(A) = 1 if ω ∈ A andδω(A) = 0 if ω � A, for any A ⊂ Ω. It is a probability measure, called theunit mass, also known as the Dirac measure, concentrated at ω. If F istaken to be the family of all subsets of Ω, then (Ω,F , δω) is a probabilityspace.

Example 1.13Let N be a positive integer. On Ω = {0, 1, . . . ,N} define

P(A) =N∑

n=0

(Nn

)1

2Nδn(A)

for any A ⊂ Ω, where δn is the unit mass concentrated at n from Exam-ple 1.12. We take F to be the family of all subsets of Ω. Then (Ω,F , P)is a probability space. This is clearly the symmetric binomial probabilityconsidered earlier.

More generally for the same Ω and any p ∈ (0, 1), the binomial proba-bility with parameters N, p is defined by setting

P(A) =N∑

n=0

(Nn

)pn(1 − p)N−nδn(A).

It is immediate from the binomial theorem that P(Ω) = 1.This example is often described as providing the probabilities of events

relating to the repeated tossing of a coin (where successive tosses are as-sumed not to affect each other, in a sense that will be made precise later): iffor any given toss the probability of ‘Heads’ is p, the probability of findingexactly k ‘Heads’ in N tosses is

(Nk

)pk(1 − p)N−k.

Example 1.14Fix λ > 0, let Ω = {0, 1, 2, . . .} and let F be the family of all subsets of Ω.For any A ∈ F put

P(A) =∞∑

n=0

e−λλn

n!δn(A),


where δn is the unit mass concentrated at n. Then (Ω,F , P) is a probabilityspace. This gives the Poisson probability mentioned in Example 1.8.

In addition to subsets of R, for example the set [0,∞) of all non-negativereal numbers, it often proves convenient to consider sets containing ∞ or−∞ in addition to real numbers. For instance, we write [−∞,∞] for the setof all real numbers in R together with ∞ and −∞, and [0,∞] to denote theset of all non-negative real numbers together with∞.

Probability measures belong to a wider class of countably additive setfunctions taking values in [0,∞]. Let Ω be a non-empty set and let F be aσ-field of subsets of Ω.

Definition 1.15We say that μ : F → [0,∞] is a measure and call (Ω,F , μ) a measurespace if

(i) μ(∅) = 0;(ii) for any pairwise disjoint sets Ai ∈ F , i = 1, 2, . . .

μ

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ = ∞∑i=1

μ(Ai).

Note that some of the terms μ(Ai) in the sum may be infinite, and we usethe convention x +∞ = ∞ for any x ∈ [0,∞].

Moreover, we call μ a finite measure if, in addition, μ(Ω) < ∞.

The properties listed in Theorem 1.11 and their proofs can readily beadapted to the case of an arbitrary measure.

Corollary 1.16Properties (i) and (iii)–(vi) listed in Theorem 1.11 remain true for anymeasure μ. If we assume in addition that μ(Ω) < ∞, then (ii) becomesμ(Ω \ A) = μ(Ω) − μ(A). Moreover, if μ(A1) is finite, then (vii) still holds.

Example 1.17For any non-empty set Ω and any A ⊂ Ω let

μ(A) =∑ω∈Aδω(A),

where δω is the unit mass concentrated atω. The sum is equal to the number

1.3 Lebesgue measure 11

of elements in A if A is a finite set, and ∞ otherwise. Then μ is a measure,called the counting measure, on Ω defined on the family F consisting ofall subsets of Ω. It is not a probability measure, however, unless Ω is aone-element set.

1.3 Lebesgue measure

The discrete stock price models in Section 1.1 admit only a limited range ofprices. To remove this restriction it is natural to allow future prices to takeany value in some interval in R. Probability spaces capable of capturingthis modelling choice require the notion of Lebesgue measure, introducedin this section. In particular, it will facilitate a study of log-normally dis-tributed stock prices and it will prove instrumental in the development ofstochastic calculus, which is of fundamental importance in mathematicalfinance.

To begin with, take any open interval I = (a, b) ⊂ R, where a ≤ b. Wedenote the length of the interval by

l(I) = b − a.

The family of such intervals will be denoted byI. Observe that∅ = (a, a) ∈I, and that l(∅) = 0.

However, I is not a σ-field, and the length l as a function defined on Iis not a measure. Can this function be extended to a larger domain, a σ-field containing I, on which it will become a measure? The answer to thisquestion is positive, as we shall see in Theorem 1.19, but not immediatelyobvious.

First of all, we need to identify the σ-field to provide the domain of sucha measure. To make the task of extending the length function easier, wewant the σ-field to be as small as possible, as long as it contains I. Theintersection of all σ-fields containing I, denoted by

B(R)=⋂{F : F is a σ-field on R and I ⊂ F } (1.1)

and called the family of Borel sets in R, is the smallestσ-field containing Ias shown in the next exercise.


Exercise 1.3 Show that:(1) B(R) is a σ-field on R such that I ⊂ B(R);(2) if F is a σ-field on R such that I ⊂ F , then B(R) ⊂ F .

We could equally have begun with closed intervals [a, b] since this classof intervals leads to the same σ-fieldB(R). To see this we only need to notethat [a, b] =

⋂∞n=1(a − 1

n , b +1n ) and (a, b) =

⋃∞n=1[a + 1

n , b − 1n ].

In particular, singleton sets {a} = [a, a] belong to B(R) for all a ∈ R, andso do all finite or countable subsets of R. Hence the set Q of rationals andits complement R \ Q, the set of irrationals, also belong to B(R).

Definition 1.18For each A ∈ B(R) we put

m(A) = inf

⎧⎪⎪⎨⎪⎪⎩∞∑

k=1

l(Jk) : A ⊂∞⋃

k=1

Jk

⎫⎪⎪⎬⎪⎪⎭ , (1.2)

where the infimum is taken over all sequences (Jk)∞k=1 consisting of openintervals. We call m the Lebesgue measure defined on B(R).

When A ⊂ ⋃∞k=1 Jk we say that A is covered by the sequence of sets

(Jk)∞k=1. The idea is to cover A by a sequence of open intervals, consider thetotal length of these intervals as an overestimate of the measure of A, andtake the infimum of the total length over all such coverings.

Theorem 1.19m : B(R)→ [0,∞] is a measure such that m((a, b)) = b − a for all a ≤ b.

The proof can be found in Section 1.5. The details are not needed inthe rest of this volume, but any serious student of mathematics applied inmodern finance should have seen them at least once!

Remark 1.20Lebesgue measure can be defined on a σ-field larger than B(R), but cannotbe extended to a measure defined on all subsets of R.1

Exercise 1.4 Find m([

12 , 2

))and m ([−2, 3] ∪ [3, 8]).

1 See, for example, M. Capinski and E. Kopp, Measure, Integral and Probability, 2ndedition, Springer-Verlag 2004.

1.4 Lebesgue integral 13

Exercise 1.5 Compute m(⋃∞

n=2

(1

n+1 ,1n

]).

Exercise 1.6 Find m(N), m(Q), m(R\Q), m({x ∈ R : sin x = cos x}).

Exercise 1.7 Show that the Cantor set C, constructed below, is un-countable, but that m(C) = 0.

The Cantor set is defined to be C =⋂∞

n=0 Cn, where C0 = [0, 1],C1 is obtained by removing from C0 the ‘middle third’ ( 1

3 ,23 ) of the

interval [0, 1], C2 is formed by similarly removing from C1 the ‘middlethirds’ ( 1

9 ,29 ), ( 7

9 ,89 ) of the two intervals [0, 1

3 ], [ 23 , 1], and so on. The set

Cn consists of 2n closed intervals, each of length ( 13 )n.

Exercise 1.8 Show that for any A ∈ B(R) and for any x ∈ Rm(A) = m(A + x),

where A+ x = {a+ x ∈ R : a ∈ A}. This property of Lebesgue measureis called translation invariance.

Lebesgue measure allows us to define a probability on any bounded in-terval Ω = [a, b] by writing, for any Borel set A ⊂ [a, b],

P(A) =m(A)m(Ω)

.

This is called the uniform probability on [a, b].

1.4 Lebesgue integral

As we noticed in the discrete case, uniform probability does not lend itselfwell to modelling stock prices. Similarly, in the continuous case, we needmore than just the uniform probability on an interval. A natural idea is toreplace the sum P(A) =

∑ω∈A pω, used in the discrete case to express the

probability of an event A, by an integral understood in an appropriate sense.The simplest case is that of an integral of a continuous function on R when


Figure 1.3 Approximating the area under the graph of f by rectangles.

A = [a, b] is an interval. With this in mind, we briefly review some basicfacts concerning integrals of continuous functions.

Riemann integral

Let f : [a, b] → R be a continuous function. In this case f must bebounded, so the area under the graph of f is finite. To approximate this areawe divide it into strips by choosing a sequence of numbers a = c0 < c1 <

· · · < cn = b and approximate each strip by a rectangle. We take the heightof such a rectangle with base [ci−1, ci] to be f (xi) for some xi ∈ [ci−1, ci],see Figure 1.3. The total area of the rectangles is

S n =

n∑i=1

f (xi)(ci − ci−1).

Let δn = maxi=1,...,n |ci − ci−1|. The sequence S n for n = 1, 2, . . . convergesto a limit independent of the way the ci and xi are selected, as long aslimn→∞ δn = 0. We call this limit the Riemann integral of f over [a, b]and denote it by ∫ b

af (x) dx = lim

n→∞ S n.

The integral∫ b

af (x) dx exists and is finite for any continuous function f .

The same applies to bounded functions having at most a countable numberpoints of discontinuity.

There are, however, some fairly obvious functions for which the Rie-mann integral cannot be defined.


Example 1.21Consider the function f : R→ [0,∞) defined as

f (x) =

⎧⎪⎪⎨⎪⎪⎩0 if x ∈ Q,1 if x ∈ R \ Q.

Fix a sequence 0 = c0 < c1 < · · · < cn = 1 of points in the interval [0, 1].Each subinterval [ci−1, ci] contains both rational and irrational numbers, sotaking the xi to be rationals we get

n∑i=1

f (xi)(ci − ci−1) = 0,

while for irrational xi we getn∑

i=1

f (xi)(ci − ci−1) = 1.

As n approaches infinity we can get a different limit (or in fact no limitat all), depending on the choice of the xi, which means that the Riemannintegral

∫ 1

0f (x) dx does not exist.

The following result captures the relationship between derivatives andRiemann integrals.

Theorem 1.22Let f : [a, b]→ R be a continuous function. Then we have the following.

(i) The function defined for any x ∈ [a, b] by

F(x) =∫ x

af (y) dy

is differentiable and its derivative at any x ∈ [a, b] (at a and b wetake right- or left-sided derivatives, respectively) is

F′(x) = f (x). (1.3)

(ii) For any function F : [a, b]→ R satisfying (1.3)∫ b

af (x) dx = F(b) − F(a).

A function F satisfying (1.3) is called an antiderivative of f . Such afunction is unique up to a constant.


Figure 1.4 Normal density.

Some technicalities are involved to justify these claims about the Rie-mann integral, of course, but they are not needed in what follows. Theseelementary techniques of integration form part of any calculus course.

The Riemann integral makes it possible to describe the probability ofany event represented by an interval A = [a, b] as the integral

P(A) =∫ b

af (x) dx. (1.4)

of a continuous (or piecewise continuous) function f : R→ [0,∞), as longas ∫ ∞

−∞f (x) dx = 1. (1.5)

Here we have used the indefinite Riemann integral, defined as∫ ∞−∞ f (x) dx =

limc→∞∫ c

−cf (x) dx.

Example 1.23Take

f (x) =1

σ√

2πe−

(x−μ)22σ2 , (1.6)

where μ ∈ R and σ > 0 are parameters. This is called a normal (or Gaus-sian) density.

In Figure 1.4 we sketch the graph for μ = 10 and σ = 2.236. As we shallsee in Example 5.54, this choice of parameters is related to Example 1.6,where the stock prices change in an additive manner over 20 time steps,which was illustrated in Figure 1.1.


Figure 1.5 Log-normal density.

From the standpoint of modelling stock prices a disadvantage of the nor-mal density is that the probability of negative values is non-zero, which ishardly acceptable when modelling prices even if, as the graph suggests,this probability may be small.

Exercise 1.9 Verify that f given by (1.6) satisfies (1.5).

Example 1.24An example related to the multiplicative changes of stock prices discussedin Example 1.7 is based on the choice of

f (x) =

⎧⎪⎪⎨⎪⎪⎩ 1xσ√

2πe−

(ln x−μ)22σ2 for x > 0,

0 for x ≤ 0,(1.7)

called a log-normal density.The graph for μ = 2.2776 and σ = 0.2238 is depicted in Figure 1.5.

Compare this with Figure 1.2. These values are related to those in Exam-ple 1.7, as will be explained in Example 5.55.

Negative prices are excluded. The log-normal density is widely acceptedby the financial community as providing a standard model of stock prices.


Exercise 1.10 Verify that f given by (1.7) satisfies (1.5).

Integral with respect to a measure

Difficulties arise when P given by the Riemann integral (1.4) needs to beextended from intervals to a measure on a σ-field. To minimize effort, wetake the smallest σ-field containing intervals, that is, the σ-field of Borelsets B(R), just like we did when introducing Lebesgue measure m.

We outline the constructions involved in this extension, leaving routineverifications as exercises. While we aim primarily to construct an integralon the measure space (R,B(R),m), this is just as easy on an arbitrary mea-sure space (Ω,F , μ), which is what we will now do, thereby achievinggreater generality.

First we integrate so-called simple functions. The indicator function ofany A ⊂ Ω will be denoted by

1A(x) =

{1 if x ∈ A,0 if x ∈ Ω \ A.

Definition 1.25By definition a (non-negative) simple function has the form

s =n∑

i=1

si1Ai ,

where A1, . . . , An ∈ F are pairwise disjoint sets with⋃n

i=1 Ai = Ω, andwhere si ≥ 0 for i = 1, . . . , n. We define∫

Ω

s dμ =n∑

i=1

siμ(Ai).

It may happen that μ(Ai) = ∞ for some i, and then the conventions 0·∞ = 0and x · ∞ = ∞ for x > 0 are used.

Exercise 1.11 Show that if r, s are any non-negative simple functionsand a, b ≥ 0 are any non-negative numbers, then ar + bs is a simplefunction and ∫

Ω

(ar + bs) dμ = a∫Ω

r dμ + b∫Ω

s dμ.


Exercise 1.12 Show that if r, s are non-negative simple functionssuch that r ≤ s, then ∫

Ω

r dμ ≤∫Ω

s dμ.

Proposition 1.26Suppose that s is a non-negative simple function. Then s1B is also a non-negative simple function for any B ∈ F , and

ν(B) =∫Ω

s1B dμ

is a measure on Ω defined on the σ-field F .

Proof Take any B ∈ F , and let s =∑n

i=1 si1Ai , where A1, . . . , An ∈ F arepairwise disjoint sets with

⋃ni=1 Ai = Ω, and si ≥ 0 for i = 1, . . . , n. Then

s1B =∑n+1

i=1 ri1Ci , where ri = si, Ci = Ai ∩ B for i = 1, . . . , n and wherern+1 = 0, Cn+1 = Ω \ B, is also a simple function. Moreover,

ν(B) =∫Ω

s1B dμ =n+1∑i=1

riμ(Ci) =n∑

i=1

siμ(Ai ∩ B).

In particular, for B = ∅

ν(∅) =n∑

i=1

siμ(∅) = 0.

Now suppose that B =⋃∞

j=1 Bj, where Bj ∈ F for j = 1, 2, . . . are pairwisedisjoint sets. Then countable additivity of μ gives

ν(B) =n∑

i=1

siμ(Ai ∩ B) =n∑

i=1

si

∞∑j=1

μ(Ai ∩ Bj)

=

∞∑j=1

n∑i=1

siμ(Ai ∩ Bj) =∞∑j=1

ν(Bj).

We have proved that ν is a measure. �

Next, we need to identify the class of functions that will be integratedwith respect to a measure. First we introduce some notation. For any B ⊂ Rwe denote by { f ∈ B} or by f −1(B) the inverse image of B under f , that is,

{ f ∈ B} = f −1(B) = {ω ∈ Ω : f (ω) ∈ B}.


The notation extends to the case when the set of values of f is specified interms of a certain property, for example, { f > a} = {ω ∈ Ω : f (ω) > a} isthe inverse image of (a,∞) under f .

Definition 1.27We say that a function f : Ω → [−∞,∞] is measurable (more precisely,measurable with respect to F or F -measurable) if

{ f ∈ B} ∈ F for all B ∈ B(R)

and

{ f = ∞}, { f = −∞} ∈ F .

Exercise 1.13 Show that f : Ω → R is measurable if and only if{ f > a} ∈ F for all a ∈ R.

Exercise 1.14 Show that every simple function is measurable.

Exercise 1.15 Show that the composition g◦ f of a measurable func-tion f : Ω→ R with a continuous function g : R→ R is measurable.

Exercise 1.16 If fk are measurable functions for k = 1, . . . , n, showthat max{ f1, . . . , fn} and min{ f1, . . . , fn} are also measurable.

Exercise 1.17 If fn are measurable functions for n = 1, 2, . . . , showthat supn≥1 fn and infn≥1 fn are also measurable.

Exercise 1.18 Suppose that fn are measurable functions for n =1, 2, . . . . Recall that, by definition,

lim supn→∞

fn = infk≥1

(supn≤k

fn

), lim inf

n→∞ fn = supk≥1

(infn≤k

fn

).

Show that lim supn→∞ fn and lim infn→∞ fn are also measurable func-tions.


Exercise 1.19 For any sequence of measurable functions fn for n =1, 2, . . . , show that limn→∞ fn is also a measurable function.

Proposition 1.28Let f : Ω→ [0,∞]. The following conditions are equivalent:

(i) f is a measurable function;(ii) there is a non-decreasing sequence of non-negative simple functions

sn, n = 1, 2, . . . such that f = limn→∞ sn.

Proof (i) ⇒ (ii) Let f : Ω → [0,∞] be a measurable function. For eachn = 1, 2, . . . we put

sn =

n2n∑i=0

i2−n1Ai,n , (1.8)

where Ai,n = {i2−n ≤ f < (i + 1)2−n}. This defines a non-decreasing se-quence of simple functions such that f = limn→∞ sn (see Exercise 1.20).

(ii)⇒ (i) This is a consequence of Exercises 1.14 and 1.19. �

Exercise 1.20 Verify that (1.8) defines a non-decreasing sequence ofnon-negative simple functions such that f = limn→∞ sn.

Exercise 1.21 Show that if f , g are non-negative measurable func-tions and a, b ≥ 0, then a f + bg is measurable.

Exercise 1.22 Show that if f , g are non-negative measurable func-tions, then f g is measurable.

The next step is to define the integral of any non-negative measurablefunction.

Definition 1.29For any non-negative measurable function f : Ω → [0,∞] the integralof f is defined as∫

Ω

f dμ = sup

{∫Ω

s dμ : s is a simple function such that s ≤ f

}.


If the supremum is finite, we say that f is integrable.

Exercise 1.23 For any non-negative measurable functions f , g suchthat f ≤ g show ∫

Ω

f dμ ≤∫Ω

g dμ.

Remark 1.30For any non-negative simple function s, Exercise 1.23 implies that Defini-tions 1.25 and 1.29 of the integral give the same result, so the same notation∫Ω

s dμ can be used in both cases.

The monotone convergence theorem, stated here for non-negative mea-surable functions, is a standard tool for handling limit operations, some-thing that the integral with respect to a measure can tackle with remarkableease.

Theorem 1.31 (monotone convergence)If f : Ω→ [0,∞] and fn : Ω→ [0,∞] for n = 1, 2, . . . is a non-decreasingsequence of non-negative measurable functions such that f = limn→∞ fn,then f is a non-negative measurable function and∫

Ω

f dμ = limn→∞

∫Ω

fn dμ.

Proof That f is measurable follows from Exercise 1.19. We put L =limn→∞

∫Ω

fn dμ for brevity. The limit exists (but may be equal to ∞) andsatisfies L ≤ ∫

Ωf dμ because

∫Ω

fn dμ is an non-decreasing sequence and∫Ω

fn dμ ≤ ∫Ω

f dμ by Exercise 1.23.

To show that, on the other hand, L ≥ ∫Ω

f dμ we take any non-negativesimple function s such that s ≤ f . We also take any α ∈ (0, 1) and putBn = { fn ≥ αs}. Because fn ≥ fn1Bn ≥ αs1Bn , using Exercise 1.23 onceagain, together with Exercise 1.11, we obtain∫

Ω

fn dμ ≥∫Ω

αs1Bn dμ = α∫Ω

s1Bn dμ = αν(Bn)

for each n, where ν is the measure on F defined in Proposition 1.26. Sincefn is a non-decreasing sequence, it follows that Bn ⊂ Bn+1 for each n. More-over, since limn→∞ fn = f ≥ s > αs, we have

⋃∞n=1 Bn = Ω. Because ν is a


measure, we therefore have

L ≥ α limn→∞ ν(Bn) = αν(Ω) = α

∫Ω

s dμ

from Theorem 1.11 (v) adapted to the case of a measure in Corollary 1.16.This is so for any α ∈ (0, 1) and any simple function s such that s ≤ f ,which implies that L ≥ ∫

Ωf dμ, completing the proof. �

Exercise 1.24 Let fn for n = 1, 2, . . . be a sequence of non-negativemeasurable functions. Show that

∑∞n=1 fn is a non-negative measurable

function and ∫Ω

⎛⎜⎜⎜⎜⎜⎝ ∞∑n=1

fn

⎞⎟⎟⎟⎟⎟⎠ dμ =∞∑

n=1

∫Ω

fn dμ.

Proposition 1.32Let f , g : Ω → [0,∞] be non-negative measurable functions and take anya, b ≥ 0. Then a f + bg is measurable and∫

Ω

(a f + bg) dμ = a∫Ω

f dμ + b∫Ω

g dμ,

where we use the conventions x · 0 = 0, x · ∞ = ∞ for any x > 0, andx +∞ = ∞ for any x ≥ 0.

Proof According to Proposition 1.28, there are non-decreasing sequencesrn and sn of simple functions such that f = limn→∞ rn and g = limn→∞ sn.It follows that arn + bsn is a non-decreasing sequence of simple functionsand a f + bg = limn→∞(arn + bsn). By the monotone convergence theorem(Theorem 1.31) and Exercise 1.11, it follows that∫

Ω

(a f + bg) dμ = limn→∞

∫Ω

(arn + bsn) dμ

= a limn→∞

∫Ω

rn dμ + b limn→∞

∫Ω

sn dμ

= a∫Ω

f dμ + b∫Ω

g dμ.

�

The final step in the construction of the integral with respect to a mea-sure is to extend the definition from non-negative measurable functions toarbitrary ones by integrating their positive and negative parts separately.


Definition 1.33Let f : Ω → [−∞,∞] be a measurable function. If both the positive andnegative parts

f + = max{ f , 0}, f − = max{− f , 0} (1.9)

are integrable, we say that f itself is integrable. When at least one of thefunctions f +, f − is integrable, we define the integral of f as∫

Ω

f dμ =∫Ω

f + dμ −∫Ω

f − dμ,

where the conventions x +∞ = ∞ and x − ∞ = −∞ for any x ∈ R applywhenever one of the integrals on the right-hand side is equal to ∞. Whenneither f + nor f − are integrable, then the integral of f will remain unde-fined.

Exercise 1.25 Let f be a measurable function. Show that f is inte-grable if and only if | f | is.

Exercise 1.26 For any integrable function f show that∣∣∣∣∣∫Ω

f dμ∣∣∣∣∣ ≤

∫Ω

| f | dμ.

Exercise 1.27 Let f , g be integrable functions and let a, b ∈ R. Showthat a f + bg is integrable and∫

Ω

(a f + bg) dμ = a∫Ω

f dμ + b∫Ω

g dμ.

Exercise 1.28 For any integrable functions f , g such that f ≤ g show∫Ω

f dμ ≤∫Ω

g dμ.


Exercise 1.29 Extend the monotone convergence theorem (Theo-rem 1.31) to the case when fn for n = 1, 2, . . . is a non-decreasingsequence of integrable functions and f = limn→∞ fn is also an inte-grable function.

It proves convenient to consider the integral over any B ∈ F rather thanjust over Ω.

Definition 1.34For any B ∈ F and any measurable f : Ω → [−∞,∞] we define theintegral of f over B by ∫

Bf dμ =

∫Ω

f 1B dμ

whenever the integral∫Ω

f 1B dμ exists (including the cases when it is∞ or−∞), and we say that f is integrable over B whenever this integral is finite.

The following result is an extension of Proposition 1.26.

Theorem 1.35Suppose that f : Ω→ [0,∞] is measurable, and

ν(B) =∫B

f dμ

for any B ∈ F . Then ν is a measure on Ω defined on the σ-field F .

Proof Suppose that B =⋃∞

j=1 Bi, where Bi ∈ F for i = 1, 2, . . . are pair-wise disjoint sets. Then

gn = f 1⋃ni=1 Bi =

n∑i=1

f 1Bi

is a non-decreasing sequence of measurable functions, and limn→∞ gn =

f 1B. It follows by the monotone convergence theorem (Theorem 1.31) that

ν(B) =∫B

f dμ =∫Ω

f 1B dμ =∫Ω

(limn→∞ gn

)dμ

= limn→∞

∫Ω

gn dμ =∞∑

i=1

∫Ω

f 1Bi dμ =∞∑

i=1

ν(Bi).

Moreover, for B = ∅ we have 1∅ = 0, so

ν(∅) =∫Ω

f 1∅ dμ = 0.


This completes the proof. �

If a measurable function f : Ω → R has a property everywhere excepton some set B ∈ F such that μ(B) = 0, we say that it has this prop-erty μ-almost everywhere (μ-a.e. for short) or, particularly when μ is aprobability measure, μ-almost surely (μ-a.s. for short). For instance, ifμ({ f � 0}) = 0, we say that f = 0, μ-a.e.

Proposition 1.36Let f : Ω → [0,∞] be a non-negative measurable function. Then f = 0,μ-a.e. if and only if

∫Ω

f dμ = 0.

Proof Suppose that f = 0, μ-a.e., that is, μ(B) = 0 for B = { f > 0}. Forevery simple function s such that s ≤ f we have s = 0 on Ω \ B. Writingthe simple function as s =

∑ni=1 si1Ai , where A1, . . . , An ∈ F and si ≥ 0

for i = 1, . . . , n, we then have μ(Ai) = 0 if Ai ⊂ B, and si = 0 otherwise.This means that

∫Ω

s dμ =∑n

i=1 siμ(Ai) = 0 because μ(Ai) = 0 or si = 0 foreach i. Because

∫Ω

s dμ = 0 for every simple function s such that s ≤ f , itfollows that

∫Ω

f dμ = 0.

Conversely, if∫Ω

f dμ = 0, then B = { f > 0} = ⋃∞n=1 Bn where Bn = { f ≥1n }. This is an increasing sequence of sets in F , so μ(B) = limn→∞ μ(Bn).The simple function sn =

1n 1Bn satisfies sn ≤ f , so

1nμ(Bn) =

∫Ω

sn dμ ≤∫Ω

f dμ = 0,

which means that μ(Bn) = 0 for all n, hence μ(B) = 0. Therefore f = 0,μ-a.e. �

Exercise 1.30 Let f : Ω→ [−∞,∞] be a measurable function. Showthat

∫B

f dμ = 0 for every B ∈ F if and only if f = 0, μ-a.e.

The origin of the next proposition is the familiar change of variablesformula for transforming Riemann integrals. In the case of integrals withrespect to a measure we have the following change of measure result.

Proposition 1.37Suppose that (Ω,F , μ) and

(Ω, F , μ

)are measure spaces, and ϕ : Ω →

Ω is a function such that ϕ−1(A) ∈ F and μ(ϕ−1(A)) = μ(A) for everyA ∈ F . If g = g ◦ ϕ is the composition of ϕ and a measurable functiong : Ω → [−∞,∞], then g : Ω → [−∞,∞] is a measurable function, the


integral∫Ω

g dμ exists if and only∫Ω

g dμ exists (including the cases whenthe integrals are equal to∞ or −∞), and∫

Ω

g dμ =∫Ω

g dμ.

Proof Suppose that s =∑n

i=1 si1Ai is a non-negative simple function on Ω,where si ∈ [0,∞) and where Ai ∈ F for i = 1, . . . , n are disjoint sets suchthat

⋃ni=1 Ai = Ω. Then ϕ−1 (Ai) ∈ F for i = 1, . . . , n are disjoint sets such

that⋃n

i=1 ϕ−1 (Ai) = Ω, and s = s ◦ ϕ = ∑n

i=1 si1ϕ−1(Ai) is a non-negativesimple function on Ω. It follows that∫

Ω

s dμ =n∑

i=1

siμ(ϕ−1(Ai)) =

n∑i=1

siμ(Ai) =∫Ω

s dμ.

Now suppose that g is non-negative measurable function on Ω. By Propo-sition 1.28 there is a non-decreasing sequence of non-negative simple func-tions sk, k = 1, 2, . . . on Ω such that g = limk→∞ sk. Then sk = sk ◦ ϕis a non-decreasing sequence of non-negative simple functions on Ω andg = limk→∞ sk is a measurable function on Ω. It follows by the monotoneconvergence theorem (Theorem 1.31) that∫

Ω

g dμ = limk→∞

∫Ω

sk dμ = limk→∞

∫Ω

sk dμ =∫Ω

g dμ.

In general, if g is a measurable function on Ω, then g = g ◦ ϕ is a mea-surable function on Ω. We can write g = g+ − g− and, correspondingly,g = g+ − g−, where g+, g− are non-negative measurable functions on Ω andg+ = g+ ◦ϕ, g− = g− ◦ϕ are non-negative measurable functions onΩ. Fromthe above argument we know that∫

Ω

g +dμ =∫Ω

g+dμ,∫Ω

g −dμ =∫Ω

g−dμ.

It follows that∫Ω

g dμ exits if and only if∫Ω

g dμ exists, and∫Ω

g dμ =∫Ω

g+ dμ −∫Ω

g− dμ

=

∫Ω

g+ dμ −∫Ω

g− dμ =∫Ω

g dμ.

�


Lebesgue integral

Our aim when constructing an integral with respect to a measure was to ex-tend the Riemann integral. To achieve this, we now specialise to the mea-sure space (R,B(R),m) with Lebesgue measure m defined on the Borel setsB(R).

Definition 1.38Let f : R→ [−∞,∞].

(i) We say that f is Borel measurable whenever it is measurable as afunction on the measure space (R,B(R),m).

(ii) The integral∫R

f dm, whenever it exists (including the cases when itis equal to∞ or −∞), is called the Lebesgue integral of f .

(iii) When the integral∫R

f dm is finite, we say that f is Lebesgue inte-grable.

We need to make sure that Lebesgue integral is indeed what we are look-ing for, that is, it coincides with the Riemann integral of any continuousfunction over an interval.

Proposition 1.39For any continuous function f : R→ R and any numbers a ≤ b∫ b

af (x) dx =

∫[a,b]

f dm,

with the Riemann integral on the left-hand side and Lebesgue integral onthe right-hand side.

Proof It is enough to consider f ≥ 0. Otherwise we can consider f +

and f − separately, and then combine the results.For any n = 1, 2, . . . , take ci = a+ (b−a)i2−n so that a = c0 < c1 < · · · <

c2n = b. For each i = 1, . . . , 2n, since f is a continuous function, it has aminimum on [ci−1, ci], which we can write as f (xi) for some xi ∈ [ci−1, ci].We put S n =

∑2n

i=1 f (xi)(ci − ci−1). Then, by the definition of the Riemannintegral, ∫ b

af (x) dx = lim

n→∞ S n.

On the other hand,

sn =

2n∑i=1

f (xi)1[ci−1,ci) + f (b)1{b}

is a non-decreasing sequence of simple functions, and limn→∞ sn = f 1[a,b].


Moreover,∫R

sn dm = S n. By the monotone convergence theorem (Theo-rem 1.31), ∫

[a,b]f dm =

∫R

f 1[a,b] dm = limn→∞

∫R

sn dm = limn→∞ S n,

completing the proof. �

While the Lebesgue integral coincides with the Riemann integral forcontinuous functions integrated over intervals, it is in fact much more gen-eral and covers various other cases. Here are a couple of relatively simpleexamples.

Example 1.40

In Example 1.21 we saw that the Riemann integral∫ 1

0f (x) dx does not exist

when the function f is defined as

f (x) =

⎧⎪⎪⎨⎪⎪⎩0 if x ∈ Q,1 if x ∈ R \ Q.

However, the Lebesgue integral∫

[0,1]f dm does exist and equals 1, as the

function f is the indicator of the Borel measurable set [0, 1] \Q, which hasLebesgue measure 1.

Exercise 1.31 Recall the Cantor set C (Exercise 1.7). Suppose thatf : [0, 1] → R is defined by setting f (x) = 0 for all x in C, andf (x) = k for all x in each of the 2k−1 intervals of length 3−k removedfrom [0, 1] in forming Ck. Calculate the Lebesgue integral

∫[0,1]

f dm

and show that the Riemann integral∫ 1

0f (x) dx does not exist.

Exercise 1.32 For any integrable function f : R→ [−∞,∞] and anya ∈ R show that the function g(x) = f (x − a) defined for all x ∈ R isintegrable and ∫

R

g dm =∫R

f dm.

This is known as translation invariance of the Lebesgue integral.


Hint. Refer to Exercise 1.8 concerning translation invariance for theLebesgue measure m.

More convergence results

In addition to the monotone convergence theorem, there are other powerfulresults concerning limits of integrals. It will be important to have theseready in our toolbox. Once again we work with a general measure space(Ω,F , μ), and begin with the following two inequalities.

Lemma 1.41 (Fatou lemmas)Let fn : Ω→ [0,∞] be measurable functions for n = 1, 2, . . . .

(i) The inequality ∫Ω

(lim inf

n→∞ fn

)dμ ≤ lim inf

n→∞

∫Ω

fn dμ

holds.(ii) If, moreover, fn ≤ g for all n, where g : Ω → [0,∞] is integrable,

then

lim supn→∞

∫Ω

fn dμ ≤∫Ω

(lim sup

n→∞fn

)dμ.

Proof (i) Set gk = infn≥k fn. Then gk for k = 1, 2, . . . is a non-decreasingsequence, and

limk→∞

gk = supk≥1

gk = supk≥1

infn≥k

fn = lim infn→∞ fn.

Moreover, gk ≤ fn and so∫Ω

gk dμ ≤ ∫Ω

fn dμ whenever k ≤ n. Hence, foreach k ≥ 1 ∫

Ω

gk dμ ≤ infn≥k

∫Ω

fn dμ.

Because gk for k = 1, 2, . . . is a non-decreasing sequence, so is∫Ω

gk dμ,and it follows by the monotone convergence theorem (Theorem 1.31) that∫

Ω

(lim inf

n→∞ fn

)dμ =

∫Ω

(limk→∞

gk

)dμ = lim

k→∞

∫Ω

gk dμ = supk≥1

∫Ω

gk dμ

≤ supk≥1

infn≥k

∫Ω

fn dμ = lim infn→∞

∫Ω

fn dμ.

(ii) Let hn = g − fn (where we set hn(ω) = 0 for any ω ∈ Ω such thatg(ω) = fn(ω) = ∞). The functions hn : Ω → [0,∞] are measurable for all


n = 1, 2, . . . , and we can apply (i) to get∫Ω

(lim inf

n→∞ hn

)dμ ≤ lim inf

n→∞

∫Ω

hn dμ.

Because g is integrable and 0 ≤ fn ≤ g for each n, it follows that fn isintegrable for each n. Moreover, it follows that 0 ≤ lim supn→∞ fn ≤ g, andso lim supn→∞ fn is also integrable. As a result, by Exercise 1.27,∫Ω

(lim inf

n→∞ hn

)dμ =

∫Ω

(g− lim sup

n→∞fn

)dμ =

∫Ω

g dμ−∫Ω

(lim sup

n→∞fn

)dμ

on the left-hand side of the inequality, and

lim infn→∞

∫Ω

hn dμ = lim infn→∞

( ∫Ω

g dμ−∫Ω

fn dμ)=

∫Ω

g dμ−lim supn→∞

∫Ω

fn dμ

on the right-hand side, completing the proof. �

Example 1.42In general, we cannot expect equality in Lemma 1.41 (i). On the measurespace (R,B(R),m) let fn = 1(n,n+1] for n = 1, 2, . . . . Then limn→∞ fn =

0 because for any fixed real number x we can find n > x. Hence∫R

(lim infn→∞ fn) dm = 0, while lim infn→∞∫R

fn dm = 1 since∫R

fn dm =1 for each n = 1, 2, . . . .

Exercise 1.33 Let (Ω,F , P) be a probability space. Use Fatou’slemma to show that for any sequence of events An ∈ F , wheren = 1, 2, . . . , we have

P

⎛⎜⎜⎜⎜⎜⎝⋃n≥1

⋂k≥n

Ak

⎞⎟⎟⎟⎟⎟⎠ ≤ lim infn→∞ P(An).

In situations when we need to integrate the limit of a non-monotonesequence of functions, the following result often comes to the rescue.

Theorem 1.43 (dominated convergence)Suppose that fn : Ω → [−∞,∞] are measurable functions for n = 1, 2, . . .and there is an integrable function g : Ω → [0,∞] such that | fn| ≤ g for


each n. Suppose further that limn→∞ fn = f . Then f and fn for each n areintegrable, and

limn→∞

∫Ω

fn dμ =∫Ω

f dμ. (1.10)

Proof Since | fn| ≤ g for each n, it follows that | f | ≤ g, where g is inte-grable. This means that f and fn for each n are integrable, and so are f − fn

and | f − fn| because | f − fn| ≤ | f | + | fn| ≤ 2g. The second Fatou lemma(Lemma 1.41 (ii)) gives

lim supn→∞

∫Ω

| fn − f | dμ ≤∫Ω

(lim sup

n→∞| fn − f |

)dμ = 0

since lim supn→∞ | fn − f | = limn→∞ | fn − f | = 0. This completes the proofbecause ∣∣∣∣∣

∫Ω

fn dμ −∫Ω

f dμ∣∣∣∣∣ =

∣∣∣∣∣∫Ω

( fn − f ) dμ∣∣∣∣∣ ≤

∫Ω

| fn − f | dμ,

where Exercise 1.26 is used in the last inequality. �

Example 1.44For an example with no integrable dominating function, consider the se-quence of functions on the measure space (R,B(R),m) defined by fn =

n1(0, 1n ] for n = 1, 2, . . . . Here g = supn≥1 fn satisfies g = n on the in-

terval(

1n+1 ,

1n

], so

∫R

g dm =∑∞

n=1 n(

1n − 1

n+1

)=∑∞

n=11

n+1 = ∞. We have

limn→∞ fn = 0 and∫R

fn dm = 1 for each n, which means that (1.10) fails inthis case.

Exercise 1.34 Let fn for n = 1, 2, . . . be a sequence of integrablefunctions and suppose that

∑∞n=1

∫Ω| fn| dμ is finite. Show that the series∑∞

n=1 fn converges μ-a.e., that its sum is an integrable function, and that∫Ω

⎛⎜⎜⎜⎜⎜⎝ ∞∑n=1

fn

⎞⎟⎟⎟⎟⎟⎠ dμ =∞∑

n=1

∫Ω

fn dμ.

Exercise 1.35 Use the previous exercise to calculate∫ ∞

0x

ex−1 dx.

1.5 Lebesgue outer measure 33

Exercise 1.36 Prove the following version of the dominated conver-gence theorem.Suppose we are given real numbers a < b and a function f :Ω × [a, b] → R such that ω �→ f (ω, s) is measurable for eachs ∈ [a, b]. Suppose further that for some fixed t ∈ [a, b] we havef (ω, t) = lims→t f (ω, s) for each ω ∈ Ω, and there is an integrablefunction g : Ω → R such that | f (ω, s)| ≤ g(ω) for each ω ∈ Ω andeach s ∈ [a, b]. Then∫

Ω

f (ω, t)dμ(ω) = lims→t

∫Ω

f (ω, s)dμ(ω).

1.5 Lebesgue outer measure

Definition 1.45For any A ⊂ R we define

m∗(A) = inf

⎧⎪⎪⎨⎪⎪⎩∞∑

k=1

l(Jk) : A ⊂∞⋃

k=1

Jk

⎫⎪⎪⎬⎪⎪⎭ ,where the infimum is taken over all sequences (Jk)∞k=1 consisting of openintervals. This is called Lebesgue outer measure.

The Lebesgue outer measure m∗ extends the function m defined on B(R)by (1.2) to the family of all subsets of R. We have

m(A) = m∗(A)

for each A ∈ B(R). Despite its name, Lebesgue outer measure is not ameasure on the subsets of R.

Proposition 1.46The Lebesgue outer measure m∗ has the following properties:

(i) m∗(∅) = 0;(ii) A ⊂ B implies m∗(A) ≤ m∗(B) for any A, B ⊂ R (monotonicity);

(iii) m∗(⋃∞

i=1 Ai) ≤ ∑∞

i=1 m∗(Ai) for any Ai ⊂ R, i = 1, 2, . . . (countablesubadditivity);

(iv) m∗([a, b]) = m∗((a, b)) = m∗((a, b]) = m∗([a, b)) = b − a for eacha ≤ b.


Proof (i) Take Jk = (a, a) = ∅ for k = 1, 2, . . . . This sequence covers ∅by open intervals with total length 0. It follows that m∗(∅) = 0.

(ii) If A ⊂ B and B is covered by a sequence (Jk)∞k=1of open intervals, thenA is covered by the same sequence, which implies that m∗(A) ≤ m∗(B).

(iii) To prove countable subadditivity, let ε > 0 be given. For each i =1, 2, . . . there is a sequence (Ji,k)∞k=1 covering Ai by open intervals and suchthat

∞∑k=1

l(Ji,k) < m∗(Ai) +ε

2i.

Summing over i, we have

∞∑i=1

∞∑k=1

l(Ji,k) <∞∑

i=1

(m∗(Ai) +

ε

2i

)=

∞∑i=1

m∗(Ai) + ε.

But the double sequence(Ji,k

)∞i,k=1 covers

⋃∞i=1 Ai by open intervals, so by

the definition of Lebesgue outer measure,

m∗⎛⎜⎜⎜⎜⎜⎝ ∞⋃

i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ ≤ ∞∑i=1

∞∑k=1

l(Ji,k) <∞∑

i=1

m∗(Ai) + ε.

This argument works for an arbitrary ε > 0, which proves the claim.(iv) Let a ≤ b. Take any ε > 0, and put J1 = (a − ε2 , b + ε2 ) and Jk =

(a, a) = ∅ for k = 2, 3, . . . . This sequence covers [a, b] by open intervals,so

m∗([a, b]) ≤∞∑

k=1

l(Jk) = l(J1) = b − a + ε.

This is so for every ε > 0, which implies that m∗([a, b]) ≤ b − a.To prove that, on the other hand, m∗([a, b]) ≥ b − a take any ε > 0 and a

covering (Jk)∞k=1 of [a, b] by open intervals Jk = (ck, dk) such that

∞∑k=1

l(Jk) < m∗([a, b]) + ε.

For any x ≥ a we say that [a, x] has a finite subcover whenever [a, x] ⊂⋃Kk=1 Jk for some positive integer K. We are going to show that [a, b] has a

finite subcover. To this end, define

s = sup{x ≥ a : [a, x] has a finite subcover

}.

Since a ∈ Jk for some k and Jk is an open interval, we have [a, a + ε] ⊂ Jk

for some ε > 0, implying that s > a. Now suppose that s ≤ b. Since


(Jk)∞k=1covers [a, b], we can find i such that s ∈ Ji, and as Ji is open we canfind x1, x2 ∈ Ji with a < x1 < s < x2. Now [a, x1] has a finite subcoversince x1 < s. But that subcover, together with Ji, gives a finite subcover of[a, x2], and since x2 > s, this contradicts the definition of s. Therefore wecannot have s ≤ b. It means that s > b. As a result, we have shown that[a, b] has a finite subcover (Jk)K

k=1 for some positive integer K.Let

c = mink=1,...,K

ck, d = maxk=1,...,K

dk.

Because [a, b] ⊂ ⋃Kk=1(ck, dk), we must have c < a ≤ b < d. It follows that

b − a < d − c ≤K∑

k=1

(dk − ck) =K∑

k=1

l(Jk) ≤∞∑

k=1

l(Jk) < m∗([a, b]) + ε.

This must be so for every ε > 0, implying that b − a ≤ m∗([a, b]). We canconclude that b − a = m∗([a, b]).

Next, using (ii), (iii), (iv), we have

m∗((a, b)) ≤ m∗([a, b]) = m∗([a, a] ∪ (a, b) ∪ [b, b])

≤ m∗([a, a]) + m∗((a, b)) + m∗([b, b])

= (a − a) + m∗((a, b)) + (b − b)

= m∗((a, b)),

so that m∗((a, b)) = m∗([a, b]). A similar argument shows that m∗((a, b)) =m∗([a, b)) and m∗((a, b)) = m∗((a, b]). �

Definition 1.47A set A ⊂ R is said to be m∗-measurable if

m∗(E) ≥ m∗(E ∩ A) + m∗(E ∩ (R\A)) (1.11)

for every E ⊂ R. The collection of all m∗-measurable sets is denoted byM.

Remarkably, as we shall see, this property will suffice for countable ad-ditivity.

Proposition 1.48The following properties hold:

(i) M is a σ-field on R;(ii) m∗ restricted toM is a measure;

(iii) every interval (a, b) belongs toM, that is, I ⊂ M.


Proof (i) We already know that m∗(∅) = 0, so for every E ⊂ Rm∗(E) = m∗(E) + m∗(∅) = m∗(E ∩ R) + m∗(E ∩ (R \ R)),

which means that R ∈ M. Moreover, because A = R \ (R \ A), it followsfrom (1.11) that A ∈ M implies R \ A ∈ M.

Now let A, B ∈ M. For any E ⊂ R we have by (1.11)

m∗(E) ≥ m∗(E ∩ A) + m∗(E ∩ (R \ A))

≥ m∗(E ∩ A ∩ B) + m∗(E ∩ A ∩ (R \ B))

+m∗(E ∩ (R \ A) ∩ B) + m∗(E ∩ (R \ A) ∩ (R \ B)),

where in the second inequality we use E∩A and E∩(R\A), respectively, inplace of E in (1.11). Since A∪B ⊂ (A∩B)∪(A∩(R\B))∪((R\A)∩B), by thesubadditivity of m∗, the sum of the first three terms is at least m∗(E∩(A∪B)).In the final term (R \ A) ∩ (R \ B) = R \ (A ∪ B). As a result,

m∗(E) ≥ m∗(E ∩ (A ∪ B)) + m∗(E ∩ (R \ (A ∪ B))),

which shows that A ∪ B ∈ M. We have shown that A, B ∈ M impliesA ∪ B ∈ M. By induction, this extends to any finite number of sets. IfAi ∈ M for i = 1, . . . , n, then

⋃ni=1 Ai ∈ M.

Finally, take any sequence Ai ∈ M for i = 1, 2, . . . , and put Dn =⋃n

i=1 Ai

and D =⋃∞

i=1 Ai. It follows that Dn ∈ M, so for any E ⊂ Rm∗(E) ≥ m∗(E ∩ Dn) + m∗(E ∩ (R \ Dn)).

Clearly Dn ⊂ D, so R \ Dn ⊃ R \ D, and by the monotonicity of m∗

m∗(E ∩ (R \ Dn)) ≥ m∗(E ∩ (R \ D)).

Next put B1 = A1 and Bn = An \ Dn−1 for n = 2, 3, . . . . From what hasalready been shown it follows that Bn ∈ M for all n. Using E ∩Dn in placeof E in (1.11), we get

m∗(E ∩ Dn) ≥ m∗(E ∩ Dn ∩ Bn) + m∗(E ∩ Dn ∩ (R \ Bn))

= m∗(E ∩ Bn) + m∗(E ∩ Dn−1).

We can repeat this for m∗(E ∩ Di) with i = n − 1, n − 2, . . . , 1 and obtain

m∗(E ∩ Dn) ≥n∑

i=1

m∗(E ∩ Bi).

It follows that

m∗(E) ≥n∑

i=1

m∗(E ∩ Bi) + m∗(E ∩ (R \ D))


for each n, and so

m∗(E) ≥∞∑

i=1

m∗(E ∩ Bi) + m∗(E ∩ (R \ D)) (1.12)

≥ m∗(E ∩∞⋃

i=1

Bi) + m∗(E ∩ (R \ D))

= m∗(E ∩ D) + m∗(E ∩ (R \ D)),

where the second inequality is due to the countable subadditivity of m∗.We have shown that D =

⋃∞i=1 Ai ∈ M, completing the proof thatM is a

σ-field.(ii) We already know that m∗(∅) = 0. Let Ai ∈ M for i = 1, 2, . . . be a

sequence of pairwise disjoint sets. Then Bi = Ai for each i and (1.12) withE = D =

⋃∞i=1 Ai gives

m∗(D) ≥∞∑

i=1

m∗(Ai).

Countable subadditivity, see Proposition 1.46 (iii), gives the reverse in-equality, proving that m∗ is countably additive, and hence it is a measureonM.

(iii) Let ε > 0 be given. There is a sequence (Jk)∞k=1of open intervalscovering E such that m∗(E) + ε ≥ ∑∞k=1 m∗(Jk). By subadditivity

m∗(E ∩ (a, b)) + m∗(E ∩ (R \ (a, b)))

≤ m∗(E ∩ (a, b)) + m∗(E ∩ (−∞, a]) + m∗(E ∩ [b,∞))

≤∞∑

k=1

[m∗(Jk ∩ (a, b)) + m∗(Jk ∩ (−∞, a]) + m∗(Jk ∩ [b,∞))] .

Inside the square brackets we have the lengths of the disjoint intervals Jk ∩(a, b), Jk ∩ (−∞, a], Jk ∩ [b,∞), which add up to give the length l(Jk) of Jk.Hence

m∗(E ∩ (a, b)) + m∗(E ∩ (R \ (a, b))) ≤∞∑

k=1

l(Jk) ≤ m∗(E) + ε.

Since this holds for all ε > 0, we have shown that (1.11) with A = (a, b)holds for every E ⊂ R, which means that (a, b) ∈ M. It follows that I ⊂ M.

�

Finally, we are ready to prove Theorem 1.19.


Theorem 1.19m : B(R)→ [0,∞] is a measure such that m((a, b)) = b − a for all a ≤ b.

Proof BecauseM is a σ-field and I ⊂ M, we know that B(R) ⊂ M (seeExercise 1.3). Because m∗ is a measure onM and m(A) = m∗(A) for everyA ∈ B(R), it follows that m is a measure on B(R). Moreover, for any a ≤ b,since (a, b) ∈ B(R), we have m((a, b)) = m∗((a, b)) = b − a. �

2

Probability distributions and randomvariables

2.1 Probability distributions2.2 Random variables2.3 Expectation and variance2.4 Moments and characteristic functions

We again motivate our discussion through simple examples of pricing mod-els. In such applications, we often have information about the probabilitydistribution of future prices, so it is natural to begin our analysis with dis-tribution functions and densities. Then we look at measurable functionsdefined on probability spaces, commonly known as random variables, andthe probability distributions associated with them. One often has to inferthe structure of the distribution from simpler data, such as the expectation,variance or higher moments, and methods for computing these thus playa major role. Finally, we introduce characteristic functions as a vehiclefor analysing distributions and computing the moments of a given randomvariable. The full power of characteristic functions will become apparentin Chapter 5, where it will be shown that the characteristic function of arandom variable determines its distribution.

2.1 Probability distributions

In Chapter 1 we looked at some examples of probabilities. These includedthe uniform, binomial, Poisson and geometric probabilities in a discretesetting, and the probabilities associated with the normal and log-normaldensities in a continuous setting. These examples can be revisited using

39

40 Probability distributions and random variables

Figure 2.1 Distribution function for the binomial distribution in Example 2.2with N = 10 and p = 0.5.

the notions of probability distribution and distribution function, which canalso be used to describe a multitude of other useful probabilities in a unifiedmanner.

Definition 2.1A probability distribution is by definition a probability measure P on Rdefined on the σ-field of Borel sets B(R). The function F : R → [0, 1]defined as

F(x) = P((−∞, x])

for each x ∈ R is called the (cumulative) distribution function.

Example 2.2Let 0 < p < 1 and fix N = 1, 2, . . . . For any A ∈ B(R) let

P(A) =N∑

n=0

(Nn

)pn(1 − p)N−n1A(n).

This defines a probability measure on R, called the binomial distributionwith parameters N, p. It corresponds to the binomial probability defined inExample 1.13. The corresponding distribution function is piecewise con-stant:

F(x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩0 for x < 0,∑k

n=0

(Nn

)pn(1 − p)N−n for k = 0, 1, . . . ,N − 1 and k ≤ x < k + 1,

1 for N ≤ x.

This function is shown in Figure 2.1. The dots represent the values F(x) atx = 0, 1, . . . , 10, where the distribution function has discontinuities.

2.1 Probability distributions 41

Figure 2.2 Distribution function for the Poisson distribution in Example 2.3with λ = 2.

Example 2.3Let λ > 0. The probability measure on R defined by

P(A) =∞∑

n=0

λn

n!e−λ1A(n)

for any A ∈ B(R) is called the Poisson distribution with parameter λ.Compare this with the Poisson probability in Example 1.8. The correspond-ing distribution function

F(x) =

{0 for x < 0,∑k

n=0λn

n! e−λ for k = 0, 1, . . . and k ≤ x < k + 1

is depicted in Figure 2.2. The dots represent the values F(x) at x = 0, 1, . . . ,where the distribution function has discontinuities.

Definition 2.4We say that a probability distribution P is discrete whenever there is asequence x1, x2, . . . ∈ R such that

∑∞n=1 P({xn}) = 1.

Example 2.5The binomial distribution and the Poisson distribution are examples of dis-crete probability distributions.

According to Theorem 1.35, if f : R→ [0,∞] is Borel measurable, thenthe integral

∫B

f dm, considered as a function of B ∈ B(R), is a measure. If,in addition,

∫R

f dm = 1, then this is a probability measure. We have seen


Figure 2.3 Distribution function for the normal distribution in Example 2.8(solid line) and for the log-normal distribution in Example 2.9 (broken line).

two examples of such functions f , the normal density (Example 1.23) andthe log-normal density (Example 1.24).

Definition 2.6Any Borel measurable function f : R → [0,∞] such that

∫R

f dm = 1 iscalled a probability density.

When f is a probability density, then

P(B) =∫

Bf dm

defined for any Borel set B ∈ B(R) is a probability distribution with distri-bution function

F(x) = P((−∞, x]) =∫

(−∞,x]f dm.

Definition 2.7We say that a probability distribution P is continuous (sometimes referredto as absolutely continuous) if there is a density function f such that foreach B ∈ B(R)

P(B) =∫

Bf dm.

Example 2.8The normal distribution is the probability distribution corresponding tothe normal density specified in Example 1.23. See Figure 2.3 for the distri-bution function in this case.


Example 2.9The probability distribution with the log-normal density in Example 1.24 isreferred to as the log-normal distribution. The corresponding distributionfunction is shown as a broken line in Figure 2.3.

Example 2.10The density

f (x) =

{λe−λx if x ≥ 0,0 if x < 0,

(2.1)

yields the so-called exponential distribution with parameter λ > 0. Thecorresponding distribution function is given by

F(x) =

{1 − e−λx if x ≥ 0,0 if x < 0,

and shown in Figure 2.4.

The following proposition lists some properties shared by all distributionfunctions.

Proposition 2.11The distribution function F of any probability distribution has the followingproperties:

(i) F(x) ≤ F(y) for every x ≤ y (F is non-decreasing);(ii) limx↘a F(x) = F(a) for each a ∈ R (F is right-continuous);

(iii) limx→−∞ F(x) = 0;(iv) limx→∞ F(x) = 1.

Proof Let F be the distribution function of a probability distribution P,that is, let F(x) = P((−∞, x]) for each x ∈ R.

(i) Note that x ≤ y implies (−∞, x] ⊂ (−∞, y], so P((−∞, x]) ≤ P((−∞, y])by Theorem 1.11 (iii).

(ii) Take any non-increasing sequence of numbers xn such that xn >

a and limn→∞ xn = a. Then (−∞, a] =⋂∞

n=1(−∞, xn], and P((−∞, a]) =P(⋂∞

n=1(−∞, xn])= limn→∞ P((−∞, an]) by Theorem 1.11 (vii).

(iii) If xn is a non-increasing sequence of numbers such that limn→∞ xn =


Figure 2.4 Distribution function for the exponential distribution in Exam-ple 2.10 with λ = 1.

−∞, then∅ =⋂∞

n=1(−∞, xn], so 0 = P(⋂∞

n=1(−∞, xn]) = limn→∞ P((−∞, xn])once again by Theorem 1.11 (vii).

(iv) Using Theorem 1.11 (v) with Bn = (−∞, xn], where xn is a non-decreasing sequence such that limn→∞ xn = ∞, we have R =

⋃∞n=1(−∞, xn],

and so 1 = P(⋃∞

n=1(−∞, xn]) = limn→∞ P((−∞, xn]). �

In general, F does not have to be continuous. Because F is non-decreasing,the left limit F(a−) = limx↗a F(x) exists for each a ∈ R. We have

P({a}) = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋂n=1

(a − 1n , a]

⎞⎟⎟⎟⎟⎟⎠ = limn→∞ P

((a − 1

n , a])

= limn→∞

(F(a) − F(a − 1

n ))= F(a) − F(a−).

If F has a discontinuity at a, then F(a−) < F(a), and so P({a}) > 0. On theother hand, if F is continuous at a, then F(a−) = F(a) and P({a}) = 0.

Exercise 2.1 Suppose F is a distribution function. Show that thereare at most countably many a ∈ R such that F(a−) < F(a). (We saythat F has at most countably many jump discontinuities.)


Example 2.12Suppose that F is a piecewise constant distribution function with a finitenumber of jumps at points a1 < a2 < · · · < aN with F(an) − F(an−) = pn,where

∑Nn=1 pn = 1. So

F(x) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩0 for x < a1,∑k

n=1 pk for k = 1, . . . ,N − 1 and ak ≤ x < ak+1,

1 for aN ≤ x.

The corresponding probability distribution satisfies P({an}) = pn for each n,and P({a1, . . . , aN}) = 1. It is a discrete probability distribution concen-trated on a finite set. Example 2.2 falls into this category.

Example 2.13In particular if N = 1 and a1 = a for some a ∈ R, we have

F(x) = 1[a,∞)(x).

The corresponding probability distribution is the unit mass concentratedat a (see Example 1.12),

δa(A) =

{1 if a ∈ A,0 if a � A,

for any Borel set A ∈ B(R).

Example 2.14Let Flog-norm denote the log-normal distribution function from Example 2.9.Fix a number 0 < p < 1 and write

F(x) =

{0 for x < 0,p + (1 − p)Flog-norm(x) for x ≥ 0.

This is a distribution function, with a jump of size p at 0. The resultingprobability distribution is

P = pδ0 + (1 − p)Plog-norm,

where δ0 is the unit mass concentrated at 0, and where Plog-norm is the log-


Figure 2.5 Distribution function in Example 2.14 with p = 0.3.

normal distribution with density given in Example 1.24. Here we have amixture of discrete and continuous distributions. It can be viewed as amodel of the stock price for a company which can suffer bankruptcy withprobability p. The graph of F with p = 0.3 is shown in Figure 2.5. The dotrepresents the value F(x) at x = 0, where the distribution function has adiscontinuity.

Exercise 2.2 Let P1, P2, . . . be probability measures defined on a σ-field F , and let a1, a2, . . . > 0. What condition on the sequence an isneeded to ensure that P =

∑∞n=1 anPn is a probability measure?

Remark 2.15The converse of Proposition 2.11 is also true. If F : R → [0, 1] satisfiesconditions (i)–(iv) of this proposition, then F is the distribution function ofsome probability distribution P. This provides a simple and convenient wayto describe a probability distribution by specifying its distribution function.

2.2 Random variables

Derivative securities (also called contingent claims) play an importantrole in finance. They represent financial assets whose present value is de-termined by some future payoff. In the case of European derivative secu-rities the payoff is available at a prescribed future time T and depends on

2.2 Random variables 47

the value S (T ) at that time of a stock or some other risky asset, called theunderlying security. We can write the payoff as H = h(S (T )) for somefunction h. For example, with h(x) = (x − K)+ we have a (European) calloption, while h(x) = (K − x)+ gives a put option. Here K is called thestrike price.

An important step in studying derivative securities is to build a modelof the future values of S (T ). Various such models have been proposed.Suppose that we have chosen a probability space (Ω,F , P) where the set Ωrepresents all possible values of S (T ). It is natural to ask for the probabilityP({H ∈ B}) that the payoff H takes values in some Borel set B ∈ B(R),for example in an interval B = [a, b]. The set {H ∈ B} should belong tothe domain of P (which is F , the σ-field of events). Functions with thisproperty were called measurable in Definition 1.27. When working withprobability spaces we call them random variables.

Definition 2.16A random variable is a function X : Ω → R such that for each Borel setB ∈ B(R)

{X ∈ B} ∈ F .

The family of all sets of the form {X ∈ B} for some B ∈ B(R) is denotedby σ(X) and is called the σ-field generated by X. We can see that X is arandom variable if and only if σ(X) ⊂ F .

Exercise 2.3 Show that σ(X) is indeed a σ-field.

If h : R → R and X : Ω → R, we shall often write h(X) to denotethe composition h ◦ X. In fact this has already been done implicitly in theexpression h(S (T )) above.

If h is a Borel measurable function, then h(X) is measurable with respectto σ(X) since h−1(B) is a Borel set for any B ∈ B(R), and so

{h(X) ∈ B} = {X ∈ h−1(B)} ∈ σ(X).

In particular, when X is a random variable, it follows that h(X) is also arandom variable.

It turns out that all functions measurable with respect to σ(X) are of theform h(X) for some Borel measurable function h.


Exercise 2.4 Show that Y : Ω → R is measurable with respect toσ(X) if and only if there is a Borel measurable function h : R → Rsuch that Y = h(X).

Example 2.17The payoff functions of derivative securities provide many examples of ran-dom variables. For European options we can express the payoff in the formh(S (T )), so we need only to show that the function h is Borel measurable.Familiar examples are call or put options with strike K, whose payoff func-tions are h(x) = (x − K)+ and h(x) = (K − x)+ respectively. Other popularoptions include the following.

(1) A bottom straddle, which consists of buying a call and a put withthe same strike K, so that the payoff is h(x) = |x − K|.

(2) A strangle, where we buy a call and a put with different strikes K1 <

K2, so that

h(x) = (x − K1)+ + (K2 − x)+.

This reduces to a straddle when K1 = K2.

(3) A bull spread, consisting of two calls, one long and one short, withstrikes K1 < K2, so that

h(x) = (x − K1)+ − (x − K2)+ =

⎧⎪⎪⎪⎨⎪⎪⎪⎩0 if x < K1,

x − K1 if K1 ≤ x ≤ K2,

K2 − K1 if x > K2.

(4) A butterfly spread, where we buy calls at strikes K1 < K3 and selltwo calls at strike K2 =

12 (K1+K3). You should verify that the payoff

is zero outside (K1,K3) and equals x − K1 on [K1,K2] and K3 − x on[K2,K3].

In all these cases the Borel measurability of h follows at once from theexercises in Chapter 1, so the payoffs are random variables.

Definition 2.18With every random variable X we can associate a probability distribu-tion PX , called the distribution of X, defined as

PX(B) = P({X ∈ B})


for any Borel set B ∈ B(R). The corresponding distribution function

FX(x) = PX((−∞, x])

defined for each x ∈ R is called the (cumulative) distribution function ofthe random variable X.

Since we will frequently work with probabilities of the form P({X ∈ B}),from now on we condense these to P(X ∈ B) for ease of notation when thereis no risk of ambiguity.

Exercise 2.5 Let X be a random variable modelling a single cointoss, with two possible outcomes: 1 (heads) or −1 (tails). For a faircoin, the probabilities are P(X = 1) = P(X = −1) = 1

2 . Sketch thedistribution function FX .

Exercise 2.6 Let X be the number of tosses of a fair coin up to andincluding the first toss showing heads. Compute and sketch the distri-bution function FX .

Exercise 2.7 Suppose that Xn for each n = 1, 2, . . . is a random vari-able having the binomial distribution with parameters n, p (see Exam-ple 2.2), where p = λn for some λ > 0. Find PXn (k) for k = 0, 1 . . . , anddetermine limn→∞ PXn (k).

The two classes of random variables that will mainly concern us can bedistinguished by the nature of their distributions.

Definition 2.19We say that a random variable X is discrete if it has a discrete probabilitydistribution PX , that is, if there is a sequence x1, x2, . . . ∈ R such that

∞∑n=1

PX({xn}) = 1.

Definition 2.20We say that a random variable X is continuous if there exists an integrable


function fX : R→ [0,∞] such that for every Borel set B ∈ B(R)

PX(B) =∫

BfX dm.

We call fX the density of X.

In these cases the distribution function FX can be expressed as follows:for any y ∈ R

(i) if X is discrete,

FX(y) =∑xn≤y

pn , where pn = PX({xn});

(ii) if X is continuous,

FX(y) =∫

(−∞,y]fX dm.

Example 2.21In Example 1.9 we saw that pn = (1 − p)n−1 p (where 0 < p < 1) defines aprobability on Ω = {1, 2, . . .}. This gives rise to a random variable X(n) = nfor n = 1, 2, . . . with geometric distribution PX({n}) = P(X = n) = pn.

Since∑∞

n=1 pn = 1, it is clear that X is a discrete random variable.

Exercise 2.8 As in Example 1.9, we can think of 1, 2, . . . as tradingdates, and regard p and 1− p as the probabilities of upward/downwardprice moves on any given trading date. Let Y be the number of tradingdates needed to record r upward price moves. Show that PY({n}) =(

n−1r−1

)pr(1− p)n−r for n = r, r+1, . . . , and verify that this is a probability

distribution. It is called the negative binomial distribution.

Example 2.22We say that a random variable X has normal distribution if the probabilitydensity takes the form introduced in Example 1.23:

fX(x) =1

σ√

2πe−

(x−μ)22σ2 ,

for some μ ∈ R and σ > 0. In this case we say simply that X has the


N(μ, σ2) distribution, abbreviated to X ∼ N(μ, σ2). The corresponding dis-tribution function is

FX(x) =∫ x

−∞

1

σ√

2πe−

(y−μ)22σ2 dy.

Since fX(x) > 0 for all x, it follows that FX : R → [0, 1] is strictly increas-ing, hence invertible.

In particular, if μ = 0 and σ = 1 we use the notation N(x) for FX(x),and say that the random variable X ∼ N(0, 1) has the standard normaldistribution.

Exercise 2.9 Given a random variable X with standard normal dis-tribution N(0, 1), let Y = μ + σX for some μ, σ ∈ R such that σ > 0.Show that Y has the normal distribution N(μ, σ2).

Example 2.23With X ∼ N(μ, σ2), write Y(x) = F−1

X (x). This function is a random vari-able on the probability space [0, 1] with uniform probability given by theLebesgue measure m. The random variable Y has the same normal distri-bution N(μ, σ2) since

FY(a) = m({x ∈ [0, 1] : Y(x) ≤ a}) = m({x ∈ [0, 1] : F−1X (x) ≤ a})

= m({x ∈ [0, 1] : x ≤ FX(a)}) = m([0, FX(a)]) = FX(a).

The distributions (and the densities, when they exist) of simple algebraicfunctions of X can often be found directly.

Example 2.24Let X be a random variable and let Y = aX + b for some a � 0 and b ∈ R.We find the distribution function FY in terms of FX . For a > 0

FY(x) = P(aX + b ≤ x) = P

(X ≤ x − b

a

)= FX

(x − b

a

).


On the other hand, for a < 0

FY (x) = P(aX + b ≤ x) = P

(X ≥ x − b

a

)

= 1 − P

(X <

x − ba

)

= 1 − limy↗ x−b

a

FX(y) = 1 − FX

(x − b

a−).

Exercise 2.10 Suppose that X has continuous distribution with den-sity fX , and let Y = aX + b for some a, b ∈ R such that a � 0. Showthat

fY(x) =1|a| fX

(x − b

a

).

Exercise 2.11 Suppose that X is a random variable having uniformdistribution on the interval [−1, 1], i.e. such that the density of X isfX =

12 1[−1,1]. Find the distribution function of Y = 1

X .

Example 2.25We say that a random variable Y > 0 has log-normal distribution if X =ln Y has normal distribution. We shall find the density of Y . Take any 0 <a < b, which is sufficient because Y = eX takes only positive values, andemploy the normal density fX:

P (a ≤ Y ≤ b) = P (ln a ≤ X ≤ ln b)

=

∫ ln b

ln a

1

σ√

2πe−

(x−μ)22σ2 dx =

∫ b

a

1

yσ√

2πe−

(ln y−μ)22σ2 dy,

where we make the substitution x = ln y in the integral. The probabilitydensity of Y has the familiar form of the log-normal density from Exam-ple 1.24.


Exercise 2.12 Suppose that X is a random variable with known den-sity fX . Find the density of Y = g(X) if g : R → R is continuouslydifferentiable and g′(x) � 0 for all x ∈ R.

Application to stock prices and option payoffs

If X is a random variable with standard normal distribution N(0, 1) definedon any probability space, then μT + σ

√T X has the normal distribution

N(μT, σ2T ) (see Exercise 2.9), and

S (T ) = S (0)eμT+σ√

T X (2.2)

can be used as a model of log-normally distributed stock prices.

Example 2.26If we want to represent the random future stock price S (T ) on a concreteprobability space, we can for example take Ω = R with P given by P(B) =∫

Bf dm, where f (x) = 1√

2πe−x2/2 is the standard normal density. Then X

such that X(ω) = ω for each ω ∈ R is random variable on (R,B(R), P)with the standard normal distribution N(0, 1), and S (T ) given by (2.2) is alog-normally distributed random variable defined on this probability space.

Various choices of the probability space can lead to the same distribu-tion. There is no single universally accepted probability space, allowingmuch flexibility in selecting one to suit particular needs.

Example 2.27Let X be a random variable with standard normal distribution N(0, 1) de-fined on the unit interval Ω = [0, 1] with Lebesgue measure m as the uni-form probability. Such a random variable was constructed in Example 2.23.Then the log-normally distributed stock price S (T ) modelled by (2.2) willalso be a random variable on Ω = [0, 1]. This lends itself well to applyinga numerical technique known as Monte Carlo simulation. An approxima-tion of the log-normal distribution function generated in this way is shownin Figure 2.6. Here T = 1, the parameters μ and σ of the log-normal dis-tribution are as in Example 1.24, and a sample of 100 points drawn from


Figure 2.6 Monte Carlo simulation of the log-normal distribution function.

[0, 1] is used. For comparison, the dotted line shows the exact distributionfunction.

The payoff H = h(S (T )) of a European derivative security is a randomvariable defined on the same probability space as S (T ).

Example 2.28The distribution function FH of the call option payoff H = (S (T )−K)+ canbe written as

FH(x) = P{(S (T ) − K)+ ≤ x

}=

{0 if x < 0

P {S (T ) ≤ K + x} if x ≥ 0

=

{0 if x < 0

FS (T )(K + x) if x ≥ 0

in terms of the distribution function FS (T ) of S (T ).In Figure 2.7 we can see the distribution function FH for a call option

with expiry time T = 1 and strike price K = 8 written on a log-normallydistributed stock given by (2.2) with parameters μ and σ as in Exam-ple 1.24. The dot indicates the value FH(0) = FS (1)(K) at x = 0, whereFH has a discontinuity. For comparison, the log-normal distribution func-tion FS (1) is shown as a dotted line.


Figure 2.7 Distribution function for the call option payoff in Example 2.28.

Exercise 2.13 Find and sketch the distribution function of the payoffH = (K − S (T ))+ of a put option with expiry time T = 1 and strikeprice K = 8 written on a stock having a log-normal distribution withparameters μ and σ as in Example 1.24.

As a simple alternative to the log-normal model, we consider discretestock prices by revisiting and extending Example 1.7. Suppose that theinitial stock price is positive, S (0) > 0, and assume that there are N mul-tiplicative price changes by a factor 1 + U or 1 + D (where −1 < D < U)with respective probabilities p and 1 − p (with 0 < p < 1), so that at a fu-ture time T > 0 the stock price S (T ) will reach one of the possible valuesS (0)(1+U)n(1+D)N−n with probability

(Nn

)pn(1− p)N−n for n = 0, 1, . . . ,N;

see [DMFM].

Example 2.29A simple choice of the probability space for such a discrete model is Ω ={0, 1, . . . ,N} equipped with the binomial probability; see Example 1.13.The future stock price can be considered as a random variable onΩ definedby S (T )(ω) = S (0)(1 + U)ω(1 + D)N−ω for each ω ∈ {0, 1, . . . ,N}. Thepayoff H = h(S (T )) of any European option is a function of the stock priceS (T ) at expiry time T , so it can also be considered as a random variable onΩ = {0, 1, . . . ,N}.


Example 2.30Next we turn our attention to path-dependent options, whose payoff de-pends not just on the stock price S (T ) at expiry time T but also on the stockprices S (t) at intermediate times 0 ≤ t ≤ T . In the discrete model the timeinterval is divided into N steps 0 < t1 < · · · < tN = T . An arithmeticAsian call with payoff

H =

⎛⎜⎜⎜⎜⎜⎝ 1N

N∑i=1

S (ti) − K

⎞⎟⎟⎟⎟⎟⎠+

can serve as an example of a path-dependent option. It is a call option onthe average of stock prices sampled at times t1, . . . , tN .

To describe H as a random variable we need a richer probability spacethan that in Example 2.29. Let us take Ω to be the set of all sequences oflength N consisting of symbols U or D to keep track of up and down stockprice moves. To any such sequence ω = (ω1, . . . , ωN) ∈ Ω we assign theprobability

P({ω}) = pkN (1 − p)N−kN ,

where kN is the number of occurrences of U and N − kN is the number ofoccurrences of D in the sequence ω. For any n = 1, . . . ,N we then put

S (tn)(ω) = S (0)(1 + U)kn (1 + D)n−kn ,

where kn is the number of occurrences of U and n− kn is the number of oc-currences of D among the first n entries in the sequenceω = (ω1, . . . , ωN) ∈Ω. This is the binomial tree model studied in [DMFM].

2.3 Expectation and variance

The probability distribution provides detailed information about the valuesof a random variable along with the associated probabilities. Sometimessimplified information is more practical, hence the need for some numeri-cal characteristics of random variables.

For a discrete random variable X such that∑∞

n=1 pn = 1 with pn = P(X =

2.3 Expectation and variance 57

xn) for some x1, x2, . . . ∈ R, the expectation of X is the weighted average

E(X) =∞∑

n=1

xn pn.

In particular, in the case of a finite Ω = {x1, . . . , xN} with uniform probabil-ity we obtain the arithmetic average

E(X) =1N

N∑n=1

xn.

Exercise 2.14 Find a discrete random variable X such that the ex-pectation of X is undefined.

Exercise 2.15 A random variable X onΩ = {0, 1, 2, . . .} has the Pois-son distribution with parameter λ > 0 if pn = e−λ λ

n

n! for n = 0, 1, 2, . . . ;see Example 2.3. Find E(X).

The above definition of the expectation is familiar in another context:a non-negative discrete random variable X is nothing other than a simplefunction, as in Definition 1.25, with An = {X = xn} and pn = P(An) forn = 1, . . . ,N. Its integral can therefore be written as∫

Ω

X dP =N∑

n=1

xn pn = E(X).

We use the fact that the expectation and integral coincide in this simplecase to motivate the general definition.

Definition 2.31The expectation E(X) of an integrable random variable X defined on aprobability space (Ω,F , P) is its integral over Ω, that is,

E(X) =∫Ω

X dP.

We can immediately deduce the following properties from Exercises 1.27,1.28 and 1.26, respectively.

Proposition 2.32If X and Y are integrable random variables on (Ω,F , P) and a, b ∈ R,then:


(i) E(aX + bY) = aE(X) + bE(Y) (linearity);(ii) if X ≤ Y, then E(X) ≤ E(Y) (monotonicity);

(iii) |E(X)| ≤ E(|X|).For a general X it may not be obvious how to calculate the expecta-

tion directly from the definition. However, the task becomes more tractablewhen we examine the relationship between integrals with respect to Pand PX .

Theorem 2.33If X : Ω → R is a random variable on (Ω,F , P) and g : R → R isintegrable under PX, then

E(g(X)) =∫Ω

g(X) dP =∫R

g dPX .

Proof This follows immediately from Proposition 1.37, in which we take(Ω, F , μ

)= (R,B(R), PX) and ϕ = X. �

It will sometimes prove helpful to be explicit about the variable of inte-gration, so we allow ourselves the freedom to write∫

Ω

X dP =∫Ω

X(ω) dP(ω)

or ∫R

g dPX =

∫R

g(x) dPX(x),

as the need arises. As a special case, for g(x) = x we obtain

E(X) =∫R

x dPX(x).

This formula enables us to compute the expectation when the distributionof X is known.

In particular, if X has continuous distribution with density fX , thenPX(B) =

∫B

fX dm,where m is the Lebesgue measure, and we can conclude,with g and X as in Theorem 2.33, that

E(g(X)) =∫R

g(x) fX(x) dm(x). (2.3)

Exercise 2.16 Prove (2.3), first for any simple function g, then usingmonotone convergence for any non-negative measurable g, and finallyfor any g integrable under PX .


In particular, whenever g(x) = x is an integrable function under PX , wehave

E(X) =∫R

x fX(x) dm(x).

Example 2.34If X ∼ N(0, 1), then E(X) =

∫ ∞−∞ x fX(x) dx by Proposition 1.39, since the

density fX(x) = 1√2π

e−12 x2

is a continuous function on R. We compute theexpectation

E(X) =∫ ∞

−∞x fX(x) dx =

1√2π

∫ ∞

−∞xe−

12 x2

dx = 0.

The result is 0 since the integrand g(x) = xe−12 x2

is an odd function, that is,g(−x) = −g(x), so that the integrals from −∞ to 0 and from 0 to ∞ canceleach other.

Exercise 2.17 Let X ∼ N(μ, σ2) for some μ ∈ R and σ > 0. Verifythat E(X) = μ.

Exercise 2.18 Let X have the Cauchy distribution with density

fX(x) =1π

11 + x2

.

Show that the expectation of X is undefined.

Example 2.35Suppose that PX = pδx0 + (1 − p)P, where P has density f . Then E(X) =px0 + (1 − p)

∫R

x f (x) dm(x). (Compare with Example 2.14.)

Exercise 2.19 Find E(X) for a random variable X with distributionPX =

14δ0+

34 P, where P is the exponential distribution with parameter

λ = 1, see Example 2.10.


The expectation can be viewed as a weighted average of the values of arandom variable X, with the corresponding probabilities acting as weights.However, this tells us nothing about how widely the values of X are spreadaround the expectation. Averaging the (positive) distance |X − E(X)| wouldbe one possible measure of the spread, but the expectation of the squareddistance (X − E(X))2 proves to be much more convenient and has becomethe standard approach.

Definition 2.36The variance of an integrable random variable X is defined as

Var(X) = E[(X − E(X))2].

(Note that the variance may be equal to∞.) The square root of the varianceyields the standard deviation

σX =√

Var(X).

When X is discrete with values xn and corresponding probabilities pn =

P(X = xn) for n = 1, 2, . . . , the variance becomes

Var(X) =∞∑

n=1

(xn − E(X))2 pn,

revealing its origins as a weighted sum of the squared distances betweenthe xn and E(X).

Exercise 2.20 Show that the variance of a random variable withPoisson distribution with parameter λ is equal to λ.

The linearity of expectation implies that

Var(X) = E((X − E(X))2)

= E(X2 − 2XE(X) + E(X)2) = E(X2) − E(X)2,

hence

Var(aX + b) = a2Var(X), σaX+b = |a|σX .

By Theorem 2.33 with g(x) = (x − E(X))2,

Var(X) =∫R

(x − E(X))2 dPX(x).


If X has a density, we obtain

Var(X) =∫R

(x − E(X))2 fX(x) dm(x).

Example 2.37We saw that if X has the standard normal distribution, its expectation is 0.So Var(X) =

∫ ∞−∞ x2 fX(x) dx = 1, as is easily seen using integration by

parts.

Exercise 2.21 Compute the variance of X ∼ N(μ, σ2).

This shows that for a normally distributed X the shape of the distributionis fully determined by the expectation and variance.

Exercise 2.22 Compute E(X) and Var(X) in the following cases:(1) X has the exponential distribution with density (2.1).(2) X has the log-normal distribution with density (1.7).

The are several useful inequalities involving expectation and/or variancethat enable one to estimate certain probabilities. Here we present just twosimple examples.

Proposition 2.38 (Markov inequality)Let f : R → [0,∞) be an even function, that is, f (−x) = f (x) for anyx ∈ R, and non-decreasing for x > 0. If X is a random variable and c > 0,then

P(|X| ≥ c) ≤ E( f (X))f (c)

.

Proof Define A = {|X| ≥ c}. It follows that f (|X|) > f (c) on A. Since f isan even function, we have f (X) = f (|X|), so

E( f (X)) = E( f (|X|)) =∫Ω

f (|X|) dP ≥∫

Af (|X|) dP ≥ f (c)P(A).

�

In particular, we can apply Proposition 2.38 with f (x) = x2 to obtain thefollowing inequality.


Corollary 2.39 (Chebyshev inequality)

P(|X| ≥ c) ≤ E(X2)c2.

For X with finite mean μ and variance σ2 we can apply Chebyshev’sinequality to |X − μ| and c = kσ to obtain

P(|X − μ| ≥ kσ) ≤ 1k2.

Thus, if X has small variance, it will remain close to its mean with highprobability.

2.4 Moments and characteristic functions

Definition 2.40For any k ∈ N we define the kth moment of a random variable X as theexpectation of Xk, that is,

mk = E(Xk),

and the kth central moment as

σk = E((X − E(X))k).

Clearly, m1 = E(X) is just the expectation of X. Moreover, σ1 = 0 andσ2 = Var(X).

We can ask whether a single function might suffice to identify all themoments of a random variable X. It turns out that the expectation of eitX

does the job. To make sense of such an expectation we define the integralof a function f : Ω → C with values among complex numbers by meansof the integral of its real and imaginary parts: if f = Re f + iIm f and boththe (real-valued) functions Re f and Im f are integrable, we set∫

Ω

f dP =∫Ω

Re f dP + i∫Ω

Im f dP.

In other words, we set

E( f ) = E(Re f ) + iE(Im f ).

Definition 2.41Let X : Ω→ R be a random variable. Then φX : R→ C defined by

φX(t) = E(eitX) = E(cos(tX)) + iE(sin(tX))

2.4 Moments and characteristic functions 63

for all t ∈ R is called the characteristic function of X.

To compute φX it is sufficient to know the probability distribution of X:

φX(t) =∫R

eitx dPX(x),

and if X has a density, this reduces to

φX(t) =∫R

eitx fX(x) dm(x).

Vice versa, it turns out (see the inversion formula, Theorem 5.41) that theprobability distribution of X is uniquely determined by the characteristicfunction φX .

The function φX has the advantage that it always exists because the ran-dom variable eitX is bounded. We begin by stating the simplest propertiesof φX and give several examples as exercises.

Exercise 2.23 Show that for any random variable X(1) φX(0) = 1;(2) |φX(t)| ≤ 1 for all t ∈ R.

The characteristic function φX(t) is continuous in t. In fact

|φX(t + h) − φX(t)| ≤∫R

|eihx − 1| dPX(x),

so it follows that φX(t) is uniformly continuous.

Exercise 2.24 Let X have the Poisson distribution with parameterλ > 0. Find its characteristic function.

Exercise 2.25 Verify that if X is a random variable with the standardnormal distribution, then φX(t) = e−

12 t2

.

Exercise 2.26 Let Y = aX + b, where X, Y are random variables anda, b ∈ R. Show that for all t ∈ R

φY(t) = eitbφX(at).


Use this relation to find φY when Y is normally distributed with mean μand variance σ2.

As hinted above, there is a close relationship between the characteristicfunction and the moments of a random variable.

Theorem 2.42Let X be a random variable and let n be a non-negative integer such thatE(|X|n) < ∞. Then

E(Xn) =1inφ(n)

X (0).

Proof First observe that for any x ∈ R

eix − 1 − ix = i∫ x

0

(eis − 1

)ds.

Estimating the integral gives the inequality∣∣∣eix − 1 − ix∣∣∣ = ∣∣∣∣∣

∫ x

0

(eis − 1

)ds∣∣∣∣∣ ≤

∫ x

0

∣∣∣eis − 1∣∣∣ ds ≤ 2 |x| . (2.4)

We show by induction that

φ(n)X (t) = E((iX)neitX)

for every random variable X such that E(|X|n) < ∞. For n = 0 this istrivially satisfied: φ(0)

X (t) = φX(t) = E(eitX). Now suppose that the assertionhas already been established for some n = 0, 1, 2, . . . , and take any randomvariable X such that E(|X|n+1) < ∞. It follows that

E(|X|n) = E(1{|X|≤1} |X|n) + E(1{|X|>1} |X|n)

≤ 1 + E(1{|X|>1} |X|n+1) ≤ 1 + E(|X|n+1) < ∞.By the induction hypothesis we therefore have

φ(n)X (t + h) − φ(n)

X (t)

h− E((iX)n+1eitX) = E

((iX)neitX eihX − 1 − ihX

h

). (2.5)

By (2.4) the random variables

Yn(h) = (iX)neitX eihX − 1 − ihXh

are dominated by 2 |X|n+1. The derivative of the exponential function giveslimh→0

eihX−1h = iX, hence limh→0 Yn(h) = 0, and by the version of

2.4 Moments and characteristic functions 65

the dominated convergence theorem in Exercise 1.36 it follows thatlimh→0 E(Yn(h)) = 0. This shows that

φ(n+1)X (t) = E((iX)n+1eitX),

completing the induction argument. Putting t = 0 proves the theorem. �

Exercise 2.27 Use the formula in Theorem 2.42 to obtain an expres-sion for the variance of X in terms of the characteristic function φX andits derivatives, evaluated at zero.

Exercise 2.28 Suppose that X ∼ N(0, σ2). Show that for any odd nwe have E(Xn) = 0, and for any even n,

E(Xn) = 1 × 3 × · · · × (n − 1) × σn.

In the general case, when X ∼ N(μ, σ2), show that

E(X2) = μ2 + σ2,

E(X3) = μ3 + 3μσ2,

E(X4) = μ4 + 6μ2σ2 + 3σ4.

Remark 2.43If X is a random variable with values in {1, 2, . . .}, the probability gen-erating function GX(s) = E(sX) =

∑n≥1 sn pn allows us (in principle) to

reconstruct the sequence pn = P(X = n). Setting s = et turns GX into themoment generating function mX(t) = E[etX] of X.

The moment generating function is not always finite-valued, not evenin an open interval around the origin. However, if X has a finite momentgenerating function mX on some interval (−a, a), then we can read off thekth moment of X directly as the value of its kth derivative at 0, namelyE[Xk] = m(k)

X (0).

3

Product measure and independence

3.1 Product measure3.2 Joint distribution3.3 Iterated integrals3.4 Random vectors in Rn

3.5 Independence3.6 Covariance3.7 Proofs by means of d-systems

In a financial market, the price of stocks in one company may well influ-ence those of another: if company A suffers a decline in the market sharefor their product, its competitor B may have an opportunity to increasetheir sales, and thus the shares in B may increase in price while those ofA decline. On the other hand, if the overall market for a particular productcontracts, we may find that the shares of two rival companies will declinesimultaneously, though not necessarily at the same rate.

Modelling the relationships between the prices of different shares istherefore of particular interest. We can regard the prices as random vari-ables X, Y defined on a common probability space (Ω,F , P), and endeav-our to describe their joint behaviour.

In Chapter 2 the distribution of a single random variable X was definedas the probability measure on R given by

PX(B) = P(X ∈ B)

for all Borel sets B ⊂ R. In the case of two random variables X, Y a naturalextension would be to write the joint distribution as

PX,Y (B) = P(X ∈ B1, Y ∈ B2) (3.1)

for any B ⊂ R2 of the form B = B1 × B2, where B1, B2 ⊂ R are Borel sets.

66

3.1 Product measure 67

However, the family of such sets B is not a σ-field, and we need to extendit further to be able to consider PX,Y as a probability measure. This will leadto the notion of Borel sets in R2.

Exercise 3.1 Show that the family of sets B ⊂ R2 of the form B =B1 × B2, where B1, B2 ⊂ R are Borel sets, is not a σ-field.

In particular, for a continuous random variable X with density fX : R→[0,∞) the probability distribution PX can be expressed as

PX(B) =∫

BfX(x)dm(x),

for any Borel set B ⊂ R, where m is the Lebesgue measure on the realline R. To extend this we need to introduce the notion of joint density fX,Y :R2 → [0,∞) and define Lebesgue measure m2 on the plane R2, so that thejoint probability distribution can be written as

PX,Y (B) =∫

BfX,Y (x, y)dm2(x, y)

for any Borel set B in R2.

3.1 Product measure

When constructing Lebesgue measure m on R, we started by taking m(I) =b − a to be the length of any interval I = (a, b), and extended this to ameasure defined on all Borel sets in R, that is, on the smallest σ-field con-taining all intervals. The idea behind the construction of Lebesgue mea-sure m2 on R2 is similar. For any rectangle R = (a, b) × (c, d) we takem2(R) = (b − a)(d − c) to be the surface area, and would like to extend thisto a measure on the family of all Borel sets in R2 defined as the smallestσ-field containing all rectangles.

Product of finite measures

It is not much more effort to consider measures on arbitrary spaces. Thishas the advantage of wider applicability. Consider arbitrary measure spaces

68 Product measure and independence

(Ω1,F1, μ1) and (Ω2,F2, μ2) with finite measures μ1, μ2. We want to con-struct a measure μ on the Cartesian product Ω1 ×Ω2 such that

μ(A1 × A2) = μ1(A1)μ2(A2) (3.2)

for any A1 ∈ F1 and A2 ∈ F2. The construction has several steps.In Ω1 × Ω2 a measurable rectangle is a product A1 × A2 for which

A1 ∈ F1 and A2 ∈ F2. We denote the family of all measurable rectangles by

R = {A1 × A2 : A1 ∈ F1, A2 ∈ F2}.

Then we consider the smallest σ-field containing the family R of measur-able rectangles, which we denote by

F1 ⊗ F2 =⋂{F : F is a σ-field on Ω1 ×Ω2 and R ⊂ F } (3.3)

and call it the product σ-field.

Exercise 3.2 Show that the product σ-field F1 ⊗ F2 is the smallestσ-field such that the projections

Pr1 : Ω1 ×Ω2 → Ω1, Pr1(ω1, ω2) = ω1

Pr2 : Ω1 ×Ω2 → Ω2, Pr2(ω1, ω2) = ω2

are measurable.

Definition 3.1The family of Borel sets on the plane can be defined as

B(R2) = B(R) ⊗ B(R).

Exercise 3.3 Show that the smallest σ-field on R2 containing thefamily

{I1 × I2 : I1, I2 are intervals in R}is equal to B(R2).

Since the domain of a measure is a σ-field by definition, the constructiondescribed in (3.3) is an example of quite a general idea.


Definition 3.2Let Ω be a non-empty set. For a family A of subsets of Ω we denote thesmallest σ-field on Ω that containsA by

σ(A) =⋂{F : F is a σ-field on Ω,A ⊂ F }.

We call σ(A) the σ-field generated byA.

Example 3.3The Borel sets in R form the σ-field generated by the family I of openintervals in R,

B(R) = σ(I).

Example 3.4The product σ-field is generated by the family R of measurable rectangles,

F1 ⊗ F2 = σ(R).

The next step in constructing a measure μ on F1 ⊗ F2 that satisfies (3.2)is to define sections of a subset A ⊂ Ω1 ×Ω2. Namely, for any ω2 ∈ Ω2 weput

Aω2 = {ω1 ∈ Ω1 : (ω1, ω2) ∈ A},and, similarly, for any ω1 ∈ Ω1

Aω1 = {ω2 ∈ Ω2 : (ω1, ω2) ∈ A}.

Exercise 3.4 Let A ∈ F1 ⊗ F2. Show that Aω2 ∈ F1 and Aω1 ∈ F2 forany ω1 ∈ Ω1 and ω2 ∈ Ω2.

In particular, for a measurable rectangle A = A1 × A2 with A1 ∈ F1

and A2 ∈ F2, we obtain Aω2 = A1 if ω2 ∈ A2 and Aω2 = ∅ otherwise. Soω2 �→ μ1(Aω2 ) = 1A2 (ω2)μ1(A1) is a simple function. Hence, from (3.2)

μ(A1 × A2) = μ1(A1)μ2(A2) =∫Ω2

μ1(Aω2 )dμ2(ω2).


By symmetry, we can also write

μ(A1 × A2) = μ1(A1)μ2(A2) =∫Ω1

μ2(Aω1 )dμ1(ω1).

This motivates the general formula defining μ on F1 ⊗F2. Namely, for anyA ∈ F1 ⊗ F2 we propose to write

μ(A) =∫Ω1

μ2(Aω1 )dμ1(ω1) =∫Ω2

μ1(Aω2 )dμ2(ω2),

along with the conjecture that the last two integrals are well defined andequal to one another. We already know from Exercise 3.4 that μ1(Aω2 )and μ2(Aω1 ) make sense since Aω2 ∈ F1 and Aω1 ∈ F2. Moreover, forthe integrals to make sense we need the function ω1 �→ μ2(Aω1 ) to beF1-measurable and ω2 �→ μ1(Aω2 ) to be F2-measurable. Our objective istherefore to prove the following result.

Theorem 3.5Suppose that μ1 and μ2 are finite measures defined on the σ-fields F1 andF2, respectively. Then:

(i) for any A ∈ F1 ⊗ F2 the functions

ω1 �→ μ2(Aω1 ), ω2 �→ μ1(Aω2 )

are measurable, respectively, with respect to F1 and F2;(ii) for any A ∈ F1 ⊗ F2 the following integrals are well defined and

equal to one another:∫Ω1

μ2(Aω1 )dμ1(ω1) =∫Ω2

μ1(Aω2 )dμ2(ω2);

(iii) the function μ : F1 ⊗ F2 → [0,∞) defined by

μ(A) =∫Ω1

μ2(Aω1 )dμ1(ω1) =∫Ω2

μ1(Aω2 )dμ2(ω2) (3.4)

for each A ∈ F1 ⊗ F2 is a measure on the σ-field F1 ⊗ F2;(iv) μ is the only measure on F1 ⊗ F2 such that

μ(A1 × A2) = μ1(A1)μ2(A2)

for each A1 ∈ F1 and A2 ∈ F2.

The proof of this theorem can be found in Section 3.7. It is quite tech-nical and can be omitted on first reading. The theorem shows that the fol-lowing definition is well posed.


Definition 3.6Let μ1 and μ2 be finite measures defined on the σ-fields F1 and F2, respec-tively. We call μ defined by (3.4) the product measure on F1 ⊗ F2 anddenote it by μ1 ⊗ μ2.

Product of σ-finite measures

Observe that Theorem 3.5 and Definition 3.6 do not apply directly to theLebesgue measure m on R because m(R) = ∞. However, the definition ofproduct measure can be extended to a large class of measures with thefollowing property, which does include Lebesgue measure.

Definition 3.7Let (Ω,F , μ) be a measure space. We say that μ is aσ-finite measure when-everΩ =

⋃∞n=1 An for some sequence of events An ∈ F such that μ(An) < ∞

and An ⊂ An+1 for each n = 1, 2, . . . .

Example 3.8Taking for example An = [−n, n], we can see that Lebesgue measure m isindeed σ-finite.

Definition 3.9Let (Ω1,F1, μ1) and (Ω2,F2, μ2) be measure spaces with σ-finite measuresμ1, μ2. The product measure μ1 ⊗ μ2 is constructed as follows.

(i) Take two sequences of events An ∈ F1 with μ1(An) < ∞ and An ⊂An+1, and Bn ∈ F2 with μ2(Bn) < ∞ and Bn ⊂ Bn+1 for n = 1, 2, . . .such that

Ω1 =

∞⋃n=1

An, Ω2 =

∞⋃n=1

Bn.

(ii) For each n = 1, 2, . . . denote by μ(n)1 the restriction of μ1 to An de-

fined by

μ(n)1 (A) = μ1(A) for each A ∈ F1 such that A ⊂ An,

and by μ(n)2 the restriction of μ2 to Bn, defined analogously; clearly

μ(n)1 and μ(n)

2 are finite measures.(iii) Define μ1 ⊗ μ2 for any C ∈ F1 ⊗ F2 as

(μ1 ⊗ μ2)(C) = limn→∞(μ(n)

1 ⊗ μ(n)2 )(C ∩ (An × Bn)).


Exercise 3.5 Show that the limit in Definition 3.9 (iii) exists anddoes not depend on the choice of the sequences An, Bn in (i).

Exercise 3.6 Show that μ1 ⊗ μ2 from Definition 3.9 (iii) is a σ-finitemeasure.

Example 3.10In Example 3.8 we observed that the Lebesgue measure m on R is σ-finite.The construction in Definition 3.9 therefore applies, and yields the productmeasure m⊗m defined on the Borel sets B(R2) = B(R)⊗B(R) in R2, whichwill be denoted by

m2 = m ⊗ m

and called the Lebesgue measure on R2.

Exercise 3.7 Verify that m2(R) = (b − a)(d − c) for any rectangleR = (a, b) × (c, d) in R2.

Example 3.11We can extend the construction of Lebesgue measure to Rn for any n =2, 3, . . . by iterating the product of measures.

Thus, we put, for example

m3 = m ⊗ m ⊗ m

for the Lebesgue measure defined on the Borel sets

B(R3) = B(R) ⊗ B(R) ⊗ B(R)

in R3. Note that this triple product can be interpreted in two ways: as m ⊗(m⊗m), a measure on R×R2, or as (m⊗m)⊗m, a measure on R2 ×R. Forsimplicity, we identify both R × R2 and R2 × R with R3 and thus make nodistinction between (m ⊗ m) ⊗ m and m ⊗ (m ⊗ m).

3.2 Joint distribution 73

In a similar manner we define the Lebesgue measure

mn = m ⊗ m ⊗ · · · ⊗ m︸��︷︷��︸n

on the Borel sets

B(Rn) = B(R) ⊗ B(R) ⊗ · · · ⊗ B(R)︸��︷︷��︸n

in Rn.

3.2 Joint distribution

A random variable on a probability space (Ω,F , P) is a function X : Ω →R such that {X ∈ B} ∈ F for every Borel set B ∈ B(R) on the real line R;see Definition 2.16. We extend this to the case of functions with valuesin R2.

Definition 3.12We call Z : Ω → R2 a random vector if {Z ∈ B} ∈ F for every Borel setB ∈ B(R2) on the plane R2.

Exercise 3.8 Show that X, Y : Ω → R are random variables if andonly if (X, Y) : Ω→ R2 is a random vector.

We are now ready to define the joint distribution for a pair of randomvariables X, Y : Ω → R. Exercise 3.8 ensures that {(X, Y) ∈ B} ∈ F so itmakes sense to consider the probability P((X, Y) ∈ B) for any B ∈ B(R2).

Definition 3.13The joint distribution of the pair X, Y is the probability measure PX,Y on(R2,B(R2)) given by

PX,Y (B) = P((X, Y) ∈ B)

for any B ∈ B(R2)).

In particular, when B = B1 × B2 for some Borel sets B1, B2 ⊂ R, then

PX,Y (B) = P((X, Y) ∈ B1 × B2) = P(X ∈ B1, Y ∈ B2).


The distributions PX and PY of the individual random variables X and Ycan be reconstructed from the joint distribution PX,Y . Namely, for any Borelset B ∈ B(R)

PX(B) = P(X ∈ B) = P(X ∈ B, Y ∈ R) = PX,Y (B × R),

PY(B) = P(Y ∈ B) = P(X ∈ R, Y ∈ B) = PX,Y (R × B).

We call PX and PY the marginal distributions of PX,Y . On the other hand,as shown in Exercise 3.9, the marginal distributions PX and PY are by nomeans enough to construct the joint distribution PX,Y .

Exercise 3.9 On the two-element probability space Ω = {ω1, ω2}with uniform probability consider two pairs or random variables X1, Y1

and X2, Y2 defined in the table below.

X1 Y1 X2 Y2

ω1 110 60 110 40ω2 90 40 90 60

Show that PX1 = PX2 , PY1 = PY2 , but PX1,Y1 � PX2,Y2 .

Definition 3.14The joint distribution function of X, Y is the function FX,Y : R2 → [0, 1]defined by

FX,Y (x, y) = P(X ≤ x, Y ≤ y).

In other words,

FX,Y (x, y) = PX,Y ((−∞, x] × (−∞, y]).

Exercise 3.10 Show that the joint distribution function (x, y) �→FX,Y (x, y) is non-decreasing in each of its arguments and that for anya, b ∈ R

limy→∞ FX,Y (a, y) = FX(a),

limx→∞ FX,Y (x, b) = FY(b).

3.3 Iterated integrals 75

Exercise 3.11 For a, b ∈ R find P(X > a, Y > b) in terms of FX , FY

and FX,Y .

Definition 3.15If

PX,Y (B) =∫

BfX,Y (x, y)dm2(x, y)

for all B ∈ B(R2), where fX,Y : R2 → R is integrable under the Lebesguemeasure m2, then fX,Y is called the joint density of X and Y , and the randomvariables X, Y are said to be jointly continuous.

If X, Y are jointly continuous, the joint distribution and joint density arerelated by

FX,Y (a, b) =∫

(−∞,a]×(−∞,b]fX,Y (x, y)dm2(x, y).

Example 3.16The bivariate normal density is given by

fX,Y (x1, x2) =1

2π√

1 − ρ2exp

(− x2

1 − 2ρx1x2 + x22

2(1 − ρ2)

), (3.5)

where ρ ∈ (−1, 1) is a fixed parameter, whose meaning will become clearin due course. To check that it is a density we need to show that∫

R2

fX,Y (x1, x2)dm2(x1, x2) = 1.

This will be done in Exercise 3.12.

We need techniques for calculating such integrals. We achieve this in thenext section by considering product measures.

3.3 Iterated integrals

As in Section 3.1, we consider measure spaces (Ω1,F1, μ1) and (Ω2,F2, μ2).In order to integrate functions defined on Ω1 × Ω2 with respect to the


product measure we seek to exploit integration with respect to the mea-sures μ1 and μ2 individually.

For a function f : Ω1 × Ω2 → [−∞,∞] the sections of f are defined byω1 �→ f (ω1, ω2) for any ω2 ∈ Ω2 and ω2 �→ f (ω1, ω2) for any ω1 ∈ Ω1.They are functions from Ω1 and, respectively, from Ω2 to [−∞,∞].

Iterated integrals with respect to finite measures

We first consider the issue of measurability of the sections and their inte-grals.

Proposition 3.17Suppose that μ1 and μ2 are finite measures. If a non-negative function f :Ω1 ×Ω2 → [0,∞] is measurable with respect to F1 ⊗ F2, then:

(i) the section ω1 �→ f (ω1, ω2) is F1-measurable for each ω2 ∈ Ω2, andω2 �→ f (ω1, ω2) is F2-measurable for each ω1 ∈ Ω1;

(ii) the functions

ω1 �→∫Ω2

f (ω1, ω2)dμ2(ω2), ω2 �→∫Ω1

f (ω1, ω2)dμ1(ω1)

are, respectively, F1- measurable and F2-measurable.

Proof First we approximate f by simple functions

fn(ω1, ω2) =

⎧⎪⎪⎨⎪⎪⎩k2n if f (ω1, ω2) ∈ [ k

2n ,k+12n ), k = 0, 1, . . . , n2n − 1,

0 if f (ω1, ω2) ≥ n,

which form a non-decreasing sequence such that limn→∞ fn = f .(i) The sections of simple functions are also simple functions. It is clear

that the sections of fn converge to those of f , and since measurability is pre-served in the limit (Exercise 1.19), the first claim of the theorem is proved.

(ii) If A ∈ F1 × F2, we know from Theorem 3.5 (i) that

ω1 �→∫Ω2

1A(ω1, ω2)dμ2(ω2) =∫Ω2

1Aω1(ω2)dμ2(ω2) = μ2(Aω1 ),

ω2 �→∫Ω1

1A(ω1, ω2)dμ1(ω1) =∫Ω1

1Aω2(ω1)dμ1(ω1) = μ1(Aω2 )

are measurable functions. It follows by linearity (Exercise 1.21) that theintegrals of the sections of fn are measurable functions. By monotone con-vergence, see Theorem 1.31, the integrals of the sections of f are limits ofthe integrals of the sections of fn, and are therefore also measurable. Thiscompletes the proof. �


We are ready to show that the integral over the product space can becomputed as an iterated integral.

Theorem 3.18 (Fubini)Suppose that μ1 and μ2 are finite measures. If f : Ω1 × Ω2 → [−∞,∞] isintegrable under the product measure μ1 ⊗ μ2, then the sections

ω2 �→ f (ω1, ω2), ω1 �→ f (ω1, ω2)

are μ1-a.e. integrable under μ2 and, respectively, μ2-a.e. integrable un-der μ1, and the functions

ω1 �→∫Ω2

f (ω1, ω2)dμ2(ω2), ω2 �→∫Ω1

f (ω1, ω2)dμ1(ω1)

are integrable under μ1 and, respectively, under μ2. Moreover,∫Ω1×Ω2

f (ω1, ω2)d(μ1 ⊗ μ2)(ω1, ω2)

=

∫Ω1

(∫Ω2

f (ω1, ω2)dμ2(ω2)

)dμ1(ω1)

=

∫Ω2

(∫Ω1

f (ω1, ω2)dμ1(ω1)

)dμ2(ω2).

Proof We prove the first equality. This will be done in a number steps.• If f = 1A is the indicator function for some A ∈ F1⊗F2, then the desired

equality becomes

(μ1 ⊗ μ2)(A) =∫Ω1

μ2(Aω1 )dμ1(ω1),

and this is satisfied by the definition of the product measure (Defini-tion 3.6).

• If f is a non-negative simple function, then it is a linear combinationof indicator functions, and linearity of the integral for simple functions(Exercise 1.11) verifies the equality in this case.

• Next, if f is a non-negative measurable function, then it can be expressedas the limit of a non-decreasing sequence non-negative simple func-tions; see Proposition 1.28. Applying the monotone convergence the-orem (Theorem 1.31), first to the inner integral over Ω2 and then to eachside of the target equality, verifies the equality.

• If f is a non-negative integrable function, then, in addition, the integral∫Ω1×Ω2

f (ω1, ω2)d(μ1⊗μ2)(ω1, ω2) on the left-hand side of the equality is

finite, and therefore so is the integral∫Ω1

(∫Ω2

f (ω1, ω2)dμ2(ω2))

dμ1(ω1)


on the right-hand side. This means that ω1 �→∫Ω2

f (ω1, ω2)dμ2(ω2) isintegrable under μ1. This in turn means that the section ω2 �→ f (ω1, ω2)is μ1-a.e. integrable under μ2.

• Finally, for any function f integrable under μ1 ⊗ μ2, we take the decom-position f = f +− f − into the positive and negative parts; see (1.9). Sincef + and f − are non-negative integrable functions, they satisfy the equalityin question, with the integrals on both sides of the equality being finite.This, in turn, gives the identity for f , along with the conclusion thatω1 �→

∫Ω2

f (ω1, ω2)dμ2(ω2) is integrable under μ1 and ω2 �→ f (ω1, ω2)is μ1-a.e. integrable under μ2.

The proof of the second identity and the integrability of the functionsω2 �→∫Ω1

f (ω1, ω2)dμ1(ω1) and ω1 �→ f (ω1, ω2) is similar. �

Iterated integrals with respect to σ-finite measures

Before we can handle iterated integrals with respect to Lebesgue measure,we need to extend Fubini’s theorem to σ-finite measures.

Suppose that μ1, μ2 areσ-finite measures, and let f :Ω1×Ω2 → [−∞,∞]be an integrable function under the product measure μ1 ⊗ μ2. We proceedas follows.• Take sequences of events An and Bn as in part (i) and μ(n)

1 , μ(n)2 to be the

finite measures from part (ii) of Definition 3.9. Then∫An×Bn

f (ω1, ω2) d(μ(n)1 ⊗ μ(n)

2 )(ω1, ω2)

=

∫Bn

(∫An

f (ω1, ω2) dμ(n)1 (ω1)

)dμ(n)

2 (ω2)

=

∫An

(∫Bn

f (ω1, ω2) dμ(n)2 (ω2)

)dμ(n)

1 (ω1).

The integrals in these identities can be written as∫Ω1×Ω2

1An×Bn (ω1, ω2) f (ω1, ω2) d(μ1 ⊗ μ2)(ω1, ω2)

=

∫Ω2

1Bn (ω2)

(∫Ω1

1An (ω1) f (ω1, ω2) dμ1(ω1)

)dμ2(ω2)

=

∫Ω1

1An (ω1)

(∫Ω2

1Bn (ω2) f (ω1, ω2) dμ2(ω2)

)dμ1(ω1).

• Next, if f is non-negative, we can use monotone convergence, that is,


Theorem 1.31, to obtain∫Ω1×Ω2

f (ω1, ω2) d(μ1 ⊗ μ2)(ω1, ω2)

=

∫Ω2

(∫Ω1

f (ω1, ω2) dμ1(ω1)

)dμ2(ω2)

=

∫Ω1

(∫Ω2

f (ω1, ω2) dμ2(ω2)

)dμ1(ω1)

in the limit as n → ∞ because An, Bn and An × Bn for n = 1, 2, . . . arenon-decreasing sequences of sets such that

⋃∞n=1 An = Ω1,

⋃∞n=1 Bn = Ω2

and⋃∞

n=1(An × Bn) = Ω1 ×Ω2.• Finally, for any function f : R2 → R integrable under μ1 ⊗ μ2, we know

that the latter identities hold for the positive and negative parts f +, f −.Since f = f +− f −, we obtain the same identities for f by the linearity ofintegrals. Moreover, since

∫Ω1×Ω2

f (ω1, ω2) d(μ1 ⊗ μ2)(ω1, ω2) < ∞, wecan conclude that the integrals on the right-hand side are finite, and sothe functions

ω2 �→∫Ω1

f (ω1, ω2) dμ1(ω1) and ω1 �→(∫Ω2

f (ω1, ω2) dμ2(ω2)

)

are integrable.This extends Fubini’s theorem to σ-finite measures and, in particular, toiterated integrals with respect to Lebesgue measure.

Example 3.19We now apply these results to the joint probability PX,Y of two randomvariables X, Y . If X, Y are jointly continuous with density fX,Y , then

PX,Y (B1 × B2) =∫

B1×B2

fX,Y (x, y)dm2(x, y)

=

∫B1

(∫B2

fX,Y (x, y)dm(y)

)dm(x)

for any B1, B2 ∈ B(R). In particular, when B1 = {X ≤ a} and B2 = {Y ≤b} for some a, b ∈ R and the joint density is a continuous function, thisbecomes

FX,Y (a, b) =∫ a

−∞

(∫ b

−∞fX,Y (x, y)dy

)dx,


with Riemann integrals on the right-hand side, so we obtain

fX,Y (a, b) =∂2

∂b∂aFX,Y (a, b).

Proposition 3.20If X, Y are jointly continuous random variables with density fX,Y , then Xand Y are (individually) continuous with densities

fX(x) =∫R

fX,Y (x, y)dm(y), fY (y) =∫R

fX,Y (x, y)dm(x). (3.6)

Proof By Fubini’s theorem,

PX(B) = PX,Y (B × R)

=

∫B×R

fX,Y (x, y)dm2(x, y) =∫

B

(∫R

fX,Y (x, y)dm(y)

)dm(x)

for any B ∈ B(R), where x �→ ∫R

fX,Y (x, y)dm(y) is a non-negative inte-grable function. This means that

fX(x) =∫R

fX,Y (x, y)dm(y).

The proof of the second identity is similar. �

We call fX and fY given by (3.6) the marginal densities of the jointdistribution of X, Y .

Exercise 3.12Confirm that fX,Y given by (3.5) is a density, that is,∫

R2

fX,Y (x1, x2)dm2(x1, x2) = 1

and that fX , fY given by (3.6) are standard normal densities.

Exercise 3.13 Suppose X, Y have joint density fX,Y . Show that X + Yis a continuous random variable with density

fX+Y (z) =∫R

fX,Y (x, z − x)dm(x).

3.4 Random vectors in Rn 81

Exercise 3.14 Suppose that random variables X, Y have joint densityfX,Y (x, y) = e−(x+y) when x > 0 and y > 0, and fX,Y (x, y) = 0 otherwise.Find the density of X/Y.

3.4 Random vectors in Rn

When defining the joint distribution of two random variables, we found ithelpful to consider random vectors (Definition 3.12) as functions from Ωto R2. We can extend this to n random variables. As for R2, we can definethe Borel sets in Rn by means of products of Borel sets in R, generalisingthe notation introduced in (3.3).

Definition 3.21Given n = 2, 3, . . . , define the σ-field of Borel sets on Rn as

B(Rn) = B(R) ⊗ B(R) ⊗ · · · ⊗ B(R)︸��︷︷��︸n

,

as in Example 3.11. In other words, B(Rn) = σ(Rn), the σ-field on Rn

generated by the collection

Rn = {B1 × · · · × Bn : B1, . . . , Bn ∈ B(R)}.

Exercise 3.15 Show that B(Rn) = σ(In), where

In = {I1 × · · · × In : I1, . . . , In are intervals in R}.

Definition 3.22A map X = (X1, X2, . . . , Xn) from (Ω,F , P) toRn is called a random vectorif {X ∈ B} ∈ F for every B ∈ B(Rn).

Exercise 3.16 Show that X = (X1, X2, . . . , Xn) is a random vector ifand only if X1, X2, . . . , Xn are random variables.

Definition 3.23Let X = (X1, X2, . . . , Xn) be a random vector from Ω to Rn. The joint


distribution of X (equivalently, of X1, . . . , Xn) is the probability PX on(Rn,B(Rn)) defined by

PX(B) = P(X ∈ B) for each B ∈ B(Rn).

The joint distribution function of X = (X1, X2, . . . , Xn) is the functionFX : Rn → [0, 1] given by

FX(x1, . . . , xn) = P(X1 ≤ x1, X2 ≤ x2, . . . , Xn ≤ xn)

for any x1, . . . , xn ∈ R.

Having made sense of Lebesgue measure mn on Rn, we can use it, justas we did for m2, to define the joint density of n random variables.

Definition 3.24We say that a random vector X = (X1, X2, . . . , Xn) has joint density if thejoint distribution PX can be written as

PX(B) =∫

BfX(x) dmn(x) for each B ∈ B(Rn)

for some integrable function fX : Rn → [0,∞], where mn denotes Lebesguemeasure on Rn. In this case the random variables X1, X2, . . . , Xn are saidto be jointly continuous. The PXi are called the marginal distributionsof PX .

If PX has a density fX on Rn, then the marginal distribution PXi has den-sity on R given by an integral relative to Lebesgue measure mn−1 on Rn−1.

Indeed, using Fubini’s theorem for σ-finite measures repeatedly, we have

PXi (B) = P(X ∈ Ri−1 × B × Rn−i) =∫Ri−1×B×Rn−i

fX(y) dmn(y)

=

∫Ri−1×B

(∫Rn−i

fX(x′, x, x′′) dmn−i(x′′))

dmi(x′, x)

=

∫B

(∫Rn−i

(∫Ri−1

fX(x′, x, x′′) dmi−1(x′))

dmn−i(x′′))

dm(x)

=

∫B

(∫Rn−1

fX(x′, x, x′′) dmn−1(x′, x′′))

dm(x)

for any B ∈ B(R), where for any x′ ∈ Ri−1, x ∈ R and x′′ ∈ Rn−i we identify(x′, x, x′′) with a point in Rn, (x′, x) with a point in Ri, and (x′, x′′) with apoint in Rn−1. It follows that

fXi (x) =∫Rn−1

fX(x′, x, x′′) dmn−1(x′, x′′).

3.5 Independence 83

This extends the result in Proposition 3.20.

Example 3.25We call X = (X1, X2, . . . , Xn) a Gaussian random vector if it has jointdensity given for all x ∈ Rn by

fX(x) =1√

(2π)n detΣexp

(−1

2(x − μ)TΣ−1(x − μ)

), (3.7)

where μ ∈ Rn, Σ is a non-singular positive definite (that is, xTΣx > 0when x ∈ Rn is any non-zero vector) symmetric n × n matrix, Σ−1 is theinverse matrix of Σ, detΣ denotes the determinant of Σ, and (x − μ)T is thetranspose of the vector x − μ in Rn. We say that (3.7) is a multivariatenormal density.

In particular, the bivariate normal density (3.5) from Example 3.16 fitsinto this pattern since it can be written as

1

2π√

1 − ρ2exp

(− x2

1 − 2ρx1x2 + x22

2(1 − ρ2)

)=

1√(2π)2 detΣ

exp

(−1

2xTΣ−1x

),

where x =

[x1

x2

]and where Σ =

[1 ρ

ρ 1

]is a positive definite sym-

metric matrix with determinant detΣ = 1 − ρ2 > 0 and inverse Σ−1 =

11−ρ2

[1 −ρ−ρ 1

].

Exercise 3.17 Show that (3.7) is indeed a density, that is,∫Rn

1√(2π)n detΣ

exp

(−1

2(x − μ)TΣ−1(x − μ)

)dmn(x) = 1.

3.5 Independence

One of the key concepts in probability theory is that of independence. Weconsider it in various forms: for random variables, events and σ-fields.Each time we start with just two such objects before moving to the gen-eral case.


Two independent random variables

We begin by examining two random variables whose joint distribution isthe product of their individual distributions.

Definition 3.26If random variables X, Y satisfy

PX,Y (B1 × B2) = PX(B1)PY(B2) (3.8)

for all choices of B1, B2 ∈ B(R), we say that X and Y are independent.

In other words, the joint distribution of two independent variables X, Yis the product measure PX,Y = PX ⊗ PY . We can conveniently express thisin terms of distribution functions.

Theorem 3.27Random variables X, Y are independent if and only if their joint distributionfunction FX,Y is the product of their individual distribution functions, thatis,

FX,Y (x, y) = FX(x)FY (y) for any x, y ∈ R.

The necessity of this condition is immediate, but the proof of its suffi-ciency is somewhat technical and is given in Section 3.7.

When X and Y are jointly continuous, their independence can be ex-pressed in terms of densities.

Proposition 3.28If X, Y are jointly continuous with density fX,Y , then they are independent ifand only if

fX,Y (x, y) = fX(x) fY (y), m2-a.e., (3.9)

that is, if and only if

m2

({(x, y) ∈ R2 : fX,Y (x, y) � fX(x) fY (y)

})= 0.

Proof Proposition 3.20 confirms that X and Y have densities fX , fY . Forany B1, B2 ∈ B(R) we have

PX,Y (B1 × B2) =∫

B1×B2

fX,Y (x, y)dm2(x, y),

3.5 Independence 85

while

PX(X ∈ B1)PY(Y ∈ B2) =

(∫B1

fX(x)dm(x)

) (∫B2

fY(y)dm(y)

)

=

∫B1

(∫B2

fX(x) fY (y)dm(y)

)dm(x)

=

∫B1×B2

fX(x) fY (y)dm2(x, y)

by Fubini’s theorem. Now if (3.9) holds, it follows immediately that (3.8)does too. Conversely, if (3.8) holds, then we see that∫

B1×B2

fX,Y (x, y)dm2(x, y) =∫

B1×B2


for any Borel sets B1, B2 ∈ B(R). It follows from Lemma 3.58 in Sec-tion 3.7 that ∫

BfX,Y (x, y)dm2(x, y) =

∫B


for any Borel set B ∈ B(R2) because, by Theorem 1.35, the integrals onboth sides of the last equality are measures when regarded as functions ofB ∈ B(R2). This implies (3.9) by virtue of Exercise 1.30. �

As a by-product we obtain the following result.

Corollary 3.29If X and Y are (individually) continuous and independent, then they arealso jointly continuous, with joint density given by the product of theirindividual densities.

This result fails when the random variables are not independent, as thenext exercise shows.

Exercise 3.18 Give an example of continuous random variables X, Ydefined on the same probability space that are not jointly continuous.

Exercise 3.19 Suppose the joint density fX,Y of random variablesX, Y is the bivariate normal density (3.5) with ρ = 0. (We call fX,Y

the standard bivariate normal density when ρ = 0.) Show that X andY are independent.


Exercise 3.20 Show that if X and Y are jointly continuous and inde-pendent, then their sum X + Y has density

fX+Y (z) =∫ ∞

−∞fX(x) fY (z − x)dm(x).

This density is called the convolution of fX and fY .

Families of independent random variables

The following definition is a natural extension of the concept of indepen-dence from 2 to n random variables.

Definition 3.30Let PX be the joint distribution of a random vector X = (X1, X2, . . . , Xn).The random variables X1, X2, . . . , Xn are said to be independent if for everychoice of Borel sets B1, B2, . . . , Bn ∈ B(R)

PX(B1 × B2 × · · · × Bn) =n∏

i=1

PXi (Bi),

or in other words if

PX = PX1 ⊗ PX2 ⊗ · · · ⊗ PXn .

An arbitrary family X of random variables are called independent if forevery finite subset {X1, X2, . . . , Xn} ⊂ X the random variables X1, X2, . . . , Xn

are independent.

For n random variables their independence can again be expressed interms of the distribution function. The proof follows just like in the previ-ous section, which dealt with the case n = 2.

Theorem 3.31X1, X2, . . . , Xn are independent if and only if the joint distribution functionfor the random vector X = (X1, X2, . . . , Xn) satisfies

FX(x1, x2, . . . , xn) =n∏

i=1

FXi (xi) for any x1, x2, . . . , xn ∈ R.

3.5 Independence 87

The description in terms of densities follows just as for the case n = 2.

Theorem 3.32If a random vector X = (X1, . . . , Xn) has joint density fX, then X1, . . . , Xn

are independent if and only if

fX(x1, x2, . . . , xn) =n∏

i=1

fXi (xi) mn-a.e. (3.10)

Exercise 3.21 Prove Theorems 3.31 and 3.32.

Exercise 3.22 Suppose that the random vector X = (X1, X2, . . . , Xn)has joint density (3.7). Show that E(Xi) = μi for each i = 1, . . . , n.Also prove that if Σ is a diagonal matrix, then X1, X2, . . . , Xn are inde-pendent.

Two independent events

Our principal interest is in random variables, but the concept of indepen-dence can be defined more widely: for A1 = {X ∈ B1} and A2 = {Y ∈ B2}we see that (3.8) becomes

P(A1 ∩ A2) = P(A1)P(A2).

We turn this into a general definition for arbitrary events A1, A2 ∈ F .

Definition 3.33Let (Ω,F , P) be a probability space. Events A1, A2 ∈ F are said to beindependent if

P(A1 ∩ A2) = P(A1)P(A2).

Example 3.34A fair die is thrown twice. Thus Ω = {(i, j) : i, j = 1, 2, 3, 4, 5, 6} andeach pair occurs with probability 1

36 . Let A be the event that the first throwis odd, and B the event that the second throw is odd. Then P(A) = 1

2 =

P(B), while P(A ∩ B) = 14 . Thus A, B are independent events. However,

for D = {(i, j) ∈ Ω : i + j > 6} the events A,D are not independent sinceP(D) = 7

12 and P(A ∩ D) = 14 �

12 × 7

12 .


Exercise 3.23 Show that A1, A2 are independent events if and only ifA1,Ω \ A2 are independent events.

Exercise 3.24 Show that A1, A2 are independent events if and only ifthe indicator functions 1A1 , 1A2 are independent random variables.

Now suppose we know that an event A has occurred. This means that Acan take the place of Ω. For any event B only the part of B lying within Amatters now, so we replace B by A ∩ B in order to compute the probabilityof B given that A has occurred. We normalise by dividing by P(A) to definethe conditional probability of B given A as

P(B | A) =P(A ∩ B)

P(A). (3.11)

This makes sense whenever P(A) � 0. It is then natural to consider theevents A, B as independent if the prior occurrence of A does not influencethe probability of B, that is, if

P(B | A) = P(B).

(It is equivalent to P(A|B) = P(A) when P(B) � 0 in addition to P(A) � 0.)In Example 3.34 this is simply the statement that the outcomes of the firstthrow of the die do not affect the outcome of the second. It is consistentwith Definition 3.33 of independent events,

P(A ∩ B) = P(A)P(B),

with the apparent advantage that the latter also applies when P(B) = 0 orP(A) = 0.

Families of independent events

Extending the definition of independence to more than two events requiressome care. It is tempting to propose that A, B,C be called independent if

P(A ∩ B ∩C) = P(A)P(B)P(C), (3.12)

but the following exercises show that this would not be satisfactory.

3.5 Independence 89

Exercise 3.25 Find subsets A, B,C of [0, 1] with uniform probabilitysuch that (3.12) holds, but A, B are not independent. Can you find threesubsets such that each pair is independent, but (3.12) fails?

Exercise 3.26 Find another example by considering the events A, Bin Example 3.34 together with a third event C chosen so that each pairof these events is independent, but (3.12) fails.

This leads us to make the following general definition.

Definition 3.35A finite family of events A1, . . . , An ∈ F are said to be independent if

P(Ai1 ∩ Ai2 ∩ · · · ∩ Aik ) =k∏

j=1

P(Aij )

for any k = 2, . . . , n and for any 1 ≤ i1 < · · · < ik ≤ n.An arbitrary family of events is defined to be independent if each of its

finite subfamilies is independent.

It is immediate from this definition that a subcollection of a family ofindependent events is also independent.

Exercise 3.27 Suppose A, B,C are independent events. Show thatA ∪ B and C are independent.

Exercise 3.28 Show that A1, A2, . . . , An are independent events if andonly if the indicator functions 1A1 , 1A2 , . . . , 1An are independent randomvariables.

Example 3.36Let (Ωi,Fi, Pi) be a probability space for each i = 1, 2, . . . , n, and supposethat Ω = Ω1×Ω2× · · · ×Ωn is equipped with the product σ-field F = F 1⊗


· · · ⊗ Fn and the product probability P = P1 ⊗ · · · ⊗ Pn. Cylinder sets aredefined in Ω as Cartesian products of the form

Ci = Ω1 × · · · ×Ωi−1 × Ai ×Ωi+1 × · · · ×Ωn

for some i = 1, 2, . . . , n and Ai ∈ Fi. By the definition the product measure,P(Ci) = Pi(Ai). Moreover, for any i � j

P(Ci ∩C j)

= P(Ω1 × · · · ×Ωi−1 × Ai ×Ωi+1 × · · · ×Ω j−1 × Aj ×Ω j+1 × · · · ×Ωn

= Pi(Ai)Pj(Aj)

= P(Ci)P(C j),

so the cylinder sets Ci,C j are independent. By extending this argument, wecan show that C1,C2, . . . ,Cn are independent.

Two independent σ-fields

The defining identity (3.8) for independent random variables X, Y can bewritten as

P(A1 ∩ A2) = P(A1)P(A2)

for any events of the form A1 = {X ∈ B1} and A2 = {Y ∈ B2} with B1, B2 ∈B(R). In other words, X, Y are independent if and only if for any A1 ∈ σ(X)and A2 ∈ σ(Y) the events A1, A2 are independent.

As we can see, independence of random variables X, Y can be expressedin terms of the generated σ-fields σ(X), σ(Y). This suggests that the notionof independence can be extended to arbitrary σ-fields G1,G2 ⊂ F .

Definition 3.37We say that σ-fields G1,G2 ⊂ F are independent if for any A1 ∈ G1 andA2 ∈ G2 the events A1, A2 are independent.

We can now say that random variables X, Y are independent if and only ifthe σ-fields σ(X), σ(Y) are independent. As a simple application, we showthe following proposition.

Proposition 3.38Suppose that X, Y are independent random variables and U,W are randomvariables measurable with respect to σ(X) and, respectively, σ(Y). ThenU,W are independent.

3.5 Independence 91

Proof Since U is measurable with respect to σ(X) and W is measurablewith respect to σ(Y), we have σ(U) ⊂ σ(X) and σ(W) ⊂ σ(Y). Indepen-dence of X, Y means that the σ-fields σ(X), σ(Y) are independent, whichimplies immediately that the sub-σ-fieldsσ(U), σ(W) are independent, andso the random variables U,W themselves are independent. �

Corollary 3.39If X, Y are independent random variables and g, h : R → R are Borel-measurable functions, then g(X), h(Y) are also independent random vari-ables.

Exercise 3.29 Show that A, B are independent events if and only ifthe σ-fields {∅, A,Ω \ A,Ω} and {∅, B,Ω \ B,Ω} are independent.

Exercise 3.30 What can you say about a σ-field that is independentof itself?

Remark 3.40Extending Definition 3.26, we can say that two random vectors

X = (X1, . . . , Xm), Y = (Y1, . . . , Yn)

with values, respectively, in Rm and Rn are independent whenever

PX,Y (B1 × B2) = PX(B1)PY (B1)

for all Borel sets B1 ∈ B(Rm) and B2 ∈ B(Rn), where by PX,Y we denote thejoint distribution of the random vector (X1, . . . , Xm, Y1, . . . , Yn).

Equivalently, we can say that the random vectors X, Y are independentwhenever theσ-fieldsσ(X), σ(Y) generated by them are independent, whereby definition σ(X) consists of all events of the form {X ∈ B} with B ∈B(Rm) and, similarly, σ(X) consists of all events of the form {Y ∈ B} withB ∈ B(Rn).

Families of independent σ-fields

The notion of independence is readily extended to any finite number ofσ-fields.


Definition 3.41We say that σ-fields G1,G2, . . . ,Gn ⊂ F are independent if for any A1 ∈G1, A2 ∈ G2, . . . , An ∈ Gn the events A1, A2, . . . , An are independent.

Exercise 3.31 Show that random variables X1, X2, . . . , Xn are inde-pendent if and only if their generated σ-fields σ(X1), σ(X2), . . . , σ(Xn)are independent.

Exercise 3.32 Show that events A1, A2, . . . , An ∈ F are independentif and only if the σ-fields G1,G2, . . . ,Gn are independent, where Gk =

{∅, Ak,Ω \ Ak,Ω} for k = 1, 2, . . . , n.

Given a random variable Y on a probability space (Ω,F , P) and a σ-fieldG ⊂ F it is now natural to say that Y is independent of G if σ(Y) and G areindependent σ-fields.

Exercise 3.33 Suppose that X1, X2, . . . , Xn, Y are independent ran-dom variables. Show that Y is independent of the σ-field σ(X) gener-ated by the random vector X = (X1, X2, . . . , Xn). By definition, the σ-field σ(X) consists of all sets of the form {X ∈ B} such that B ∈ B((R)).

Independence: expectation and variance

Theorem 3.42If X, Y are independent integrable random variables, then the product XYis also integrable and

E(XY) = E(X)E(Y).

Proof First suppose that X =∑m

i=1 ai1Ai and Y =∑n

j=1 bj1Bj are simplefunctions. Then XY =

∑mi=1

∑nj=1 aib j1Ai∩Bj . We may assume without loss

of generality that the ai are distinct, so Ai = {X = ai} for each i = 1, . . . ,m,and that the bj are also distinct, so Bj = {Y = bj} for each j = 1, . . . , n. If

3.5 Independence 93

X, Y are independent, then so are Ai, Bj for each i, j, and

E(XY) =m∑

i=1

n∑j=1

aib jP(Ai ∩ Bj) =m∑

i=1

n∑j=1

aib jP(Ai)P(Bj)

=

⎛⎜⎜⎜⎜⎜⎝ m∑i=1

aiP(Ai)

⎞⎟⎟⎟⎟⎟⎠⎛⎜⎜⎜⎜⎜⎜⎝ n∑

j=1

bjP(Bj)

⎞⎟⎟⎟⎟⎟⎟⎠ = E(X)E(Y).

Now define

fn(x) =

⎧⎪⎪⎨⎪⎪⎩k−12n for k−1

2n ≤ x < k2n , k = 1, 2, . . . , n2n

n for x ≥ n

for each n = 1, 2, . . . . If X is a non-negative random variable, then Xn =

fn(X) is a non-decreasing sequence and limn→∞ Xn = X. If X is not neces-sarily non-negative, we put

Xn = fn(X+) − fn(X−).

Then Xn is a sequence of simple functions such that limn→∞ Xn = X,and |Xn| is a non-decreasing sequence of simple functions such thatlimn→∞ |Xn| = |X|, so limn→∞ E(|Xn|) = E(|X|) by monotone convergence,see Theorem 1.31. Similarly, we put

Yn = fn(Y+) − fn(Y−),

which have similar properties as the Xn. It follows that |XnYn| = |Xn| |Yn| isa non-decreasing sequence of simple functions such that limn→∞ |XnYn| =|XY |. Using monotone convergence once again, we have limn→∞ E(|XnYn|) =E(|XY |). If X, Y are independent, then by Corollary 3.39, so are Xn, Yn andalso |Xn| , |Yn|, so the result already established for independent simple ran-dom variables yields

E(XnYn) = E(Xn)E(Yn) and E(|XnYn|) = E(|Xn|)E(|Yn|),for any n. If, in addition, X and Y are both integrable, then

E(|XY |) = limn→∞E(|XnYn|) = lim

n→∞E(|Xn|)E(|Yn|) = E(|X|)E(|Y |) < ∞,which means that |XY | is integrable. Finally, because |XnYn| ≤ |XY | andlimn→∞ XnYn = XY , by dominated convergence, see Theorem 1.43, we canconclude that XY is integrable and

E(XY) = limn→∞E(XnYn) = lim

n→∞E(Xn)E(Yn) = E(X)E(Y),



Exercise 3.34 Prove the following version of Theorem 3.42 ex-tended to the case of n random variables.If X1, X2, . . . , Xn are independent integrable random variables, thenthe product

∏ni=1 Xi is also integrable and

E

⎛⎜⎜⎜⎜⎜⎝ n∏i=1

Xi

⎞⎟⎟⎟⎟⎟⎠ = n∏i=1

E(Xi).

Example 3.43The converse of Theorem 3.42 is false. For a simple counterexample takeX(x) = x and Y(x) = x2 on [− 1

2 ,12 ] with Lebesgue measure. Since X and XY

are both odd functions, their integrals over [− 12 ,

12 ] are 0, so that E[XY] =

E[X]E[Y]. However, X, Y are not independent, as we can verify by takingthe inverse images of B = [− 1

9 ,19 ] under X and Y . We see that {X ∈ B} =

[− 19 ,

19 ] and {Y ∈ B} = [− 1

3 ,13 ], and these are not independent events: their

intersection has measure 29 , whereas the product of their measures is 2

9× 23 =

427 .

The following important result illustrates the extent to which knowledgeof expectations provides a sufficient condition for independence.

Theorem 3.44Random variables X, Y are independent if and only if

E[ f (X)g(Y)] = E[ f (X)]E[g(Y)] (3.13)

for all choices of bounded Borel measurable functions f , g : R→ R.Proof Suppose (3.13) holds, and apply it with the indicators 1B, 1C forBorel sets B,C ∈ B(R). Then (3.13) becomes simply

P(X ∈ B, Y ∈ C) = P(X ∈ B)P(Y ∈ C).

This holds for arbitrary sets B,C ∈ B(R), so X, Y are independent.Conversely, if X and Y are independent and f , g are real Borel functions,

then Corollary 3.39 tells us that f (X) and g(Y) are independent. If f , g arebounded, then f (X) and g(Y) are integrable, so by Theorem 3.42 we have(3.13). �

3.5 Independence 95

Recalling that for a complex-valued f = Re f + iIm f we define E( f ) =E(Re f ) + iE(Im f ), we can see that (3.13) extends to bounded complex-valued Borel measurable functions. We therefore immediately have the fol-lowing way of finding the characteristic function of the sum of independentrandom variables.

Corollary 3.45If X, Y are independent random variables, then

φX+Y(t) = φX(t)φY(t).

Exercise 3.35 Recall from Exercise 2.25 the characteristic functionof a standard normal random variable. Use this to find the characteris-tic function of the linear combination aX+bY of independent standardnormal random variables X, Y.

Proposition 3.46If X, Y are independent integrable random variables, then

Var(X + Y) = Var(X) + Var(Y).

Proof The random variables V = X − E(X) and W = Y − E(Y) are in-tegrable because X and Y are. Moreover, since X, Y are independent, soare V,W by Corollary 3.39. It follows that VW is integrable and E(VW) =E(V)E(W) = 0. Applying expectation to both sides of the equality

(V +W)2 = V2 + 2VW +W2,

we get

Var(X + Y) = E((V +W)2) = E(V2) + 2E(VW) + E(W2)

= E(V2) + E(W2) = Var(X) + Var(Y).

�

Exercise 3.36 Prove the following version of Proposition 3.46 ex-tended to the case of n random variables.If X1, . . . , Xn are independent integrable random variables, then

Var(X1 + X2 + · · · + Xn) = Var(X1) + Var(X2) + · · · + Var(Xn).


3.6 Covariance

Covariance and correlation can serve as tools to quantify the dependencebetween random variables, which in general may or may not be indepen-dent.

Definition 3.47For integrable random variables X, Y whose product XY is also integrablewe define the covariance as

Cov(X, Y) = E ((X − E(X)) (Y − E(Y))) = E (XY) − E(X)E(Y).

If, in addition, Var(X) � 0 and Var(Y) � 0, we can define the correlationcoefficient of X, Y as

ρX,Y =Cov(X, Y)σXσY

.

It is not hard to verify the following properties of covariance, which aredue to the linearity of expectation:

Cov(aX, Y) = aCov(X, Y),

Cov(W + X, Y) = Cov(W, Y) + Cov(X, Y).

It is also worth noting that, in general,

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y).

Observe further that

Cov(X, Y) = Cov(Y, X)

and

Cov(X, X) = Var(X).

Exercise 3.37 Suppose X, Y have bivariate normal distribution withdensity (3.5). Compute Cov(X, Y).

It follows from Definition 3.47 that independent random variables X, Yhave zero covariance and zero correlation (when it exists). More generally,we say that X and Y are uncorrelated if ρX,Y = 0. Example 3.43 showsthat uncorrelated random variables need not be independent. However, fornormally distributed random variables the two concepts coincide.

3.6 Covariance 97

Exercise 3.38 Show that if X, Y have joint distribution with den-sity (3.5) for some constant ρ ∈ (−1, 1), then their correlation is givenby ρXY = ρ. Hence show that if such X, Y are uncorrelated, then theyare independent.

Next, we prove an important general inequality which gives a boundfor Cov(X, Y) in terms of Var(X) and Var(Y) or, equivalently, a bound forρX,Y . Anticipating the terminology to be used extensively later, we makethe following definition.

Definition 3.48A random variable X is said to be square-integrable if X2 is integrable.

Lemma 3.49 (Schwarz inequality)If X and Y are square-integrable random variables, then XY is integrable,and

[E(XY)]2 ≤ E(X2)E(Y2).

Proof Observe that for any t ∈ R we have obtain 0 ≤ (X + tY)2 = X2 +

2tXY + t2Y2, and these random variables are integrable since X and Y aresquare-integrable. As a result,

0 ≤ E((X + tY)2) = E(X2) + 2tE(XY) + t2E(Y2)

for any t ∈ R. For this quadratic expression in t to be non-negative for allt ∈ R, its discriminant must be non-positive, that is,

[2E(XY)]2 − 4E(X2)E(Y2) ≤ 0,

which proves the Schwarz inequality. �

Applying Lemma 3.49 to the centred random variables X−E(X) and Y−E(Y) it is now easy to verify the following bounds for Cov(X, Y) and ρX,Y .

Corollary 3.50The following inequalities hold:

[Cov(X, Y)]2 ≤ Var(X)Var(Y),

−1 ≤ ρX,Y ≤ 1.

Exercise 3.39 Suppose that |ρX,Y | = 1. What is the relationship be-tween X and Y?


Finally, we can quantify the dependencies between n random variablesby means of the covariance matrix defined as follows.

Definition 3.51For a random vector X = (X1, X2, . . . , Xn) consisting of integrable randomvariables such that the product XiXj is integrable for each i, j = 1, 2, . . . , nwe define the covariance matrix to be the n× n square matrix with entriesCov(Xi, Xj) for i, j = 1, 2, . . . , n, that is, the matrix

C =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝Cov(X1, X1) Cov(X1, X2) · · · Cov(X1, Xn)Cov(X2, X1) Cov(X2, X2) · · · Cov(X2, Xn)

......

. . ....

Cov(Xn, X1) Cov(Xn, X2) · · · Cov(Xn, Xn)

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠ .Since Cov(Xi, Xj) = Cov(Xj, Xi), the covariance matrix is symmetric.

The diagonal elements are Cov(Xi, Xi) = Var(Xi). Also note that for anyvector a = (a1, a2, . . . , an) ∈ Rn we have

0 ≤ Var(a1X1 + · · · + anXn)

= Cov(a1X1 + · · · + anXn, a1X1 + · · · + anXn)

= aT Ca,

which means that the covariance matrix is non-negative definite.

3.7 Proofs by means of d-systems

The idea behind the proof of Theorem 3.5 is to observe that the desiredproperties hold for all sets in the σ-field generated by R, the class of mea-surable rectangles. For example, setting

D = {A ∈ σ(R) : ω1 �→ μ2(Aω1 ) is F1-measurable},we wish to show that D = σ(R) in order to prove that ω1 �→ μ2(Aω1 ) isF1-measurable for every A ∈ σ(R) (and similarly for the function ω2 �→μ1(Aω2 )). It is clear that R ⊂ D since for A1 × A2 ∈ R we have μ2(Aω1 ) =1A1 (ω1)μ2(A2). It would therefore suffice to prove that D is a σ-field inorder to verify part (i) of the theorem

Similarly, to ensure that μ is uniquely determined on F1 ⊗ F2 by itsdefining property (3.2) on R, it would suffice to prove that, given two finitemeasures ν1, ν2 that agree on R, the collection of all sets on which theyagree is a σ-field.

3.7 Proofs by means of d-systems 99

Here we have two examples of a general proof technique which findsfrequent use in measure theory. First we observe that a certain propertyholds on a given collection C of subsets, and show that the collectionD ofall sets that satisfy this property is aσ-field. Since theσ-fieldD contains C,it must contain σ(C), so the property holds on σ(C). Rather than verifyingdirectly thatD satisfies the definition of a σ-field, it is often easier to checkthatD meets the following requirements.

Definition 3.52A systemD of subsets of Ω is called a d-system on Ω when the followingconditions are satisfied:

(i) Ω ∈ D;(ii) if A, B ∈ D with A ⊂ B, then B \ A ∈ D;

(iii) if Ai ⊂ Ai+1 and Ai ∈ D for i = 1, 2, . . . , then⋃∞

i=1 Ai ∈ D.

Similarly as for σ-fields, the smallest d-system on Ω that contains afamily C of subsets of Ω is given by

d(C) =⋂{D : D is a d-system on Ω, C ⊂ D}

and called the d-system generated by C.It is clear that the conditions defining a d-system are weaker than those

for a σ-field; compare Definitions 1.10 (ii) and 3.52.Since every σ-field is a d-system, for any collection C we have d(C) ⊂

σ(C). If, for a particular collection C, we can prove the opposite inclu-sion, our task of checking that the desired property holds on σ(C) will beaccomplished by checking the conditions defining a d-system.

Of course, we cannot expect this to be true for an arbitrary collection C.However, the following simple property of C is sufficient.

Definition 3.53A family C is closed under intersection if A, B ∈ C implies A ∩ B ∈ C.

An immediate example of such a collection is given by the measurablerectangles.

Exercise 3.40 Show that the family R of measurable rectangles isclosed under intersection.

When a family C of subsets of a given set Ω is closed under intersection,the d-system and the σ-field generated by C turn out to be the same. Weprove this below.


Lemma 3.54Suppose a family C of subsets of Ω is closed under intersection. Then d(C)is closed under intersection.

Proof Consider the family of sets

G = {A ∈ d(C) : A ∩C ∈ d(C) for all C ∈ C}.Since C is closed under intersection and C ⊂ d(C), we have C ⊂ G. Weclaim that G is a d-system. Obviously, Ω ∩ C = C ∈ C for each C ∈ C, soΩ ∈ G. If A, B ∈ G and A ⊂ B, then for any C ∈ C

(B \ A) ∩C = (B ∩C) \ (A ∩C) ∈ d(C)

since A ∩C, B ∩C ∈ d(C) and A ∩C ⊂ B ∩C. Thus B \ A ∈ G.Finally, suppose that Ai ⊂ Ai+1 and Ai ∈ d(C) for i = 1, 2, . . . . Then for

any C ∈ C we have Ai ∩C ⊂ Ai+1 ∩C and Ai ∩C ∈ d(C), so⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ ∩C =∞⋃

i=1

(Ai ∩C) ∈ d(C),

implying that⋃∞

i=1 Ai ∈ G. We have shown that G is a d-system such thatC ⊂ G ⊂ d(C), hence G = d(C).

Now consider the family of sets

H = {A ∈ d(C) : A ∩ B ∈ d(C) for all B ∈ d(C)}.Because G = d(C), we know that C ⊂ H . Moreover, H is a d-system,which can be verified in a very similar way as for G. Since H ⊂ d(C), weconclude thatH = d(C), and this proves the lemma. �

Lemma 3.55A familyD of subsets of Ω is a σ-field if and only if it is a d-system closedunder intersection.

Proof If D is a σ-field, it obviously is a d-system closed under intersec-tion. Conversely, ifD is a d-system closed under intersection and A, B ∈ D,then Ω,Ω \ A,Ω \ B ∈ D. Hence Ω \ (A ∪ B) = (Ω \ A) ∩ (Ω \ B) is in D.But then so is its complement A ∪ B since Ω ∈ D. Finally, if Ai ∈ D for alli = 1, 2, . . . , then the sets Bk =

⋃ki=1 Ai belong to D by induction on what

has just been proved for k = 2. The Bk increase to⋃∞

i=1 Ai, so this unionalso belongs to the d-systemD.We have verified thatD is a σ-field. �

Together, these two lemmas imply the result we seek.


Proposition 3.56If a family C of subsets ofΩ is closed under intersection, then d(C) = σ(C).

Proof Since C is closed under intersection, it follows by Lemma 3.54 thatd(C) is also closed under intersection. According to Lemma 3.55, d(C) istherefore a σ-field. Because C ⊂ d(C) ⊂ σ(C) and σ(C) is the smallestσ-field containing C we can therefore conclude that d(C) = σ(C). �

By Exercise 3.40 we have an immediate consequence for measurablerectangles.

Corollary 3.57The family of measurable rectangles on Ω1 ×Ω2 satisfies

d(R) = σ(R).

Exercise 3.41 Show that

d(I) = σ(I),

where I is the family of open intervals in R.

The final step in our preparation for the proof Theorem 3.5 will ensurethat the measure μ is uniquely defined on F1 ⊗ F2 by the requirement thatfor all measurable rectangles A1 × A2,

μ(A1 × A2) = μ1(A1)μ2(A2).

Again, we shall phrase the result in terms of general families of sets toenable us to use it in a variety of settings. Assume that C is a family ofsubsets of a non-empty set Ω.

Lemma 3.58Suppose that C is closed under intersection. If μ and ν are measures definedon the σ-field σ(C) such that μ(A) = ν(A) for every A ∈ C and μ(Ω) =ν(Ω) < ∞, then μ(A) = ν(A) for every A ∈ σ(C).

Proof Consider the family of sets

D = {A ∈ σ(C) : μ(A) = ν(A)}.Since the measures agree on C we know that C ⊂ D. Let us verify that Dis a d-system. Since μ(Ω) = ν(Ω), it follows that Ω ∈ D. For any A, B ∈ Dsuch that B ⊂ A we have

μ(A \ B) = μ(A) − μ(B) = ν(A) − ν(B) = ν(A \ B).


(Here it is important that μ and ν are finite measures.) Hence A \ B ∈ D.Moreover, for any non-decreasing sequence Ai ⊂ Ai+1 with Ai ∈ D fori = 1, 2, . . . we have

μ

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ = limi→∞ μ(Ai) = lim

i→∞ ν(Ai) = ν

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ ,which shows that

⋃∞i=1 Ai ∈ D. We have shown that D is a d-system such

that C ⊂ D ⊂ σ(C). Hence d(C) ⊂ D ⊂ σ(C). Because C is closed underintersection, it follows by Lemma 3.56 thatD = σ(C), that is, the measuresμ and ν coincide on σ(C). �

To prove Theorem 3.5 we now need to apply these general results to thefamily R of measurable rectangles on Ω1 ×Ω2.

Theorem 3.5Suppose that μ1 and μ2 are finite measures defined on the σ-fields F1 andF2, respectively. Then:

(i) for any A ∈ F1 ⊗ F2 the functions

ω1 �→ μ2(Aω1 ), ω2 �→ μ1(Aω2 )

are measurable, respectively, with respect to F1 and F2;(ii) for any A ∈ F1 ⊗ F2 the following integrals are well defined and

equal to one another:∫Ω1

μ2(Aω1 )dμ1(ω1) =∫Ω2

μ1(Aω2 )dμ2(ω2);

(iii) the function μ : F1 ⊗ F2 → [0,∞) defined by

μ(A) =∫Ω1

μ2(Aω1 )dμ1(ω1) =∫Ω2

μ1(Aω2 )dμ2(ω2) (3.4)

for each A ∈ F1 ⊗ F2 is a measure on the σ-field F1 ⊗ F2;(iv) μ is the only measure on F1 ⊗ F2 such that

μ(A1 × A2) = μ1(A1)μ2(A2)

for each A1 ∈ F1 and A2 ∈ F2.

Proof (i) Define the family of sets

D = {A ∈ σ(R) : ω1 �→ μ2(Aω1 ) is F1-measurable}.In order to prove thatω1 �→ μ2(Aω1 ) isF1-measurable for any A ∈ F1⊗F2 =

σ(R) it is enough to show that D = σ(R). Since μ2(Aω1 ) = 1A1 (ω1)μ2(A2)for A = A1 × A2, it follows that R ⊂ D. To show that D is a d-system


note first that Ω1 × Ω2 belongs to R, hence it belongs to D. If A ⊂ B andA, B ∈ D, then for any ω1 ∈ Ω1

(B \ A)ω1 = Bω1 \ Aω1 and Aω1 ⊂ Bω1 ,

so

μ2((B \ A)ω1 ) = μ2(Bω1 ) − μ2(Aω1 ),

where ω1 �→ μ2(Aω1 ) and ω1 �→ μ2(Bω1 ) are measurable functions. Henceω1 �→ μ2((B \ A)ω1 ) is a measurable function, so B \ A ∈ D. Finally, givenan non-decreasing sequence Ai ⊂ Ai+1 with Ai ∈ D for i = 1, 2, . . . , we seethat (Ai)ω1 ⊂ (Ai+1)ω1 for any ω1 ∈ Ω1, so

μ2

⎛⎜⎜⎜⎜⎜⎜⎝⎛⎜⎜⎜⎜⎜⎝ ∞⋃

i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ω1

⎞⎟⎟⎟⎟⎟⎟⎠ = μ2

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

(Ai)ω1

⎞⎟⎟⎟⎟⎟⎠ = limi→∞ μ2((Ai)ω1 ).

By Exercise 1.19, this means that⋃∞

i=1 An ∈ D. We have shown that D isa d-system containing R. Hence d(R) ⊂ D. By Corollary 3.57 it followsthat D = σ(R), completing the proof. The same argument works for thefunction ω2 �→ μ1(Aω2 ).

(ii) For any A ∈ F1 ⊗ F2 put

ν1(A) =∫Ω1

μ2(Aω1 )dμ1(ω1), ν2(A) =∫Ω2

μ1(Aω2 )dμ2(ω2).

The integrals are well defined by part (i) of the theorem. We show that ν1

and ν2 are finite measures on F1 ⊗ F2. Let Ai ∈ F1 ⊗ F2 for i = 1, 2, . . .be a sequence of pairwise disjoint sets. Then the sections (Ai)ω2 are alsopairwise disjoint for any ω2 ∈ Ω2, and

ν1

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ =∫Ω1

μ2

⎛⎜⎜⎜⎜⎜⎜⎝⎛⎜⎜⎜⎜⎜⎝ ∞⋃

i=1

Ai

⎞⎟⎟⎟⎟⎟⎠ω1

⎞⎟⎟⎟⎟⎟⎟⎠ dμ1(ω1) =∫Ω1

μ2

⎛⎜⎜⎜⎜⎜⎝ ∞⋃i=1

(Ai)ω1

⎞⎟⎟⎟⎟⎟⎠ dμ1(ω1)

=

∫Ω1

∞∑i=1

μ2((Ai)ω1 )dμ1(ω1) =∞∑

i=1

∫Ω1

μ2((Ai)ω1 )dμ1(ω1)

=

∞∑i=1

ν1(Ai)

by the monotone convergence theorem for a series (Exercise 1.24). More-over, ν1 is a finite measure since

ν1(Ω1 ×Ω2) = μ1(Ω1)μ2(Ω2) < ∞.


A similar argument applies to ν2. For measurable rectangles A1 × A2 ∈ R,where A1 ∈ F1 and A2 ∈ F2, we have

ν1(A1 × A2) = μ1(A1)μ2(A2) = ν2(A1 × A2).

By Lemma 3.58, the measures ν1 and ν2 therefore coincide on F1 ⊗ F2 =

σ(R).(iii) Since μ = ν1 = ν2, we have already proved in part (ii) that μ is a

measure on F1 ⊗ F2.(iv) Uniqueness of μ follows directly from Lemma 3.58. �

Exercise 3.42 Let X be an integrable random variable on the proba-bility space Ω = [0, 1] with Borel sets and Lebesgue measure. Showthat if ∫

(i2−n, j2−n]X dm = 0

for any n = 0, 1, . . . and for any i, j = 0, 1, . . . , 2n with i ≤ j, then∫A

X dm = 0

for every Borel set A ⊂ [0, 1], and deduce that X = 0, m-a.s.

The proof of Theorem 3.27 is based on a similar technique, making useof d-systems.

Theorem 3.27Random variables X, Y are independent if and only if their joint distributionfunction FX,Y is the product of their individual distribution functions, thatis,

FX,Y (x, y) = FX(x)FY (y) for any x, y ∈ R.Proof Since intervals are Borel sets, the necessity is obvious. For suffi-ciency, the claim is that we need only check (3.8) for intervals of the formB1 = (−∞, x), B2 = (−∞, y).We are then given that for all x, y ∈ R

P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y). (3.14)

Now consider the class C of all Borel sets A ∈ B(R) such that for all y ∈ RP(X ∈ A, Y ≤ y) = P(X ∈ A)P(Y ≤ y), (3.15)


and the classD of Borel sets B ∈ B(R) such that for all A ∈ CP(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B). (3.16)

Our aim is to show that C and D are both equal to the Borel σ-field B(R).By our assumption (3.14), C contains the collection of all intervals (−∞, x),and this collection is closed under intersection, we only need to check thatC is a d-system. This will mean that it contains all Borel sets, and so (3.15)holds for all Borel sets. This in turn will mean thatD contains all intervals(−∞, x), hence to show that it contains B(R) we again only need to makesure thatD is a d-system.

We now check that D satisfies the conditions for a d-system; the prooffor C is almost identical. We have Ω ∈ D since for all A ∈ C

P(X ∈ A,Y ∈ Ω) = P(X ∈ A) = P(X ∈ A)P(Y ∈ Ω).

If B ∈ D, then

P(X ∈ A, Y ∈ Ω \ B) = P(X ∈ A, Y ∈ Ω) − P(X ∈ A, Y ∈ B)

= P(X ∈ A)P(Y ∈ Ω) − P(X ∈ A)P(Y ∈ B)

= P(X ∈ A)P(Y ∈ Ω \ B).

Finally, if Bn ⊂ Bn+1 with Bn ∈ D for all n = 1, 2, . . . and⋃∞

n=1 Bn = B,then

P(X ∈ A, Y ∈ B) = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=1

{X ∈ A, Y ∈ Bn}⎞⎟⎟⎟⎟⎟⎠

= limn→∞ P(X ∈ A, Y ∈ Bn)

= P(X ∈ A) limn→∞ P(Y ∈ Bn)

= P(X ∈ A)P(Y ∈ B).

Thus D is a d-system. By Proposition 3.56, D is a σ-field containing allintervals (−∞, y), and so contains all Borel sets, which proves that (3.16)holds for all pairs of Borel sets, that is, X and Y are independent. �

4

Conditional expectation

4.1 Binomial stock prices4.2 Conditional expectation: discrete case4.3 Conditional expectation: general case4.4 The inner product space L2(P)4.5 Existence of E(X | G) for integrable X4.6 Proofs

We turn our attention to the concept of conditioning, which involves ad-justing our expectations in the light of the knowledge we have gained ofcertain events or random variables. Building on the notion of the condi-tional probability defined in (3.11), we describe how knowledge of onerandom variable Y may cause us to review how likely the various outcomesof another random variable X are going to be. We adjust the probabilitiesfor the values of X in the light of the information provided by the valuesof Y , by focusing on those scenarios for which these values of Y have oc-curred. This becomes especially important when we have a sequence, oreven a continuous-parameter family, of random variables, and we considerhow knowledge of the earlier terms will affect the later ones. We first il-lustrate these ideas in the simplest multi-step financial market model, sinceour main applications come from finance.

4.1 Binomial stock prices

Consider the binomial model of stock prices in order to illustrate the prob-abilistic ideas we will develop. This model, studied in detail in [DMFM],combines simplicity with flexibility. The general multi-step binomial modelconsists of repetitions of the single-step model.

106

4.1 Binomial stock prices 107

Single-step model Suppose that the current price S (0) of some risky asset(stock) is known, and its future price S (T ) at some fixed T > 0 is a randomvariable S (T ) : Ω→ [0,+∞), taking just two values:

S (T ) =

{S (0)(1 + U) with probability p,S (0)(1 + D) with probability 1 − p,

where −1 < D < U. The return

K =S (T ) − S (0)

S (0)

is a random variable such that

K =

{U with probability p,D with probability 1 − p.

As the sample space we take a two-element setΩ = {U,D} equipped with aprobability P determined by a single number p ∈ (0, 1) such that P(U) = pand P(D) = 1 − p.

Multi-step model All the essential features of a general multi-step modelare contained in a model with three time steps, where we take time tobe 0, h, 2h, 3h = T . We simplify the notation by just specifying thenumber of a step, ignoring its length h. The model involves stock pricesS (0), S (1), S (2), S (3) at these four time instants, where S (0) is a constant,and S (1), S (2), S (3) are random variables. The returns

Kn =S (n) − S (n − 1)

S (n − 1)

at each step n = 1, 2, 3 are independent random variables, and each has thesame distribution as the return K in the single-step model. For the prob-ability space we take Ω = {U,D}3, which consists of eight triples, calledscenarios (or paths):

Ω = {UUU,UUD,UDU,UDD,DUU,DUD,DDU,DDD}.As K1,K2,K3 are independent random variables, the probability of eachpath is the product of the probabilities of the up/down price movementsalong that path, namely

P(UUU) = p3,

P(DUU) = P(UDU) = P(UUD) = p2(1 − p),

P(UDD) = P(DUD) = P(DDU) = p(1 − p)2,

P(DDD) = (1 − p)3.

108 Conditional expectation

The emerging binomial tree is recombining, as can be seen in the nextexample.

Example 4.1Let S (0) = 100, U = 0.1, D = −0.1 and p = 0.6. The correspondingbinomial tree is

S (0) S (1) S (2) S (3)

133.10.6↗

1210.6↗ ↘

0.4

110 108.90.6↗ ↘

0.4

0.6↗100 99

↘0.4

0.6↗ ↘0.4

90 89.1

↘0.4

0.6↗81

↘0.4

72.9

The number X of upward movements in this binomial tree is a randomvariable with distribution

P(X = 3) = P({UUU}) = p3,

P(X = 2) = P({DUU,UDU,UUD}) = 3p2(1 − p),

P(X = 1) = P({UDD,DUD,DDU}) = 3p(1 − p)2,

P(X = 0) = P({DDD}) = (1 − p)3,

that is, X has binomial distribution (see Example 2.2). The expected priceafter three steps is E(S (3)) � 106.12.

We use this example to analyse the changes of this expectation corre-sponding to the flow of information over time.


Partitions and expectation

Conditioning on the first step Suppose we know that the stock has goneup in the first step. This means that the collection of available scenarios isreduced to those beginning with U, which we denote by

ΩU = {UUU,UUD,UDU,UDD}.This set now plays the role of the probability space, which we need to equipwith a probability measure (the events will be all the subsets of ΩU). Thisis done by adjusting the original probabilities so that the new probabilityof ΩU is 1. For A ⊂ ΩU we put

PU(A) =P(A)

P(ΩU).

Of course PU(ΩU) = 1. Since A ⊂ ΩU , it follows that

PU(A) =P(A ∩ΩU)

P(ΩU)= P(A |ΩU),

so the measure PU is the conditional probability A �→ P(A |ΩU) consideredfor subsets A ⊂ ΩU .

If we know that the stock went down in the first step, we replace ΩU bythe set ΩD of all paths beginning with D, and for A ⊂ ΩD we put

PD(A) = P(A |ΩD).

We have decomposed the setΩ = ΩU∪ΩD of all scenarios into two disjointsubsets ΩU ,ΩD, which motivates the following general definition.

Definition 4.2Let Ω be a non-empty set. A family P = {B1, B2, . . .} such that Bi ⊂ Ω fori = 1, 2, . . . is called a partition of Ω if Bi ∩ Bj = ∅ whenever i � j andΩ =

⋃∞i=1 Bi.

Note that we allow for the possibility that Bi = ∅ when i > n for some n,so the partition may be finite or countably infinite.

Example 4.3P = {ΩU ,ΩD} is a partition of Ω = {U,D}3.

For any discrete random variable X with values x1, x2, . . . ∈ R the familyof all sets of the form {X = xi} for i = 1, 2, . . . is a partition of Ω. We call itthe partition generated by X.


Example 4.4In Ω = {U,D}3 the partition generated by S (1) from Example 4.1 is{ΩU ,ΩD}.

Recall from Definition 3.2 that for any collection C of subsets of Ω, theσ-field generated by C is the smallest σ-field containing C.

Exercise 4.1 Show that the σ-field σ(P) generated by a partition Pconsists of all possible countable unions of the sets belonging to P.

Not all σ-fields are generated by partitions. For example, the σ-field ofBorel sets B(R) is not generated by a partition.

Although σ-fields are more general, we will work with partitions for thepresent to develop better intuition in a relatively simple case.

Exercise 4.2 We call A an atom in aσ-fieldF if A ∈ F is non-emptyand there are no non-empty sets B,C ∈ F such that A = B ∪ C. Showthat if the familyA of all atoms in F is a partition, then F = σ(A).

Example 4.5Continuing Example 4.1, we compute the expectation of S (3) in the newprobability space ΩU = {S (1) = 110} with probability PU . Since the pathsbeginning with D are excluded, S (3) takes three values on ΩU , and thisexpectation, which we denote by E(S (3) |ΩU) and call the conditional ex-pectation of S (3) given ΩU , is equal to

E(S (3) |ΩU) = 131.1× 0.62 + 108.9× 2× 0.6× 0.4+ 89.1× 0.42 � 114.44.

In a similar manner we can compute the expectation of S (3) on ΩD =

{S (1) = 90} with probability PD, denote it by E(S (3) |ΩD) and call it theconditional expectation of S (3) given ΩD. We obtain

E(S (3) |ΩD) = 108.9 × 0.62 + 89.1 × 2 × 0.6 × 0.4 + 72.9 × 0.42 � 93.64.

These two cases can be combined by setting up a new random variable,


denoted by E(S (3) | S (1)) and called the conditional expectation of S (3)given S (1):

E(S (3) | S (1)) =

{E(S (3) |ΩU) on ΩU ,

E(S (3) |ΩD) on ΩD,

�{

114.44 on {S (1) = 110},93.64 on {S (1) = 90}.

Conditioning on the first two steps Suppose now that we know the pricemoves for the first two steps. There are four possibilities, which can bedescribed by specifying a partition of Ω into four disjoint sets:

ΩUU = {UUU,UUD},ΩUD = {UDU,UDD},ΩDU = {DUU,DUD},ΩDD = {DDU,DDD}.

Each of these sets can be viewed as a probability space equipped with ameasure adjusted in a similar manner as before, for instance in ΩUU wehave PUU(A) = P(A |ΩUU) for any A ⊂ ΩUU , and similarly for PUD, PDU

and PDD. We follow the setup in Example 4.1 to illustrate how this leads tothe conditional expectation of S (3) given S (2).

Example 4.6We compute the expected value of S (3) on each of the setsΩUU ,ΩUD,ΩDU ,ΩDD under the respective probability:

E(S (3) |ΩUU) = 131.1 × 0.6 + 108.9 × 0.4 � 123.42,

E(S (3) |ΩUD) = 108.9 × 0.6 + 89.1 × 0.4 � 100.98,

E(S (3) |ΩDU) = 108.9 × 0.6 + 89.1 × 0.4 � 100.98,

E(S (3) |ΩDD) = 89.1 × 0.6 + 72.9 × 0.4 � 82.62,

introducing similar notation for these expectations as in Example 4.5. Wecan see, in particular, that E(S (3) |ΩUD) = E(S (3) |ΩDU). Since S (2) hasthe same value on ΩUD and ΩDU , this allows us to employ a random vari-able denoted by E(S (3) | S (2)). To this end, note that the partition generated


by S (2) consists of three sets

ΩUU = {S (2) = 121}, ΩUD ∪ΩDU = {S (2) = 99}, ΩDD = {S (2) = 81},and the random variable E(S (3) | S (2)) takes three different values

E(S (3) | S (2)) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩E(S (3) |ΩUU) on ΩUU ,

E(S (3) |ΩUD) = E(S (3) |ΩDU) on ΩUD ∪ΩDU ,

E(S (3) |ΩDD) on ΩDD,

�⎧⎪⎪⎪⎨⎪⎪⎪⎩

123.42 on {S (2) = 121},100.98 on {S (2) = 99},82.62 on {S (2) = 81}.

The actual values of S (2) are irrelevant here since they do not appear in thecomputations. What matters is the partition related to these values.

4.2 Conditional expectation: discrete case

The binomial example motivates a more general definition. Recall thatfor any event B ∈ F such that P(B) > 0 and for any A ∈ F we knowfrom (3.11) that the conditional probability of A given B is

P(A | B) =P(A ∩ B)

P(B).


PB(A) = P(A | B)

is a probability measure on B defined on the σ-field FB consisting ofall events A ∈ F such that A ⊂ B.

Definition 4.7If X is a discrete random variable on Ω with finitely many distinct valuesx1, x2, . . . , xn and B ∈ F is an event such that P(B) > 0, the conditionalexpectation of X given B, denoted by E(X | B), can be defined as the ex-pectation of X restricted to B under the probability PB.

4.2 Conditional expectation: discrete case 113

This follows the same pattern as in Examples 4.5 and 4.6, and gives

E(X | B) =1

P(B)

n∑i=1

xiP({X = xi} ∩ B)

=1

P(B)

n∑i=1

xiP(1BX = xi) =1

P(B)E(1BX).

We use this to extend the definition to any integrable random variable X.

Definition 4.8Given an integrable random variable X on Ω and an event B ∈ F withP(B) > 0, the conditional expectation of X given B is defined as

E(X | B) =1

P(B)E(1BX). (4.1)

Exercise 4.4 For a random variable X with Poisson distribution findthe conditional expectation of X given that the value of X is an oddnumber.

As noted in the binomial example, given a partition of Ω we can piecetogether the conditional expectations of X relative to the members of thepartition to obtain a random variable.

Definition 4.9Given a partition P = {B1, B2, . . .} of Ω, the random variable E(X | P) :Ω→ R such that for each i = 1, 2, . . .

E(X | P)(ω) = E(X | Bi) if ω ∈ Bi and P(Bi) > 0

is called the conditional expectation of X with respect to the partition P.Note that in this definition the random variable E(X | P) remains unde-

fined when P(Bi) = 0. Since P is a partition, this means that the functionE(X | P) is well defined P-a.s.

Applying (4.1), we can write

E(X | P) =∑i=1,2,...P(Bi )>0

1P(Bi)

E(1Bi X)1Bi . (4.2)

The above definition, applied to the partition ofΩ generated by a discreterandom variable Y , leads to the following one.


Definition 4.10If X is an integrable random variable and Y is a discrete random variablewith values y1, y2, . . . , then the conditional expectation E(X |Y) of X givenY is the conditional expectation of X with respect to the partition P gener-ated by Y , that is, for each i = 1, 2, . . .

E(X |Y)(ω) = E(X | {Y = yi}) if Y(ω) = yi and P(Y = yi) > 0.

Exercise 4.5 On [0, 1] equipped with its Borel subsets and Lebesguemeasure, let Z be the random variable taking just two values, −1 on[0, 1

2 ) and 1 on [ 12 , 1], and let X be the random variable defined as

X(ω) = ω for each ω ∈ [0, 1]. Compute E(X |Z).

Exercise 4.6 Suppose that Z is the same random variable on [0, 1] asin Exercise 4.5, and Y is the random variable defined as Y(ω) = 1 − ωfor each ω ∈ [0, 1]. Compute E(Y |Z).

Observe that if Y is constant on a subset of Ω, then E(X |Y) is also con-stant on that subset. The values of E(X |Y) depend only on the subsets onwhich Y is constant, not on the actual values of Y . For discrete randomvariables Y and Z generating the same partition we always have the sameconditional expectations,

E(X |Y) = E(X |Z).

Exercise 4.7 Construct an example to show that, for random vari-ables V and W defining different partitions, in general we have

E(X |V) � E(X |W).

Exercise 4.8 Let X, Y be random variables on Ω = {1, 2, . . .}equipped with the σ-field of all subsets of Ω and a probability mea-sure P such that P({n}) = 2× 3−n, X(n) = 2n and Y(n) = (−1)n for eachn = 1, 2, . . . . Compute E(X |Y).


Properties of conditional expectation: discrete case

We establish some of the basic properties of the random variable E(X |Y),where X is an arbitrary integrable random variable (so that E(X) is well-defined) and Y is any discrete random variable.

First, note that conditional expectation preserves linear combinations:for any integrable random variables X1, X2 and a discrete random variableY , and for any numbers a, b ∈ R

E(aX1 + bX2 |Y) = aE(X1 |Y) + bE(X2 |Y).

This is an easy consequence of the definition of conditional expectationand the linearity of expectation. If y1, y2, . . . are the values of Y , then oneach set Bn = {Y = yn} such that P(Bn) > 0 we have

E(aX1 + bX2 |Y) = E(aX1 + bX2 | Bn) =1

P(Bn)E(1Bn (aX1 + bX2))

=1

P(Bn)(aE(1Bn X1) + bE(1Bn X2))

= aE(X1 | Bn) + bE(X2 | Bn) = aE(X1 |Y) + bE(X2 |Y).

Example 4.11To illustrate some further properties of E(X |Y), again consider stock priceevolution through a binomial tree. Starting with S (0) = 100, with returns±20% in the first step and ±10% in the second, we obtain four valuesS (2) = 132, 108, 88, 72. (You may find it helpful to draw the tree.) Sup-pose that p = 3

4 at each step. Conditioning S (2) on S (1), we observethat E(S (2) | S (1)) is constant on each of the sets ΩU = {UU,UD} andΩD = {DU,DD} in this two-step model, with values

E(S (2) |ΩU) = 132p + 108(1 − p) = 126,

E(S (2) |ΩD) = 88p + 72(1 − p) = 84.

Hence E(S (2) | S (1)) equals 126 on ΩU and 84 on ΩD. The expectation ofthis random variable is E(E(S (2) | S (1))) = 115.5. This equals E(S (2)), asyou may check.

In this example, therefore, the ‘average of the averages’ of S (2) over thesets in the partition generated by S (1) coincides with its overall average.This is true in general.


Proposition 4.12When X is an integrable and Y a discrete random variable, the expectationof E(X |Y) is equal to the expectation of X:

E(E(X |Y)) = E(X). (4.3)

Proof Let y1, y2 . . . be the values of Y . Writing Bn = {Y = yn}, we canassume without loss of generality that P(Bn) > 0 for all n = 1, 2, . . . . Then∑∞

n=1 1Bn = 1, and we obtain

E(E(X |Y)) =∞∑

n=1

E(X | Bn)P(Bn) =∞∑

n=1

1P(Bn)

E(1Bn X)P(Bn)

=

∞∑n=1

E(1Bn X) = E(X).

Note that dominated convergence in the form stated in Exercise 1.34 isused in the last equality. �

Exercise 4.9 Let X be an integrable random variable and Y a discreterandom variable. Verify that for any B ∈ σ(Y)

E(1BE(X |Y)) = E(1BX).

Example 4.13The two-step stock-price model in Example 4.11 provides two partitions ofΩ = {U,D}2, partition P1 defined by S (1) consisting of two sets ΩU ,ΩD,

andP2 determined by S (2) consisting of four setsΩUU ,ΩUD,ΩDU ,ΩDD (asdefined earlier). Notice that ΩU = ΩUU ∪ ΩUD and ΩD = ΩDU ∪ ΩDD. Itreflects the fact that S (2) carries more information than S (1), that is, if thevalue of S (2) becomes known, then we will also know the value of S (1).

This gives rise to the following definition.

Definition 4.14Given two partitions P1, P2, we say P2 is finer than (or refines) P1 (equiv-alently, that P1 is coarser than P2) whenever each element of P1 can berepresented as a union of sets from P2.


Remark 4.15The tree constructed in Example 4.11 is not recombining, which resultsin partition P2 being finer than P1. In a recombining binomial tree, asin Example 4.1, the partition defined by S (2), which consists of the setsΩUU ,ΩUD∪ΩDU ,ΩDD, is not finer than that generated by S (1), which con-sists of the sets ΩU = ΩUU ∪ΩUD, ΩD = ΩDU ∪ΩDD. In a recombining treeS (2) has only three values, and its middle value gives us no informationabout the value of S (1).

Exercise 4.10 SupposeP1 andP2 are partitions of some setΩ. Showthat the coarsest partition which refines them both is that consisting ofall intersections A ∩ B, where A ∈ P1 and B ∈ P2.

Example 4.16We extend Example 4.11 by adding a third step with returns ±10%. Thestock price S (3) then takes the values 145.2, 118.8, 97.2, 96.8, 79.2, 64.8, asyou may confirm. (Note that there are only six values as the tree recombinesin two of its nodes.) The conditional expectation E(S (3) | S (2)) is calculatedin a similar manner as for E(S (2) | S (1)). Its sets of constancy are thoseof the partition P2 generated by S (2), that is, ΩUU ,ΩUD,ΩDU ,ΩDD. Thecorresponding values of E(S (3) | S (2)) are 138.6, 113.4, 92.4, 75.6.We nowcondition the random variable E(S (3) | S (2)) on the values of S (1):

E(E(S (3) | S (2)) |ΩU) = 138.60 × p + 113.40 × (1 − p) = 132.3,

E(E(S (3) | S (2)) |ΩD) = 92.40 × p + 75.60 × (1 − p) = 88.2,

given that p = 34 . We compare this with the constant values taken by

E(S (3) | S (1)) on these two sets:

E(S (3) |ΩU) = 145.20 × p2 + 118.80 × 2p(1 − p) + 97.20 × (1 − p)2

= 138.60 × p + 113.40 × (1 − p) = 132.3,

E(S (3) |ΩD) = 96.80 × p2 + 79.20 × 2p(1 − p) + 64.80 × (1 − p)2

= 92.40 × p + 75.60 × (1 − p) = 88.2.

Hence E(E(S (3) | S (2)) | S (1)) = E(S (3) | S (1)), which is again a particularcase of an important general result.


Proposition 4.17 (tower property)Let X be an integrable random variable and let Y, Z be discrete randomvariables such that Y generates a finer partition than Z. Then

E(E(X |Y) |Z) = E(X |Z). (4.4)

Proof Let y1, y2, . . . be the values of Y and z1, z2, . . . the values of Z. Wecan assume without loss of generality that the sets Bi = {Y = yi} andC j = {Z = z j} in the partitions generated, respectively, by Y and Z are suchthat P(Bi), P(C j) > 0 for each i, j = 1, 2, . . . . Because Y generates a finerpartition than Z, for any j = 1, 2, . . . we can write C j =

⋃i∈I j

Bi for someset of indices I j ⊂ {1, 2, . . .} . For any ω ∈ C j, by (4.1) and (4.2),

E(E(X |Y) |Z)(ω) =1

P(C j)E(E(X |Y)1C j )

=1

P(C j)E

⎛⎜⎜⎜⎜⎜⎜⎜⎝∑i∈I j

1P(Bi)

E(1Bi X)1Bi 1C j

⎞⎟⎟⎟⎟⎟⎟⎟⎠ .But Bi ⊂ C j for i ∈ I j, so 1Bi 1C j = 1Bi . Moreover,

∑i∈I j

1Bi = 1C j . It followsthat

1P(C j)

E

⎛⎜⎜⎜⎜⎜⎜⎜⎝∑i∈I j

1P(Bi)

E(1Bi X)1Bi

⎞⎟⎟⎟⎟⎟⎟⎟⎠ = 1P(C j)

∑i∈I j

1P(Bi)

E(1Bi X)E(1Bi )

=1

P(C j)

∑i∈I j

E(1Bi X) =1

P(C j)E

⎛⎜⎜⎜⎜⎜⎜⎜⎝∑i∈I j

1Bi X

⎞⎟⎟⎟⎟⎟⎟⎟⎠=

1P(C j)

E(1C j X) = E(X |Z)(ω).

Once again, dominated convergence in the form stated in Exercise 1.34 isused here. �

Example 4.18To motivate the next property, we return to Example 4.16 and considerE(S (1)S (3) |ΩUU). This conditional expectation boils down to summationover ω ∈ ΩUU , but for such scenarios S (1) is constant (and equal to120), so it can be taken outside the sum, which gives E(S (1)S (3) |ΩUU) =S (1)E(S (3) |ΩUU). If we repeat the same argument forΩUD,ΩDU andΩDD,we discover that E(S (1)S (3) | S (2)) = S (1)E(S (3) | S (2)). In other words,

4.3 Conditional expectation: general case 119

when conditioning on the second step, we can take out the ‘known’ valueS (1). This is again a general feature of conditioning.

Proposition 4.19 (taking out what is known)Assume that X is integrable and Y, Z are discrete random variables suchthat Y generates a finer partition than Z. In this case

E(ZX |Y) = ZE(X |Y). (4.5)

Proof Fix a set B belonging to the partition generated by Y such thatP(B) > 0, and notice that Z is constant on B, taking value z, say. Then

E(ZX | B) =1

P(B)E(1BZX) =

1P(B)

E(z1BX) =1

P(B)zE(1BX) = zE(X | B)

so the result holds on B. By gluing together the formulae obtained for eachsuch B we complete the argument. �

The intuition behind this result is that once the value of Y becomesknown, we will also know that of Z, and therefore can treat it as if it werea constant rather than a random variable, moving it out in front of the con-ditional expectation. In particular, for X ≡ 1, we have E(Z |Y) = Z.

The final property we consider here extends a familiar feature of in-dependent events, namely the fact that the conditional probability of anevent A is not sensitive to conditioning on an event B that is independentof A, that is, P(A | B) = P(A).

Exercise 4.11 Prove that E(X |Y) = E(X) if X and Y are independentrandom variables and Y is discrete.

4.3 Conditional expectation: general case

Let Y be a uniformly distributed random variable on [0, 1]. Then the eventBy = {Y = y} has probability P(By) = 0 for every y ∈ [0, 1]. In suchsituations the definition of conditional probability P(A | By) as P(A∩By)

P(By) nolonger makes sense, nor is there a partition generated by Y , and we need adifferent approach. This can be achieved by turning matters on their head,defining conditional expectation in the general case by means of certain


properties which follow from the special case of conditioning with respectto a discrete random variable or a partition.

Suppose Y is a discrete random variable and B is a set from the partitiongenerated by Y . From Proposition 4.19 we know that

1BE(X |Y) = E(1BX |Y).

Applying expectation to both sides and using Proposition 4.12, we get

E(1BE(X |Y)) = E(1BX ). (4.6)

These equalities are also valid for each B ∈ σ(Y), since each such B canbe expressed as a countable union of disjoint sets from the partition gener-ated by Y . We also note that the conditional expectation E(X |Y) is σ(Y)-measurable and does not depend on the actual values of Y but just on thepartition generated by Y , or equivalently on the σ-field σ(Y). We could,therefore denote E(X |Y) by E(X |σ(Y)).

These observations are very useful in the general case of an arbitraryrandom variable Y , when there may be no partition generated by Y , but wedo have the σ-field σ(Y) generated by Y . This gives rise to the followingdefinition of conditional expectation with respect to a σ-field.

Definition 4.20Let X be an integrable random variable on a probability space (Ω,F , P).The conditional expectation of X with respect to a σ-field G ⊂ F is de-fined as a random variable, denoted by E(X | G), that satisfies the followingtwo conditions:

(i) E(X | G) is G-measurable;(ii) for each B ∈ G

E(1BE(X | G)) = E(1BX).

WhenG = σ(Y) for some random variable Y on the same probability space,then we shall write E(X |Y) in place of E(X | G) and call it the conditionalexpectation of X given Y . In other words,

E(X |Y) = E(X |σ(Y)).

The first condition is a general counterpart of the condition that the con-ditional expectation should be constant on the atoms of G in the discretecase, the second extends (4.6).

At this point we have no guarantee that a random variable with the prop-erties (i), (ii) exists, nor that it is uniquely defined if it does exist. We deferthis question for the moment, and will return to it after discussing the prop-erties implied by Definition 4.20.


Exercise 4.12 On the probability space [0, 1] with Borel sets andLebesgue measure compute E(X |Y) when X(ω) =

∣∣∣ω − 13

∣∣∣ and Y(ω) =∣∣∣ω − 12

∣∣∣ for ω ∈ [0, 1].

Properties of conditional expectation: general case

All the properties we proved for the discrete case can also be proved withDefinition 4.20. We summarise them in the following exercises and propo-sitions.

Exercise 4.13 Let X, Y be integrable random variables on (Ω,F , P)and let G be a sub-σ-field of F . Show that, P-a.s.,

(1) E((aX + bY) | G) = aE(X | G) + bE(Y | G) for any a, b ∈ R (lin-earity);

(2) E(X | G) ≥ 0 if X ≥ 0 (positivity).

Remark 4.21As in Exercise 4.13, identities and inequalities involving conditional ex-pectation (as a random variable) should be read as holding up to P-a.s inwhat follows.

Proposition 4.22 (tower property)If X is integrable andH ⊂ G, then

E(E(X | G) | H) = E(X | H).

Proof Write Y = E(X | G) and take any A ∈ H . We need to show thatE(1AY) = E(1AE(X | H)). By the definition of conditional expectation withrespect to G (A ∈ G sinceH ⊂ G), we have

E(1AY) = E(1AE(X | G)) = E(1AX).

By the definition of conditional expectation with respect toH ,

E(1AE(X | H)) = E(1AX),

which concludes the proof. �

Corollary 4.23For any integrable random variable X

E(E(X | G)) = E(X).


Proof Since 1Ω = 1 and Ω ∈ G, the definition of conditional expectationwith respect to G applies:

E(E(X | G)) = E(1ΩE(X | G)) = E(1ΩX) = E(X).�

Exercise 4.14 Prove the following monotone convergence theoremfor conditional expectations.If Xn for n = 1, 2, . . . is a non-decreasing sequence of integrablerandom variables such that limn→∞ Xn = X, P-a.s., then their condi-tional expectations E(Xn | G) form a non-decreasing sequence of non-negative integrable random variables such that limn→∞ E(Xn | G) =E(X | G), P-a.s.

If Z is a G-measurable random variable, we can ‘take it outside’ theconditional expectation of the product XZ; this accords with the intuitionthat Z is ‘known’ once we know G and can therefore be treated like aconstant when conditioning on G, exactly as in the discrete case.

Proposition 4.24 (taking out what is known)If both X and XZ are integrable and Z is G-measurable, then

E(ZX | G) = ZE(X | G).

Proof We may assume X ≥ 0 by linearity. Take any B ∈ G. We have toshow that

E(1BZX) = E(1BZE(X | G)).

Let Z = 1A for some A ∈ G. Then, since A ∩ B ∈ G,

E(1BZX) = E(1B1AX) = E(1A∩BX) = E(1A∩BE(X | G))

= E(1B1AE(X | G)) = E(1BZE(X | G)).

By linearity, we have E(ZX | G) = ZE(X | G) for any simple G-measurablerandom variable Z. For any G-measurable Z ≥ 0 we use Exercise 4.14and a sequence Z1, Z2, . . . of non-negative simple G-measurable functionsincreasing to Z to conclude that E(ZnX | G) increases to E(ZX | G), whileZnE(X | G) increases to ZE(X|G). Since E(ZnX | G) = ZnE(X | G) for eachn, and the limit on the left as n → ∞ is P-a.s. finite by our integrabilityassumption, we have E(ZX | G) = ZE(X | G) as required. For general Zwe can take positive and negative parts of Z and apply linearity of theconditional expectation. �


Corollary 4.25E(Z | G) = Z if Z is integrable and G-measurable.

Proof Take X = 1 in Proposition 4.24. �

At the other extreme, independence of random variables X, Y means thatknowing one ‘tells us nothing’ about the other. Recall that random variablesX, Y are independent if and only if their generated σ-fields σ(X), σ(Y) areindependent, see Exercise 3.31. Moreover, recall that X is independent ofa σ-field G ⊂ F precisely when the σ-fields σ(X) and G are independent.In that case E(X | G) is constant, as the next result shows.

Proposition 4.26 (independence)If X is integrable and independent of G, then

E(X | G) = E(X).

Proof For any B ∈ G the random variables 1B, X are independent, so

E(1BX) = E(1B)E(X) = E(1BE(X)),

which shows that the constant random variable E(X) satisfies (4.6). SinceE(X ) is also G-measurable, it satisfies the definition of E(X | G). �

In the next two results we use the fact that for independent random vari-ables or random vectors their joint distribution is simply the product ofthe individual distributions. The first result will be used in the analysis ofthe Black–Scholes model in [BSM]; the second is a special case whichbecomes crucial for the development of Markov processes in [SCF]. Bothfollow easily from the Fubini theorem.

Theorem 4.27Let (Ω,F , P) be a probability space, and let G ⊂ F be a σ-field. Supposethat X : Ω → R is a G-measurable random variable and Y : Ω → R isa random variable independent of G. If f : R2 → R is a bounded Borelmeasurable function, then gf : R→ R defined for any x ∈ R by

gf (x) = E( f (x, Y)) =∫R

f (x, y) dPY (y)

is a bounded Borel measurable function, and we have

E( f (X, Y) | G)) = gf (X), P-a.s. (4.7)


Proof We know from Proposition 3.17 that gf is a Borel measurable func-tion. If follows that gf (X) is σ(X)-measurable. By the definition of condi-tional expectation it suffices to show that

E[1G f (X, Y)] = E[gf (X)1G]

for each G ∈ G.By hypothesis, σ(Y) and G are independent σ-fields. For any bounded

G-measurable random variable Z the σ-field σ(X, Z) generated by the ran-dom vector (X, Z) is contained in G, hence Y and (X, Z) are independent.This means that their joint distribution is the product measure PX,Z ⊗ PY

(see Remark 3.40). Applying Fubini’s theorem, we obtain

E( f (X, Y)Z) =∫R3

f (x, y)z d(PX,Z ⊗ PY)(x, z, y)

=

∫R2

(∫R

f (x, y)z dPY (y)

)dPX,Z(x, z)

=

∫R

gf (x)z dPX,Z(x, z)

= E(gf (X)Z).

Applying this with Z = 1G proves (4.7). �

In the special case where G = σ(X) for some random variable X : Ω →R, the theorem reduces to the following

Corollary 4.28Let (Ω,F , P) be a probability space, and suppose that X : Ω → R andY : Ω→ R are independent random variables. If f : R2 → R is a boundedBorel measurable function, then gf : R→ R defined for any x ∈ R by

gf (x) = E( f (x, Y)) =∫R

f (x, y) dPY (y)

is a bounded Borel measurable function, and we have

E( f (X, Y) |σ(X)) = gf (X).

Exercise 4.15 Extend Theorem 4.27 to the case of random vectorsX, Y with values in Rm and Rn, respectively, and a function f : Rm ×Rn → R.


Now suppose that Z ≥ 0 is a non-negative random variable on a prob-ability space (Ω,F , P) such that E(Z) = 1. It can be used to define a newprobability measure Q such that for each A ∈ F

Q(A) = E(1AZ).

We know from Theorem 1.35 that Q is indeed a measure. It is a probabilitymeasure because Q(Ω) = E(Z) = 1.

Since we now have two probability measures P and Q, we need to dis-tinguish between the corresponding expectations by writing EP and EQ,respectively. For any B ∈ F we have

EQ(1B) = Q(B) = EP(1BZ).

By linearity this extends to

EQ(s) = EP(sZ)

for any simple function s. Approximating any non-negative random vari-able X by a non-decreasing sequence of simple functions, we obtain bymonotone convergence that

EQ(X) = EP(XZ). (4.8)

Finally, we can extend the last identity to any random variable X integrableunder Q by considering X+ and X− and using linearity once again. Thisgives a relationship between the expectation under Q and that under P. Thenext result, which will be needed in [BSM], extends this to conditionalexpectation.

Lemma 4.29 (Bayes formula)Let Z ≥ 0 be a random variable such that EP(Z) = 1 and let Q(A) =EP(1AZ) for each A ∈ F . For any integrable random variable X under Qand for any σ-field G ⊂ F

EQ(X | G)EP(Z | G) = EP(XZ | G).

Proof For any B ∈ G we apply (4.8) and the definition of conditionalexpectation to get

EP(1BEP(XZ | G)) = EP(1BXZ ) = EQ(1BX ) = EQ(1BEQ(X | G)).

Now we use (4.8) again and then the tower property and the fact that 1B


and EQ(X | G) are G-measurable to write the last expression as

EQ(1BEQ(X | G)) = EP(1BEQ(X | G)Z)

= EP(EP(1BEQ(X | G)Z | G))

= EP(1BEQ(X | G)EP(Z | G)).

Since EQ(X | G)EP(Z | G) is G-measurable, this proves the Bayes formula.�

Conditional density

When X is a continuous random variable with density fX and g : R → Ris a Borel measurable function such that g(X) is integrable, the expectationof g(X) can be written as

E(g(X)) =∫R

g(x) fX(x) dm(x). (4.9)

For two jointly continuous random variables X, Y we would like to writethe conditional expectation E(g(X) |Y) in a similar manner. Since the con-ditional expectation is a σ(Y)-measurable random variable, we need to ex-press it as a Borel measurable function of Y . We know that for any Borelset B ∈ B(R)

E(1B(Y)g(X)) = E(1B(Y)E(g(X) |Y)).

We can write the left-hand side in terms of the joint density fX,Y and useFubini’s theorem to transform it as follows:

E(1B(Y)g(X)) =∫R×B

g(x) fX,Y (x, y) dm2(x, y)

=

∫B

(∫R

g(x) fX,Y (x, y) dm(x)

)dm(y)

=

∫B

(∫R

g(x)fX,Y (x, y)

fY(y)dm(x)

)dPY(y). (4.10)

Dividing by the marginal density fY is all right because fY � 0, PY-a.s.,that is, for C = {y ∈ R : fY (y) = 0} we have

PY(C) =∫

CfY(y) dm(y) = 0.

The fraction appearing in (4.10) is what we are looking for to play a rolesimilar to the density fY(y) in (4.9).


Definition 4.30We define the conditional density of X given Y as

h(x, y) =fX,Y (x, y)

fY(y)

for any x, y ∈ R such that fY(y) � 0, and put h(x, y) = 0 otherwise.

This allows us write

E(1B(Y)g(X)) = E

(1B(Y)

∫R

g(x)h(x, Y) dm(x)

).

Since h(x, Y) is a σ(Y)-measurable random variable for each x ∈ R, itfollows that

∫R

g(x)h(x, Y) dm(x) is σ(Y)-measurable. We have just provedthe following result.

Proposition 4.31If X, Y are jointly continuous random variables and g : R → R is a Borelmeasurable function such that g(X) is integrable, then

E(g(X) |Y) =∫R

g(x)h(x, Y) dm(x),

where h(x, y) for x, y ∈ R is the conditional density of X given Y.

Note that this result provides an immediate alternative proof (valid onlyfor jointly continuous random variables, of course) of Proposition 4.26:if X, Y are jointly continuous and independent, and X is integrable, thenfX,Y (x, y) = fX(x) fY (y), so h(x, y) = fX,Y (x,y)

fY (y) = fX(x), hence

E(X |Y) =∫R

xh(x, Y)dx =∫R

x fX(x) dm(x) = E(X).

Exercise 4.16 Let fX,Y (x, y) be the bivariate normal density given inExample 3.16. Find a formula for the corresponding conditional den-sity h(x, y) and use it to compute E(X |Y).

Jensen’s inequality

The next property of expectation requires some facts concerning convexfunctions. Many common inequalities have their origin in the notion ofconvexity. First we recall the definition of a convex function.


Definition 4.32A function φ : (a, b)→ R, where −∞ ≤ a < b ≤ ∞, is called convex if theinequality

φ(λx + (1 − λ)y) ≤ φ(x) + (1 − λ)φ(y)

holds whenever x, y ∈ (a, b) and 0 ≤ λ ≤ 1.

Such functions have right- and left-hand derivatives at each point in theopen interval (a, b). We recall some of their properties, including a proofof this well-known fact.

Suppose that x, y, z ∈ (a, b) and x < y < z. Taking λ = z−yz−x , we have

y = λx + (1 − λ)z, so the convexity of φ gives

φ(y) ≤ z − yz − x

φ(x) +y − xz − x

φ(z).

Rearranging, we get

φ(y) − φ(x)y − x

≤ φ(z) − φ(y)z − y

.

The next exercise shows that the one-sided derivatives of φ exist and arefinite.

Exercise 4.17 Show that if φ : (a, b) → R is convex and h > 0 withx − h, x + h ∈ (a, b), then

φ(x) − φ(x − h)h

≤ φ(x + h) − φ(x)h

.

Explain why the ratio 1h [φ(x + h) − φ(x)] decreases as h ↘ 0, and is

bounded below by a constant. Similarly, explain why 1h [φ(x)−φ(x−h)]

increases as h↘ 0, and is bounded above by a constant.

This exercise shows that the right- and left-derivatives

φ′+(x) = limh↘0

φ(x + h) − φ(x)h

, φ′−(x) = limh↘0

φ(x) − φ(x − h)h

(4.11)

are well defined for each x ∈ (a, b). We also obtain

φ′−(x) ≤ φ′+(x).

Moreover, for any x < y in (a, b)

φ′+(x) ≤ φ(y) − φ(x)y − x

≤ φ′−(y),


which ensures that both one-sided derivatives are non-decreasing on (a, b).Since φ has finite one-sided derivatives at each point, it is a continuousfunction on (a, b).

Lemma 4.33Any convex function φ : (a, b) → R is the supremum of some sequence ofaffine functions Ln : R → R of the form Ln(x) = anx + bn for x ∈ R, wherean, bn ∈ R and n = 1, 2, . . . .

Proof Consider the set of rational numbers in (a, b). It is a countableset, which we can therefore write as a sequence q1, q2, . . . . For each n =1, 2, . . . we take any an ∈ [φ′−(qn), φ′+(qn)], put bn = φ(qn) − anqn and con-sider the straight line anx+bn for x ∈ R. Clearly, anqn+bn = φ(qn) for eachn = 1, 2, . . . , and it follows from the above inequalities that anx+bn ≤ φ(x)for each x ∈ (a, b) and for each n = 1, 2, . . . . As a result, for each x ∈ (a, b)

supn=1,2,...

(anx + bn) ≤ φ(x).

Now, for any x ∈ (a, b) we can take a subsequence qi1 , qi2 , . . . of the ratio-nals in (a, b) such that limn→∞ qin = x. Then, since φ is continuous,

limn→∞(ain qin + bin ) = lim

n→∞ φ(qin ) = φ(x).

It follows that for each x ∈ (a, b)

supn=1,2,...

(anx + bn) = φ(x).

This completes the proof. �

Proposition 4.34 (Jensen’s inequality)Let −∞ ≤ a < b ≤ ∞. Suppose that X : Ω → (a, b) is an integrablerandom variable on (Ω,F , P) and take a σ-field G ⊂ F . If φ : (a, b) → Ris a convex function such that the random variable φ(X) is also integrable,then

φ(E(X | G)) ≤ E(φ(X) | G).

Proof We must show that E(X | G) ∈ (a, b), P-a.s. before we can evenwrite φ(E(X | G)) on the left-hand side of the inequality. By assumption,X > a. The set B = {E(X | G) ≤ a} is G-measurable, so by the definition ofconditional expectation,

0 ≤ E(1B(X − a)) = E(1B(E(X − a | G))) = E(1B(E(X | G) − a)) ≤ 0,

implying that E(1B(X − a)) = 0. Because, X > a, it means that P(B) = 0, or


in other words, E(X | G) > a, P-a.s. We can show similarly that E(X | G) <b, P-a.s.

By Lemma 4.33, since φ is the supremum of a sequence of affine func-tions Ln(x) = anx + bn, we have anX + bn ≤ φ(X) for each n = 1, 2, . . . ,hence by the linearity and positivity of conditional expectation (Exercise4.13) we obtain

anE(X | G) + bn ≤ E(φ(X) | G)

for each n = 1, 2, . . . , and taking the supremum over n completes the proof.�

Applying this to the trivial σ-field G = {Ω,∅}, we have the followingcorollary.

Corollary 4.35Suppose the random variable X is integrable, φ is convex on an open inter-val containing the range of X and φ(X) is also integrable. Then

φ(E(X)) ≤ E(φ(X)).

The next special case is equally important for the applications we havein mind. It follows from Jensen’s inequality by taking φ(x) = x2.

Corollary 4.36If X2 is integrable, then (E(X | G))2 ≤ E(X2 | G).

4.4 The inner product space L2(P)

We turn to some unfinished business: establishing the existence of condi-tional expectation in the general setting of Definition 4.20. We do this firstfor square-integrable random variables, that is, random variables X suchthat E(X2) is finite.

We shall identify random variables which are equal to one another P-a.s. Given two random variables X, Y , observe that if a ∈ R and bothE(X2) and E(Y2) are finite, then E((aX)2) = a2E(X2) and E((X + Y)2) ≤2(E(X2)+E(Y2)) are also finite. This shows that the collection of such ran-dom variables is a vector space. Moreover, by the Schwarz inequality, seeLemma 3.49, |E(XY)| ≤ √

E(X2)E(Y2) is finite too. We introduce somenotation to reflect this.

4.4 The inner product space L2(P) 131

Definition 4.37We denote by

L2(P) = L2(Ω,F , P)

the vector space of all square-integrable random variables on a probabilityspace (Ω,F , P). For any X, Y ∈ L2(P) we define their inner product by

〈X, Y〉 = E(XY).

Remark 4.38In abstract texts on functional analysis it is customary to eliminate the non-uniqueness due to identifying random variables equal to one another P-a.s.,by considering the vector space of equivalence classes of elements of L2(P)under the equivalence relation

X ∼ Y if and only if X = Y , P-a.s.

We prefer to work directly with functions rather than equivalence classesfor the results we require.

We note some immediate properties of the inner product.(i) The inner product is linear in its first argument: given a, b ∈ R and

X1, X2, Y ∈ L2(P), we have (by the linearity of expectation)

〈aX1 + bX2, Y〉 = a〈X1, Y〉 + b〈X2, Y〉.(ii) The inner product is symmetric:

〈Y, X〉 = E(YX) = E(XY) = 〈X, Y〉.Hence it is also linear in its second argument.

(iii) The inner product is non-negative:

〈X, X〉 = E(X2) ≥ 0.

(iv) 〈X, X〉 = 0 if and only if X = 0, P-a.s.; this is so because 〈X, X〉 =E(X2), and E(X2) = 0 if and only if X = 0, P-a.s., see Proposi-tion 1.36.

Since we do not distinguish between random variables equal to one anotherP-a.s., we interpret this as saying that 〈X, X〉 = 0 means X = 0, and withthis proviso the last two properties together say that the inner product ispositive definite.

The inner product induces a notion of ‘length’, or norm, on vectors inL2(P).


Definition 4.39For any X ∈ L2(P) define the L2-norm as

||X||2 =√〈X, X〉 =

√E(X2).

This should look familiar. For x = (x1, . . . , xn) ∈ Rn the Euclidean norm

||x||2 =√√

n∑i=1

x2i

is related in the same manner to the scalar product

〈x, y〉 =n∑

i=1

xiyi.

The sum is now replaced by an integral.The L2-norm shares the following properties of the Euclidean norm.(i) For any X ∈ L2(P)

||X||2 ≥ 0,

with ‖X‖2 = 0 if and only X = 0, P-a.s.(ii) For any a ∈ R and X ∈ L2(P)

||aX||2 = |a| || X ||2.(iii) For any X, Y ∈ L2(P)

||X + Y ||2 ≤ ||X||2 + ||Y ||2 .The first two claims are obvious, while the third follows from the Schwarzinequality, using the definition of the norm:

||X + Y ||22 = E((X + Y)2) = E(X2) + 2E(XY) + E(Y2)

≤ ||X||22 + 2 ||X||2 ||Y ||2 + ||Y ||22 = (||X||2 + ||Y ||2)2.

The Schwarz inequality is key to many properties of the inner productspace L2(P). First, since a constant random variable is square-integrable,the Schwarz inequality implies that

(E(|X| ))2 = (E(|1X| )2 ≤ ||1||22 ||X||22 = E(X2),

so E(|X|) must be finite if E(X2) is. In other words, any square-integrable Xis also an integrable random variable, hence E(X) is well-defined for eachX ∈ L2(P).

Second, the Schwarz inequality implies continuity of the inner productand L2-norm.


Definition 4.40We say that f : L2(P) → R is norm continuous if for any X ∈ L2(P) andany sequence X1, X2, . . . ∈ L2(P)

limn→∞ ||Xn − X||2 = 0 implies lim

n→∞ | f (Xn) − f (X)| → 0.

Exercise 4.18 Show that the maps X �→ 〈X, Y〉 and X �→ ||X||2 arenorm continuous functions.

For our purposes, the most important property of L2(P) is its complete-ness. The terminology is borrowed from the real line: recall that x1, x2, . . . ∈R is called a Cauchy sequence if supm,n≥k |xn − xm| → 0 as k → ∞. The keyproperty that distinguishes R from Q is that R is complete while Q is not:every Cauchy sequence x1, x2, . . . ∈ R has a limit limn→∞ xn ∈ R, but this isnot the case in Q. For example, take any Cauchy sequence r1, r2, . . . ∈ Q ofrationals with limn→∞ rn =

√2, which is not in Q.

The definition of a Cauchy sequence and the notion of completeness alsomake sense in L2(P).

Definition 4.41We say that X1, X2, . . . ∈ L2(P) is a Cauchy sequence whenever

supm,n≥k‖Xn − Xm‖2 → 0 as k → ∞.

By saying that L2(P) is complete we mean that for every Cauchy sequenceX1, X2, . . . ∈ L2(P) there is an X ∈ L2(P) such that

‖Xn − X‖2 → 0 as n→ ∞.Theorem 4.42L2(P) is complete.

The proof makes essential use of the first Fatou lemma (Lemma 1.41 (i)).We leave the details to the end of the chapter, Section 4.6.

Exercise 4.19 Show that any convergent sequence X1, X2, . . . ∈L2(P) is a Cauchy sequence, that is, show that if limn→∞ ‖Xn − X‖2 = 0for some X ∈ L2(P), then X1, X2, . . . is a Cauchy sequence.


Orthogonal projection and conditional expectation in L2(P)

The conditional expectation of X ∈ L2(P) with respect to aσ-fieldG ⊂ F isgiven in Definition 4.20 as a G-measurable random variable E(X | G) suchthat for all B ∈ G

E(1BE(X | G)) = E(1BX).

We denote the set of all G-measurable square-integrable random vari-ables by L2(G, P), and write L2(F , P) instead of L2(P) when there is somedanger of ambiguity. Then L2(G, P) is an example of a linear subspace ofL2(F , P), that is, a subset L2(G, P) ⊂ L2(F , P) such that X, Y ∈ L2(G, P)implies aX + bY ∈ L2(G, P) for all a, b ∈ R. The inner product and normfor any X, Y ∈ L2(G, P) coincide with those in L2(F , P) and can be denotedby the same symbols 〈X, Y〉 and ‖X‖2.

Since Theorem 4.42 applies to the family of square-integrable randomvariables on any probability space, we know that L2(G, P) is also complete.It is often useful to state this property slightly differently, using the notionof a closed set.

Definition 4.43We say that a subset C ⊂ L2(F , P) is closed whenever it has the followingproperty: for any sequence X1, X2, . . . ∈ C and X ∈ L2(F , P)

limn→∞ ‖Xn − X‖2 = 0 implies X ∈ C.

Proposition 4.44For any σ-field G ⊂ F , the family L2(G, P) of G-measurable square-inte-grable random variables is a closed subset of L2(F , P).

Proof Suppose that X1, X2, . . . ∈ L2(G, P) and limn→∞ ‖Xn − X‖2 = 0for some X ∈ L2(F , P). By Exercise 4.19, it is a Cauchy sequence inL2(F , P). Because the norms in L2(G, P) and L2(F , P) coincide, it followsthat X1, X2, . . . is a Cauchy sequence in L2(G, P). Because L2(G, P) is com-plete, there is a Y ∈ L2(G, P) such that limn→∞ ‖Xn − Y‖2 = 0. To concludethat X ∈ L2(G, P) it remains to show that X = Y , P-a.s. This is so because

0 ≤ ‖X − Y‖2 = ‖(X − Xn) − (Y − Xn)‖2 ≤ ‖X − Xn‖2 + ‖Y − Xn‖2 → 0

as n→ ∞, hence ‖X − Y‖2 = 0. �

The analogy with the geometric structure of Rn can be taken further.Using the centered random variables Xc = X − E(X), Yc = Y − E(Y), wecan write the variance of X as

Var(X) = E(X2c ) = ||Xc||2 ,


and similarly for Y. Their covariance is given by

Cov(X, Y) = E(XcYc) = 〈Xc, Yc〉.Thus, if we define the angle θ between two random variables X, Y inL2(F , P) by setting

cos θ =〈X, Y〉||X||2 ||Y ||2

(which makes sense as long as neither X nor Y are 0, P-a.s.), we recoverthe correlation between non-constant random variables X, Y ∈ L2(F , P) asthe angle between the centred random variables Xc, Yc:

ρX,Y =〈Xc, Yc〉||Xc||2 ||Yc||2 .

In particular, X and Y are uncorrelated if and only if 〈Xc, Yc〉 = 0.Clearly, as defined above, in general we have cos θ = 0 if 〈X1, X2〉 = 0.

It seems natural to use this to define orthogonality with respect to the innerproduct.

Definition 4.45Whenever random variables X, Y ∈ L2(F , P) satisfy 〈X, Y〉 = 0, we say thatthey are orthogonal.

The next two exercises show how the geometry of the vector spaceL2(F , P) reflects Euclidean geometry, even though L2(F , P) is not nec-essarily finite-dimensional.

Exercise 4.20 Prove the following Pythagoras theorem inL2(F , P).If X, T ∈ L2(F , P) and 〈X, Y〉 = 0, then

||X + Y ||22 = ||X||22 + ||Y ||22 .

Exercise 4.21 Prove the following parallelogram law in L2(F , P).For any X, Y ∈ L2(F , P)

‖X + Y‖22 + ||X − Y ||22 = 2 ||X||22 + 2 ||Y ||22 .


Exercise 4.22 Show that Xn(ω) = sin nω and Ym(ω) = cos mω areorthogonal in L2[−π, π] for any m, n = 1, 2, . . . .

More generally, if X1, X2, . . . , Xn ∈ L2(F , P) are mutually orthogonal,then the linearity of the inner product yields⟨ n∑

i=1

Xi,

n∑j=1

Xj

⟩=

n∑i, j=1

〈Xi, Xj〉 =n∑

i=1

〈Xi, Xi〉.

so that ∥∥∥∥∥∥∥n∑

i=1

Xi

∥∥∥∥∥∥∥2

2

=

n∑i=1

‖Xi‖22.

(With n = 2 we recover the Pythagoras theorem.)In R3 the nearest point to (x, y, z) in the (x, y)-plane is its orthogonal

projection (x, y, 0).We can write (x, y, z) = (x, y, 0) + (0, 0, z) and note thatthe vector (0, 0, z) is orthogonal to (x, y, 0), as their scalar product is 0.

We wish to define orthogonal projections in L2(F , P) similarly, usingthe inner product 〈X, Y〉. Suppose that M is a closed linear subspace inL2(F , P); that is, M is a closed subset of L2(F , P) such that aX + bY ∈ Mfor any X, Y ∈ M and a, b ∈ R. First we introduce the nearest point in Mto an X ∈ L2(F , P). It is by definition the random variable Y ∈ M whoseexistence and uniqueness is asserted in the next theorem.

Theorem 4.46 (nearest point)Let M be a closed linear subspace in L2(F , P). For any X ∈ L2(F , P) thereis a Y ∈ M such that

||X − Y ||2 = inf {||X − Z||2 : Z ∈ M} .Such a random variable Y is unique to within equality P-a.s.

The proof is again deferred to the end of the chapter, Section 4.6.Suppose that X ∈ L2(F , P) and let Y be its the nearest point in M in the

sense of Theorem 4.46. We claim that X − Y is orthogonal to every Z ∈ M.Indeed, for any c ∈ R we have Y + cZ ∈ M, hence

||X − Y ||2 ≤ ||X − (Y + cZ)||22 = ||X − Y ||22 − 2c〈X − Y, Z〉 + c2 ||Z||22 .It follows that 2c〈X − Y, Z〉 ≤ c2 ||Z||22 for any c ∈ R. As a result, −c ‖Z‖22 ≤2〈X − Y, Z〉 ≤ c ‖Z‖22 for any c > 0, which implies that 〈X − Y, Z〉 = 0,proving that X − Y and Z are orthogonal.

4.5 Existence of E(X | G) for integrable X 137

The converse is easy to check.

Exercise 4.23 Let M be a closed linear subspace in L2(F , P). Showthat if Y ∈ M satisfies 〈X − Y, Z〉 = 0 for all Z ∈ M, then

||X − Y ||2 = inf {||X − Z||2 : Z ∈ M} .

Because of these properties, for any X ∈ L2(F , P) its nearest point in Mis also called the orthogonal projection of X onto M.

We already know that L2(G, P) is a closed linear subspace in L2(F , P).This makes it possible to relate orthogonal projection onto L2(G, P) to con-ditional expectation.

Proposition 4.47For any σ-field G ⊂ F and any X ∈ L2(F , P), the orthogonal projection ofX onto L2(G, P) is P-a.s. equal to the conditional expectation E(X | G).

Proof Let Y be the orthogonal projection of X onto L2(G, P). Since Y ∈L2(G, P), it is G-measurable. Moreover, for any B ∈ G we have 1B ∈L2(G, P), so X − Y and 1B are orthogonal,

0 = 〈1B, X − Y〉 = E(1BX) − E(1BY),

which means that

E(1BY) = E(1BX) = E(1BE(X | G)).

We have shown that Y = E(X | G), P-a.s. �

Because we have established the existence and uniqueness of the or-thogonal projection, this immediately gives the existence and uniqueness(to within equality P-a.s.) of the conditional expectation E(X | G) for anysquare-integrable random variable X and any σ-field G ⊂ F . For manyapplications in finance this will suffice.

4.5 Existence of E(X | G) for integrable X

In this section we construct E(X | G) for any integrable random variable Xand σ-field G ⊂ F . The next result is a vital stepping stone in this task.

We observed in Exercise 4.18 that, for a fixed Y ∈ L2(P), the linearmap on L2(P) given by X �→ 〈X, Y〉 is norm continuous. Remarkably, allcontinuous linear maps from L2(P) to R have this form.


Theorem 4.48If L : L2(P)→ R is linear and norm continuous, then there exists (uniquelyto within equality P-a.s.) a Y ∈ L2(P) such that for all X ∈ L2(P)

L(X) = 〈X, Y〉 = E(XY).

Proof Since L is linear and norm continuous,

M = {X ∈ L2(P) : L(X) = 0}is a closed linear subspace in L2(P). If L(X) = 0 for all X ∈ L2(P), then wetake Y = 0. Otherwise, there is an X ∈ L2(P) such that L(X) � 0. Let Z bethe orthogonal projection of X onto M. It follows that X � Z and X − Z isorthogonal to every random variable in M. We put

E =X − Z‖X − Z‖2

and

U = L(X)E − L(E)X.

Then L(U) = L(X)L(E) − L(E)L(X) = 0, so U ∈ M. As a result,

0 = 〈U, E〉 = L(X) − 〈X, L(E)E〉.Hence Y = L(E)E satisfies L(X) = 〈X, Y〉 for all X ∈ L2(P). This provesthe existence part.

To prove uniqueness, suppose that V ∈ L2(P) satisfies 〈X, Y〉 = 〈X,V〉for all X ∈ L2(P). Then 〈X, Y − V〉 = 0 for all X ∈ L2(P). Apply this withX = Y −V . Then 〈Y −V, Y −V〉 = E((Y −V)2) = 0, hence Y = V , P-a.s. byProposition 1.36. �

The set of all integrable random variables on a given probability space isa vector space due to the linearity of expectation. We continue to identifyX and Y if they are equal to one another P-a.s., and define a natural normon this vector space.

Definition 4.49Let (Ω,F , P) be a probability space. We denote by L1(P) = L1(Ω,F , P)the vector space consisting all integrable random variables, and define

||X||1 = E(|X|)for any X ∈ L1(P). We say that ||X||1 is the L1-norm of X.

Like the L2-norm in the previous section, the L1-norm satisfies the fol-lowing conditions.


(i) For any X ∈ L1(P)

‖X‖1 ≥ 0,

with ‖X‖1 = 0 if and only if X = 0, P-a.s.(ii) For any a ∈ R and X ∈ L1(P)

‖aX‖1 = |a| ‖X‖1 .(iii) For any X, Y ∈ L1(P)

‖X + Y‖1 ≤ ‖X‖1 + ‖Y‖1 .The first two properties are obvious, while the last one follows by applyingexpectation to both sides of the inequality |X + Y | ≤ |X| + |Y | .

In the same manner as for the L2-norm, we can consider Cauchysequences in the L1-norm, that is, sequences X1, X2, . . . ∈ L1(P) such that

supm,n≥k‖Xn − Xm‖1 → 0 as k → ∞,

and define completeness of L1(P) by the condition that every Cauchysequence X1, X2, . . . ∈ L1(P) should converge to some X ∈ L1(P), thatis,

‖Xn − X‖1 → 0 as n→ ∞.Theorem 4.50L1(P) is complete.

The proof is very similar to that of Theorem 4.42 and can be found inSection 4.6.

Even though L1(P) and L2(P) have some similar features such as com-pleteness, the L1-norm does not share the geometric properties of the L2-norm, as the next exercise confirms.

Exercise 4.24 Show that the parallelogram law stated in Exer-cise 4.21 fails for the L1-norm, by considering the random variablesX(ω) = ω and Y(ω) = 1 − ω defined on the probability space [0, 1]with Borel sets and Lebesgue measure. Explain why this means thatthe L1-norm is not induced by an inner product.

To compensate for the lack of an inner product in L1(P) we shall usea result that comes close to representing a particular linear map on L1(P)in a manner resembling the representation in Theorem 4.48 of any norm


continuous linear map on L2(P) by the inner product. To introduce thisresult, we need the following definition.

Definition 4.51Given measures μ, ν defined on the same σ-field F on Ω, we write ν � μand say that ν is absolutely continuous with respect to μ if for any A ∈ F

μ(A) = 0 implies ν(A) = 0.

Example 4.52Any random variable X with continuous distribution provides an example.In that case, for any Borel set A ∈ B(R) we have PX(A) =

∫A

fX dm, wherefX is the density of X and m is Lebesgue measure. Then PX � m, sincem(A) = 0 implies PX(A) =

∫R

1A fX dm = 0, as follows from Exercise 1.30.

Example 4.53At the other extreme we may consider Lebesgue measure m and the Diracmeasure δa for any a ∈ R, defined in Example 1.12 and restricted to theBorel sets. We have m({a}) = 0 while δa({a}) = 1, so δa is not absolutelycontinuous with respect to m. On the other hand, m(R \ {a}) = ∞ whileδa(R \ {a}) = 0, so m is not absolutely continuous with respect to δa either.

If Z ∈ L1(P) is a non-negative random variable such that E(Z) = 1, thenQ(A) =

∫A

Z dP for each A ∈ F defines a probability measure Q on thesame σ-field F as P. It follows that Q � P. The following theorem showsthat the converse is also true.

Theorem 4.54 (Radon–Nikodym)If P,Q are probability measures defined on the same σ-field F on Ω andsuch that Q � P, then there exists a random variable Z ∈ L1(P) such thatfor each A ∈ F

Q(A) =∫

AZ dP.

The proof of this theorem, based on a brilliant argument due to John vonNeumann, is given in Section 4.6.


Exercise 4.25 Under the assumptions of Theorem 4.54, show thatthe expectation of any random variable X ∈ L1(Q) with respect to Qcan be written as

EQ(X) = EP(XZ). (4.12)

The right-hand side of (4.12) resembles the inner product of X and Z.(We cannot write it as 〈X, Z〉 unless we know, in addition, that X, Z ∈L2(P).) This is the result which compensates for the lack of an inner prod-uct behind the L1-norm as alluded above. It enables us to establish theexistence of conditional expectation for any random variable in L1(P).

Proposition 4.55For any σ-field G ⊂ F and any random variable X ∈ L1(F , P), the condi-tional expectation E(X | G) exists and is unique to within equality P-a.s.

Proof First suppose that X is non-negative and E(X) = 1. The probabilitymeasure Q defined on the σ-field G as

Q(A) = E(1AX) for each A ∈ Gis absolutely continuous with respect to P (to be precise, with respect tothe restriction of P to the σ-field G, denoted here by the same symbol Pby a slight abuse of notation). By the Radon–Nikodym theorem, we knowthat there is a random variable Z ∈ L1(G, P) such that

Q(A) = E(1AZ) for each A ∈ G.We therefore have

E(1AX) = E(1AZ) for each A ∈ G.If X ∈ L1(F , P) is non-negative but E(X) is not necessarily equal to 1,then we can apply the above to X = X

E(X) so that E(X) = 1, obtain Z ∈L1(G, P) for X as above, and put Z = E(X)Z. This works when E(X) > 0.If E(X) = 0, we simply take Z = 0. Finally, for an arbitrary X ∈ L1(F , P)we write X = X+ − X−, where X+, X− ∈ L1(F , P) are non-negative randomvariables, obtain Z+ and Z− for X+ and, respectively, X− as above, and takeZ = Z+ − Z−.

This enables us to conclude that for any X ∈ L1(F , P) there is a randomvariable Z ∈ L1(G, P) such that

E(1AX) = E(1AZ) for each A ∈ G.


It follows from Definition 4.20 that Z = E(X | G), P-a.s., which provesthe existence of conditional expectation as well as its uniqueness to withinequality P-a.s. �

The Radon–Nikodym theorem has much wider application, of course. IfQ � P, we refer to Z ≥ 0 such that Q(A) =

∫A

ZdP for each A ∈ F asthe Radon–Nikodym derivative (often also referred to as the density) ofQ with respect to P, and write Z = dQ

dP . In finance, the principal applicationoccurs when the probabilities P,Q have the same collections of sets ofmeasure 0, so that Q � P and P � Q. We then write P ∼ Q and saythat P and Q are equivalent probabilities. An important application of thiscan be found, for example, in the fundamental theorem of asset pricing,asserting that the lack of arbitrage is equivalent to the existence of a riskneutral probability, see [DMFM] and [BSM].

Some elementary relationships between Radon–Nikodym derivatives ap-pear in the next exercise.

Exercise 4.26 Suppose that P,Q,R are probabilities defined on thesame σ-field F . Verify the following conditions.

(1) If Q � P, R � P and λ ∈ (0, 1), then λQ + (1 − λ)R � P and

d(λQ + (1 − λ)R)dP

= λdQdP+ (1 − λ) dR

dP.

(2) If Q � P and R � Q, then R � P and

dRdP=

dRdQ

dQdP.

(3) If P ∼ Q, then

dPdQ=

(dQdP

)−1

.

4.6 Proofs

Theorem 4.42L2(P) is complete.

Proof First, note that if X1, X2, . . . is a Cauchy sequence in L2(P), then we

4.6 Proofs 143

can find n1 such that

||Xk − Xl||2 ≤ 12

whenever k, l ≥ n1.

Next, find n2 > n1 such that

||Xk − Xl||2 ≤ 122

whenever k, l ≥ n2,

and continue in this fashion to find a sequence of natural numbers n1 <

n2 < · · · such that for each i = 1, 2, . . .

||Xk − Xl||2 ≤ 12i

whenever k, l ≥ ni.

In particular, for every i = 1, 2, . . .

E(∣∣∣Xni+1 − Xni

∣∣∣) ≤ ||Xni+1 − Xni ||2 ≤12i.

This means that, starting with a Cauchy sequence in the L2-norm, we havea subsequence Xn1 , Xn2 , . . . for which

E(∣∣∣Xni+1 − Xni

∣∣∣) ≤ 12i

for each i = 1, 2, . . . .

Since Yi =∣∣∣Xni+1 − Xni

∣∣∣ is a non-negative F -measurable function on Ω,the monotone convergence theorem, applied to the partial sums

∑ni=1 Yi,

ensures that

E

⎛⎜⎜⎜⎜⎜⎝ ∞∑i=1

Yi

⎞⎟⎟⎟⎟⎟⎠ = ∞∑i=1

E(Yi) ≤ 1.

This means that P-a.s. the series∑∞

i=1

∣∣∣Xni+1 − Xni

∣∣∣ converges in R, henceP-a.s. the series

∑∞i=1(Xni+1 − Xni ) converges absolutely, and so, P-a.s., it

converges in R. We put

X = Xn1 +

∞∑i=1

(Xni+1 − Xni ) = limi→∞ Xni

on the subset of Ω on which∑∞

i=1(Xni+1 − Xni ) converges, and X = 0 on thesubset of P-measure 0 on which it possibly does not converge.

Finally, we must show that Xn also converges to X in L2-norm. First notethat

|Xk − X|2 = limi→∞

∣∣∣Xk − Xni

∣∣∣2 = lim infi→∞

∣∣∣Xk − Xni

∣∣∣2 .


So we can apply Fatou’s lemma to obtain

||Xk − X||22 = E(lim inf

i→∞∣∣∣Xk − Xni

∣∣∣2) ≤ lim infi→∞ E

(∣∣∣Xk − Xni

∣∣∣2)= lim inf

i→∞∣∣∣∣∣∣Xk − Xni

∣∣∣∣∣∣22→ 0 as k → ∞,

where the last step employs the fact that X1, X2, . . . is a Cauchy sequencein the L2-norm. �

Theorem 4.46 (nearest point)Let M be a closed linear subspace in L2(F , P). For any X ∈ L2(F , P) thereis a Y ∈ M such that

||X − Y ||2 = inf {||X − Z||2 : Z ∈ M} .Such a random variable Y is unique to within equality P-a.s.

Proof Let

δ = inf{||X − Z||2 : Z ∈ M}.There is a sequence Y1, Y2, . . . ∈ M such that δ ≤ ||X − Yk||2 < δ + 1

k foreach k = 1, 2, . . . . We will show that the Yk form a Cauchy sequence inthe L2-norm and then use completeness and the fact that M is closed toobtain Y ∈ M as a limit of the sequence Yn.

The parallelogram law (Exercise 4.21), applied to Yn − X and Ym − X forany m, n = 1, 2, . . . , provides that

||Yn + Ym − 2X||22 + ||Yn − Ym||22 = 2 ||Yn − X||22 + 2 ||Ym − X||22 .Now, ||Yn − X||22 → δ2 and ||Ym − X||22 → δ2 as m, n → ∞. Moreover,||Yn + Ym − 2X||22 → 4δ2 as m, n→ ∞ because 1

2 (Yn + Ym) ∈ M and

2δ ≤ ||Yn + Ym − 2X||2 ≤ ||Yn − X||2 + ||Yn − X||2 ≤ 2δ +1n+

1m.

This means that ||Yn − Ym||22 → 0 as m, n → ∞, showing that Y1, Y2, . . . isa Cauchy sequence. By completeness, the sequence converges in the L2-norm to a random variable Y ∈ L2(F , P) and, since M is closed, Y ∈ M.Finally, the continuity of the L2-norm shows that ||X − Yk||2 → ||X − Y ||2 ask → ∞, and this means that ||X − Y ||2 = δ.

To see that Y is unique, take any W ∈ M such that ||X −W ||2 = δ. Usingthe parallelogram law with Y − X and W − X we then have

||Y +W − 2X||22 + ||Y −W ||22 = 2 ||Y − X||22 + 2 ||W − X||22 = 4δ2,

while, since 12 (Y + W) ∈ M, it follows that ||Y +W − 2X||22 ≥ 4δ2, so

||Y −W ||22 = 0 and therefore Y = W, P-a.s. �

4.6 Proofs 145

Theorem 4.50L1(P) is complete.

Proof The argument in the proof of Theorem 4.42, which shows thatL2(P) is complete, can be repeated in the case of L1(P), with the L1-norminstead of the L2-norm and with the squares dropped in the final para-graph. �

Theorem 4.54 (Radon–Nikodym)If P,Q are probability measures defined on the same σ-field F on Ω andsuch that Q � P, then there exists a random variable Z ∈ L1(P) such thatfor each A ∈ F

Q(A) =∫

AZ dP.

Proof Consider a third probability measure defined for each A ∈ F as

R(A) =12

Q(A) +12

P(A).

By the Schwarz inequality, see Lemma 3.49, for any X ∈ L2(R)∣∣∣EQ(X)∣∣∣ ≤ EQ(|X|) ≤ 2ER(|X|) = 2ER(1 |X|)≤ 2

√ER(12)ER(|X|2) = 2

√ER(|X|2) = 2 ‖X‖2,R ,

where ‖X‖2,R denotes the norm in L2(R). This means that L : L2(R) → Rdefined as L(X) = 1

2EQ(X) for each X ∈ L2(R) is a norm continuous linearmap on L2(R). Therefore, by Theorem 4.48, there is a U ∈ L2(R) such that

12EQ(X) = ER(XU) for each X ∈ L2(R).

Since R = 12 Q + 1

2 P, this can be written as

EQ(X(1 − U)) = EP(XU) for each X ∈ L2(R). (4.13)

Applying this to X = 1A for any A ∈ F gives 12 Q(A) = ER(1AU), and since

0 ≤ 12 Q(A) ≤ R(A), we have

0 ≤ ER(1AU) ≤ R(A).

Because this holds for any A ∈ F , it follows that 0 ≤ U ≤ 1, R-a.s. Thisin turn implies that 0 ≤ U ≤ 1, P-a.s. and therefore also Q-a.s. Moreover,taking X = 1{U=1} in (4.13), we get

0 = EQ(1{U=1}(1 − U)) = EP(1{U=1}U) = P(U = 1),


and since Q � P, we also have Q(U = 1) = 0. This means that 0 ≤ U < 1,P-a.s. and Q-a.s.

We put Yn = 1 + U + U2 + · · · + Un. For any A ∈ F , taking X = 1AYn,which belongs to L2(R) because U is bounded R-a.s. and therefore Yn isbounded R-a.s., we get from (4.13) that

EQ(1A(1 − Un+1)) = EQ(1AYn(1 − U)) = EP(1AYnU) = EP(1A(Yn+1 − 1)).

Since 0 ≤ U < 1, Q-a.s., it follows that 1−Un+1 is a Q-a.s. non-decreasingsequence with limit 1. Moreover, since 0 ≤ U < 1, P-a.s., it follows thatYn+1 − 1 is a P-a.s. non-decreasing sequence, whose limit we denote by Z.By monotone convergence, Theorem 1.31, this gives

Q(A) = EQ(1A) = EP(1AZ),


5

Sequences of random variables

5.1 Sequences in L2(P)5.2 Modes of convergence for random variables5.3 Sequences of i.i.d. random variables5.4 Convergence in distribution5.5 Characteristic functions and inversion formula5.6 Limit theorems for weak convergence5.7 Central Limit Theorem

Although financial markets can support only finitely many trades, finite se-quences of random variables are hardly sufficient for modelling financialreality. For instance, to model frequent trading we might consider the bi-nomial model with a large but finite number of short steps. However, itwould be rather restrictive to place an arbitrary lower bound on the steplength. We prefer to consider infinite sequences of random variables (andin due course families of random variables indexed by a continuous timeparameter, as in [BSM]). In doing so we need to be aware that convergencequestions for random variables are more complex than for a sequence ofnumbers.

5.1 Sequences in L2(P)

Continuing a theme developed in Chapter 4, we study sequences of square-integrable random variables. The properties of the inner product allow usto construct families of mutually orthogonal random variables, which canplay a similar role as an orthogonal basis in a finite-dimensional vectorspace. Then we move our attention to approximating square-integrable ran-

147

148 Sequences of random variables

dom variables on [0, 1] by sequences of continuous functions, a useful re-sult because of the familiar properties of continuous functions.

Orthonormal sequences

Recall from Definition 4.45 that X, Y ∈ L2(P) are called orthogonal if〈X, Y〉 = E(XY) = 0. This leads naturally to the notion of an orthonor-mal set, that is, a subset of L2(P) whose members are pairwise orthogonaland each has L2-norm 1. A natural question arises how to approximatean arbitrary element of L2(P) by linear combinations of the elements of agiven finite orthonormal set.

Proposition 5.1Given Y ∈ L2(P) and a finite orthonormal set {X1, X2, . . . , Xn} in L2(P), thenorm ||Y−∑n

i=1 aiXi||2 attains its minimum when ai = 〈Y, Xi〉 for i = 1, . . . , n.

Proof By definition, linearity and symmetry of the inner product, andsince the Xi are orthonormal,∥∥∥∥∥∥∥Y −

n∑i=1

aiXi

∥∥∥∥∥∥∥2

2

=

⟨Y −

n∑i=1

aiXi, Y −n∑

j=1

ajX j

⟩

= ||Y ||22 − 2n∑

i=1

ai〈Xi, Y〉 +n∑

i=1

n∑j=1

aia j〈Xi, Xj〉

= ||Y ||22 − 2n∑

i=1

ai〈Xi, Y〉 +n∑

i=1

a2i

= ||Y ||22 +n∑

i=1

[a2i − 2ai〈Xi, Y〉].

Note that for each i

[ai − 〈Xi, Y〉]2 = a2i − 2ai〈Xi, Y〉 + 〈Xi, Y〉2

so that in each term of the sum on the right we can replace a2i − 2ai〈Xi, Y〉

by [ai − 〈Xi, Y〉]2 − 〈Xi, Y〉2. In other words∥∥∥∥∥∥∥Y −n∑

i=1

aiXi

∥∥∥∥∥∥∥2

2

= ||Y ||22 −n∑

i=1

〈Xi, Y〉2 +n∑

i=1

[ai − 〈Xi, Y〉]2, (5.1)

and the right-hand side attains its minimum if and only if ai = 〈Xi, Y〉 foreach i. �

5.1 Sequences in L2(P) 149

This choice of coefficients leads to a very useful inequality when welet n → ∞ and consider an orthonormal sequence, that is, a sequenceX1, X2, . . . ∈ L2(P) of random variables with ||Xi||2 = 1 for all i = 1, 2, . . .and with 〈Xi, Xj〉 = 0 for all i, j = 1, 2, . . . such that i � j.

Corollary 5.2 (Bessel inequality)Given Y ∈ L2(P) and an orthonormal sequence X1, X2, . . . ∈ L2(P), wehave

∞∑i=1

〈Xi, Y〉2 ≤ ||Y ||22 . (5.2)

Equality holds precisely when∑n

i=1〈Xi, Y〉Xi converges to Y in L2-norm,i.e. when ||Y −∑n

i=1〈Xi, Y〉Xi||2 → 0 as n→ ∞.

Proof Take X1, . . . , Xn from the given orthonormal sequence. Putting ai =

〈Xi, Y〉 in (5.1), we can see that

0 ≤∥∥∥∥∥∥∥Y −

n∑i=1

〈Xi, Y〉Xi

∥∥∥∥∥∥∥2

2

= ||Y ||22 −n∑

i=1

〈Xi, Y〉2.

Thus ||Y ||22 is an upper bound for the increasing sequence of partial sums∑ni=1〈Xi, Y〉2, hence also for its limit

∑∞i=1〈Xi, Y〉2.

The identity ∥∥∥∥∥∥∥Y −n∑

i=1

〈Xi, Y〉Xi

∥∥∥∥∥∥∥2

2

= ||Y ||22 −n∑

i=1

〈Xi, Y〉2

holds for each n, and so, if the partial sums∑n

i=1〈Xi, Y〉Xi converge to Y inL2-norm, then

0 = limn→∞

∥∥∥∥∥∥∥Y −n∑

i=1

〈Xi, Y〉Xi

∥∥∥∥∥∥∥2

2

= ||Y ||22 − limn→∞

n∑i=1

〈Xi, Y〉2, (5.3)

which shows that∑∞

i=1〈Xi, Y〉2 = ||Y ||22 .Conversely, if we have equality in (5.2), then the right-hand side of (5.3)

is 0, and since the left-hand side is also 0, it means that∑n

i=1〈Xi, Y〉Xi con-verges to Y in the L2-norm as n→ ∞. �

In the Euclidean space Rn, the standard orthonormal basis e1, e2, . . . en

provides the representation x =∑n

i=1 xiei with 〈x, ei〉 = xi, for each x =(x1, x2, . . . , xn) ∈ Rn. The basis is maximal in the sense that we cannot addfurther non-zero vectors to it and retain an orthonormal set. This idea canbe used to provide an analogue for a basis in L2(P).


Definition 5.3We say that D ⊂ L2(P) is a complete orthonormal set whenever

〈X, Y〉 ={

1 if X = Y , P-a.s.0 otherwise

for any X, Y ∈ D, and 〈X, Z〉 = 0 for all X ∈ D implies that Z = 0, P-a.s.

The case when there is a countable complete orthonormal set {E1, E2, . . .}is of particular interest. We then say that E1, E2, . . . is a complete or-thonormal sequence or orthonormal basis.

Given a sequence a1, a2, . . . ∈ R such that∑∞

i=1 a2i converges, the partial

sums Yn =∑n

i=1 aiEi satisfy ‖Yn − Ym‖22 =∑n

i=m+1 a2i (by Pythagoras), and

this becomes arbitrarily small as m, n → ∞. So Y1, Y2, . . . is a Cauchy se-quence in the L2-norm, and therefore by Theorem 4.42 there is a Y ∈ L2(P)such that ∥∥∥∥∥∥∥Y −

n∑i=1

aiEi

∥∥∥∥∥∥∥2

→ 0 as n→ ∞. (5.4)

We define the sum of the infinite series∑∞

i=1 aiEi in L2-norm as∑∞

i=1 aiEi =

Y (to within equality P-a.s.) whenever (5.4) holds (see also Remark 5.10).In particular, when ai = 〈Y, Ei〉, the Bessel inequality (5.2) ensures that∑∞

i=1 ai2 ≤ ||Y ||22 < ∞. This yields a representation of Y analogous to that

for a basis in a finite-dimensional vector space.

Theorem 5.4Given a complete orthonormal sequence E1, E2, . . . ∈ L2(P), every Y ∈L2(P) satisfies

Y =∞∑

i=1

〈Y, Ei〉Ei. (5.5)

This is known as the Fourier representation of Y . The 〈Y, Ei〉 are calledthe Fourier coefficients of Y relative to the complete orthonormal se-quence E1, E2, . . . .

Remark 5.5The classical representation of functions by their Fourier series uses (5.5)in the case of Ω = [−π.π] with Borel sets and uniform probability P =1

2πm[−π,π], where m[−π,π] is the restriction of Lebesgue measure to [−π, π],and with the sequence of functions

E0(t) =1√2π, E2n−1(t) =

cos nt√π, E2n(t) =

sin nt√π


for each t ∈ [−π, π] and n = 1, 2, . . . . We are not going to prove the com-pleteness of this well-known sequence, but focus instead on an examplewhich has direct applications in stochastic calculus.

Proposition 5.6 (Parseval identity)An orthonormal sequence E1, E2, . . . ∈ L2(P) is complete if and only if foreach Y ∈ L2(P)

‖Y‖22 =∞∑

n=1

〈Y, Ei〉2. (5.6)

Proof If Y ∈ L2(P) and E1, E2, . . . ∈ L2(P) is a complete orthonormalsequence, then (5.5) holds, so

‖Y‖22 =⟨ ∞∑

i=1

〈Y, Ei〉Ei,

∞∑i=1

〈Y, Ei〉Ei

⟩=

∞∑i=1

〈Y, Ei〉2〈Ei, Ei〉 =∞∑

i=1

〈Y, Ei〉2

since 〈Ei, E j〉 = 1 if i = j, and 0 otherwise. Conversely, if 〈Z, Ei〉 = 0 foreach i = 1, 2, . . . , then (5.6) implies that ‖Z‖22 = 0, hence Z = 0, P-a.s., soE1, E2, . . . is a complete orthonormal sequence. �

Exercise 5.1 Show that if E1, E2, . . . ∈ L2(P) is a complete orthonor-mal sequence, then for any X, Y ∈ L2(P)

〈X, Y〉 =∞∑

i=0

〈X, Ei〉〈Y, Ei〉.

Example 5.7Let Ω = [0, 1] with Borel sets and Lebesgue measure. A complete or-thonormal sequence is given by the Haar functions:

H0 = 1,

Hn = 2j2 1( 2k

2 j+1 ,2k+12 j+1

] − 2j2 1( 2k+1

2 j+1 ,2k+22 j+1

] for n = 1, 2, . . . ,

where j = 0, 1, . . . and k = 0, 1, . . . , 2 j − 1 are such that n = 2 j + k. TheHaar functions are useful, for example in the construction of the Wienerprocess, see [SCF].

The Haar functions form a complete orthonormal sequence. The calcu-lations showing that these functions are orthogonal to one another and each


has L2-norm 1 are left as Exercise 5.2 below. We show that this sequenceis complete. Suppose that a square-integrable random variable X on [0, 1]is orthogonal to every member of the sequence of Haar functions. We needto show that X is zero m-a.s.

To do this, we first show by induction on j = 0, 1, . . . that

I2 j+k =

∫(

k2 j ,

k+12 j

] X dm = 0 for each k = 0, 1, . . . , 2 j − 1. (5.7)

Since X is orthogonal to H0,

0 = 〈X,H0〉 =∫

(0,1]X dm = I1,

so (5.7) is true for j = 0. Now suppose that (5.7) holds for some j =0, 1, . . . . Then for any k = 0, 1, . . . , 2 j − 1

0 = I2 j+k =

∫(

k2 j ,

k+12 j

] X dm

=

∫(

2k2 j+1 ,

2k+12 j+1

] X dm +∫(

2k+12 j+1 ,

2k+22 j+1

] X dm = I2 j+1+2k + I2 j+1+2k+1

and

0 = 〈X,H2 j+k〉 = 2j2

∫(

2k2 j+1 ,

2k+12 j+1

] X dm − 2j2

∫(

2k+12 j+1 ,

2k+22 j+1

] X dm

= 2j2 (I2 j+1+2k − I2 j+1+2k+1) .

It follows that

I2 j+1+2k = I2 j+1+2k+1 = 0 for each k = 0, 1, . . . , 2 j − 1,

completing the induction argument. As a result,∫(

k2 j ,

l2 j

] X dm = 0

for each j = 0, 1, . . . and each k, l = 0, 1, . . . , 2 j such that k ≤ l. By Ex-ercise 3.42 we can conclude that X = 0, m-a.s. This shows that the Haarfunctions form a complete orthonormal sequence.


Exercise 5.2 Verify that the Haar functions Hn for n = 0, 1, . . . forman orthonormal sequence, as claimed in Example 5.7.

Approximation by continuous functions

From the construction of Lebesgue integral we know that functions that areintegrable, and therefore also square-integrable, can be approximated by asequence of simple functions. Here we consider the special case of square-integrable functions on Ω = [0, 1], equipped with Borel sets and Lebesguemeasure, and show that they can be also be approximated by a sequence ofcontinuous functions.

Lemma 5.8For every square-integrable function f on Ω = [0, 1] (with Borel sets andLebesgue measure) there is a sequence of continuous functions fn on [0.1]approximating f in the L2-norm, that is,

‖ f − fn‖2 → 0 as n→ ∞.Proof It suffices to show that for every square-integrable function f on[0, 1] and for every ε > 0 there is a continuous function g defined on [0, 1]such that ‖ f − g‖2 ≤ ε.

First, take f (x) = 1(a,b)(x) for any x ∈ R, with a, b ∈ R such that a < b.For any ε > 0 put

g(x) =x − a + εε

1[a−ε,a](x) + 1(a,b)(x) +b − x + εε

1[b,b+ε](x)

for each x ∈ R, which defines a continuous function. We denote the restric-tions of f and g to [0, 1] by the same symbols f , g. Then

‖ f − g‖22 =∫

[0,1]( f − g)2 dm ≤

∫R

( f − g)2 dm

=

∫[a−ε,a]

g2 dm +∫

[b,b+ε]g2 dm

=

∫ a

a−ε

( x − a + εε

)2

dx +∫ b+ε

b

(b − x + εε

)2

dx =23ε < ε.

Next take f = 1A for a Borel set A ⊂ [0, 1], and let ε > 0. By Defini-tion 1.18, there is a countable family of open intervals J1, J2, . . . such that


A ⊂ I =⋃∞

k=1 Jk and

m(A) ≤ m(I) ≤ m(A) +(ε

3

)2

,

hence

‖1A − 1I‖22 ≤ m(I \ A) = m(I) − m(A) ≤(ε

3

)2

.

We can take the Jk to be pairwise disjoint (otherwise any overlapping openintervals in this family can be joined together to form a new countablefamily of pairwise disjoint open intervals J′k such that I =

⋃∞k=1 J′k). Let

IK =⋃K

k=1 Jk. Because the series∑∞

k=1 m(Jk) = m(I) < ∞ converges, thereis a K such that∥∥∥1I − 1IK

∥∥∥2

2= m(I \ IK) =

∞∑k=K+1

m(Jk) ≤(ε

3

)2

.

We already know that for each k = 1, 2, . . . ,K there is a non-negative con-tinuous function gk such that∥∥∥1Jk − gk

∥∥∥2≤ ε

3K.

Putting g = g1 + · · · + gK , we have∥∥∥1IK − g∥∥∥

2≤ ∥∥∥1J1 − g1

∥∥∥2+ · · · + ∥∥∥1JK − gK

∥∥∥2≤ ε

3.

It follows that

‖ f − g‖2 =∥∥∥(1A − 1I) +

(1I − 1IK

)+(1IK − g

)∥∥∥2

≤ ‖1A − 1I‖2 +∥∥∥1I − 1IK

∥∥∥2+∥∥∥1IK − g

∥∥∥2≤ ε.

Next take any non-negative square-integrable function f on [0, 1]. ByProposition 1.28 there is a non-decreasing sequence of non-negative simplefunctions sn such that limn→∞ sn = f . It follows that limn→∞ ( f − sn)2 = 0and 0 ≤ f − sn ≤ f , so by dominated convergence, see Theorem 1.43, wehave

‖ f − sn‖22 =∫

[0,1]( f − sn)2 dm→ 0 as n→ ∞.

This shows that for any ε > 0 there is a non-negative simple function ssuch that

‖ f − s‖2 ≤ε

2.

Writing the simple function as s =∑N

n=1 an1An for some an ≥ 0 and some


Borel sets An ⊂ [0, 1], we know that for each n = 1, 2, . . . ,N there is anon-negative continuous function gn such that∥∥∥1An − gn

∥∥∥2≤ ε

2Nan.

Putting g = g1 + · · · + gN , we get

‖s − g‖2 =∥∥∥a1

(1A1 − g1

)+ · · · + aN

(1AN − gN

)∥∥∥2

≤ a1

∥∥∥(1A1 − g1)∥∥∥

2+ · · · + aN

∥∥∥(1AN − gN)∥∥∥

2

≤ a1ε

2Na1+ · · · + aN

ε

2NaN=ε

2.

It follows that

‖ f − g‖2 ≤ ‖ f − s‖2 + ‖s − g‖2 ≤ε

2+ε

2= ε.

Finally, for an arbitrary square-integrable function f on [0, 1], we canwrite it as f = f +− f −, where f +, f − are non-negative and square-integrable.Then for any ε > 0 there are non-negative continuous functions g+, g− suchthat ∥∥∥ f + − g+

∥∥∥ ≤ ε2,

∥∥∥ f − − g−∥∥∥ ≤ ε

2,

and for g = g+ − g− we have

‖ f − g‖2 ≤∥∥∥ f + − g+

∥∥∥2+∥∥∥ f − − g−

∥∥∥2≤ ε

2+ε

2= ε,


The following exercise, which can be solved by applying Lemma 5.8, isused when constructing stochastic integrals in [SCF].

Exercise 5.3 Let a < 0 < 1 < b and let f be a Borel measurablefunction on [a, b] such that

∫[a,b]

f 2 dm < ∞. Show that

limh→0

∫[0,1]

( f (x) − f (x + h))2 dm(x) = 0.

Hint. Approximate f by continuous functions in L2-norm and use thefact that every continuous function on a closed interval is uniformlycontinuous.


5.2 Modes of convergence for random variables

The partial sums of the Fourier representation of X ∈ L2(P) provide anexample of a sequence converging to X in L2-norm. We now explore howthe notion of convergence familiar from Euclidean space Rn may be gen-eralised to define several distinct modes of convergence for sequences ofrandom variables defined on the same probability space. We will describerelationships between four distinct modes of convergence for random vari-ables:

(i) convergence in L2-norm;(ii) convergence in L1-norm;

(iii) convergence P-almost surely;(iv) convergence in probability.By way of contrast, for a sequence x(1), x(2), . . . ∈ Rn the idea of con-

vergence is quite unambiguous: x(k) = (x(k)1 , x

(k)2 , . . . , x

(k)n ) converges to x =

(x1, x2, . . . , xn) as k → ∞ if and only if x(k)i → xi for each i = 1, . . . , n; in

other words, we have convergence for each coordinate. Convergence in Rn

can also be captured in terms of the norm ‖x‖2 =√∑n

i=1 x2i or the norm

‖x‖1 = ∑ni=1 |x| defined for each x ∈ Rn.

Exercise 5.4 Show that the following conditions are equivalent:(1) x(k)

i → xi as k → ∞ for each i = 1, . . . , n;

(2)∥∥∥x(k) − x

∥∥∥2=

√∑ni=1(x(k)

i − xi)2 → 0 as k → ∞;

(3)∥∥∥x(k) − x

∥∥∥1=∑n

i=1

∣∣∣x(k)i − xi

∣∣∣→ 0 as k → ∞.

Since any x = (x1, . . . , xn) ∈ Rn can be regarded as a function from{1, 2, . . . , n} to R, assigning xi to each i = 1, . . . , n, the analogy betweenthe above norms defined on Rn and the L2-norm and L1-norm consideredin Chapter 4 is apparent. We now define convergence in these norms forsequences of random variables on a probability space (Ω,F , P).

Definition 5.9

We say that Xn converges to X in L2-norm and write XnL2

→ X if

||Xn − X||22 = E((Xn − X)2

)→ 0

as n→ ∞.

5.2 Modes of convergence for random variables 157

Remark 5.10In Section 5.1 the sum of a series

∑∞i=1 aiEi with ai ∈ E and Ei ∈ L2(P)

for i = 1, 2, . . . was defined as a random variable Y ∈ L2(P) such that (5.4)holds. It terms of convergence in L2-norm this simply means that

n∑i=1

aiEiL2

→∞∑

i=1

aiEi.

Definition 5.11

We say that Xn converges to X in L1-norm and write XnL1

→ X if

||Xn − X||1 = E (|Xn − X|)→ 0

as n→ ∞.

Recall that for any Z ∈ L2(P) the Schwarz inequality (Lemma 3.49)implies [E(|1Z|)]2 ≤ E(Z2) since E(1) = 1. Hence we have the followinginequality between the two norms:

‖Z‖1 ≤ ‖Z‖2 . (5.8)

This means that if ‖Xn − X‖2 → 0, then also ‖Xn − X‖1 → 0 as n → ∞, soconvergence in L2-norm implies convergence in L1-norm. The converse isfalse in general, as the next example shows.

Example 5.12

For each n = 1, 2, . . . let Xn = n1An , where P(An) = 1n√

n. Then Xn

L1

→ 0

since E(|Xn|) = nP(An) = 1√n→ 0. But Xn does not converge to 0 in L2-

norm since E(X2n) = n2P(An) =

√n→ ∞ as n→ ∞.

In further contrast to the situation in Rn, convergence in either of thesenorms is not the same as ‘coordinatewise’ convergence. For a sequence ofrandom variables Xn the natural analogue of coordinatewise convergenceto a random variable X is pointwise convergence, where X = limn→∞ Xn

means that, for every ω ∈ Ω, the real numbers Xn(ω) converge to the realnumber X(ω). Recall, however, that random variables are identified if theyare equal to one another P-a.s. This leads to the following definition.

Definition 5.13We say that Xn converges to X almost surely and write Xn

a.s.→ X if thereis an A ∈ F with P(A) = 0 such that Xn(ω) → X(ω) as n → ∞ for allω ∈ Ω \ A.


Example 5.14Let Ω = [0, 1] with Borel sets and Lebesgue measure. The sequenceXn(ω) = ωn converges to X(ω) = 0 for all ω ∈ [0, 1), but not for ω = 1.Since the singleton {1} has Lebesgue measure 0, it follows that Xn

a.s.→ 0.

One obvious question is whether convergence in L1-norm or L2-normimplies convergence almost surely. The following counterexample showsthat we cannot always expect this.

Example 5.15On Ω = [0, 1] with Borel sets and Lebesgue measure, construct X1 = 1[0,1],

X2 = 1[0, 12 ), X3 = 1[ 12 ,1), X4 = 1[0, 14 ), X5 = 1[ 1

4 ,12 ), X6 = 1[ 1

2 ,34 ), X7 = 1[ 3

4 ,1),

X8 = 1[0, 18 ), and so on. We have XnL1

→ 0 and XnL2

→ 0 because ‖Xn‖1 ≤‖Xn‖2 → 0 since the lengths of the intervals tend to 0. However, for eachω ∈ [0, 1) there are infinitely many n such that Xn(ω) = 1, so Xn(ω) doesnot converge to 0 for any ω in [0, 1).

The next example shows that convergence almost surely does not neces-sarily imply convergence in either norm.

Example 5.16A sequence converging almost surely but failing to converge in L1-norm isbuilt on [0, 1] with Borel sets and Lebesgue measure by setting Xn(ω) =n1[0, 1n ](ω). Clearly Xn(ω)→ 0 for each ω ∈ (0, 1] since then Xn(ω) = 0 if n

is large enough. Hence Xna.s.→ 0. On the other hand, E(Xn) = 1 for all n, so

Xn fails to converge to 0 in L1-norm and therefore also in L2-norm.

We can, however, make progress by imposing additional conditions. Forexample, the dominated convergence theorem (Theorem 1.43) can now bestated as follows.

Theorem 5.17If there is a random variable Y ∈ L1(P) such that |Xn| ≤ Y for all n, and

Xna.s.→ X, then X ∈ L1(P) and Xn

L1

→ X.


We can characterise convergence almost surely by considering the set ofall ω ∈ Ω where, for infinitely many n, the values Xn(ω) and X(ω) differ bymore than some given ε > 0. To make the notion that some phenomenonoccurs infinitely often more precise, observe that Exercise 1.33 suggestsan analogue for events of the lim inf of a sequence of real numbers. Thismotivates the following terminology.

Definition 5.18Given a sequence A1, A2, . . . ∈ F in some σ-field F , define

lim infn→∞ An =

∞⋃n=1

∞⋂m=n

Am,

lim supn→∞

An =

∞⋂n=1

∞⋃m=n

Am.

Note that lim infn→∞ An and lim supn→∞ An both belong to F .

We say that ω is in An infinitely often if ω ∈ lim supn→∞ An. For anysuch ω there are infinitely many n such that ω ∈ An. Similarly, we say thatω is in An eventually if ω ∈ lim infn→∞ An. For any such ω we can find aninteger k such that ω ∈ An for all n ≥ k.

Our main interest is in lim supn→∞ An. The next exercise applies the sec-ond Fatou lemma (Lemma 1.41 (ii)) to a sequence of indicator functions ofsets (compare with Exercise 1.33).

Exercise 5.5 Show that for any sequence of events A1, A2, . . . ∈ FP(lim sup

n→∞An) ≥ lim sup

n→∞P(An).

Applying de Morgan’s laws twice, we obtain

Ω \(lim sup

n→∞An

)= Ω \

⎛⎜⎜⎜⎜⎜⎝ ∞⋂n=1

∞⋃m=n

Am

⎞⎟⎟⎟⎟⎟⎠=

∞⋃n=1

∞⋂m=n

(Ω \ Am) = lim infn→∞

(Ω \ An) . (5.9)

The following characterisation of convergence almost surely is a simpleapplication of the definitions and (5.9).


Proposition 5.19Given an ε > 0 and random variables X1, X2, . . . and X, write An,ε =

{|Xn − X| > ε} for each n = 1, 2, . . . . Then the following conditions areequivalent:

(i) Xna.s.→ X;

(ii) P(lim supn→∞ An,ε

)= 0 for every ε > 0.

Proof Suppose (i) holds. Write Yn = |Xn − X| . For each ω ∈ Ω the state-ment Yn(ω) → 0 means that for every ε > 0 we can find n = n(ε, ω)such that |Yk(ω)| ≤ ε for every k ≥ n. If Yn

a.s.→ 0, then such n can befound for each ω from a set of probability 1. Hence for any fixed ε > 0we have P

(⋃∞n=1

⋂∞k=n Bk,ε

)= 1, where Bk,ε = {|Yk| ≤ ε}. By definition,⋃∞

n=1

⋂∞k=n Bk,ε = lim infn→∞ Bn,ε. But Bk,ε = Ω \ Ak,ε, so by (5.9) we have

Ω \ (lim supn→∞ An,ε) = lim infn→∞ Bn,ε. Since P(lim infn→∞ Bn,ε) = 1, wehave P(lim supn→∞ An,ε) = 0. As ε > 0 was arbitrary, (ii) is proved.

That (ii) implies (i) follows immediately by reversing the above steps.�

In many applications, convergence almost surely of a given sequence ofrandom variables is difficult to verify. However, observe that for fixed ε > 0and with Ak,ε as in Proposition 5.19, the sets Cn,ε =

⋃∞k=n Ak,ε decrease as n

increases. So if Xna.s.→ X, then for all ε > 0,

limn→∞ P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃k=n

Ak,ε

⎞⎟⎟⎟⎟⎟⎠ = limn→∞ P(Cn,ε) = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋂n=1

Cn,ε

⎞⎟⎟⎟⎟⎟⎠ = P

(lim sup

n→∞An,ε

)= 0.

Replacing⋃∞

k=n Ak,ε by the smaller set An,ε provides the following weakermode of convergence, which is often easier to verify in practice.

Definition 5.20We say that a sequence of random variables X1, X2, . . . converges to X in

probability, and write XnP→ X, if for each ε > 0

P(|Xn − X| > ε)→ 0 as n→ ∞.

Proposition 5.21

If Xna.s.→ X, then Xn

P→ X.

Proof It is evident that convergence in probability is weaker than conver-gence almost surely since An,ε ⊂ ⋃∞k=n Ak,ε. �


Example 5.15 shows that it is strictly weaker because the Xn satisfy

P(|Xn| > 0) → 0 (hence XnP→ 0), but they fail to converge to 0 almost

surely.Comparison of convergence in probability with convergence in L2-norm

or L1-norm is established in the next proposition.

Proposition 5.22

If XnL1

→ X or XnL2

→ X, then XnP→ X.

Proof Write Yn = |Xn−X|. If E(Yn)→ 0, then εP(Yn ≥ ε) ≤ E(Yn)→ 0 as

n→ ∞ for each fixed ε > 0. So convergence in L1-norm implies that XnP→

X. Because convergence in L2-norm implies convergence in L1-norm, ittherefore also implies convergence in probability. �

The converse of Proposition 5.22 is false, in general. The sequence ofrandom variables defined in Example 5.16 converges to 0 in probabilitysince P(Xn > 0) = 1

n , but we know that it does not converge to 0 in L1-norm or in L2-norm.

Borel–Cantelli lemmas

The following provides a simple method of checking when lim supn→∞ An

is a set of P measure 0.

Lemma 5.23 (first Borel–Cantelli lemma)If∑∞

n=1 P(An) < ∞, then

P

(lim sup

n→∞An

)= 0.

Proof First note that lim supn→∞ An ⊂ ⋃∞n=k An, hence for all k

P

(lim sup

n→∞An

)≤ P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=k

An

⎞⎟⎟⎟⎟⎟⎠ .By subadditivity we have P

(⋃∞n=k An

) ≤ ∑∞n=k P(An) → 0 as k → ∞ since∑∞

n=1 P(An) < ∞. This completes the proof. �

Our first application of this lemma gives a partial converse of Proposi-tion 5.21.

Theorem 5.24If Xn

P→ X, then there is a subsequence Xkn such that Xkn

a.s.→ X.


Proof We build a sequence An of sets encapsulating the undesirable be-haviour of Xn, which from the point of view of convergence occurs when|Xn − X| > a for some real a. First take a = 1. Since convergence in proba-bility is given, P (|Xn − X| > 1)→ 0 provides k1 such that for all n ≥ k1

P (|Xn − X| > 1) ≤ 1.

Next for a = 12 we find k2 > k1 such that for all n ≥ k2

P

(|Xn − X| > 1

2

)≤ 1

4.

We continue this process, obtaining an increasing sequence of integers kn

such that

P

(∣∣∣Xkn − X∣∣∣ > 1

n

)≤ 1

n2.

Now put

An =

{∣∣∣Xkn − X∣∣∣ > 1

n

}.

The series∑∞

n=1 P(An) converges, being dominated by∑∞

n=11n2 < ∞. So the

first Borel–Cantelli lemma yields that A = lim supn→∞ An has probability

zero. By Proposition 5.19 this means that Xkn

a.s.→ X almost surely, since forany given ε > 0 we can always find n > 1

ε. �

It is natural to ask what the counterpart to the first Borel–Cantelli lemmashould be when the series

∑∞n=1 P(An) diverges. The result we now derive

lies a little deeper, and requires the An to be independent events, whereasthe first Borel–Cantelli lemma holds for any sequence of events, withoutrestriction. Nonetheless, the two results together give us a typical 0–1 law,which says that, for a sequence of independent random variables. the prob-ability of ‘tail events’ (those that involve infinitely many events in the se-quence) is either 0 or 1, but never inbetween.

Lemma 5.25 (second Borel–Cantelli lemma)If A1, A2, . . . is a sequence of independent events and

∑∞n=1 P(An) = ∞, then

P(lim supn→∞ An

)= 1.

Proof To prove that P(⋂∞

k=1

⋃∞n=k An

)= 1 note that the events

⋃∞n=k An

decrease as k increases, hence

limk→∞

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=k

An

⎞⎟⎟⎟⎟⎟⎠ = P

⎛⎜⎜⎜⎜⎜⎝ ∞⋂k=1

∞⋃n=k

An

⎞⎟⎟⎟⎟⎟⎠ .


Thus it is sufficient to show that P(⋃∞

n=k An)= 1 for each k = 1, 2, . . . .

Now consider⋂m

n=k(Ω \ An) for a fixed m > k. By de Morgan’s laws wehave Ω \ (

⋃mn=k An) =

⋂mn=k(Ω \ An). The events Ω \ A1,Ω \ A2, . . . are also

independent, so for k = 1, 2, . . .

P

⎛⎜⎜⎜⎜⎜⎝ m⋂n=k

(Ω \ An)

⎞⎟⎟⎟⎟⎟⎠ = m∏n=k

P(Ω \ An) =m∏

n=k

[1 − P(An)].

For any x ≥ 0 we know that 1 − x ≤ e−x (consider the derivative of e−x +

x − 1), so thatm∏

n=k

[1 − P(An)] ≤m∏

n=k

e−P(An) = e−∑m

n=k P(An).

Now recall that we assume that the series∑∞

n=1 P(An) diverges. Hence forany fixed k the partial sums

∑mn=k P(An) diverge to ∞ as m → ∞. Thus as

m→ ∞ the right-hand side of the inequality becomes arbitrarily small.This proves that

1 − P

⎛⎜⎜⎜⎜⎜⎝ m⋃n=k

An

⎞⎟⎟⎟⎟⎟⎠ = P

⎛⎜⎜⎜⎜⎜⎝ m⋂n=k

(Ω \ An)

⎞⎟⎟⎟⎟⎟⎠→ 0 as m→ ∞.

Finally, write Bm =⋃m

n=k An, which is an increasing sequence and its unionis⋃∞

m=1 Bm =⋃∞

n=k An. Hence

P

⎛⎜⎜⎜⎜⎜⎝ ∞⋃n=k

An

⎞⎟⎟⎟⎟⎟⎠ = limm→∞ P(Bm) = 1.

�

Example 5.26The independence requirement limits applications of the second Borel–Cantelli lemma, but it cannot be dropped. Consider A ∈ F with P(A) ∈(0, 1), and let An = A for all n = 1, 2, . . . . Then

∞∑n=1

P(An) = ∞, but

P(lim supn→∞ An

)= P(A) < 1.

Uniform integrability

We have shown that convergence in probability is strictly the weakest of thefour modes of convergence we have studied. To study situations where the


implications can be reversed we consider sequences of random variableswith an additional property. The next exercise motivates this property.

Exercise 5.6 Let X be a random variable defined on a probabilityspace (Ω,F , P). Prove that X ∈ L1(P) if and only if for any givenε > 0 we can find a K > 0 such that

∫{|X|>K} |X| dP < ε.

We extend this condition from single random variables to families ofrandom variables in L1(P).

Definition 5.27S ⊂ L1(P) is a uniformly integrable family of random variables if forevery ε > 0 there is a K > 0 such that

∫{|X|>K} |X| dP < ε for each X ∈ S.

A uniformly integrable family S of random variables is bounded in L1-norm since, taking ε = 1 in the definition, we can find K > 0 such that forall X ∈ S

‖X‖1 =∫{|X|≤K}

|X| dP +∫{|X|>K}

|X| dP ≤ K + 1.

The sequence Xn = n1[0, 1n ] discussed in Example 5.16 is not uniformly inte-

grable. For any K > 0 and n > K we have∫{|Xn |>K} |Xn| dP = nP([0, 1

n ]) = 1.On the other hand, ||Xn||1 = 1 for all n, so the sequence X1, X2, . . . isbounded in L1-norm. This shows that boundedness in the L1-norm doesnot imply uniform integrability of a family of random variables. However,the stronger condition of boundedness in L2-norm is sufficient.

Proposition 5.28If a family S of random variables is bounded in L2-norm, then it is uni-formly integrable.

Proof Given y ≥ K > 0, we have y ≤ y2

K . Use this with y = |X(ω)| forevery ω such that |X(ω)| > K. Then∫

{|X|>K}|X| dP <

1K

∫{|X|>K}

|X|2 dP.

If there is C > 0 such that ||X||2 ≤ C for all X ∈ S, we have∫{|X|>K} |X|2 dP ≤

||X||22 ≤ C2 for all X ∈ S, so the right-hand side above can be made smallerthan any given ε > 0 by taking K > C2

ε. �


The next exercise exhibits a uniformly integrable sequence in L1(P)(compare this with the dominated convergence theorem).

Exercise 5.7 Show that if X1, X2, . . . is a sequence of random vari-ables dominated by an integrable random variable Y > 0 (that is,|Xn| ≤ Y , P-a.s. for all n), then the sequence is uniformly integrable.

A particularly useful uniformly integrable family in L1(P) is the follow-ing.

Example 5.29Suppose that X ∈ L1(P) and that

S = {E(X | G) : G ⊂ F is a σ-field}.By Jensen’s inequality with φ(x) = |x|, we have |E(X | G)| ≤ E(|X| | G), so

KP(|E(X | G)| > K) ≤∫{|E(X | G)|>K}

|E(X | G)| dP

≤∫{|E(X | G)|>K}

E(|X| | G) dP

=

∫{|E(X | G)|>K}

|X| dP ≤ ‖X‖1since {|E(X|G)| > K} ∈ G. It follows that

P(|E(X | G)| > K) ≤ ‖X‖1K.

Moreover, to prove that S is uniformly integrable we only need to showthat for every ε > 0 there is a K > 0 such that

∫{|E(X | G)|>K} |X| dP < ε for

each σ-field G ⊂ F . Suppose that this is not the case. Then there wouldexist an ε > 0 such that for each n = 1, 2, . . . one could find a σ-fieldGn ⊂ F such that

∫An|X| dP > ε, where An = {|E(X | Gn)| > 2n}. We know

that P(An) ≤ 2−n ‖X‖1, so∑∞

n=1 P(An) < ∞. By the first Borel–Cantellilemma, P

(⋂∞n=1

⋃∞m=n An

)= 0. As a result, by the dominated convergence

theorem,

0 < ε <∫

An

|X| dP ≤∫Ω

|X| 1⋃∞m=n An dP→

∫Ω

|X| 1⋂∞n=1

⋃∞m=n An dP = 0,

which is a contradiction.


In particular, this shows that for an integrable random variable X anda sequence of σ-fields F1,F2, . . . ⊂ F , the sequence Xn = E(X | Fn) isuniformly integrable.

For uniformly integrable sequences, convergence in probability impliesconvergence in L1-norm.

Theorem 5.30Let Y1, Y2, . . . be a uniformly integrable sequence such that Yn

P→ 0. Then‖Yn‖1 → 0 as n→ ∞.Proof As the sequence is uniformly integrable, given ε > 0 we can findK > ε3 such that

∫{|Yn |>K} |Yn| dP < ε3 . Also, limn→∞ P(|Yn| > α) = 0 for every

α > 0 since YnP→ 0. So we can find N > 0 with P(|Yn| > ε3 ) < ε

3K if n ≥ N.For any n = 1, 2, . . . put

An = {|Yn| > K}, Bn = {K ≥ |Yn| > ε3 }, Cn = {|Yn| ≤ ε3 }.Then

‖Yn‖1 =∫

An

|Yn| dP +∫

Bn

|Yn| dP +∫

Cn

|Yn| dP

≤ ε3+ KP

(|Yn| > ε3

)+ε

3P(|Yn| ≤ ε3

)≤ ε

for each n ≥ N. Hence ‖Yn‖1 → 0 as n→ ∞. �

We note an immediate consequence.

Corollary 5.31

If X1, X2, . . . is a uniformly integrable sequence and Xna.s.→ X, then Xn

L1

→ X.

Example 5.32Suppose that X is square-integrable andF1 ⊂ F2 ⊂ · · · ⊂ F is an increasingsequence of σ-fields contained in F . Suppose further that Xn

a.s.→ X, where

Xn = E(X | Fn) for each n = 1, 2, . . . . Then XnL1

→ X.To see this, apply Jensen’s inequality with φ(x) = x2, which shows that

5.3 Sequences of i.i.d. random variables 167

for each n = 1, 2, . . .

X2n = (E(X|Fn))2 ≤ E(X2|Fn), P-a.s.,

so that

E(X2n) ≤ E(E(X2|Fn)) = E(X2).

Hence ||Xn||2 ≤ ||X||2 for all n = 1, 2, . . . , so that the sequence X1, X2, . . . isbounded in L2-norm, hence it is uniformly integrable by Proposition 5.28.Corollary 5.31 now proves our claim.

5.3 Sequences of i.i.d. random variables

The limit behaviour of independent sequences is of particular interest whenall the random variables in the sequence share the same distribution.

Definition 5.33A sequence X1, X2, . . . of random variables on a probability space (Ω,F , P)is identically distributed if FXn (x) = FX1 (x) (that is, P(Xn ≤ x) = P(X1 ≤x)) for all n = 1, 2, . . . and all x ∈ R. If, in addition, the random variables areindependent, we call it a sequence of independent identically distributed(i.i.d.) random variables.

Consider the arithmetic averages 1n

∑ni=1 Xi for a sequence of i.i.d. ran-

dom variables as n tends to ∞. Since the Xi share the same distribution,their expectations are the same, as are their variances. Convergence of theseaverages in L2-norm, and hence in probability, follows from the basic prop-erties of expectation and variance.

Theorem 5.34 (weak law of large numbers)Let X1, X2, . . . be a sequence of i.i.d. random variables with finite expecta-

tion m and variance σ2. Then 1n

∑ni=1 Xi

L2

→ m, and hence 1n

∑ni=1 Xi

P→ m asn→ ∞.

Proof Let S n = X1 + · · · + Xn for each n = 1, 2, . . . . First note thatE(S n) = nm for each n = 1, 2, . . . by the linearity of expectation. HenceE(

S n

n

)= m and, by the properties of variance for the sum of independent


random variables,

E

((S n

n− m

)2)= Var

(S n

n

)=

1n2

Var(S n)

=1n2

n∑i=1

Var(Xi) =1n2

n∑i=1

σ2 =1nσ2 → 0

as n → ∞. This means that 1n S n

L2

→ m. Convergence in probability followsfrom Proposition 5.22. �

The law of large numbers provides a mathematical statement of our intu-ition that the average value over a large number of independent realizationsof a random variable X is likely to be close to its expectation E(X).

Remark 5.35The weak law of large numbers can be strengthened considerably. Ac-cording to Kolmogorov’s strong law of large numbers, for a sequenceX1, X2, . . . of i.i.d. random variables the averages 1

n

∑ni=1 Xi converge almost

surely to m if the Xn are integrable. We shall not prove this here,1 but willfocus instead on the Central Limit Theorem.

Constructing an i.i.d. sequence with given distribution

Our aim is to construct i.i.d. sequences of random variables with a givendistribution. In applications of probability theory one often pays little atten-tion to the probability space on which such random variables are defined.The main interest is in the distribution of the random variables. In part, thisalso applies in financial modelling, but here the knowledge of a particularrealisation of the sample space may in fact be useful for computer simula-tions. From this point of view the choice of Ω = [0, 1] with Borel sets andLebesgue measure plays a special role since random sampling in this spaceis provided in many standard computer packages.

The simplest case, developed with binomial tree applications in mind, isa sequence of i.i.d. random variables Xn, each taking just two values, 1 or0 with equal probabilities. The good news is such a such a sequence canbe built with Ω = [0, 1] as the domain, so that Xn : [0, 1] → R for eachn = 1, 2, . . . . To this end, for each n = 1, 2, . . . and for each ω ∈ [0, 1] we

1 For details see M. Capinski and E. Kopp, Measure, Integral and Probability, 2ndedition, Springer-Verlag 2004.

5.3 Sequences of i.i.d. random variables 169

put

Xn(ω) =

⎧⎪⎨⎪⎩ 1 if ω ∈[0, 1

2n

)∪[

22n ,

32n

)∪ · · · ∪

[2n−2

2n ,2n−1

2n

),

0 otherwise.

It is routine to check that these random variables are independent and havethe desired distribution.

In the construction of the Wiener process in [SCF] we need a sequenceof i.i.d. random variables uniformly distributed on [0, 1] and defined onΩ = [0, 1] with Borel sets and Lebesgue measure. Such a sequence can beobtained as follows.

(i) Set up an infinite matrix of independent random variables Xi j on[0, 1] so that m(Xi j = 0) = m(Xi j = 1) = 1

2 by relabelling the se-quence Xn constructed above in the following manner:

X11 X12 X13 X14 · · ·X21 X22 X23 · · ·X31 X32 · · ·X41 · · ·· · ·

=

X1 X2 X6 X7 ↓X3 X5 ↙ ↗ ↙X4 ↙ ↗ .

↓ ↗ .

↗ .

(ii) Define

Zi =

∞∑j=1

Xi j

2 j

for each i = 1, 2, . . . . The series is convergent for each ω since

0 ≤∞∑j=1

Xi j

2 j≤∞∑j=1

12 j= 1.

It turns out that Zi is uniformly distributed on [0, 1], that is, FZi (x) = xfor each x ∈ [0, 1]. Indeed, for any n the sequence Xi1, . . . , Xin is equal toa specific n-element sequence of 0s and 1s with probability 1

2n , so the sum∑nj=1

Xi j

2 j is equal to k2n with probability 1

2n for each k = 0, 1, . . . , 2n−1. Givenany x ∈ [0, 1], there are [2nx] + 1 numbers of the form k

2n in the interval[0, x] (here [a] denotes the integer part of a), so

m

⎛⎜⎜⎜⎜⎜⎜⎝ n∑j=1

Xi j

2 j≤ x

⎞⎟⎟⎟⎟⎟⎟⎠ = [2nx] + 12n

→ x as n→ ∞.

Since An ={∑n

j=1Xi j

2 j ≤ x}

is a decreasing sequence of sets with⋂∞

n=1 An =


{Zi ≤ x}, we therefore have

FZi (x) = m (Zi ≤ x) = limn→∞m(An) = x.

Moreover, the Zi are independent. We verify that Z1, Z2 are independent.Once this is done, routine induction will extend this to the finite collectionZ1, Z2, . . . , ZN for any fixed N, which is all that we need. Note that Zn1 =∑n

j=1X1 j

2 j and Zn2 =∑n

j=1X2 j

2 j are independent because so are the randomvariables X11, . . . , X1n, X21, . . . , X2m. This implies that for any A1, A2 ∈ B(R)

m (Zn1 ∈ A1, Zn2 ∈ A2) = m (Zn1 ∈ A1) m (Zn2 ∈ A2) .

We can write the measure of each of these sets as an integral of the indicatorfunction of that set. Now since Zn1 → Z1 and Zn2 → Z2 almost surely, wehave 1{Zn1∈A1} → 1{Z1∈A1} and 1{Zn2∈A2} → 1{Z2∈A2} almost surely as n → ∞. Itfollows by dominated convergence (all indicator functions being boundedby 1) that

m (Z1 ∈ A1, Z2 ∈ A2) = limn→∞m (Zn1 ∈ A1, Zn2 ∈ A2)

= limn→∞m (Zn1 ∈ A1) m (Zn2 ∈ A2)

= m (Z1 ∈ A1) m (Z2 ∈ A2)

for any A1, A2 ∈ B(R), proving that Z1, Z2 are independent.

Remark 5.36It is possible to associate a sequence of i.i.d. random variables Z1, Z2, . . .

defined on [0, 1] with an i.i.d. sequence Y1, Y2, . . . defined on the spaceΩ = [0, 1]N consisting of all functions ω : N → [0, 1] so that the Yn havethe same distribution as the Zn. To this end we consider Z = (Z1, Z2, . . .) asa function Z : [0, 1] → [0, 1]N, and equip [0, 1]N with the σ-field F con-sisting of all sets A ⊂ [0, 1]N such that {Z ∈ A} ∈ B(R) and with probabilitymeasure P : F → [0, 1] such that P(A) = m(Z ∈ A). Then Yn(ω) = ω(n),mapping each ω ∈ [0, 1]N into ω(n), defines a sequence of i.i.d. randomvariables on [0, 1]N such that PYn = PZn for each n ∈ N.

5.4 Convergence in distribution

Let X1, X2, . . . and X be random variables. We introduce a further notion ofconvergence concerned with their distributions.

5.4 Convergence in distribution 171

Definition 5.37A sequence of random variables X1, X2, . . . is said to converge in distribu-tion (or in law) to a random variable X, written Xn =⇒ X, if

limn→∞ FXn (x) = FX(x)

at every continuity point x of FX , that is, at every x ∈ R such that

limy→x

FX(y) = FX(x).

Let us compare this mode of convergence with those developed ear-lier. In fact, convergence in distribution is the weakest (that is, easiest toachieve) convergence notion we have so far encountered.

Theorem 5.38If Xn

P→ X, then Xn =⇒ X.

Proof For any ε > 0

{X ≤ x − ε} ⊂ {Xn ≤ x} ∪ {|Xn − X| > ε} ,{Xn ≤ x} ⊂ {X ≤ x + ε} ∪ {|Xn − X| > ε} ,

so

FX(x − ε) ≤ FXn (x) + P (|Xn − X| > ε) ,FXn (x) ≤ FX(x + ε) + P (|Xn − X| > ε) .

If XnP→ X, then P (|Xn − X| > ε)→ 0 as n→ ∞. It follows that

FX(x − ε) ≤ lim infn→∞ FXn (x) ≤ lim sup

n→∞FXn (x) ≤ FX(x + ε).

Suppose that x ∈ R is a continuity point of FX . Then FX(x − ε) → FX(x)and FX(x + ε)→ FX(x) as ε↘ 0. As a result,

limn→∞ FXn (x) = FX(x),

proving that Xn =⇒ X. �

The converse of Theorem 5.38 is not true. In fact, convergence in prob-ability does not even make sense if Xn and X are defined on different prob-ability spaces, which is possible since the definition of convergence in dis-tribution makes no direct reference to the underlying probability space.


Example 5.39Although limn→∞ P (|Xn − X| > ε) = 0 in general makes no sense unless Xn

and X are defined on the same probability space, we can arrive at a converseof Theorem 5.38 in a very special (indeed, trivial) case. Suppose that X isconstant, that is, X(ω) = c for all ω ∈ Ω. Its distribution function is

Fc(x) =

{0 for x < c,1 for x ≥ c.

Now we can show that if Xn =⇒ c and all the Xn are defined on the same

probability space, then XnP→ c.

To see this, fix ε > 0. We have

P (|Xn − c| > ε) ≤ P(Xn ≤ c − ε) + P(Xn > c + ε)

= FXn (c − ε) + 1 − FXn (c + ε)

→ Fc(c − ε) + 1 − Fc(c + ε) = 0

as n→ ∞.

Exercise 5.8 Suppose that |Xn − Yn| P→ 0 and Yn =⇒ Y . Show thatXn =⇒ Y .

Exercise 5.9 Show that if Xn =⇒ X, then −Xn =⇒ −X.

Exercise 5.10 Show that if Xn =⇒ X and YnP→ c, where c is a

constant, then Xn + Yn =⇒ X + c.

Since convergence in distribution is the weakest notion of convergencethat we have defined, we may hope for convergence theorems that tell usmuch more about the distribution of the limit random variable than hith-erto. The most important limit theorem in probability theory, the CentralLimit Theorem (CLT), does this for sequences of independent random vari-ables and highlights the importance of the normal distribution. This is veryfortunate, since for normally distributed random variables we have a very

5.4 Convergence in distribution 173

simple test for independence: they are independent if and only if they areuncorrelated, see Exercise 3.38.

Example 5.40We will be concerned solely with the CLT for i.i.d. sequences, althoughmuch more general results are known. The classical example of conver-gence in distribution describes how the distributions of a sequence of bino-mial random variables, suitably normalised, will approximate the standardnormal distribution.

We phrase this in terms of tossing a fair coin arbitrarily many times.After n tosses there are 2n possible outcomes, consisting of all possible n-tuples of H and T, where H stands ‘heads’ and T for ‘tails’. We denotethe set of all such outcomes by Ωn. We assume that at each toss H andT are equally likely and that successive tosses are independent. By thiswe mean that the random variables X1, . . . , Xn defined on n-tuples ω =(ω1, ω2, . . . , ωn) in Ωn by setting

Xi(ω) =

{1 if ωi = H,0 if ωi = T,

for i = 1, 2, . . . , n

are independent. Let Pn denote the counting measure on all subsets of Ωn,

that is, Pn(A) = |A|2n , where |A| denotes the number of n-tuples belonging toA ⊂ Ωn. The sum S n =

∑ni=1 Xi (which counts the number of ‘heads’ in

n tosses) has the binomial distribution with parameters n and p = 12 ; see

Example 2.2.We have E(Xi) = 1

2 and Var(Xi) = 14 for all i = 1, 2, . . . , n, which implies

that the proportion of ‘heads’ 1n S n has expectation 1

2 and variance 14n . The

weak law of large numbers (see Theorem 5.34) implies that 1n S n converges

to 12 in probability, i.e. for each ε > 0

limn→∞ Pn

(∣∣∣∣∣S n

n− 1

2

∣∣∣∣∣ ≤ ε)= 1.

In other words, given ε > 0, the fraction of n-tuples for which the propor-tion of ‘heads’ in n tosses of a fair coin differs from 1

2 by at most ε increaseswith n, reaching 1 in the limit as n → ∞. This supports our belief that forlarge n, a sequence of n tosses of a fair coin will, in most instances, yieldapproximately n

2 ‘heads’.However, this leaves open the question of the limiting distribution of the

number of ‘heads’. The answer will be given by the simplest (and oldest)


form of the CLT, known as the de Moivre–Laplace theorem, see Corol-lary 5.53 below.

5.5 Characteristic functions and inversion formula

We revisit characteristic functions, which were introduced in Section 2.4,as these provide the key to finding limit distributions. We begin with aresult showing that the distribution of a random variable is determined byits characteristic function.

Theorem 5.41 (inversion formula)If the distribution function FX of a random variable X is continuous ata, b ∈ R, then

FX(b) − FX(a) = limT→∞

12π

∫[−T,T ]

e−ita − e−itb

itφX(t) dm(t). (5.10)

Random variables X and Y have the same distribution if and only if theyhave the same characteristic function.

Proof For any a ≤ b

12π

∫[−T,T ]

e−ita − e−itb

itφX(t) dt

=1

2π

∫[−T,T ]

e−ita − e−itb

it

(∫R

eitx dPX(x)

)dm(t).

Since ∣∣∣∣∣∣e−ita − eitb

iteitx

∣∣∣∣∣∣ =∣∣∣∣∣∣∫ b

aeitxdx

∣∣∣∣∣∣ ≤ b − a

is integrable over R × [−T, T ] with respect to the product measure PX ⊗m,Fubini’s theorem gives

12π

∫[−T,T ]

e−ita − e−itb

itφX(t) dt =

12π

∫R

(∫[−T,T ]

e−ita − e−itb

iteitx dm(t)

)dPX(x)

=1

2π

∫R

I(x, T ) dPX(x),

5.5 Characteristic functions and inversion formula 175

where

I(t, x) =1

2π

∫[−T,T ]

e−ita − e−itb

iteitx dm(t)

=1

2π

∫ T

−T

sin t(x − a) − sin t(x − b)t

dt

+1

2π

∫ T

−T

cos t(x − a) − cos t(x − b)it

dt.

The last integral is equal to 0 because the integrand is an odd function.Substituting y = t(x − a) and z = t(x − b), we obtain

I(x, T ) =1

2π

∫ T (x−a)

−T (x−a)

sin yy

dy − 12π

∫ T (x−b)

−T (x−b)

sin zz

dz.

It is shown in Exercise 5.11 below that∫ s

r

sin yy

dy→ π

as s→ ∞ and r → −∞. Thus,

limT→∞ I(x, T ) =

{0 if x < a or x > b,1 if a < x < b.

By dominated convergence, see Exercise 1.36, we have

limT→∞

12π

∫[−T,T ]

e−ita − eitb

itφX(t) dm(t) = lim

T→∞

∫R

I(x, T ) dPX(x)

=

∫R

1(a,b)(x) dPX(x)

= PX((a, b))

= FX(b) − FX(a),

if a, b are continuity points of FX , so that PX({a}) = PX({b}) = 0.Finally, we show that (5.10) implies uniqueness. Suppose that X and Y

have the same characteristic function, φX = φY . If a, b ∈ R are continuitypoints of FX , we have by (5.10)

FX(b) − FX(a) = FY(b) − FY(a).

It follows that the continuity points of FX and of FY coincide, and by lettinga → −∞ we obtain FX(b) = FY(b) at all continuity points b of FX and FY .By right-continuity and the fact that the points where continuity fails forman at most countable set, we obtain FX = FY . �



limT→∞

∫ T

0

sin xx

dx =π

2.

Exercise 5.12 Show that if∫R|φX(t)|dt < ∞, then X has a density

given by

fX(x) =1

2π

∫R

e−itxφX(t) dm(t).

Exercise 5.13 Suppose that X is an integer-valued random variable.Show that for each integer n

P({X = n}) = 12π

∫ 2π

0e−itnφX(t)dt.

5.6 Limit theorems for weak convergence

It is useful to consider convergence in distribution in terms of probabil-ity measures defined on the σ-field B(R) of Borel sets. We called suchmeasures probability distributions in Definition 2.1. A probability distri-bution P uniquely determines a distribution function F : R→[0, 1] byF(x) = P((−∞, x]) for each x ∈ R. Conversely, if two distribution func-tions agree, then the corresponding probability measures agree on the col-lection of all intervals of the form (−∞, x], where x ∈ R, and this col-lection is closed under intersection and generates the σ-field B(R), so byLemma 3.58 these measures agree on B(R). Thus there is a one-to-one cor-respondence between distribution functions and probability distributions.

Definition 5.42Given probability measures Pn and P defined on B(R), we say that Pn con-verge weakly to P and write Pn =⇒ P if

limn→∞ Pn((−∞, x]) = P((−∞, x])

5.6 Limit theorems for weak convergence 177

for each x ∈ R such that P({x}) = 0.

Observe that if Pn and P are the probability distributions of some randomvariables Xn and X, then Pn =⇒ P is equivalent to Xn =⇒ X. That is, in thiscase weak convergence is the same as convergence in distribution.

Theorem 5.43 (Skorohod representation)Suppose that Pn and P are probability measures defined on B(R) such thatPn =⇒ P. Then there are random variables Xn and X on the probabilityspace Ω = (0, 1) (with Borel sets and Lebesgue measure) such that PXn =

Pn, PX = P and limn→∞ Xn(ω) = X(ω) for each ω ∈ (0, 1).

Proof Let F(x) = P((−∞, x]) and Fn(x) = Pn((−∞, x]) for each x ∈ R.We put

Y(ω) = inf{x ∈ R : ω ≤ F(x)},Yn(ω) = inf{x ∈ R : ω ≤ Fn(x)}

for each ω ∈ (0, 1). It follows that

m({ω ∈ (0, 1) : Y(ω) ≤ x}) = m({ω ∈ (0, 1) : ω ≤ F(x)}) = F(x),

so F is the distribution function of Y . Moreover, Fn is the distribution func-tion of Yn by a similar argument.

Now take any ω ∈ (0, 1) and any ε > 0, η > 0. Let x, y be continuitypoints of F such that

Y(ω) − ε < x < Y(ω) < y < Y(ω + η) + ε.

Then

F(x) < ω < ω + η ≤ F(y).

Since limn→∞ Fn(x) = F(x) and limn→∞ Fn(y) = F(y), we have Fn(x) <ω < Fn(y), so

Y(ω) − ε < x < Yn(ω) ≤ y < Y(ω + η) + ε

for any sufficiently large n. It follows that limn→∞ Yn(ω) = Y(ω) wheneverY is continuous at ω.

We put Xn(ω) = Yn(ω) and X(ω) = Y(ω) at any continuity point ωof Y and Xn(ω) = X(ω) = 0 at any discontinuity point ω of Y . Thenlimn→∞ Xn(ω) = X(ω) for every ω ∈ (0, 1). The distributions of Xn and Xare the same as those of Yn and Y , respectively, since the Xs differ fromthe corresponding Ys only on the set of discontinuity points of the non-decreasing function Y , which is at most countable and hence of Lebesguemeasure 0. �


Corollary 5.44If PXn converges weakly to PX, then the characteristic functions of Xn and Xsatisfy limn→∞ φXn (t) = φX(t) for each t.

Proof Take the Skorohod representation Yn, Y of the measures PXn , PX .Pointwise convergence of Yn to Y implies that φYn (t) = E(eitYn ) → φY (t) =E(eitY ) as n→ ∞ by the dominated convergence theorem. But the distribu-tions of Xn, X are the same as those of Yn, Y , so the characteristic functionsare the same. �

The following result has many varied applications in analysis and prob-ability.

Theorem 5.45 (Helly selection)Let F1, F2, . . . be a sequence of distribution functions of probability mea-sures. Then there exists a subsequence Fn1 , Fn2 , . . . and a non-decreasingright-continuous function F such that limk→∞ Fnk (x) = F(x) at each conti-nuity point x of F.

Proof Let q1, q2, . . . be a sequence consisting of all rational numbers.Because the distribution functions have values in [0, 1], there is a subse-quence n1

1, n12, . . . of the sequence 1, 2, . . . such that the limit

limk→∞ Fn1k(q1) = G(q1) exists. Moreover, there is a subsequence n2

1, n22, . . .

of the sequence n11, n

12, . . . such that limk→∞ Fn2

k(q2) = G(q2) exists, and so

on. Taking nk = nkk for k = 1, 2, . . . , we then have

limk→∞

Fnk (qi) = G(qi)

for every i = 1, 2, . . . . The functions G and

F(x) = inf{G(q) : x < q, q ∈ Q}are non-decreasing since so are the Fn. For each x ∈ R and ε > 0 thereis a q ∈ Q such that x < q and G(q) < F(x) + ε. If x ≤ y < q, thenF(y) ≤ G(q) < F(x) + ε. Hence F is right-continuous.

If F is continuous at x, we take y < x such that F(x) − ε < F(y). Wealso take q, r ∈ Q such that y < q < x < r and G(r) < F(x) + ε. SinceFn(q) ≤ Fn(x) ≤ Fn(r), it follows that

F(x) − ε < F(y) ≤ G(q) = limk→∞

Fnk (q) ≤ lim infk→∞

Fnk (x)

≤ lim supk→∞

Fnk (x) ≤ limk→∞

Fnk (r) = G(r) < F(x) + ε.

Because this holds for any ε > 0, we can conclude that limk→∞ Fnk (x) =F(x). �

5.6 Limit theorems for weak convergence 179

Example 5.46The F in Helly’s theorem does not need to be a distribution function. Forinstance, if Fn = 1[n,∞), then limn→∞ Fn(x) = 0 for each x ∈ R.

In view of this example, we introduce a condition which ensures that aprobability distribution is obtained in the limit.

Definition 5.47A sequence of probability measures P1, P2, . . . defined on B(R) is said tobe tight if for each ε > 0 there exists a finite interval [−a, a] such thatPn([−a, a]) > 1 − ε for all n = 1, 2, . . . .

Example 5.48If Pn = δn is the unit mass at n = 1, 2, . . . , then P1, P2, . . . is not a tightsequence.

Theorem 5.49 (Prokhorov)If a sequence P1, P2, . . . of probability measures on B(R) is tight, then ithas a subsequence converging weakly to a probability measure P on B(R).

Proof Let Fn(x) = Pn((−∞, x]) for each x ∈ R. By Helly’s theorem,there is a subsequence Fnk converging to a non-decreasing right-continuousfunction F at each continuity point of F.

We claim that limy→∞ F(y) = 1. Take any ε > 0. Tightness ensures thatthere is an a > 0 such that P([−a, a]) > 1−ε. Then for any continuity pointy of F such that y > a we have

Fn(y) = Pn((−∞, y]) > 1 − ε for all n = 1, 2, . . . .

Hence, F(y) = limk→∞ Fnk (y) ≥ 1 − ε. Because 1 ≥ F(y) > 1 − ε for eachε > 0, this proves that limy→∞ F(y) = 1. It follows that F is a distributionfunction, and the corresponding probability measure P on B(R) satisfiesPn =⇒ P. �


5.7 Central Limit Theorem

Characteristic functions provide a powerful means of studying the distribu-tions of sums of independent random variables. Because of the followingimportant theorem, characteristic functions can be used to study limit dis-tributions.

Theorem 5.50 (continuity theorem)Let X1, X2, . . . and X be random variables such that φXn (t)→ φX(t) for eacht ∈ R. Then PXn =⇒ PX.

Proof First we show that the sequence PX1 , PX2 , . . . is tight. For any a > 0

PXn ([−2/a, 2/a]) = 1 − PXn ({x ∈ R : |x| > 2/a})≥ 1 − 2

∫{x∈R:|x|>2/a}

(1 − 1

a |x|)

dPXn (x)

≥ 1 − 2∫R

(1 − sin (ax)

ax

)dPXn (x)

= 2∫R

sin (ax)ax

dPXn (x) − 1.

Using Fubini’s theorem, we get∫R

sin (ax)ax

dPXn (x) =1

2a

∫R

(∫[−a,a]

eitx dm(t)

)dPXn (x)

=1

2a

∫[−a,a]

(∫R

eitx dPXn (x)

)dm(t)

=1

2a

∫[−a,a]φXn (t) dm(t).

Since φX is continuous at 0 and φX(0) = 1, for any ε > 0 there is an a > 0such that ∣∣∣∣∣∣ 1

2a

∫[−a,a]φX(t) dm(t) − 1

∣∣∣∣∣∣ ≤ ε.Furthermore, since φXn (t) converges to φX(t) for each t, the dominated con-vergence theorem (Theorem 1.43) implies that there exists an integer Nsuch that ∣∣∣∣∣∣ 1

2a

∫[−a,a]φXn (t) dm(t) − 1

∣∣∣∣∣∣ ≤ 2ε

5.7 Central Limit Theorem 181

for all n ≥ N. It follows that there is an a > 0 such that

PXn ([−2/a, 2/a]) ≥ 1a

∫[−a,a]φXn (t) dm(t) − 1 ≥ 1 − 4ε

for each n ≥ N. We can ensure by taking a smaller a that this inequalityholds for each n, which proves that the sequence PX1 , PX2 , . . . is tight.

Now suppose that PXn does not converge weakly to PX . It means thatFXn (x) does not converge to FX(x) at some continuity point x ∈ R of FX .It follows that there exist an η > 0 and a subsequence n1, n2, . . . of thesequence 1, 2, . . . such that∣∣∣FXnk

(x) − FX(x)∣∣∣ > η for all k = 1, 2, . . . . (5.11)

The subsequence PXn1, PXn2

, . . . is tight because PX1 , PX2 , . . . is tight. Ac-cording to Prokhorov’s theorem, there is a subsequence m1,m2, . . . of thesequence n1, n2, . . . such that PXmk

converges weakly to the probability dis-tribution PY of some random variable Y . By Corollary 5.44, φXmk

(t) →φY(t). On the other hand, φXmk

(t) → φX(t) for each t ∈ R, so this impliesφY = φX . By Theorem 5.41, PX and PY must coincide. This shows thatPXmk

=⇒ PX , contradicting (5.11). �

We conclude with a famous version of the Central Limit Theorem (CLT).We will concentrate on i.i.d. sequences, rather than seek to find the mostgeneral results. First we need some elementary inequalities.

Lemma 5.51The following inequalities hold.

(i) For any complex numbers z,w such that |z| ≤ 1 and |w| ≤ 1 and forany n = 1, 2, . . .

|zn − wn| ≤ n |z − w| .(ii) For any x ∈ R ∣∣∣eix − 1 − ix

∣∣∣ ≤ 12|x|2 .

(iii) For any x ∈ R∣∣∣∣∣∣eix − 1 − ix − (ix)2

2

∣∣∣∣∣∣ ≤ min

(|x|2 , 1

6|x|3

).

Proof (i) Since

zn − wn =(zn−1 + zn−2w + · · · + zwn−2 + wn−1

)(z − w) ,


it follows that

|zn − wn| ≤(|z|n−1 + |z|n−2 |w| + · · · + |z| |w|n−2 + |w|n−1

)|z − w|

≤ n |z − w| .(ii) We have

eix − 1 − ix =∫ x

0(s − x)eisds,

Estimating the integral gives∣∣∣eix − 1 − ix∣∣∣ = ∣∣∣∣∣

∫ x

0(s − x)eisds

∣∣∣∣∣ ≤ 12|x|2 .

(iii) We have

eix − 1 − ix − (ix)2

2=

12i

∫ x

0(s − x)2eisds.

Estimating the integral gives∣∣∣∣∣∣eix − 1 − ix − (ix)2

2

∣∣∣∣∣∣ =∣∣∣∣∣ 12i

∫ x

0(s − x)2eisds

∣∣∣∣∣ ≤ 16|x|3 .

Moreover, from (ii)∣∣∣∣∣∣eix − 1 − ix − (ix)2

2

∣∣∣∣∣∣ ≤ ∣∣∣eix − 1 − ix∣∣∣ + ∣∣∣∣∣∣ (ix)2

2

∣∣∣∣∣∣ ≤ 12

∣∣∣x2∣∣∣ + 1

2

∣∣∣x2∣∣∣ = |x|2 ,

which completes the proof. �

Take a sequence of i.i.d. random variables X1, X2, . . . with finite meanm = E(X1) and variance σ2 = Var(X1). Let S n = X1 + · · · + Xn and write

Tn =S n − mn

σ√

n.

All Tn have expectation 0 and variance 1.

Theorem 5.52 (Central Limit Theorem)Let Xn be independent identically distributed random variables with finiteexpectation and variance. Then Tn =⇒ T, where T has the standard nor-mal distribution N(0, 1).

Proof Replacing Xk by Xk−mσ

shows that there is no loss of generality inassuming m = E(Xk) = 0 and σ2 = Var(Xk) = 1. Let φ denote the character-istic function of Xk (the same for each k = 1, 2, . . .). By Lemma 5.51 (iii),


for any t ∈ R∣∣∣∣∣∣φ (t) −(1 − t2

2

)∣∣∣∣∣∣ =∣∣∣∣∣∣E(eitX1 − 1 − itX1 − (itX1)2

2

)∣∣∣∣∣∣≤ E

(∣∣∣∣∣∣eitX1 − 1 − itX1 − (itX1)2

2

∣∣∣∣∣∣)

≤ E(|tX1|2 1{|X1 |3>|t|−1/2}

)+

16E(|tX1|3 1{|X1 |3≤|t|−1/2}

)≤ t2E

(|X1|2 1{|X1 |3>|t|−1/2}

)+

16|t|5/2 . (5.12)

Moreover, using Lemma 5.51 (ii) with x = it2

2 , we find that for any t ∈ R∣∣∣∣∣∣e− t2

2 −(1 − t2

2

)∣∣∣∣∣∣ =∣∣∣∣∣∣ei it2

2 − 1 − iit2

2

∣∣∣∣∣∣ ≤ 12

∣∣∣∣∣∣ it2

2

∣∣∣∣∣∣2

=|t|48. (5.13)

Since

φTn (t) = φn

(t√n

),

by Lemma 5.51 (i),∣∣∣∣φTn (t) − e−t2

2

∣∣∣∣ =∣∣∣∣∣∣φn

(t√n

)−(e−

t2

2n

)n∣∣∣∣∣∣

≤ n

∣∣∣∣∣∣φ(

t√n

)− e−

t2

2n

∣∣∣∣∣∣≤ n

∣∣∣∣∣∣φ(

t√n

)−(1 − t2

2n

)∣∣∣∣∣∣ + n

∣∣∣∣∣∣e− t2

2n −(1 − t2

2n

)∣∣∣∣∣∣ ,and from (5.12), (5.13) we get

n

∣∣∣∣∣∣φ(

t√n

)−(1 − t2

2n

)∣∣∣∣∣∣ ≤ nt2

nE(|X1|2 1{|X1 |4>t−1/2n1/4}

)+ n

16

(t√n

)5/2

= t2E(|X1|2 1{|X1 |6>t−1/2n1/4}

)+

16

t5/2

n1/4

→ 0 as n→ ∞and

n

∣∣∣∣∣∣e− t2

2n −(1 − t2

2n

)∣∣∣∣∣∣ ≤ n18

(t√n

)4

=t4

8n→ 0 as n→ ∞.

This shows that limn→∞ φTn (t) = e−t2

2 for each t ∈ R. By the continuity theo-


rem this means that Tn =⇒ T , where T has the standard normal distributionN(0, 1). �

The following result, which justifies our claims for the limiting behaviourof binomial distributions in Example 1.23 can be deduced from the CentralLimit Theorem.

Corollary 5.53 (de Moivre–Laplace theorem)Let X1, X2, . . . be i.i.d. random variables with P(Xn = 1) = P(Xn = 0) = 1

2for each n = 1, 2, . . . . Then for each a < b

P

(a <

S n − n/2√n/2

< b

)→ 1√

2π

∫ b

ae−

12 x2

dx as n→ ∞.

Proof Note that the expectation and variance of Xn are

E(Xn) =12, Var(Xn) =

14

and apply the Central Limit Theorem, observing that

Tn =S n − n/2√

n/2.

�

Exercise 5.14 Use the de Moivre–Laplace theorem to estimate theprobability that the number of ‘heads’ obtained in n = 10 000 tossesof a fair coin lies in (a, b) = (4900, 5100).

Example 5.54In Example 1.6 stock prices were modelled by n = 20 equally likely ad-ditive up/down jumps of 0.50 from an initial price 10. This gives the priceafter n such jumps as

Yn = 10 + 0.5n∑

i=1

(2Xi − 1) = 10 + S n − n2= 10 +

√n

2Tn,

where X1, X2, . . . are i.i.d. random variables with P(Xn = 1) = P(Xn = 0) =12 .

Since E(Xn) = 12 and Var(Xn) = 1

4 , by the CLT we have Tn =⇒ T , whereT has the standard normal distribution N(0, 1). It means that the distribution


of Yn can be approximated by the normal distribution N(μ, σ2) with μ = 10and σ =

√n

2 = 2.236 when n = 20, as in Example 1.23.

Example 5.55Consider a sequence of i.i.d. random variables K1,K2, . . . with distribution

P(Kn = u) = P(Kn = d) =12

for n = 1, 2, . . . ,

where −1 < d < u. In a binomial model with multiplicative jumps the stockprices at time step n are given by

S (n) = S (0)(1 + K1) × · · · × (1 + Kn)

with S (0) > 0 being the initial stock price (the spot price); see Example 1.7,where

u = 0.05, d = −0.05, S (0) = 10, n = 20. (5.14)

We want to understand the limiting distribution of S n for large n. To thisend, take

Xn = ln (1 + Kn)

for n = 1, 2, . . . , which form an i.i.d. sequence of random variables. Thisgives

S (n) = S (0)e∑n

i=1 Xi .

Suppose that

E(Xn) =mn

, Var(Xn) =σ2

n

for some parameters m and σ > 0 (which can be expressed in terms ofu, d). Then, according to the CLT, for

Tn =

∑ni=1 Xi − m

σ

we have Tn =⇒ T , where T has the standard normal distribution N(0, 1).As a result,

∑ni=1 Xi =⇒ X, where X has the normal distribution N(m, σ2)

with mean m and variance σ2, and this implies that

S (n) =⇒ S ,


where ln S has the normal distribution N(μ, σ2) with μ = ln S (0) + m.In other words, the distribution of S is log-normal with parameters μ, σ,

see Example 1.24. The numerical values μ = 2.2776 and σ = 0.2238 inthat example have been computed from u, d, S (0), n in (5.14).

Index

additivity, 2antiderivative, 15atom, 110

binomial tree model, 56

contingent claim, 46convergence

almost surely, 157in distribution, 171in L1-norm, 157in L2-norm, 156in probability, 160weak, 176

convolution, 86correlation coefficient, 96countable additivity, 6, 7covariance, 96covariance matrix, 98

d-system, 99density

bivariate normal, 75, 85conditional, 127Gaussian, 16joint, 75, 82log-normal, 17marginal, 80multivariate normal, 83normal, 16of a random variable, 50probability, 42

derivative security, 46distribution

binomial, 40continuous, 42discrete, 41exponential, 43function, 40, 49geometric, 50joint, 73, 82log-normal, 43, 52marginal, 74, 82negative binomial, 50normal, 42, 50of a random variable, 48

Poisson, 41probability, 40standard normal, 51

event, 6independence of, 87, 89

expectation, 57conditional, 112, 113, 120

Fourier coefficients, 150Fourier representation, 150function

Borel measurable, 28characteristic, 63convex, 128Haar, 151indicator, 18integrable, 22, 24Lebesgue integrable, 28measurable, 20norm continuous, 133simple, 18

inequalityBessel, 149Chebyshev, 62Jensen, 129Markov, 61Schwarz, 97

inner product, 131integral, 21, 24

Lebesgue, 28Riemann, 14

inverse image, 19

L1-norm, 138L2-norm, 132law of large numbers, 167lemma

Borel–Cantelli, first, 161Borel–Cantelli, second, 162Fatou, 30

measure, 10absolutely continuous, 140change of, 26counting, 11

187

188 Index

Dirac, 9finite, 10Lebesgue, on R, 12Lebesgue, on R2, 72Lebesgue, on Rn, 73outer, 33probability, 7product, 71restriction of, 71σ-finite, 71space, 10tight sequence of, 179unit mass, 9

moment, 62central, 62

moment generating function, 65Monte Carlo simulation, 53

nearest point, 136

optionarithmetic Asian call, 56bottom straddle, 48bull spread, 48butterfly spread, 48call, 47European, 46path-dependent, 56put, 47strangle, 48

orthogonal projection, 137orthonormal basis, 150

parallelogram law, 135Parseval identity, 151partition, 109probability

binomial, 3, 9conditional, 88density, 42distribution, 40equivalent, 142geometric, 6measure, 7Poisson, 5space, 6uniform, 3, 13

probability generating function, 65

Radon–Nikodym derivative, 142random variable, 47

continuous, 49discrete, 49i.i.d., 167

independence of, 84, 86jointly continuous, 75, 82orthogonal, 135square-integrable, 97uncorrelated, 96uniformly integrable family of, 164

random vector, 73, 81Gaussian, 83independence of, 91

setBorel, 11, 68, 81Cantor, 13closed, 134complete orthonormal, 150orthonormal, 148σ-field, 6

generated by a family of sets, 69generated by a random variable, 47independence of, 90, 92product of, 68

spaceL1, 138L2, 131measure, 10probability, 6sample, 2

standard deviation, 60

theoremCentral Limit Theorem (CLT), 182continuity, 180de Moivre–Laplace, 184dominated convergence, 31Fubini, 77Helly selection, 178monotone convergence, 22Prokhorov, 179Pythagoras, 135Radon–Nikodym, 140, 145Skorohod representation, 177

tower property, 118translation invariance, 13, 29

variance, 60

0521175577_1107002494probabilityfina

Documents

chair of mathematical

quantitative finance

ekkehard kopppf probability

stochastic analysis

mmf books

level study ofmodern

probability spaces

study of stochastic