1. introduction - read this firstaix1.uottawa.ca/.../mat4171lec2018w_online.pdf · 2018-04-09 ·...

Post on 14-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1. Introduction - READ THIS FIRST

These documents are the lecture notes that I write for myself before class, supplementedwith any particularly relevant points that come up during discussions over the course ofthe term. They are not intended to replace the textbook, they often do not offer completeproofs, and they are not carefully edited. Finally, the schedule of future lectures should beviewed as provisional - it will almost certainly change over the course of the term.

All of this is to say: I appreciate corrections, suggestions and other comments on thesenotes, but please do not use them as your primary guide to the course material.

1

2. Lecture 1: January 8

2.1. Course Overview.

• Textbook: We will continue to use Probability and Measure (3rd Edition) by Billings-ley as our primary book. When we study martingales, I plan to refer heavily toWilliams’ book Probability with Martingales, which I find easier to digest. The bookis a small cheap softcover, and of course seems to be available online. As always, itcan be helpful to refer to several books. I list a bunch of others on the course website.• Office Hours: Since this class is small, it is not clear to me that “regular” office

hours are useful. We will discuss related arrangements in the first class.• Homework: There will be five short homework sets, the first two of which is cur-

rently posted. Due dates are on my website (I don’t repeat them here for fear ofmaking a typo). Let me know if any of these are inconvenient, as there is someflexibility now (but there will be much less during the term).• Midterm: There will be an in-class midterm on February 28. It will cover Sections

22 and 25-27 of Billingsley, as well as Sections 9.1-9.8 of Williams’ book.• Material: The core content is Sections 22, 25-27, and 32-35 of Billingsley. Time per-

mitting, I also plan to cover Section 30 and some of the alternative proofs in Williams’book on Martingales; I also hope to cover some important results on martingales thatare not covered in Billingsley (this may include concentration results for martingales,the relationship between martingales and complex/harmonic analysis, and applica-tions of martingales to combinatorics/branching processes/statistics/etc). If we goreally fast, we will consider an advanced topic related to the course (provisionally,this will be Stein’s method for proving CLTs for dependant random variables).• The Plan: We will go through all of the above material. I’ll generally write down

at least a detailed sketch of the main proofs, though I will occasionally relegate someproofs to home reading if I think the details are not interesting. Having said that, Ithink it is often quite difficult to digest a long proof “in real time” during a lecture,especially if it is very “slick.” As such, I will try to do one of the following for themajor proofs:

– If I find a long proof has a few easily-isolated “big ideas,” I’ll try to use themindividually before the main proof. Ideally this will result in weaker versions ofthe main result, though occasionally I might not be able to find them. As anexample, I’ll do that for the strong law of large numbers today and tomorrow.

– Some long proofs are harder to break down. In that case, I’ll try to presenta bare-bones proof as quickly as possible, then focus on applications. As anexample, Theorem 26.2 of Billingsley is a very important result that we will usequite often. However, the proof of Theorem 26.2 (including the sequence of veryexplicit calculations that it requires) is quite lengthy, and the ideas in the proofare not used heavily in the rest of the course. Thus I will probably run throughthe proof quite quickly and try to focus on applications.

In most of the course, I will stick pretty closely to Billingsley (including the mainexamples). However I am always happy to try to add extra worked examples inclass, and personally often find them more helpful. Please let me know if you find aproblem you like that you think we should go over, or if there is a concept that you

2

feel would be easier to understand with some applications. I’ll try to let everybodyknow that we will discuss this, and add it to the next lecture.

2.2. Review. Some things I don’t review in class but will use:

(1) Basic measure theory (σ-algebras, completions, definition of integrals, etc etc), in-cluding the “standard machines” from elementary measure theory: approximatingevents and functions by simple events and functions, and using monotone sets.

(2) Independence, construction of random variables, main inequalities (Markov, Cheby-shev, Jensen, etc).

(3) Borel-Cantelli lemmas.

In class, we review (or say for the first time) the following topics:

(1) Definition of Lp space. Holder’s inequalities.(2) Some basic types of convergence. Relationships between them, and “counterexamples

to nonrelationships” (e.g. an example showing that convergence in probability or Lp

convergence does not imply almost-sure convergence).

I don’t include all of this in the notes, as it is all in Billingsley or referred to in the sectionsof Williams that we will cover. I do encourage you to take notes on this material, especiallythe material on Lp spaces.

2.3. Failing to Prove the Strong Law of Large Numbers. The strong law of largenumbers, and more generally sums of random variables, is the first major topic of the course.Recall:

Theorem 1 (Strong Law of Large Numbers). Let X1, X2, . . . , be an i.i.d. sequence withE[|X1|] <∞. Let Sn = 1

n

∑ni=1Xi. Then

P[ limn→∞

Sn = E[X1]] = 1.

The obvious approach is to check that |Sn−E[X1]| > ε has extremely small probabilityfor all large n, then use the Borel-Cantelli lemma. Let’s do that in an easy special case:

Lemma 2.1 (SLLN for Bounded Random Variables). In the same setting as above, assumethere exists −∞ < a < b <∞ so that X1 ∈ [a, b]. Then

P[ limn→∞

Sn = E[X1]] = 1.

Proof. Fix ε > 0. By Hoeffding’s inequality,

P[|Sn − E[X1]| > ε] ≤ 2e− 2nε2

(b−a)2 .

Thus, ∑n∈N

P[|Sn − E[X1]| > ε] ≤∑n∈N

2e− 2nε2

(b−a)2 <∞.

By Borel-Cantelli’s lemma, this implies

P[|Sn − E[X1]| > ε infinitely often.] = 0.

Thus,

P[E[X1]− ε ≤ liminfn→∞Sn ≤ limsupn→∞Sn ≤ E[X1] + ε] = 1.

Since ε > 0 is arbitrary, the result follows. 3

We would like to extend this to more general random variables, but Hoeffding’s inequalityonly applies for X1 bounded. There is a standard way to fix this sort of technical difficulty:the truncation trick. Informally, the idea is to consider a sequence of random variablesY1, Y2, . . . of interest and a sequence of “rare bad events” A1, A2, . . .. We then define

Zi =

Yi Aci holds.0 otherwise.

As long as∑

i P[Ai] <∞, we have

P[Zi 6= Yi infinitely often.] = 0

by Borel-Cantelli. In this case, the “obvious” thing to do is truncate at a level for whichX1, X2, . . . never get “too big.” One version of this result is:

Lemma 2.2 (SLLN with Truncation). As above, but this time assume only that there exists0 < C1 <∞, 4 < C2 <∞ so that

P[|X1| > α] ≤ C1α−C2

for all α > α0 sufficiently large. Then

P[ limn→∞

Sn = E[X1]] = 1.

Proof. Define δ = (34− 1

C2) and define An = max1≤i≤nXi ≥ n1−δ. By the union bound,∑n∈N

P[An] ≤∑n∈N

nP[X1 > n1−δ] (2.1)

≤ C0 +∑n∈N

(n)(C1n−C2(1−δ))

≤ C0 + C1

∑n∈N

n−C24 <∞,

where C0 <∞ is a constant and the last line follows from the assumption C2 > 4.Next, set

S ′n =

Sn Acn holds.0 otherwise.

It is straightforward to check that1

limn→∞

E[S ′n] = E[X1].

Defining γn = |E[S ′n] − E[X1]| and applying the same Hoeffding bound as in the previousproof,

P[|S ′n − E[X1]| > ε+ γn] ≤ 2e−ε2n2δ−1

,

so ∑n∈N

P[|S ′n − E[X1]| > ε+ γn] <∞,

1For example, using the “consistency of Cesaro summation” lemma given in the Appendix of Billingsley4

so by Borel-Cantelli

P[ limn→∞

S ′n = E[X1]] = 1.

By (2.1),

P[Sn 6= S ′n i.o.] = 0.

Combining these two last equations completes the proof.

This is pretty good! We’ve proved the strong law of large numbers under the very weaktail assumption

P[|X1| > α] ≤ C1α−C2

for some C2 > 2. It is straightforward to check that this is equivalent to the assumption

E[|X1|3+δ] <∞

for some δ > 0. So, we get quite a lot, but not everything. Reading through the proof, itshould be clear that we can “do better” using a similar approach. This leads to the naturalquestion: can we prove the whole SLLN using the truncation method? On the one hand,it is certainly true that the above theorem is not optimal. One can get a better SLLN bybeing more careful with the truncation method 2. On the other hand, we should notice thatwe actually prove a little more than we intend when using the truncation method:

Lemma 2.3 (Triangular SLLN via Truncation). For each n ∈ N, let Xnii∈1,2,...,n be i.i.d.Assume there exist 0 < C1 <∞, 4 < C2 <∞ so that, for each n,

P[|Xn1| > α] ≤ C1α−C2

for all α > α0 sufficiently large. Define

Sn =1

n

n∑j=1

Xnj.

Then

P[ limn→∞

Sn = E[X1]] = 1.

The proof is identical to the previous “truncation method” proof. Note that this theoremis quite a bit stronger than the previous one, because we allow arbitrary dependence betweenS1, S2, . . . - they can be (but don’t have to be) based on the same underlying sequence ofrandom variables. Although this isn’t “really math,” I assert that any similar proof of theSLLN will also have this strengthening.

This matters because, it turns out, strengthening the full SLLN in this way actually makesit false:

Example 2. Let Xij be i.i.d. with PDF

fX(x) ∝ 1

1 + |x|3,

2Homework problem: improve the proof!5

and let Sn =∑n

i=1Xni. It is clear that E[X11] = 0 by symmetry, and we can see

E[|X11|] /∫ ∞0

1

1 + x2<∞.

Indeed, slightly higher moments exist:

E[|X11|1.5] /∫ ∞0

1

1 + x1.5<∞.

This quickly implies (via Markov’s inequality) that Sn → E[X11] in probability. For fixedC > 0, we note that

P[Xn1 > Cn] ≈∫ ∞Cn

1

1 + x3≈ n−2.

Thus,

P[|Sn| > ε] ≥n∑i=1

P[|Sn| > ε ∩ |Xni| > 4εn]−∑

1≤i 6=j≤n

P[|Sn| > ε ∩ |Xni| > 4εn ∩ |Xnj| > 4εn]

≥ nP[|Sn| > ε ∩ |Xnn| > 4εn]− n2P[|Xn1| > 4εn ∩ |Xn2| > 4εn]≥ P[|Sn−1| < ε ∩ |Xnn| > 4εn]−O(n−2)

≥ P[|Sn−1| < ε]P[|Xnn| > 4εn]−O(n−2)

&1

n.

Since S1, S2, . . . are independent, by the second Borel-Cantelli lemma we have

P[|Sn| > ε infinitely often] = 1.

In particular,

P[ limn→∞

Sn = E[X11]] = 0.

We have learned two things:

(1) The above “truncation” proof always leads to versions of the SLLN that work evenif S1, S2, . . . are independent.

(2) The full SLLN is not true if S1, S2, . . . are allowed to be independent!

Thus, to prove the SLLN, we need to take advantage of the fact that S1, S2, . . . havemoderately strong dependence. This is fairly subtle, which is one of the reasons the fullproof of the SLLN is so much harder than the above proofs of weaker versions.

2.4. Summary And Main Ideas.

(1) Review of Lp spaces as metric spaces, and L2 as Hilbert spaces.(2) Truncation trick for SLLN.(3) Decomposition of rare events in main calculation of Example 2.

6

3. Lecture 2: January 10

Today, we prove the SLLN and then prove some basic results that will be used to discussother types of convergence.

(1) We will go over the proof of Theorem 22.1 of Billingsley (SLLN), as well as theimmediate corollary.

(2) We will define the σ-algebra or “tail events” and go over the proof of Thereom 22.4of Billingsley (Kolmogorov’s 0-1 law).

(3) We will define the σ-algebra of “exchangeable events” and go over the proof of theHewitt-Savage 0-1 law.

(4) We will consider some simple applications of these results, including a comparison ofthe two 0-1 laws.

Notes that are not in Billingsley:

Definition 3.1. Say that f : Rn 7→ R is symmetric if

f(x1, . . . , xn) = f(xπ(1), . . . , xπ(n))

for all permutations π ∈ Sn.

Theorem 3 (Hewitt-Savage 0-1 Law). Let Xnn∈N be a sequence of i.i.d. random variables.Define the exchangeable σ-algebra by

En = σ(Xn+1, Xn+2, . . . ∪ f(X1, . . . , Xn) : f is Borel and symmetric.)E = ∩nEn.

Then E is trivial (that is, P[A] ∈ 0, 1 for all A ∈ E).

Remark 3.2. Note that E1 ⊃ E2 ⊃ . . ., since we get only symmetrized information aboutmore and more random variables. Prove this for yourself if it is not clear!

Remark 3.3. Note that the exchangeable σ-algebra always contains the tail σ-algebra. Thus,the Hewitt-Savage 0-1 law is better than the Kolmogorov 0-1 law when it applies.

Proof. Fix A ∈ E . Note that, since ∪nσ(X1, . . . , Xn) ⊃ E , there exists a sequence Ann∈Nwith the properties:

(1) An ∈ σ(X1, . . . , Xn), and so in particular An = (X1, . . . , Xn) ∈ Bn for somemeasurable set Bn.

(2) limn→∞ P[A∆An] = 0.

Define A′n = (Xn+1, . . . , X2n) ∈ Bn, and define the permutation πn ∈ S2n by:

πn(i) = n+ i, 1 ≤ i ≤ n

πn(i) = i− n, n+ 1 ≤ i ≤ 2n.

When π acts on coordinates in the obvious way, we have:

πn(An) = A′nπn(A′n) = An

πn(A) = A,

7

where the last line is true because A ∈ E is exchangeable. Thus,

P[A′n∆A] = P[An∆A]→ 0

as n goes to infinity, which implies

P[A′n ∩ An]→ P[A].

On the other hand, since Xnn∈N is iid,

P[An ∩ A′n] = P[An]P[A′n]→ P[A]2.

Combining these two equalities, we conclude

P[A] = P[A]2,

which implies

P[A] ∈ 0, 1.

We can apply this:

Theorem 4. Let X1, X2, . . . be i.i.d. and let Sn =∑n

i=1Xi. Then one of the following mustoccur:

(1) Sn ≡ 0.(2) limn→∞ Sn =∞.(3) limn→∞ Sn = −∞.(4) lim infn→∞ Sn = −∞ and lim supn→∞ Sn =∞.

Proof. Sketch: Note that, for all x ∈ [−∞,∞], the events lim infn→∞ Sn ≤ x and lim supn→∞ Sn ≤x are in the exchangeable σ-algebra. Thus, by Hewitt-Savage, these events all have proba-bility 0 or 1, and so the two random variables lim infn→∞ Sn and lim supn→∞ Sn are almost-surely constant.3 By independence, it is straightforward to check that they must take onvalues in −∞, 0,∞, and the theorem follows immediately.

Remark 3.4. This calculation is hard to do directly from e.g. the SLLN or the CLT. Inparticular, it applies even if E[X1] does not exist!

3.1. Summary And Main Ideas.

(1) Statement of SLLN and 0-1 laws.(2) ”Lacunary sequence” trick for SLLN proof. Recall this had two parts:

(a) If Xn → X in probability, any sufficiently sparse subsequence Xun convergesalmost surely.

(b) If a sparse subsequence converges almost surely, this “almost sure” convergencecan be transferred to the original sequence if the terms have sufficiently strongdependence.

(3) Using 0-1 laws.

3This last assertion is worth proving. The approach used in Exercise 2.1 of Billingsley will work here aswell.

8

4. Lecture 3: January 15

We will discuss maximal inequalities, and apply them to prove the convergence of randomseries.

(1) We proved Theorems 22.4 and 22.5 of Billingsley (maximal inequalities), and alsothe “reflection principle” for random walks.

(2) We discussed how to apply these results to prove Theorems 22.6-22.8 of Billingsley(convergence of random series).

(3) We do not cover the material related to random Fourier series in any depth.

We had two results from outside of Billingsley: the reflection principle and an applicationof the three-series theorem to AR processes. The reflection principle was an “easy” versionof the maximal inequalities:

Example 5. Let Xii∈N be an i.i.d. sequence with

P[X1 = 1] = P[X1 = −1] =1

2.

Define Sn =∑n

i=1Xi, and τk = infn : Sn = k. The reflection principle says:

P[Sn = a− c, τa ≤ n] = P[Sn = a+ c]

for all a, c, n ∈ N. The proof is essentially a picture, as shown in class.

The reflection principle almost immediately implies the following maximal principle:

P[τa ≤ n] = P[ max1≤k≤n

Sk ≥ a] = P[Sn /∈ [−a, a− 1]].

We applied the final maximal inequality (not reflection principle or Kolmogorov’s inequal-ity) to the following situation:

Example 6 (AR processes). Fix φ ∈ [−1, 1] and a distribution π. Let Z1, Z2, . . .iid∼ π. For

an initial point X0, define the random sequence Xn by the recurrence:

Xn+1 = φXn + Zn.

We want to know: does there exist a measure µ with the property: for X0 ∼ µ, we haveXn ∼ µ for all n? In other words, does this recurrence have a stationary measure?

There is a natural guess here: we could define the measure µ by the infinite series

X =∞∑n=0

φnZn.

It is straightforward to check that, if this series converges almost surely, then it is stationary.To check that it really is stationary, we can use the Kolmogorov three-series theorem.

4.1. Summary And Main Ideas.

(1) The “usual” approach to maximal inequalities is to partition the event ∑n

i=1Xi >

Cα according to the first exceedance time τα = rnr=1, where τα = infj :∑j

i=1Xi >α. In some of our proofs, we could choose C = 1. In others, we needed C > 1 inorder to “give some room.” Recall that we chose C > 1 when we wanted to use thenext important trick:

9

(2) In the maximal inequalities, and also many examples related to sharpness, we wantedto understand the probability of the event ∪n∈I

∑ni=1Xi > α over some interval I.

That is, we wanted to know if at least one of the partial sums exceeds some largevalue. Our usual trick was to note that

Sn > α ∪ Sn+1 > α ⊃ |Xn+1| > 2.1α.That is: if Xn+1 is large, than at least one of Sn, Sn+1 is large. This is useful becauseXn is often i.i.d. or otherwise “nice,” while Sn has strong dependence. We willuse this trick frequently in the remainder of the course.

(3) In class, we proved the “reflection principle” and an associated maximal inequalityfor simple random walk. This looked very specialized, but we will see throughoutthis course that it is in fact very general. The key idea, which you should revisit onceyou have seen more about martingales and Brownian motion, is that a very broadclass of stochastic processes can be well-approximated by (suitably warped) simplerandom walks. This lets you transfer facts about simple random walks, including thismaximal inequalities, to a more general setting.

10

5. Lecture 4: January 17

We finish our discussion of almost sure convergence for averages, and begin discussingweak convergence.

(1) We complete our discussion of Kolmogorov’s three-series theorem.(2) We state Kronecker’s lemma and give some consequences for sums of not-necessarily-

iid random variables. Although we don’t do this in class, I note that Kronecker’slemma is used in the “traditional” proof of the SLLN. See e.g. Amir Dembo’s notes,available online, for this traditional argument. Dembo’s notes also include someinteresting examples.

(3) We begin building towards “distributional” convergence results, most importantlythe central limit theorem. We begin by reviewing some important calculations andabstract distributional convergence results:(a) We will over the proof of Theorem 22.2 of Billingsley (near-uniqueness of moment-

generating functions).(b) We begin Section 25 of Billingsley, reviewing notions of convergence and some

examples. We get up to the proof of Skorohod’s embedding.

This class was largely taken from Billingsley. I emphasize one example from class, whichwas an explicit example constructed as in the proof of Skorohod’s theorem:

Example 7. Let Yii∈N be an independent sequence, with

P[Yi = 1] = n−1, P[Yi = 0] = 1− n−1.Let Y∞ ≡ 0. Let µi = L(Yi). We note that µi converges weakly to µ∞, but (by the Borel-Cantelli lemma) Yi does not converge almost surely to Y∞.

How does this relate to Skorohod’s lemma? Our proof of that lemma gave the constructionas follows. Sample w ∼ Unif([0, 1]), and define

Y ′i (w) = 1, w < n−1

Y ′i (w) = 0, w ≥ n−1.

Then YiD= Yi, but Y ′i converges almost surely to Y∞.

5.1. Summary And Main Ideas.

(1) We completed our discussion of the three-series theorem. You should be able torecognize when this result is useful, and apply this to moderately difficult examples.

(2) We gave a sketch of the proof of the central limit theorem. It is worth rememberingsome shorthand for the proof of the central limit theorem.

(3) We saw quite a few abstract convergence results from Section 25. You should bevery comfortable using all of them, and should remember their proofs (which arenot too long). I don’t have much to add here about “tricks,” as these are all fairlyfundamental. Personally, I always found it hard to remember to use Skorohod’sconstruction (which lets you go from weak convergence to almost sure convergenceof an associated sequence of random variables).

11

6. Lecture 5: January 22

We continue discussing weak convergence. This will include the “Fundamental Theorem”section of Billingsley (starting with Theorem 25.7) and applications. We will finish chapter25 of Billingsley, including discussion of tightness and uniform integrability.

In class, we did problem 25.17 of Billingsley, which gives a very general sufficient conditionfor a sequence of measures to be tight. I think this is quite a worthwhile result, since thetheorem is very flexible and it makes checking tightness fairly simple in many cases. There isalso a very closely related theorem that lets you check if a sequence is uniformly integrable.I strongly suggest that you try the following exercises when studying:

(1) Check that you are comfortable doing “easy” proofs of tightness (such as the fol-lowing example) fairly quickly. Personally, the condition given in Problem 25.17 ofBillingsley seems like the simplest “general” condition to use.

(2) Prove an analogue to Problem 25.17 of Billingsley that gives a statement for uniformintegrability, and use it to prove an analogue to the following example.

(3) Prove a result related to the converse of Problem 25.17 of Billingsley: if there existsa tight sequence, then there exists a function satisfying the conditions of Problem25.17.

An immediate application of Problem 25.17 is:

Lemma 6.1. Let Xn be any collection of random variables with supn E[|Xn|] <∞. ThenL(Xn) and L(n−1

∑ni=1Xi) are both tight collections of random variables.

We also did problem 25.20 of Billingsley. It is very similar to many examples you saw inthe first half of the course when studying convergence theorems for integrals. The content ofthe example is: the analogue of the dominated convergence theorem for uniformly integrablerandom variables is not sharp, for essentially the same reason that the dominated convergencetheorem itself is not sharp. It is worth thinking for a moment about this connection so thatyou don’t waste a lot of time “reinventing the wheel.”

Example 8. Let Xnn∈N satisfy P[Xn = n] = 1n log(n)

, P[Xn = 0] = 1− P[Xn = n]. Then

limα→∞

supn

E[|Xn|1|Xn|>α]dP ≤ limα→∞

supn :n>α

1

log(n)≤ lim

α→∞

1

log(α)= 0.

Thus, Xn is not uniformly integrable. On the other hand, if Z dominates Xn in thesense

P[|Z| > s] ≥ supn

P[|Xn| > s],

we would have P[|Z| ≥ n] ≥ 1n log(n)

for all n. By the usual integration-by-parts formula, this

implies that E[|Z|] =∞.

6.1. Summary And Main Ideas.

(1) We finished the “fundamental” weak convergence results. As with the previous day,you should know these results and their proofs extremely well; there will certainly beat least one question on weak convergence conditions on the midterm and/or finalexam. As emphasized during the proofs, the main idea is often to use Skorohod’sembedding theorem to replace weak convergence with actual convergence of randomvariables, then invoke limit theorems from your introduction to measure theory.

12

(2) You should always remember the tightness and uniform integrability conditions. Itis worthwhile to remember that they are closely related to each other.

(3) In class, we pointed out that you could use uniform integrability to prove a version ofthe dominated convergence theorem that is strictly better (for probability measures).It is worth remembering how that worked, and that we have a better version of DCTavailable.

13

7. Lecture 6: January 24

We will begin Chapter 26 of Billingsley, covering roughly the first half. In class, we alsostated Polya’s criterion (Exercise 26.3 of Billingsley) and used it to do Exercises 26.4 ofBillingsley. We also used Polya’s criterion to prove that the “α-stable” distributions exist(this last was just an illustrative application; we will not really study these distributions inclass).

7.1. Summary And Main Ideas. I think class was a little confusing this day, and I alsothink Billingsley is a little confusing. We started talking about characteristic functions(CHF’s) because we want to use them to prove the central limit theorem. However, we seemto have proved a lot of side results as well. It is worth spending a moment to think aboutwhat sorts of things we prove about CHFs and why we care:

(1) Taylor Series Estimates: We can write

φ(s) = E[eisX ] = E[∑n

(isX)n

n!].

It is tempting to swap the sum and expectation, and otherwise start formally ma-nipulating this formula (e.g. by taking derivatives in order to get a formula for themoments of X in terms of the derivatives of φ). Unfortunately we can’t always dothat (the most obvious problem is that X may not have enough moments - but asyou might guess from e.g. Problem 6 of HW 1, that is not a precise diagnostic). Thefirst chunk of the chapter was about justifying these formal manipulations by getting

estimates on the error |eix −∑n

k=0(ix)k

k!| of the first estimates. This is the most

important bit from the point of view of proving the CLT.(2) Basic Properties: We pointed out that CHFs always exist, are always uniformly

continuous, etc. Good to have these (and their short proofs) in mind.(3) Smoothness and Convergence to 0: We proved the Riemann-Lebesgue theorem,

which says: if X has a density, then φX(s)→ 0 as |s| → ∞. This is the first result weproved that says, roughly: decay of the tails of φX is closely related to the smoothnessof X. The proof was short and, by itself, I think quite unilluminating.

(4) Drawing CHF’s: Polya’s criterion lets you “just draw” a function and check that itis the characteristic functions of something, without knowing what it is a characteristicfunction of. This is incredibly useful for thinking about characteristic functions,though that may not be so obvious at first. Roughly speaking, if you have a guessabout CHFs, you can just start drawing CHFs that follow Polya’s criterion and seeif you run into problems (see the next point).

Note that you’re already used to doing this informal reasoning when you thinkabout CDFs.

(5) Bad Heuristics and Counterexamples: We have a lot of nice results relatingCHFs, moments, and distributions. It is very easy to end up with a sort of fuzzyview that knowledge of the full CHF, the derivatives of the CHF in a neighbourhoodof 0, moments, and distributions are all sort of equivalent. This is not quite right,and Polya’s criterion let us see this pretty easily: we can clearly draw two CHFs thatare equal in a large neighbourhood of 0 but are not equal generally.

14

8. Lecture 7: January 29

We will finish our discussion of Chapter 26 of Billingsley. We also do problem 26. 10 ofBillingsley and discuss many, many consequences. The main points of this discussion were:

(1) We found a formula for all derivatives of a density in terms of its characteristicfunction, under certain conditions.

(2) We established dominated-convergence-like conditions for derivatives of a density toconverge. These generalize the result for the first derivative given in problem 26.10of Billingsley.

(3) We did not discuss any converse of the result, though in fact there are many. Thisis a vast field; see e.g. Katznelson’s An Introduction to Harmonic Analysis for moreinformation.

8.1. Summary And Main Ideas. Lots of important results today:

(1) You must know the uniqueness and convergence theorems! You don’t really have toremember the details of the proofs.

(2) You should understand the inversion formula, and that (formally) manipulating itby e.g. taking derivatives gives interesting new results; the most interesting of thesewere formulas for the moments of random variables in terms of derivatives of theircharacteristic functions, and formulas for the densities of random variables in termsof integrals of their characteristic functions. You should also understand easy ways toprove that these formal manipulations work, by e.g. using the dominated convergencetheorem. We proved many results along these lines, and you should expect one toappear on the midterm and/or final.

15

9. Lecture 8: January 31

We will begin Chapter 27 of Billingsley, covering everything up to and including the proofof the CLT and the statement of Lyapunov’s condition. We also gave several applications ofthe CLT and the results we’ve shown so far, including:

(1) Using Lemma 1 in Chapter 27 of Billingsley, together with the continuity theoremfor characteristic functions, to prove the Weak Law of Large Numbers.

(2) Using the usual central limit theorem (both directly and indirectly) to prove twostability results for Gaussians. In both, assume X, Y are iid with E[X] = 0, E[X2] =1. Assume either:(a) X+Y√

2

D= X, or

(b) (X − Y ), (X + Y ) are independent.Then X is a standard Gaussian. The first result was an immediate application of theCLT. The second we proved via manipulation of characteristic functions in order torelate s−1φX(s) and s−2φX(s) at “large” values of s to their values at s near 0.

(3) We proved a central limit theorem for the “record problem” from homework.

9.1. Summary And Main Ideas. Remember the statement of the CLT, Lyapunov andLindeberg’s condition, and the heuristic “we expect a CLT if a sum is made of many small,independent contributions.”

The rest of the examples we did in class were largely for practice. This is a major result,and you should expect at least one application of the CLT to appear on the midterm and/orfinal.

16

10. Lecture 9: February 5

We will finish our discussion of Chapter 27 of Billingsley, and also do problem 27.14 ofBillingsley . Note that:

(1) We do not cover anything after the proof of the CLT for dependent random variables,and

(2) The CLT for dependent random variables is too technical to appear on an examwithout many hints/partial results.

The main lessons from today are important examples (e.g. CLTs for Markov chains andself-normalized sums) and the “blocking: trick: you can prove a CLT by splitting a sum intoa bunch of “large” blocks (which are nearly independent, and thus converge via the usualCLT) separated by “small” blocks (which depend strongly on the big blocks, but whichare shown to be negligible via moment bounds). The “small” and “large” blocks can becombined via Slutsky’s theorem.

Remark 10.1. To check that you understand the proof of the CLT for Markov chains, youcould try to prove an analogous CLT for Markov random fields on e.g. Z2. The proof shouldbe very similar in spirit, with a slightly more complicated definition of the “large” and “small”blocks.

Markov chains are an important class of stochastic processes that satisfy Billingsley’smixing condition:

Example 9 (Discrete Markov Chain). We call an n by n matrix K a transition kernel if itsatisfies:

K[i, j] ≥ 0,n∑

m=1

K[i,m] = 1

for all i, j. We call a sequence of random variables Xss∈N a Markov chain with kernel Kif it has the following distribution:

P[Xs+1 = x|X1, . . . , Xs] = K[Xs, x].

Quick exercise: Check that this equation, combined with a starting point X1, uniquelydetermines the entire distribution of Xss∈N.

Assume that K satisfies the minorization condition

min1≤a≤n

Ks[a, b] ≥ δ > 0

for some s ∈ N, b ∈ 1, 2, . . . , n, δ > 0. It is then straightforward to check that

|Kms+1[a1, c]−Kms+1[a2, c]| ≤ (1− min1≤a′1,a′2≤n

|Ks[a′1, b]−Ks[a′2, b]|)m

≤ (1− δ)m.Thus, we see that Markov chains that satisfy a minorization condition satisfy the conditionsfor our CLT.

After proving the CLT for dependent random variables in the main text of Billingsley, wewill go through Excercise 27.14 of Billingsley:

17

Example 10. Let Xnn∈N be i.i.d with mean 0 and variance 1. Let Sn =∑n

i=1Xi. Fors > 0, let vs be a random variable (possibly dependent on Xn). Assume that there exists asequence as ⊂ N so that

as →∞,vsas

D→ 1.

Then

Svs√vs

D→ N (0, 1),Svs√as

D→ N (0, 1).

Remark 10.2. This result is used frequently in statistics. Recall: the usual “confidenceinterval” in statistics looks like:

(n−1Sn − zα2σ, n−1Sn + zα

2σ)

if you “know” the variance of the underlying sampling distribution. Since you never know thisin practice, you always end up replacing σ with an estimator, such as the sample variance.The present theorem justifies this replacement.

Most of this proof consists of step-by-step replacement of the random vs terms by deter-ministic as terms, using Slutsky’s theorem. I follow the order suggested by Billingsley in theexercise, but in fact this doesn’t matter too much.

we begin by assuming the second is true and showing that it implies the first. Indeed, notethat

Svs√vs

=Svs√as− Svs√

as

(1−

√asvs

).

If Svs√as

converges to any random variable Z, then, since(

1−√

asvs

)converges to 0 in prob-

ability, we must have that Svs√as

(1−

√asvs

)converges to 0 in probability as well by Slutsky’s

theorem. But applying Slutsky’s theorem again, this implies that Svs√vs

also converges to Z in

probability. Thus, it is enough to study the sequence Svs√as

.

Repeating essentially the same argument, we have

Svs√as

=Sas√as

+Svs − Sas√

as.

Since Sas√as

converges to a Gaussian via the usual CLT, by Slutsky’s theorem it is thus sufficient

to check that Svs−Sas√as

converges to 0 in probability.

Finally, to check that Svs−Sas√as

converges to 0 in probability, note that

P[|Svs − Sas| > ε√as] ≤ P[|vs − as| ≥ ε3as] + P[ max

k : |k−as|≤ε3as|Sk − Sas| ≥ ε

√as]

≤ P[|vs − as| ≥ ε3as] +1

ε2as,

where the last line is via Kolmogorov’s inequality. Since vsas→ 1 in probability, the first term

goes to 0. Since as →∞, the second term also goes to 0.18

10.1. Summary And Main Ideas. The arguments presented today were too complicatedfor most of us to be able to memorize effectively. However, the following are good roughideas to have in the back of your head:

(1) The easiest way to prove a CLT (or SLLN, or other limit theorem) for weakly-dependent random variables is to use the blocking trick. Informally, you can view theblocking trick as being similar in spirit to the truncation trick: in the usual truncationtrick, you set to 0 the part of the random variable that is too big; in the blockingtrick, you set to 0 the part of the random variable that has dependence.

(2) We discussed discrete Markov chains, which are a nice class of dependent randomvariables that appear many places. It is worth remembering roughly how they work.

19

11. Lecture 10: February 7

We will take a detour and cover Sections 9.1-9.8 of the book Probability with Martingalesby David Williams. This gives a careful definition of the conditional expectation, using avery different proof strategy than that given in Chapter 33 of Billingsley. We won’t talkabout the differences much in class, as it won’t make too much sense until you have seenboth strategies, but here is a rough guide for future reference:

(1) The construction in Williams is based on some very important ideas:(a) L2(Ω,F , P ) is a complete Hilbert space, and if G ⊂ F then L2(Ω,G, P ) ⊂

L2(Ω,F , P ). Furthermore L2(Ω,G, P ) is also a complete Hilbert space.(b) For those Hilbert spaces, conditional expectation is just taking the “usual” pro-

jection operator. Because Hilbert spaces are so nice, projections behave basicallythe way you expect them to in linear algebra class.

(c) L2(Ω,F , P ) is dense in L1(Ω,F , P ), so we can just extend the definition fromL2 to L1 for free - even though projections no longer make sense!

This is a very, very common strategy in functional analysis and related areas: dothe calculation you want to do in the nicest possible space, then use something likecontinuity to get it to work everywhere.

(2) The construction of conditional expectation in Billingsley follows from a very impor-tant theorem, Radon-Nikodym, which tells you that a certain abstract “derivative”for measures exists. Unfortunately the proof of Radon-Nikodym seems to be quite abit harder than Williams’ direct construction.

(3) As an aside: the Radon-Nikodym theorem itself can be proved using the martingalemethods we develop in this course. Time permitting we will go over this later in thecourse. Because of this, one never needs the difficult proof of Radon-Nikodym thatis found in Billingsley.

In class, we will go over the relevant section of Williams’ book.In addition, we will do Exercise 34.14 of Billingsley.

11.1. Summary And Main Ideas. This is an extremely important chunk of material toknow, and there will certainly be questions about it on both the midterm and the final exam.The highlights are:

(1) How to construct the conditional expectation, including a good understanding of therelevant material about Hilbert spaces.

(2) Various properties of conditional expectation (e.g. the tower property, conditionalMarkov’s inequality, etc etc) and how to prove them (mostly by using the “simplefunction” approach from Probability I).

(3) Relating the conditional expectation constructed by projections to the more familiarformulas about conditional expectation from earlier probability classes.

(4) Remembering that the conditional expectation is now a random variable, with all ofthe usual difficulties this entails (it is only defined up to “almost sure” equality, etc).

We will go over much of this again when we get to the construction in Billingsley.Note: I suggest skimming Chapter 31 of Billingsley before next class, as it will be a useful

guide to intuition.

20

12. Lecture 11: February 12

We will go over Chapter 32 of Billingsley. We also did exercise 4.1.8 of Dembo’s notes,which shows that for absolutely continuous measures determined by a product formulaµ(∏n

i=1Ai) =∏n

i=1 µi(Ai), ν(∏n

i=1Ai) =∏n

i=1 νi(Ai) for all “product” sets of the formA =

∏ni=1Ai, the “obvious” formula

dν=

n∏i=1

dµidνi

for the Radon-Nikodym derivative is true.

12.1. Summary And Main Ideas. This chapter is not as fundamental as the precedingor following chapters. However, it gives some good practice for a few techniques we careabout.

(1) You should know the main result (the Radon-Nikodym theorem) and the existenceof the Lebesgue and Hahn decompositions (of countable set functions and their statespaces respectively).

(2) We got some practice relating the formal definition of a Radon-Nikodym derivativeto the formulas that we expect to hold, based on earlier probability classes. Thework of relating new formal definitions to old formulas is very important in this class(both for Radon-Nikodym derivatives and for conditional expectations), and so thispractice might be the most important part of this chapter.

(3) I won’t ask you to prove any of the results about countable set functions. However,they are decent review for earlier probability classes, as the proofs are fairly similar.

21

13. Lecture 12: February 14

Reminder: No class next week!We will start Chapter 33 of Billingsley, which re-introduces conditional probability. Of

course, we already have a formula for conditional probability thanks to reading Williamsearlier. Nonetheless, Billingsley has a different (and very popular) perspective. More impor-tantly for most of us, he has a very large number of examples.

I also very briefly discuss a third approach to conditional probabilities, known as disinte-gration of measure. I will not ask test questions about this, and the theorem is quite a bitharder than the weaker versions in Williams and Billingsley. However, it is an importanttool and you are allowed to use it on exams as long as you cite it properly. I include oneintroductory reference on the course website, also available at:

http://www.stat.yale.edu/~jtc5/papers/ConditioningAsDisintegration.pdf

Related to this, we quickly went over Example 33.1 of Billingsley.

13.1. Summary And Main Ideas. We get lots more practice writing down conditionalprobabilities, and exploring their properties and how they relate to the definitions you’veseen in earlier classes.

22

14. Lecture 13: February 26

Today will be devoted to midterm review. We will spend most of the time goingover the midterm from the last time this course was taught, as our midterm will be fairlysimilar. Some rough information about the midterm:

(1) The midterm will cover Sections 22 and 25-27 of Billingsley, as well as Sections 9.1-9.8of Williams’ book.

(2) The midterm has 5 questions, and there are 5 “parts” of the course:(a) Strong convergence (Section 22). Main ideas and results were the strong law

of large numbers, the “truncation trick,” the “maximal inequalities,” and Kol-mogorov’s three-series theorem.

(b) Weak convergence (Section 25). Main ideas and results were general “easy”conditions for weak convergence (e.g. Slutsky’s theorem), Skorohod’s embeddingtheorem, the Portmanteau theorem, and conditions and applications related totightness/uniform integrability.

(c) Characteristic Functions (Section 26). The most important result is the con-tinuity theorem: a sequence of characteristic functions converge to a charac-teristic function if and only if the associated sequence of measures convergesweakly. Some other important ideas were the formula for moments in terms ofderivatives of the characteristic functions (including the proofs), various boundsrelated to Taylor expansions, and finally various results (including some excer-cises) relating the decay rate of |φ| to the smoothness of the associated measure(including proofs of these results, and formulas for the densite of µ in terms ofits characteristic function).

(d) Central Limit Theorem (Section 27). The most important results are the Linde-berg and Lyapunov CLT, and knowing how to apply them. We spent some timefinding consequences and related results (e.g. the “stability” of the Gaussiandistribution; CLTs for dependent random variables), but none of these will beon the midterm.

(e) Conditional Distributions (Williams’ book). The most important part of thissection is the (somewhat complicated) definition of a conditional probability.You should also know how to prove its basic properties (e.g. the tower property;various conditional versions of classical results), the details of how we provedits existence (first viewing conditioning as a projection operator in L2, and thenusing a limiting operator to extend this definition), and how the new definitionof conditional probability related to the “usual” definition.

Thus, you should expect approximately one question per section.(3) Roughly speaking, I classify the questions as follows:

(a) There will be three questions that are small modifications of results we showedin class or in homework. To give you an idea of what I mean by a “smallmodification,” here are a few results that feel like fair game to me:

(i) In class, we proved that a family of random variables are tight if they arestochastically dominated by a single random variable. You should be ableto prove an analogous condition for a sequence of random variables to beuniformly integrable. You should also know examples showing that theseconditions are not themselves sharp.

23

(ii) We did Exercise 26.10 in Billingsley in class. You should be able to provethe analogous results for higher-order derivatives.

(iii) In class, we proved Kolmogorov’s maximal inequality. If you read the proofcarefully, you will see that in fact you don’t need to assume full indepen-dence to get a similar result. I might ask for a proof of this analogousinequality in a case where one does not have exact independence as in theusual statement of Kolmogorov’s inequality, but the proof does apply withessentially no changes.

(iv) After proving the continuity theorem for characteristic functions, Billings-ley has a corollary of the following form: if you add an assumption (in thiscase: the assumption that associated sequence of measures is tight), youcan often remove another assumption (in the same example: the assump-tion that the limit of the characteristic functions is in fact a characteris-tic function itself) without changing the main conclusion or substantiallychanging the proof. You should understand the proofs of our main resultswell enough that you can do this type of “assumption exchange.” I willnot be too mean about this: any question of this form will not involvehuge changes to the proof. Assumptions such as tightness are often goodcandidates for this type of question.

(b) Two questions will involve computations. In the midterm these will be quiteshort (the exam might have a longer computation), in one of the following cat-egories:

(i) In class, we used the “truncation trick” several times, including e.g. toprove a SLLN. This is a fundamental technique, often used in conjunctionwith the Borel-Cantelli lemmas or Slutsky’s theorem (to deal with thedifference between the original random variable and its truncation). Youshould be able to use it appropriately to prove limit theorems for specificexamples. This could include e.g. proving a SLLN for sequences thatdon’t quite have finite expectation (e.g. due to rare but large outliers) orproving a CLT for sequences that don’t quite satisfy Lindeberg’s condition(again e.g. due to rare but large outliers). If you can’t construct thesesorts of examples for practice, please talk to me and we’ll go through sometogether.

(ii) In class, we used the “first hitting time” trick to prove various maximalinequalities. This is also a basic technique that you should know to use.

(iii) More generally, you should be able to verify the conditions of our maintheorems (SLLN, Kolmogorov’s three-series theorem, CLT, continuity the-orem, etc) for specific examples and recognize when doing so is appropriate.

(4) We will solve a few problems in the rest of the class, including the midterm fromthe last time this course was offered. I won’t share the midterm on my website, butplease email me if you would like to see a copy ahead of time.

24

15. Lecture 14: February 28

Midterm today.

25

16. Lecture 15-17: March 5, 7, 12

We completed Chapters 33 and 34 of Billingsley in this time. This is a fairly large changefrom the original draft of the schedule, where each chapter was given quite a bit of time. Themain reason for the change was that we had already covered many of the most importantconstructions and theorems when studying Sections 9.1-9.8 of Williams. Although we gaveless time than originally planned to Chapters 33 and 34, I want to emphasize that condi-tional expectations and conditional probabilities are still quite important to this course - weeffectively spent 5-6 classes on the material, and you should expect 1-2 questions on the finalexam.

I don’t give detailed notes on Chapters 33 and 34 of Billingsley, since our coverage wasnot in the same order. Instead, I highlight the facts we should remember and the types ofproblems we should be able to do:

(1) We should know the basic definition of conditional expectation and probability, andtheir constructions (both via projection and via Radon-Nikodym derivatives).

(2) We should definitely know how to prove simple facts about conditional expectationsand probabilities directly from the definitions. As an example, in class we proved theconditional integration-by-parts formula:

E[X|G] =

∫ ∞0

P[X > s|G]ds

for nonnegative random variables X ≥ 0. This involved a few lines of calculation,but essentially we just used the definition of conditional probability and carried italong through the usual proof of the conditional integration-by-parts formula.

(3) We should know the long list of properties of conditional expectation (this is thelist that includes theorems such as the tower property; conditional dominated con-vergence theorem; linearity of conditional expectations; monotonicity of conditionalexpectations; etc), and how to prove them (most of these proofs were either directfrom the definition, or used the “standard machine” of showing that the result holdsfor indicator functions of sets and then using linearity + monotone convergence the-orem).

(4) One of the main theorems in Billingsley is the existence of conditional distributions(in chapter 33) and the fact that conditional distributions can be used to constructconditional expectations in a coherent way (in chapter 34). You should be veryfamiliar with the statements of these facts, and the proof of the latter theorem (youdon’t need to memorize the proof of the former, which is a bit too messy for an examproblem).

(5) In this part of the course, we put conditional probabilities on a rigorous footing. Oneof the most important things you need to know is that our new definitions always“agree” with the old definitions, when the old definitions make sense. You shoulddefinitely know:(a) That the old formula for conditional probability/expectation gives a valid con-

ditional distribution/expectation using the new definition, and(b) how to prove this fact, and(c) that this does not mean you can always marginalize the way you used to in an

introductory probability course (see the Borel-Kolmogorov paradox covered in26

class; you must understand that this “paradox” really does present a problemfor the undergrad-probability viewpoint and that it is not in conflict with ournew definitions/theorems).

(6) We spent some time discussing sufficient statistics, Rao-Blackwellization, and relatedapplications. Although you should be able to follow these calculations, you don’tneed to remember the specific applications for the final exam - there won’t be anyquestions directly about e.g. sufficient σ-fields.

27

17. Lecture 18: March 14

We will start Chapter 35 of Billingsley.General Note on Martingale Coverage: Billingsley gives a fairly terse introduction to

martingales. In class, we’ll likely add some extra material, including some worked examplesand also the theory of “integrating to the limit” for martingales (the latter is one of thebig reasons to study martingales). The examples are nice practice, but you will not needto memorize them for the final exam. You will also not need to memorize the proofs of theadditional theoretical results, though you are free to use them as long as you quote themaccurately before doing so. Most of the additional material will be from Dembo or Resnick,though isolated examples will be drawn from other books I reference on the website.

In addition to covering the first few pages of Billingsley, we did the following exercises andexamples:

(1) Showed that for any stopping times θ, τ , min(θ, τ) and θ + τ are stopping times.(2) Briefly mentioned the vertex-exposure martingales, and more generally how martin-

gales are used in combinatorics.(3) Briefly mentioned the “subharmonic” submartingale, and referenced relationship be-

tween martingales and complex analysis.(4) Before proving Theorem 35.2, we took a detour to prove a number of related results.

Most of this is covered in pages 183-186 of Amir Dembo’s notes. The most usefulresult/example here is likely the transformation of a martingale by pre-visible se-quences; this trick is used throughout Billingsley in a number of special cases but isnot explicitly called out.

17.1. Summary And Main Ideas. You should know:

(1) Quite a few definitions!(2) How to do basic calculations related to the many σ-algebras that appear here (and

next lecture).(3) How to use the the trick of transformating a martingale by pre-visible sequences.

28

18. Lecture 19: March 19

We will continue Chapter 35 of Billingsley, likely getting (roughly) to the martingaleanalogue of Kolmogorov’s maximal inequality. Besides content in Billingsley, we

(1) Finish our excursion into Dembo’s notes on stopping times, and complete several ofthe exercises.

(2) Prove Azuma’s inequality, using the proof in the excellent survey article Concentra-tion inequalities and martingale inequalities: a survey by Fan Chung and Linyuan Lu.We also applied this to prove a simple concentration bound for the “balls-in-boxes”problem.

18.1. Summary And Main Ideas. You should know:

(1) Even more definitions!(2) Most of the proofs we discussed today are worth remembering (you are not responsible

for the proof of Azuma’s inequality, however).

29

19. Lecture 20: March 21

We continued reading Chapter 35 of Billingsley, getting to the martingale convergencetheorem of Billingsley. We also took an extended break to study the branching process, asan example of a family of simple martingales with very nontrivial and nonobvious limits.

19.1. Summary And Main Ideas. You should know:

(1) Even more definitions!(2) The proof of the martingale convergence theorem.(3) You do not need to know the detailed calculations related to the branching process.

However, I suggest that you remember what branching processes are and how to get amartingale out of the branching process - one of the main skills in using martingalesis figuring out how to turn a stochastic process of interest into a martingale, andthere are a few standard tricks for doing this in simple situations.

30

20. Lecture 21: March 26

Raluca Balan taught a class on the existence/construction of stochastic processes.

31

21. Lecture 22: April 4

We continued reading Chapter 35 of Billingsley. Among other results, we:

(1) Checked that E[Z|Fn] is always uniformly integrable if Z is integrable (this was anapplication of conditional Markov’s inequality).

(2) Used the above lemma to prove a version of the martingale convergence theorem inwhich we could integrate to the limit.

(3) Did exercise 35.5 of Billingsley. This is not on the exam, but is a good exam-levelpractice problem.

(4) Introduced reverse martingales and proved the reverse martingale convergence theo-rem. Recall the proof is nearly identical to the proof of the original martingale conver-gence theorem, but the hypotheses change: you no longer require supn E[|Xn|] <∞,but you now require that Xn is actually a martingale (not just a submartingale).

(5) As an aside, we proved an anti-concentration inequality for martingales (this was notin Billingsley). The proof was similar to, but more complicated than, the proof ofChebyshev’s inequality and it made crucial use of the optional stopping theorem.

21.1. Summary And Main Ideas. As mentioned in class, this material is all after thelast tested material, and none of it is needed for the final exam. Having said that, the reversemartingale construction is quite handy and might be an alternative way to do a question,and exercise 35.5 is good practice.

When reviewing older material, you might want to consider martingale proofs of someof our earlier theorems. In particular,

(1) We already proved an extension to Kolmogorov’s maximal inequality in class. Notethat this proof does not reduce to our previous proof, even in the special case thatXn is i.i.d. However, the proof is “similar” in that it also involves a first-hittingtime.

(2) In class, I mentioned that Kolmogorov’s 0-1 law could be proved (and improved)using martingale arguments, but we did not actually do this. You might want totry this yourself. The key idea is to consider the reverse-martingale associated withthe sequence E[1A‖Fn], then use the reverse-martingale convergence theorem. SeeTheorem 1.10 of

http://stat.columbia.edu/~porbanz/teaching/G6106S15/NotesG6106S15_04May15.pdf

for some details.(3) See Section 8 of

http://galton.uchicago.edu/~lalley/Courses/385/Martingales.pdf

for further translations, including an extremely short proof of the strong law of largenumbers.

None of this is mandatory - I mention it only because some people find it easierto remember the (usually shorter) martingale proofs, especially when they have spent theprevious month studying martingales. Some of the martingale proofs (especially of theSLLN) are also generally considered much easier than the elementary proofs, once you knowthe theorems and definitions that they use.

32

22. Lecture 23: April 9

We will be doing mostly exam review today. Basic details:

(1) The exam will be 180 minutes. You may not bring your notes or books. You maybring a calculator, though I don’t expect it to be useful.

(2) The exam says the following at the front:

There are 10 questions, each worth 10 marks. Attempt any combi-

nation of 7 problems. The exam will be graded out of 70 marks.

The passing mark for the comprehensive examination is 45 marks.

The course grade in MAT 5171/STAT 5708 is treated separately

from the PASS or FAIL decision for the comprehensive examina-

tion. The latter can only be communicated by the Director of the

Institute.

Explain clearly each step in your proofs. Quote carefully any theo-

rems or results that you may be using.

Generally, the questions on the final exam will be similar to the midterm. The maindifference is that the final exam does include at least one request for a definition, and alsodoes include at least one question that asks you to prove a “baby version” of a main resultwe proved in class.

The exam will cover the material that we covered in class (Chapters 21,22, 25,26,27,32,33,34 and 35 of Billingsley; Sections 9.1-9.8 of Williams; a few specific exercises fromother books). I personally classify the questions on the exam as follows:

(1) Asking you to state a definition. For example, “State the definition of “uniformintegrability”.”

(2) Asking you to re-prove an important theorem. For example, “Let µnn∈N and µbe probability measures. Assume that µn(f) → µ(f) for all bounded, continuousfunctions f . Prove that µn converges weakly to µ.” (this is the Portmanteau theorem).

Some theorems from class are too complicated to re-do entirely. This includesthe CLT theorem and other results about characteristic functions, the existence ofRadon-Nikodym derivatives, and so on. In these cases I might give you many relevantlemmas and ask you to fill in a few important steps or ask you to prove the theoremin a special case. For example, “Let Xn ∼ Unif(−1,+1) be a sequence ofi.i.d. random variables. Prove that 1√

n

∑ni=1Xi converges weakly to a Gaussian” or

“Prove the Radon-Nikodym theorem in the special case that the probability space isthe interval [0, 1].”

(3) A small variant of a question from homework or an exercise we did in class. ”Smallvariant” here should be interpreted as in the midterm (e.g. using the Fourier inversionformula to get a formula for 3’rd derivatives, when we only did 2’nd in class). Fora more recent example, in class we showed that if τ, σ were stopping times, so aremax(τ, σ) and τ + σ. I might ask you to prove that min(τ, σ) is a stopping time.

(4) Asking for a new but small calculation, along with an application of an existingtheorem. For example, “Let Yn ∈ [−C,C] be a sequence of independent anduniformly bounded random variables. (i) Find a sequence An such that (

∑ni=1 Y

2i )−

An is a martingale. (ii) Either prove that this martingale has an integrable limit,or find a counterexample (and prove that it is a counterexample).

33

You should certainly expect at least one “counterexample”-type question. Forexample, ”The portmanteau theorem says, if µn(f)− > µ(f) for all bounded, contin-uous f , then µn− > µ weakly. Find an example for which µn(f) converges for all fthat are bounded, continuous and have compact support, but µn does not convergeweakly to a probability measure µ.”

(5) There is exactly one “difficult computation” question that has not appeared in classor on the homework. Fortunately there is a fairly short proof, so it has the advantageof being quick to write down if you can find the trick.

The midterm review section has a quick discussion of the main results up to the midterm,as well as a list of appropriate-looking questions. Here is a similar list for what we’ve donesince then:

(1) Chapter 32 (Radon-Nikodym derivative): This was a short chapter focusing on asingle important theorem with a fairly non-obvious proof. You should know thedefinition and the precise statement of the result very well. I do not expect you tobe able to prove the theorem from scratch, but you might e.g. be provided withthe important lemmas and asked to do the final steps (there is one key idea, theconstruction of the family G of sub-densities, in the proof in Billingsley).

Note: Although I didn’t emphasize it when we were covering this chapter in class,the Radon-Nikodym derivative does depend on the underlying σ-algebra F , and thederivative itself can be viewed as an F -measurable random variable. These ideasbecame important for several examples that appeared when studying martingales.

(2) Chapter 33 (conditional probability): This was a long and important chapter withmany examples. You should definitely know all of the main definitions, both con-structions of the conditional probability, and how to prove the results in our longlist of conditional analogues to previous theorems (e.g. conditional monotone conver-gence theorem, conditional Markov’s inequality, etc). You should definitely also knowhow to prove that conditional probabilities behave the expected way, and should befamiliar with how conditional distributions work. This last deserves some attention,as it was a frequent source of errors in the most recent homework. In particular,even though we have a theorem that conditional distributions exist, we do not havegeneral theorems saying that conditional densities always exist, nor do we have atheorem saying that conditional densities are given by the formulas you leanred inintroductory classes; indeed both these things are false.

(3) Chapter 34 (conditional expectation): This chapter is quite similar to Chapter 33.Perhaps the only really-new idea is the theorem stating that you can use conditionaldistributions to define a coherent family of conditional expectations.

For Chapters 33 and 34 together: given the emphasis in class and homework,you should certainly expect at least one question on proving conditional analogues toprevious theorems and at least one question on computing a formula for a conditionaldistribution or expectation (and proving it works).

(4) Chapter 35 (martingales): This was a long chapter, and we spent a long time onit. Expect several questions, likely including (i) providing the basic definitions andchecking that certain specific examples satisfy them, (ii) re-proving important results(e.g. the maximal inequality, upcrossing lemma or convergence theorem), (iii) doinga calculation with martingales.

34

All of the ten questions have several parts, usually grouped around some application/theme.I classify the main topics as:

• 3 questions on convergence of sums of independent random variables.• 1 question on weak convergence.• 2 questions on conditional probability or expectations.• 4 questions on martingales.

In particular, there is a slightly heavier weighting on the second half of the course.Obviously there is overlap - questions about convergence of sums may also involve us-

ing results about weak convergence, and questions abou martingales may require a goodunderstanding of conditional probability. Note that I don’t list any questions focused oncharacteristic functions or the Radon-Nikodym derivative, but I can assure you that at leastone of those topics will appear quite heavily as a part of at least one question that is focusedon something else!

35

top related