chapter 2 weak convergence - pennsylvania state...

Chapter 2

Weak Convergence

Chapter 1 discussed limits of sequences of constants, either scalar-valued or vector-valued.Chapters 2 and 3 extend this notion by defining what it means for a sequence of randomvariables to have a limit. As it turns out, there is more than one sensible way to do this.

Chapters 2 and 4 (and, to a lesser extent, Chapter 3) lay the theoretical groundwork fornearly all of the statistical topics that will follow. While the material in Chapter 2 isessential, readers may wish to skip Chapter 3 on a first reading. Furthermore, as is commonthroughout the book, some of the proofs here have been relegated to the exercises, meaningthat readers may study these proofs as much or as little as they desire.

2.1 Modes of Convergence

Whereas the limit of a constant sequence is unequivocally expressed by Definition 1.30, inthe case of random variables there are several ways to define the convergence of a sequence.This section discusses three such definitions, or modes, of convergence; Section 3.1 presentsa fourth.

2.1.1 Convergence in Probability

What does it mean for the sequence X1, X2, . . . of random variables to converge to, say, therandom variable X? In what case should one write Xn → X? We begin by consideringa definition of convergence that requires that Xn and X be defined on the same samplespace. For this form of convergence, called convergence in probability, the absolute difference|Xn − X|, itself a random variable, should be arbitrarily close to zero with probability

29

arbitrarily close to one. More precisely, we make the following definition.

Definition 2.1 Let Xn and X be defined on the same probability space. We say that

Xn converges in probability to X, written XnP→X, if for any ε > 0,

P (|Xn −X| < ε) → 1 as n →∞. (2.1)

It is very common that the X in Definition 2.1 is a constant, say X ≡ c. In such cases, we

simply write XnP→ c. There is no harm in replacing X by c in Definition 2.1 because any

constant may be defined as a random variable on any sample space. In the most commonstatistical usage of convergence to a constant, the random variable Xn is some estimator ofa particular parameter, say g(θ):

Definition 2.2 If XnP→ g(θ), Xn is said to be consistent (or weakly consistent) for

g(θ).

As the name suggests, weak consistency is implied by a concept called “strong consistency,” atopic to be explored in Chapter 3. Note that “consistency,” used without the word “strong”or “weak,” generally refers to weak consistency.

Example 2.3 Suppose that Y1, Y2, . . . are independent and identically distributeduniform (0, θ) random variables, where θ is an unknown positive constant. Forn ≥ 1, let Xn be defined as the largest value among Y1 through Yn: That is,

Xndef= max1≤i≤n Xi. Then we may show that Xn is a consistent estimator of θ as

follows:

By Definition 2.1, we wish to show that for an arbitrary ε > 0, P (|Xn−θ| < ε) → 1as n → ∞. In this particular case, we can evalulate P (|Xn − θ| < ε) directly bynoting that Xn cannot possibly be larger than θ, so that

P (|Xn − θ| < ε) = P (Xn > θ − ε) = 1− P (Xn ≤ θ − ε).

Now the maximum Xn is less than a constant if and only if each of the randomvariables Y1, . . . , Yn is less than that constant. Therefore, by independence,

P (Xn ≤ θ − ε) = [P (Y1 ≤ θ − ε)]n =[1− ε

θ

]nas long as ε < θ. (If ε ≥ θ, the above probability is simply zero.) Since anyconstant c in the open interval (0, 1) has the property that cn → 0 as n →∞, weconclude that P (Xn ≤ θ − ε) → 0 as desired.

There are probabilistic analogues of the o and O notations of Section 1.3 that are frequentlyused in the literature but that apply to random variable sequences instead of constant se-quences.

30

Definition 2.4 We write Xn = oP (Yn) if Xn/YnP→ 0.

Definition 2.5 We write Xn = OP (Yn) if for every ε > 0, there exist M and N suchthat

P

(∣∣∣∣Xn

Yn

∣∣∣∣ < M

)> 1− ε for all n > N .

In particular, it is common to write oP (1), to denote a sequence of random variables thatconverges to zero in probability, or Xn = OP (1), to state that Xn is bounded in probability(see Exercise 2.7 for a definition of bounded in probability).

Example 2.6 In Example 2.3, we showed that if Y1, Y2, . . . are independent and iden-tically distributed uniform (0, θ) random variables, then

max1≤i≤n

YiP→ θ as n →∞.

Equivalently, we may say that

max1≤i≤n

Yi = θ + oP (1) as n →∞. (2.2)

It is also correct to write

max1≤i≤n

Yi = θ + OP (1) as n →∞, (2.3)

though statement (2.3) is less informative than statement (2.2). On the otherhand, we will see in Example 6.1 that statement (2.3) may be sharpened consid-erably — and made even more informative than statement (2.2) — by writing

max1≤i≤n

Yi = θ + OP

(1

n

)as n →∞.

2.1.2 Convergence in Distribution

As the name suggests, convergence in distribution has to do with convergence of the distri-bution functions of random variables. Given a random variable X, the distribution functionof X is the function

F (x) = P (X ≤ x). (2.4)

Any distribution function F (x) is nondecreasing and right-continuous, and it has limitslimx→−∞ F (x) = 0 and limx→∞ F (x) = 1. Conversely, any function F (x) with these proper-ties is a distribution function for some random variable.

31

It is not enough to define convergence in distribution as simple pointwise convergence of asequence of distribution functions; there are technical reasons that such a simplistic definitionfails to capture any useful concept of convergence of random variables. These reasons areillustrated by the following two examples.

Example 2.7 Let Xn be normally distributed with mean 0 and variance n. Then thedistribution function of Xn is Φ(x/

√n), where Φ(z) denotes the standard normal

distribution function. Because Φ(0) = 1/2, we see that for any fixed point x, thedistribution function of Xn converges to 1/2 at that point as n → ∞. But thefunction that is constant at 1/2 is not a distribution function. Therefore, thisexample shows that not all convergent sequences of distribution functions havelimits that are distribution functions.

Example 2.8 By any sensible definition of convergence, 1/n should converge to 0.But consider the distribution functions Fn(x) = I{x ≥ 1/n} and F (x) = I{x ≥0} corresponding to the constant random variables 1/n and 0. We do not havepointwise convergence of Fn(x) to F (x), since Fn(0) = 0 for all n but F (0) = 1.Note, however, that Fn(x) → F (x) is true for all x 6= 0. Not coincidentally, thepoint x = 0 is the only point at which the function F (x) is not continuous.

To write a sensible definition of convergence in distribution, Example 2.7 demonstrates thatwe should require that the limit of distribution functions be a distribution function, whileExample 2.8 indicates that we should exclude points where F (x) is not continuous. Wetherefore arrive at the following definition:

Definition 2.9 Suppose that X has distribution function F (x) and that Xn has dis-tribution function Fn(x) for each n. Then we say Xn converges in distribution

to X, written Xnd→X, if Fn(x) → F (x) as n → ∞ for all x at which F (x) is

continuous. Convergence in distribution is sometimes called convergence in law

and written XnL→ X.

Example 2.10 The Central Limit Theorem for i.i.d. sequences: Let X1, . . . , Xn beindependent and identically distributed (i.i.d.) with mean µ and finite varianceσ2. Then by a result that will be covered in Chapter 4 (but which is perhapsalready known to the reader),

√n

(1

n

n∑i=1

Xi − µ

)d→N(0, σ2), (2.5)

where N(0, σ2) denotes a random variable with mean 0 and variance σ2. Note

that the expressions on either side of thed→ symbol may sometimes be random

variables, sometimes distribution functions or other notations indicating certain

32

distributions. The meaning is always clear even if the notation is sometimesinconsistent.

Note that the distribution on the right hand side of (2.5) does not depend on n.Indeed, as a general rule, the right side of an arrow should never depend on nwhen the arrow conveys the idea “as n → ∞.” By Definition 2.5, the right sideof (2.5) is OP (1). We may therefore write (after dividing through by

√n and

adding µ)

1

n

n∑i=1

Xi = µ + OP

(1√n

)as n →∞,

which is less specific than expression (2.5) but expresses at a glance the rate ofconvergence of the sample mean to µ.

Could Xnd→X imply Xn

P→X? Not in general: Since convergence in distribution only

involves distribution functions, Xnd→X is possible even if Xn and X are not defined on

the same sample space. However, we now prove that convergence in probability does implyconvergence in distribution.

Theorem 2.11 If XnP→X, then Xn

d→X.

Proof: Let Fn(x) and F (x) denote the distribution functions of Xn and X, respectively.

Assume that XnP→X. We need to show that Fn(c) → F (c), where c is any point of continuity

of F (x).

Choose any ε > 0. Note that whenever Xn ≤ c, it must be true that either X ≤ c + ε or|Xn −X| > ε. This implies that

Fn(c) ≤ F (c + ε) + P (|Xn −X| > ε).

Similarly, whenever X ≤ c− ε, either Xn ≤ c or |Xn −X| > ε, implying

F (c− ε) ≤ Fn(c) + P (|Xn −X| > ε).

We conclude that for arbitrary n and ε > 0,

F (c− ε)− P (|Xn −X| > ε) ≤ Fn(c) ≤ F (c + ε) + P (|Xn −X| > ε). (2.6)

Taking both the lim infn and the lim supn of the above inequality, we conclude (since XnP→X)

that

F (c− ε) ≤ lim infn

Fn(c) ≤ lim supn

Fn(c) ≤ F (c + ε)

33

for all ε. Since c is a continuity point of F (x), letting ε → 0 implies

F (c) = lim infn

Fn(c) = lim supn

Fn(c),

so we know Fn(c) → F (c) and the theorem is proved.

We remarked earlier that Xnd→X could not possibly imply Xn

P→X because the latterexpression requires that Xn and X be defined on the same sample space for every n. However,a constant a may be considered to be a random variable defined on any sample space; thus,

it is reasonable to ask whether Xnd→ a implies Xn

P→ a. The answer is yes:

Theorem 2.12 Xnd→ a if and only if Xn

P→ a.

Proof: We only need to prove that Xnd→ a implies Xn

P→ a, since the other direction is aspecial case of Theorem 2.11. If F (x) is the distribution function of the constant random

variable a, then a is the only point of discontinuity of F (x). Therefore, for ε > 0, Xnd→ a

implies that Fn(a− ε) → F (a− ε) = 0 and Fn(a + ε) → F (a + ε) = 1 as n →∞. Therefore,

P (−ε < Xn − a ≤ ε) = Fn(a + ε)− Fn(a− ε) → 1,

which means XnP→ a.

2.1.3 Convergence in (quadratic) mean

Definition 2.13 Let k be a positive constant. We say that Xn converges in kth mean

to X, written Xnk→X, if

E |Xn −X|k → 0 as n →∞. (2.7)

Two specific cases of Definition 2.13 deserve special mention. When k = 1, we normallyomit mention of the k and simply refer to the condition E |Xn −X| → 0 as convergence inmean. Note that convergence in mean is not equivalent to E Xn → E X: For one thing,E Xn → E X is possible without any regard to the joint distribution of Xn and X, whereasE |Xn −X| → 0 clearly requires that Xn −X be a well-defined random variable.

Even more important than k = 1 is the special case k = 2:

Definition 2.14 We say that Xn converges in quadratic mean to X, written Xnqm→X,

if

E |Xn −X|2 → 0 as n →∞.

34

Convergence in quadratic mean is important for two reasons: First, it is often quite easyto check: In Exercise 2.1, you are asked to prove that Xn

qm→ c if and only if E Xn → cand Var Xn → 0 for some constant c. Second, quadratic mean convergence is strongerthan convergence in probability, which means that weak consistency of an estimator maybe established by checking that it converges in quadratic mean. This latter property is acorollary of the following result:

Theorem 2.15 Let k > 0 be fixed. Then Xnk→X implies Xn

P→X.

Proof: The proof relies on Markov’s inequality (1.22), which states that

P (|Xn −X| ≥ ε) ≤ 1

εkE |Xn −X|k (2.8)

for an arbitrary ε > 0. If Xnk→X, then by definition the right hand side of inequality (2.8)

goes to zero as n →∞. But this implies that P (|Xn −X| ≥ ε) also goes to zero as n →∞.

Therefore, XnP→X by definition.

Example 2.16 Any unbiased estimator is consistent if its variance goes to zero. Thisfact follows directly from Exercise 2.1 and Theorem 2.15: If the estimator δn isunbiased for the parameter θ, then by definition E δn = θ, which is stronger thanthe statement E δn → θ. If we also have Var δn → 0, then by Exercise 2.1 we

know δnqm→ θ. This implies δn

P→ θ.

Exercises for Section 2.1

Exercise 2.1 Prove that for a constant c, Xnqm→ c if and only if E Xn → c and

Var Xn → 0.

Exercise 2.2 The converse of Theorem 2.15 is not true. Construct an example in

which XnP→ 0 but E Xn = 1 for all n (by Exercise 2.1, if E Xn = 1, then Xn

cannot converge in quadratic mean to 0).

Hint: The mean of a random variable may be strongly influenced by a largevalue that occurs with small probability (and if this probability goes to zero,then the mean can be influenced in this way without destroying convergence inprobability).

Exercise 2.3 Prove or disprove this statement: If there exists M such that P (|Xn| <M) = 1 for all n, then Xn

P→ c implies Xnqm→ c.

35

Exercise 2.4 Prove that if 0 < k < ` and E |X|` < ∞, then E |X|k < 1 + E |X|`.

Hint: Use the fact that |X|k ≤ 1 + |X|kI{|X| > 1}.

Exercise 2.5 Prove that if α > 1, then

(E |X|)α ≤ E |X|α.

Hint: Use Holder’s inequality (1.26) with p = 1/α.

Exercise 2.6 (a) Prove that if 0 < k < `, then Xn`→ X implies Xn

k→X.

Hint: Use Exercise 2.5.

(b) Prove by example that the conclusion of part (a) is not true in general if0 < ` < k.

Exercise 2.7 This exercise deals with bounded in probability sequences.

Definition 2.17 We say that Xn is bounded in probability if Xn = OP (1),i.e., if for every ε > 0, there exist M and N such that P (|Xn| < M) >1− ε for n > N .

(a) Prove that if Xnd→X for some random variable X, then Xn is bounded in

probability.

Hint: You may use the fact that any interval of real numbers must contain apoint of continuity of F (x). Also, recall that F (x) → 1 as x →∞.

(b) Prove that if Xn is bounded in probability and YnP→ 0, then XnYn

P→ 0.

2.2 Consistent estimates of the mean

For a sequence of random vectors X1, X2, . . ., we denote the nth sample mean by

Xndef=

1

n

n∑i=1

Xi.

We begin with a formal statement of the weak law of large numbers for an independent andidentically distributed sequence. Later in this section, we discuss some case in which thesequence of random vectors is not independent and identically distributed.

36

2.2.1 The weak law of large numbers

Theorem 2.18 Weak Law of Large Numbers (univariate version): Suppose thatX1, X2, . . . are independent and identically distributed and have finite mean µ.

Then XnP→µ.

The proof of Theorem 2.18 in its full generality is somewhat beyond the scope of this chap-ter, though it may be proved using the tools in Section 4.1. However, by tightening theassumptions a bit, a proof can be made simple. For example, if the Xn are assumed to havefinite variance (not a terribly restrictive assumption), the weak law may be proved in a singleline: Chebyshev’s inequality (1.23) implies that

P(|Xn − µ| ≥ ε

)≤ Var Xn

ε2=

Var X1

nε2→ 0,

so XnP→µ follows by definition. (This is a special case of Theorem 2.15.)

Example 2.19 If X ∼ binomial(n, p), then X/nP→ p. Although we could prove this

fact directly using the definition of convergence in probability, it follows imme-diately from the Weak Law of Large Numbers due to the fact that X may beexpressed as the sum of n independent and identically distributed Bernoulli ran-dom variables, each with mean p.

2.2.2 Independent but not identically distributed variables

Let us know generalize the conditions of the previous section: Suppose that X1, X2, . . . areindependent but not necessarily identically distributed, such that E Xi = µ and Var Xi = σ2

i .When is Xn consistent for µ?

Let us explore sufficient conditions for a slightly stronger result, namely Xnqm→µ. Since

E Xn = µ for all n, Exercise 2.1 implies that Xnqm→µ if and only if Var Xn → 0. (note,

however, that the converse is not true; see Exercise 2.8). Since

Var Xn =1

n2

n∑i=1

σ2i , (2.9)

we conclude that XnP→µ if

∑ni=1 σ2

i = o(n2).

On the other hand, if instead of Xn we consider the BLUE (best linear unbiased estimator)

δn =

∑ni=1 Xi/σ

2i∑n

j=1 1/σ2j

, (2.10)

37

then as before E δn = µ, but

Var δn =1∑n

j=1 1/σ2j

. (2.11)

Since n Var Xn and n Var δn are, respectively, the arithmetic and harmonic means of the σ2i ,

the statement that Var δn ≤ Var Xn (with equality only if σ21 = σ2

2 = · · ·) is a restatementof the harmonic-arithmetic mean inequality. It is even possible that Var Xn → ∞ whileVar δn → 0: As an example, take σ2

i = i log(i) for i > 1.

Example 2.20 Consider the case of simple linear regression,

Yi = β0 + β1zi + εi,

where the zi are known covariates and the εi are independent and identicallydistributed with mean 0 and finite variance σ2. If we define

wi =zi − z∑n

j=1(zj − z)2and vi =

1

n− zwi,

then the least squares estimators of β0 and β1 are

β0n =n∑

i=1

viYi and β1n =n∑

i=1

wiYi,

respectively. Since E Yi = β0 + β1zi, we have

E β0n = β0 + β1z − β0zn∑

i=1

wi − β1zn∑

i=1

wizi

and

E β1n = β0w + β1

n∑i=1

wizi.

Note that∑n

i=1 wi = 0 and∑n

i=1 wizi = 1. Therefore, E β0n = β0 and E β1n = β1,

which is to say that E β0n and E β1n are unbiased. Therefore, by Exercise 2.1and Theorem 2.15, a sufficient condition for the consistency of E β0n and E β1n

is that their variances tend to zero as n →∞. Since Var Yi = σ2, we obtain

Var β0n = σ2

n∑i=1

v2i and Var β1n = σ2

n∑i=1

w2i .

38

With∑n

i=1 w2i =

{∑nj=1(zi − z)2

}−1

, these expressions simplify to

Var β0n =σ2

n+

σ2z2∑nj=1(zi − z)2

and Var β1n =σ2∑n

j=1(zi − z)2.

Therefore, β0n and β1n are consistent if

z2∑nj=1(zi − z)2

and1∑n

j=1(zi − z)2,

respectively, tend to zero.

2.2.3 Identically distributed but not independent variables

Suppose that X1, X2, . . . satisfy E Xi = µ but they are not independent. Then E Xn = µand so Xn is consistent for µ if Var Xn → 0. In this case,

Var Xn =1

n2

n∑i=1

n∑j=1

Cov (Xi, Xj). (2.12)

For dependent sequences, it is common to generalize “identically distributed” to the strongerconcept of stationarity:

Definition 2.21 The sequence X1, X2, . . . is said to be stationary if the joint distri-bution of (Xi, . . . , Xi+k) does not depend on i for any i > 0 and any k ≥ 0.

Note that stationary implies identically distributed, and for independent sequences these twoconcepts are the same.

To simplify Equation (2.12), note that for a stationary sequence, Cov (Xi, Xj) depends onlyon the “gap” j − i. For example, stationarity implies Cov (X1, X4) = Cov (X2, X5) =Cov (X5, X8) = · · ·. Therefore, Equation (2.12) becomes

Var Xn =σ2

n+

2

n2

n−1∑k=1

(n− k) Cov (X1, X1+k) (2.13)

for a stationary sequence.

Lemma 2.22 Expression (2.13) tends to 0 as n →∞ if σ2 < ∞ and Cov (X1, X1+k) →0 as k →∞.

39

Proof: It is clear that σ2/n → 0 if σ2 < ∞. Assuming that Cov (X1, X1+k) → 0, selectε > 0 and note that if N is chosen so that |Cov (X1, X1+k)| < ε/2 for all n > N , we have∣∣∣∣∣ 2

n2

n−1∑k=1

(n− k) Cov (X1, X1+k)

∣∣∣∣∣ ≤ 2

n

N∑k=1

|Cov (X1, X1+k)|+2

n

n−1∑k=N+1

|Cov (X1, X1+k)| .

Note that the second term on the right is strictly less than ε/2, and the first term is aconstant divided by n, which may be made smaller than ε/2 by choosing n large enough.

Definition 2.23 For a fixed nonnegative integer m, the sequence X1, X2, . . . is calledm-dependent if the random vectors (X1, . . . , Xi) and (Xj, Xj+1, . . .) are indepen-dent whenever j − i > m.

Any independent and identically distributed sequence is 0-dependent and stationary. Also,any stationary m-dependent sequence trivially satisfies Cov (X1, X1+k) → 0 as k →∞, so byLemma 2.22, Xn is consistent for any stationary m-dependent sequence with finite variance.


Exercise 2.8 Give an example of an independent sequence X1, X2, . . . with E Xi = µ

such that XnP→µ but Var Xn does not converge to 0.

Exercise 2.9 Suppose X1, X2, . . . are independent and identically distributed withmean µ and finite variance σ2. Let Yi = X i = (

∑ij=1 Xj)/i.

(a) Prove that Y n = (∑n

i=1 Yi)/n is a consistent estimator of µ.

(b) Compute the relative efficiency eY n,Xnof Y n to Xn, defined as Var (Xn)/ Var (Y n),

for n ∈ {5, 10, 20, 50, 100,∞} and report the results in a table. Note that n = ∞in the table is shorthand for the limit (of the efficiency) as n →∞.

Exercise 2.10 Let Y1, Y2, . . . be independent and identically distributed with meanµ and variance σ2 < ∞. Let

X1 = Y1, X2 =Y2 + Y3

2, X3 =

Y4 + Y5 + Y6

3, etc.

Define δn as in Equation (2.10).

(a) Show that δn and Xn are both consistent estimators of µ.

(b) Calculate the relative efficiency eXn,δnof Xn to δn, defined as Var (δn)/ Var (Xn),

for n = 5, 10, 20, 50, 100, and ∞ and report the results in a table.

40

(c) Using Example 1.22, give a simple expression asymptotically equivalentto eXn,δn

. Report its values in your table for comparison. How good is theapproximation for small n?

2.3 Convergence of Transformed Sequences

Many statistic estimators of interest may be written as functions of simpler statistics whoseconvergence properties are known. Therefore, results that describe the behavior of the trans-formed sequences have central importance for the study of statistical large-sample theory.Here, we survey a few of these results. Some are unsurprising to anyone familiar withsequences of real numbers, like the fact that continuous functions preserve various modesof convergence. Others, such as Slutsky’s theorem or the Cramer-Wold theorem, have noanalogues for non-random sequences.

We begin with some univariate results, but to appreciate the richness of the theory it isnecessary to consider the more general multivariate case. For this reason, part of this sectionis spent on extending the earlier univariate definitions of the chapter to the multivariate case.

2.3.1 Continuous transformations: The univariate case

Just as they do for sequences of real numbers, continuous functions preserve convergenceof sequences of random variables. We state this result formally for both convergence inprobability and convergence in distribution.

Theorem 2.24 Suppose that f(x) is a continuous function.

(a) If XnP→X, then f(Xn)

P→ f(X).

(b) If Xnd→X, then f(Xn)

d→ f(X).

Proving Theorem 2.24 is quite difficult; each of the statements of the theorem will requirean additional result for its proof. For convergence in probability, this additional result mustwait until the next chapter (Theorem 3.9), when we show convergence in probability tobe equivalent to another condition involving almost sure convergence. To prove Theorem2.24(b), we introduce a condition equivalent to convergence in distribution:

Theorem 2.25 Xnd→X if and only if E g(Xn) → E g(X) for all bounded and con-

tinuous real-valued functions g(x).

41

Theorem 2.25 is proved in Exercises 2.11 and 2.12. Once this theorem is established, Theorem2.24(b) follows easily:

Proof of Theorem 2.24(b) Let g(x) be any bounded and continuous function. ByTheorem 2.25, it suffices to show that E [g ◦f(Xn)] → E [g ◦f(X)]. Since f(x) is continuous,g ◦ f(x) is bounded and continuous. Therefore, another use of Theorem 2.25 proves thatE [g ◦ f(Xn)] → E [g ◦ f(X)] as desired.

2.3.2 Multivariate Extensions

We now extend our notions of convergence to the multivariate case. Several earlier resultsfrom this chapter, such as the weak law of large numbers and the results on continuousfunctions, are generalized to this case.

The multivariate definitions of convergence in probability and convergence in kth mean arestraightforward and require no additional development:

Definition 2.26 Xn converges in probability to X (written XnP→X) if for any ε > 0,

P (‖Xn −X‖ < ε) → 1 as n →∞.

Definition 2.27 Xn converges in kth mean to X (written Xnk→X) if for any ε > 0,

E ‖Xn −X‖k → 0 as n →∞.

As a special case, we have convergence in quadratic mean, written Xnqm→X, when

E[(Xn −X)T (Xn −X)

]→ 0 as n →∞.

Example 2.28 The weak law of large numbers As in the univariate case, we definethe nth sample mean of a random vector sequence X1,X2, . . . to be

Xndef=

1

n

n∑i=1

Xi.

Suppose that X1,X2, . . . are independent and identically distributed and have

finite mean µ. Then XnP→µ. Thus, the multivariate weak law is identical to the

univariate weak law (and its proof using characteristic functions is also identicalto the univariate case; see Section 4.1).

It is also straightforward to extend convergence in distribution to random vectors, thoughin order to do this we need the multivariate analogue of Equation (2.4), the distribution

42

function. To this end, let

F (x)def= P (X ≤ x),

where X is a random vector in Rk and X ≤ x means that Xi ≤ xi for all 1 ≤ i ≤ k.

Definition 2.29 Xn converges in distribution to X (written Xnd→X) if for any point

c at which F (x) is continuous,

Fn(c) → F (c) as n →∞.

There is one rather subtle way in which the multivariate situation is not quite the same asthe univariate situation. In the univariate case, it is very easy to characterize the points ofcontinuity of F (x): The distribution function of the random variable X is continuous at xif and only if P (X = x) = 0. However, this simple characterization no longer holds true forrandom vectors; a point x may be a point of discontinuity yet still satisfy P (X = x) = 0.The task in Exercise 2.14 is to produce an example of this phenomenon.

As a final generalization, we extend Theorem 2.24 to the multivariate case.

Theorem 2.30 Suppose that f : S → R` is a continuous function defined on somesubset S ⊂ Rk, Xn is a k-component random vector, and P (X ∈ S) = 1.

(a) If XnP→X, then f(Xn)

P→ f(X).

(b) If Xnd→X, then f(Xn)

d→ f(X).

The proof of Theorem 2.30 is basically the same as in the univariate case, although there aresome subtleties related to using multivariate distribution functions in part (b). For provingpart (b), one can use the multivariate version of Theorem 2.31:

Theorem 2.31 Xnd→X if and only if E g(Xn) → E g(X) for all bounded and con-

tinuous real-valued functions g : Rk → R.

Proving Theorem 2.31 involves a few subtleties related to the use of multivariate distributionfunctions, but the essential idea is exactly the same as in the univariate case (see Theorem2.25 and Exercises 2.11 and 2.12).

2.3.3 The Cramer-Wold Theorem

Suppose that X1,X2, . . . is a sequence of random k-vectors. By Theorem 2.30, we seeimmediately that

Xnd→X implies aTXn

d→ aTX for any a ∈ Rk. (2.14)

43

It is not clear, however, whether the converse of statment (2.14) is true. Such a conversewould be useful because it would give a means for proving multivariate convergence in distri-bution using only univariate methods. We will see in the next subsection that multivariateconvergence in distribution does not follow from the mere fact that each of the componentsconverges in distribution (if you cannot stand the suspense, see Example 2.33). Yet theconverse of statement (2.14) is much stronger than the statement that each component con-verges in distribution. Could it be true that requiring all linear combinations to converge indistribution is strong enough to guarantee multivariate convergence? The answer is yes:

Theorem 2.32 Cramer-Wold Theorem: Xnd→X if and only if aTXn

d→ aTX for alla ∈ Rk.

Using the machinery of characteristic functions, to be presented in Section 4.1, the proof ofthe Cramer-Wold Theorem is immediate; see Exercise 4.3.

2.3.4 Slutsky’s Theorem

Related to Theorem 2.24 is the question of when we may “stack” random variables to makerandom vectors while preserving convergence. It is here that we encounter perhaps thebiggest surprise of this section: Convergence in distribution is not preserved by “stacking”.

To understand what we mean by “stacking,” consider that by the definition of convergencein probability,

XnP→X and Yn

P→Y implies that

(Xn

Yn

)P→(

X

Y

). (2.15)

Thus, two convergent-in-probability sequences Xn and Yn may be “stacked” to make a vector,and this vector must still converge in probability to the vector of stacked limits. Note that theconverse of (2.15) is true by Theorem 2.24 because the function f(x, y) = x is a continuousfunction from R2 to R. By induction, we can therefore stack or unstack arbitrarily manyrandom variables or vectors without distrurbing convergence in probability. Combining thisfact with Theorem 2.24 yields a useful result; see Exercise 2.18.

However, statement 2.15 does not remain true in general ifP→ is replaced by

d→ throughoutthe statement. Consider the following example.

Example 2.33 Take Xn and Yn to be independent standard normal random variables

for all n. These distributions do not depend on n at all, and we may write Xnd→Z

and Ynd→Z, where Z ∼ N(0, 1). But it is certainly not true that(

Xn

Yn

)d→(

Z

Z

), (2.16)

44

since the distribution on the left is bivariate normal with correlation 0, while thedistribution on the right is bivariate normal with correlation 1.

Expression (2.16) is untrue precisely because the marginal distributions do not uniquely de-termine the joint distribution. However, there are certain special cases in which the marginalsdo determine the joint distribution. For instance, if random variables are independent, thentheir marginal distributions uniquely determine their joint distribution. Indeed, we can saythat if Xn → X and Yn → Y , where Xn is independent of Yn and X is independent of

Y , then statement (2.15) remains true whenP→ is replaced by

d→ (see Exercise 2.19). As aspecial case, the constant c, when viewed as a random variable, is automatically independent

of any other random variable. Since Ynd→ c is equivalent to Yn

P→ c by Theorem 2.12, it mustbe true that

Xnd→X and Yn

P→ c implies that

(Xn

Yn

)d→(

X

c

)(2.17)

if Xn is independent of Yn for every n. The content of a powerful theorem called Slutsky’sTheorem is that statement (2.17) remains true even if the Xn and Yn are not independent.Although the preceding discussion involves stacking only random (univariate) variables, wepresent Slutsky’s theorem in a more general version involving stacking random vectors.

Theorem 2.34 Slutsky’s Theorem: For any random vectors Xn and Yn and any

constant c, if Xnd→X and Yn

P→ c, then(Xn

Yn

)d→(X

c

).

A proof of Theorem 2.34 is outlined in Exercise 2.20.

Putting several of the preceding results together yields the following corollary.

Corollary 2.35 If X is a k-vector such that Xnd→X, and Ynj

P→ cj for 1 ≤ j ≤ m,then

f

(Xn

Yn

)d→ f

(X

c

)for any continuous function f : S ⊂ Rk+m → R`.

It is very common practice in statistics to use Corollary 2.35 to obtain a result, then statethat the result follows “by Slutsky’s Theorem”. In fact, there is not a unanimously heldidea in the literature about what precisely “Slutsky’s Theorem” refers to; some consider theCorollary itself, or particular cases of the Corollary, to be Slutsky’s Theorem. These minordifferences are unimportant; the common thread in any application of “Slutsky’s Theorem,”

45

however it is defined, is some combination of one sequence that converges in distributionwith one or more sequences that converge in probability to constants.

Example 2.36 Normality of the t-statistic Let X1, . . . , Xn be independent and iden-tically distributed with mean µ and finite positive variance σ2. By Example 2.10,√

n(Xn − µ)d→N(0, σ2). Suppose that σ2

n is any consistent estimator of σ2; that

is, σ2n

P→σ2. (For instance, we might take σ2n to be the usual unbiased sample vari-

ance estimator, whose asymptotic properties will be studied later.) If Z denotesa standard normal random variable, Theorem 2.34 implies(√

n(Xn − µ)

σ2n

)d→(

σZ

σ2

). (2.18)

Therefore, since f(a, b) = a/b is a continuous function,

√n(Xn − µ)√

σ2n

d→Z. (2.19)

It is common practice to skip step (2.18), attributing equation (2.19) directly to“Slutsky’s Theorem”.


Exercise 2.11 Here we prove part of Theorem 2.25: If Xnd→X and g(x) is a bounded,

continuous function on R, then E g(Xn) → E g(X). (This half of Theorem 2.25is sometimes called the univariate Helly-Bray Theorem.)

Let Fn(x) and F (x) denote the distribution functions of Xn and X, as usual. Forε > 0, take b < c to be constant real numbers such that F (b) < ε and F (c) > 1−ε.First, we note that since g(x) is continuous, it must be uniformly continuous on[b, c]: That is, for any ε > 0 there exists δ > 0 such that |g(x)−g(y)| < ε whenever|x − y| < δ. This fact, along with the boundedness of g(x), ensures that thereexists a finite set of real numbers b = a0 < a1 < · · · < am = c such that:

• Each ai is a continuity point of F (x).

• F (a0) < ε and F (am) > 1− ε.

• For 1 ≤ i ≤ m, |g(x)− g(ai)| < ε for all x ∈ [ai−1, ai].

(a) Define

h(x) ={

g(ai) if ai−1 < x ≤ ai for some 1 ≤ i ≤ m.0 otherwise.

46

Prove that there exists N such that |E h(Xn)− E h(X)| < ε for all n > N .

Hint: Let M be such that |g(x)| < M for all x ∈ R. Use the fact that for anyrandom variable Y ,

E h(Y ) =m∑

i=1

g(ai)P (ai−1 < Y ≤ ai).

Also, please note that we may not write E h(Xn)− E h(X) as E [h(Xn)− h(X)]because it is not necessarily the case that Xn and X are defined on the samesample space.

(b) Prove that E g(Xn) → E g(X).

Hint: Use the fact that

|E g(Xn)− E g(X)| ≤ |E g(Xn)− E h(Xn)|+ |E h(Xn)− E h(X)|+|E h(X)− E g(X)|.

Exercise 2.12 Prove the other half of Theorem 2.25:

(a) Let a be any continuity point of F (x). Let ε > 0 be arbitrary. Show thatthere exists δ > 0 such that F (a− δ) > F (a)− ε and F (a + δ) < F (a) + ε.

(b) Show how to define continuous functions g1 : R → [0, 1] and g2 : R → [0, 1]such that for all x ≤ a, g1(x) = g2(x − δ) = 0 and for all x > a, g1(x + δ) =g2(x) = 1. Use these functions to bound the difference between Fn(a) and F (a)in such a way that this difference must tend to 0.

Exercise 2.13 To illustrate a situation that can arise in the multivariate settingthat cannot arise in the univariate setting, construct an example of a sequence(Xn, Yn), a joint distribution (X, Y ), and a connected subset S ∈ R2 such that

(i) (Xn, Yn)d→(X, Y );

(ii) every point of R2 is a continuity point of the distribution function of (X, Y );

(iii) P [(Xn, Yn) ∈ S] does not converge to P [(X, Y ) ∈ S].

Hint: Condition (ii) may be satisfied even if the distribution of (X,Y ) is con-centrated on a line.

Exercise 2.14 If X is a univariate random variable with distribution function F (x),then F (x) is continuous at c if and only if P (X = c) = 0. Prove by counterex-ample that this is not true if variables X and c are replaced by vectors X andc.

47

Exercise 2.15 Suppose that X and Y are standard normal random variables withCorr (X, Y ) = ρ. Construct a computer program that simulates the distributionfunction Fρ(x, y) of the joint distribution of X and Y . For a given (x, y), the pro-gram should generate at least 50,000 random realizations from the distributionof (X, Y ), then report the proportion for which (X, Y ) ≤ (x, y). (If you wish,you can also report a confidence interval for the true value.) Use your function toapproximate F.5(1, 1), F.25(−1,−1), and F.75(0, 0). As a check of your program,you can try it on F0(x, y), whose true values are not hard to calculate directly as-suming your software has the ability to evaluate the standard normal distributionfunction.

Hint: To generate a bivariate normal random vector (X, Y ) with covariance

matrix

(1 ρρ 1

), start with independent standard normal U and V , then take

X = U and Y = ρU +√

1− ρ2V .

Exercise 2.16 Adapt the method of proof in Exercise 2.11 to the multivariate case,

proving half of Theorem 2.31: If Xnd→X, then E g(Xn) → E g(X) for any

bounded, continuous g : Rk → R.

Hint: Instead of intervals (ai−1, ai] as in Exercise 2.11, use small regions {x :ai,j−1 < xi ≤ ai,j for all i} of Rk. Make sure these regions are chosen so that theirboundaries contain only continuity points of F (x).

Exercise 2.17 Construct a counterexample to show that Theorem 2.34 may not be

strengthened by changing YnP→ c to Yn

P→Y .

Exercise 2.18 (a) Prove that if f : Rk → R` is continuous and XnjP→Xj for all

1 ≤ j ≤ k, then f(Xn)P→ f(X).

(b) Taking f(a, b) = a + b for simplicity, construct an example demonstrating

that part (a) is not true ifP→ is replaced by

d→.

Exercise 2.19 Prove that if Xn → X and Yn → Y , where Xn is independent of Yn

and X is independent of Y , then

Xnd→X and Yn

d→Y implies that

(Xn

Yn

)d→(

X

Y

).

Hint: Be careful to deal with points of discontinuity: If Xn and Yn are indepen-dent, what characterizes a point of discontinuity of the joint distribution?

48

Exercise 2.20 Prove Slutsky’s Theorem, Theorem 2.34, using the following approach:

(a) Prove the following lemma:

Lemma 2.37 Let Vn and Wn be k-dimensional random vectors on thesame sample space.

If Vnd→V and Wn

P→0, then Vn + Wnd→V.

Hint: For arbitrary ε > 0, let ε denote the k-vector all of whose entries are ε.Show that for any a ∈ Rk,

P (Vn ≤ a− ε) + P (‖Wn‖ < ‖ε‖)− 1 ≤ P (Vn + Wn ≤ a)

≤ P (Vn ≤ a + ε) + P (‖Wn‖ ≥ ‖ε‖).

If N is such that P (‖Wn‖ ≥ ‖ε‖) < ε for all n > N , rewrite the above inequalitiesfor n > N . If a is a continuity point of F (v), what happens as ε → 0?

(b) Show how to prove Theorem 2.34 using the lemma in part (a).

49

chapter 2 weak convergence - pennsylvania state...

Documents