Class Notes of Stat 8112
1 Bayes estimators
Here are three methods of estimating parameters:
(1) MLE; (2) Moment Method; (3) Bayes Method.
An example of Bayes argument: LetX ∼ F (x|θ), θ ∈ H©.We want to estimate g(θ) ∈ R1.
Suppose t(X) is an estimator and look at
MSEθ(t) = Eθ(t(X) − g(θ))2.
The problem is MSEθ(t) depends on θ. So minimizing one point may costs at other points.
Bayes idea is to average MSEθ(t) over θ and then minimize over t’s. Thus we pretend to
have a distribution for θ, say, π, and look at
H(t) = E(t(X) − g(θ))2
where E now refers to the joint distribution of X and θ, that is
E(t(X) − g(θ))2 =
∫(t(x) − g(θ))2F (dx|θ)π(dθ). (1.1)
Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.
LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability
space and H be the set of functions from R1 to R
1. Then
minh∈H
E(Z − h(W ))2 = E(Z − E(Z|W ))2.
That is, the minimizer above is E(Z|W ).
Proof. Note that
E(Z − h(W ))2 = E(Z − E(Z|W ) + E(Z|W ) − h(W ))2
= E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2
+2E(Z − E(Z|W ))(E(Z|W ) − h(W )).
Conditioning on W, we have that the cross term is zero. Thus
E(Z − h(W ))2 = E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2.
1
Now E(Z − E(Z|W ))2 is fixed. The second term E(E(Z|W ) − h(W ))2 depends on h. So
choose h(W ) = E(Z|W ) to minimize E(Z − h(W ))2.
Now choose W = X, t = h and Z = g(θ). Then
COROLLARY 1.1 The Bayes estimator in (1.1) is g(θ) = E(g(θ)|X), which is the poste-
rior mean.
At a later occasion, we call E(g(θ)|X) the posterior mean of g(θ). Similarly, we have the
following multivariate analogue of the above corollary. The proof is in the same spirit of
Lemma 1.1. We omit it.
THEOREM 1 Let X ∈ Rm and X ∼ F (·|θ), θ ∈ H©. Let g(θ) = (g1(θ), · · · , gk(θ)) ∈ R
k.
The MSE between an estimator t(X) and g(θ) is
E‖t(X) − g(θ)‖2 =
∫‖t(x) − g(θ)‖2F (dx|θ)π(dθ).
Then Bayes estimator g(θ) = E(g(θ)|X) := (E(g1(θ)|X), · · · , E(gk(θ)|X)).
Definition. The distribution of θ (we thought there is such) , π(θ), is called a prior
distribution of θ. The conditional distribution of θ given X is called the posterior distribu-
tion.
Example. Let X1, · · · ,Xn be i.i.d. Ber(p) and p ∼ Beta(α, β). We know that the
density function of p is
f(p) =Γ(α+ β)
Γ(α)Γ(β)pα−1(1 − p)β−1
and E(p) = α/(α + β). The joint distribution of X1, · · · ,Xn and p is
f(x1, · · · , xn, p) = f(x1, · · · , xn|p) · π(p) = px(1 − p)n−x · C(α, β)pα−1(1 − p)β−1
= C(α, β)px+α−1(1 − p)n+β−x−1
where x =∑
i xi. The marginal density of (X1, · · · ,Xn) is
f(x1, · · · , xn) = C(α, β)
∫ 1
0px+α−1(1 − p)n+β−x−1 dp = C(α, β) ·B(x+ α, n + β − x).
Therefore,
f(p|x1, · · · , xn) =f(x1, · · · , xn, p)
f(x1, · · · , xn)= D(x, α, β) px+α−1(1 − p)n+β−x−1.
2
This says that p|x ∼ Beta(x+ α, n+ β − x). Thus the Bayes estimator is
p = E(p|X) =X + α
n+ α+ β,
where X =∑n
i=1Xi. One can easily check that
E(p) =np+ α
n+ α+ β.
So p is biased unless α = β = 0, which is not allowed in the prior distribution.
The next lemma is useful
LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be
a sufficient statistic. Then E(g(θ)|X) = E(g(θ)|t(X)).
Proof. The joint density function of X and θ is p(x|θ)π(θ) where π(θ) is the prior distri-
bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem
p(x|θ) = q(t(x)|θ)h(x).
So the joint distribution function is q(t(x)|θ)h(x)π(θ). It follows that the conditional distri-
bution of θ given X is
p(θ|x) =q(t(x)|θ)h(x)π(θ)∫q(t(x)|ξ)h(x)π(ξ) dξ
=q(t(x)|θ)π(θ)∫q(t(x)|ξ)π(ξ) dξ
which is a function of t(x) and θ.
By the above conclusion, E(g(θ)|X) = l(t(X)) for some function l(·). Take conditional
expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) ⊂ σ(X).
By the Tower theorem, l(t(X)) = E(g(θ)|t(X)). This proves the second conclusion.
Example. Let X1, · · · ,Xn be iid from N(µ, 1) and µ ∼ π(µ) = N(µ0, τ0). We know
that X(∼ N(µ, 1/n)) is a sufficient statistic. The Bayes estimator is µ = E(µ|X). We need
to calculate the joint distribution of (µ, X)T first.
It is not difficult to see that (µ, X)T is bivariate normal. We know that Eµ = µ0, V ar(µ) =
τ0, E(X) = E(E(X |µ)) = E(µ) = µ0. Now
V ar(X) = E(X − µ+ µ− µ0)2 = E((1/n) + (µ− µ0)
2) =1
n+ τ0;
Cov(X, µ) = E(E((X − µ0)(µ− µ0)|µ)) = E(µ− µ0)2 = τ0.
Thus,(X
µ
)∼ N
((µ0
µ0
),
((1/n) + τ0 τ0
τ0 τ0
)).
3
Recall the following fact: if
(Y1
Y2
)∼ N
((µ1
µ2
),
(Σ11 Σ12
Σ21 Σ22
)),
then
Y1|Y2 ∼ N(µ1 + Σ12Σ
−122 (Y2 − µ2), Σ11·2
)
where Σ11·2 = Σ11 − Σ12Σ−122 Σ21. It follows that
µ|X ∼ N
(µ0 +
τ0(1/n) + τ0
(X − µ0), τ0 −τ20
(1/n) + τ0
).
Thus the Bayes estimator is
µ = µ0 +τ0
(1/n) + τ0(X − µ0) =
τ0(1/n) + τ0
X +µ0
1 + nτ0.
One can see that µ→ µ0 as τ0 → 0 and µ→ X as τ0 → ∞.
Example. Let X ∼ Np(θ, Ip) and π(θ) ∼ Np(0, τIp), τ > 0. The Bayes estimator is the
posterior mean E(θ|X). We claim that
(X
θ
)∼ N
((0
0
),
((1 + τ)Ip τIp
τIp τIp
)).
Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τIp. By conditioning argument one
can verify that E(Xθ′) = τIp and E(XX ′) = (1 + τ)Ip. So by conditional distribution of
normal random variables,
θ|X ∼ N
(τ
1 + τX,
τ
1 + τIp
).
So the Bayes estimator is t0(X) = τX/(1 + τ).
The usual MVUE is t1(X) = X. It follows that
MSEt1(θ) = E‖X − θ‖2 = p
which doesn’t depend on θ. Now
MSEt0(θ) = Eθ‖τ
1 + τX − θ‖2 = Eθ‖
τ
1 + τ(X − θ) +
−1
1 + τθ‖2
=
(τ
1 + τ
)2
p+1
(1 + τ)2‖θ‖2 (1.2)
4
which goes to p as τ → ∞. What happens to the prior when τ → ∞? It “converges
to” the Lebesgue measure, or “uniform distribution over the real line” which generates
no information (recall the information of X ∼ Np(0, (1 + τ)Ip) is p/(1 + τ) and θi =
uniform over [−τ, τ ] with θi, 1 ≤ i ≤ p i.i.d. is p/(2τ2). Both go to zero and the later is
faster than the former). This explains why the MSE of the former and the limiting MSE
are identical.
Now let
M = inft
supθEθ‖t(X) − θ‖2
called the Minimax risk and any estimator t∗ with
M = supθEθ‖t∗(X) − θ‖2
is called a Minimax estimator.
THEOREM 2 For any p ≥ 1, t0(X) = X is a minimax.
Proof. First,
M = inft
supθEθ‖t(X) − θ‖2 ≤ sup
θEθ‖X − θ‖2 = p.
Second
M = inft
supθEθ‖t(X) − θ‖2 ≥ inf
t
∫Eθ‖t(X) − θ‖2 πτ (dθ) ≥
(τ
1 + τ
)2
p
for any τ > 0 where πτ (θ) = N(0, τIp), where the last step is from (1.2). Then M ≥ p by
letting p→ ∞. The above says that M = supθ Eθ‖X − θ‖2 = p.
Question. Does there exist another estimator t1(X) such that
Eθ‖t1(X) − θ‖2 ≤ Eθ‖X − θ‖2
for all θ, and the strict inequality holds for some θ? If so, we say t0(X) = X is inadmissible
(because it can be beaten for any θ by some estimator t, and strictly beaten at some θ).
Here is the answer:
(i) For p = 1, there is not. This is proved by Blyth in 1951.
(ii) For p = 2, there is not. This result was shown by Stein in 1961.
(iii) When p ≥ 3, Stein (1956) shows there is such estimator, which is called James-Stein
estimator.
Recall the density function of N(µ, σ2) is φ(x) = (√
2πσ)−1 exp(−(x− µ)2/(2σ2)).
5
LEMMA 1.3 (Stein’s lemma). Let Y ∼ N(µ, σ2) and g(y) be a function such that g(b) −g(a) =
∫ ba g
′(y) dy for any a and b and some function g′(y). If E|g′(Y )| <∞. Then
Eg(Y )(Y − µ) = σ2Eg′(Y ).
Proof. Let φ(y) = (1/√
2πσ) exp(−(y − µ)2/(2σ2)), the density of N(µ, σ2). Then φ′(y) =
−φ(y)(y − µ)/σ2. Thus, φ(y) =∫∞y
z−µσ2 φ(z) dz = −
∫ y−∞
z−µσ2 φ(z) dz. We then have that
Eg′(Y ) =
∫ ∞
−∞g′(y)φ(y) dy
=
∫ ∞
0g′(y)
∫ ∞
y
z − µ
σ2φ(z) dz
dy −
∫ 0
−∞g′(y)
∫ y
−∞
z − µ
σ2φ(z) dz
dy
=
∫ ∞
0
z − µ
σ2φ(z)
∫ z
0g′(y) dy
dz −
∫ 0
−∞
z − µ
σ2φ(z)
∫ 0
zg′(y) dy
dz
=
(∫ ∞
0+
∫ 0
−∞
)z − µ
σ2φ(z)(g(z) − g(0)) dz
=1
σ2
∫ ∞
−∞(z − µ)g(z)φ(z) dz =
1
σ2E(Y − µ)g(Y ).
The Fubini’s theorem is used in the third step; the mean of a centered normal random
variable is zero is used in the fifth step.
Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) − g(a) =∫ ba h(x) dx for any a < b. This does not mean that g′(x) = h(x) for all x. Actually, in this
case, g(x) is differentiable almost everywhere and g′(x) = h(x) a.s. under Lebesgue measure.
For example, let h(x) = 1 if x is an irrational number, and h(x) = 0 if x is rational. Then
g(x) = x =∫ x0 h(t) dt for any x ∈ R. We know in this case that g′(x) = h(x) a.s. The
following are true from real analysis:
Fundamental Theorem of Calculus. If f(x) is absolutely continuous on [a, b], then f(x)
is differentiable almost everywhere and
f(x) − f(a) =
∫ x
af ′(t) dt, x ∈ [a, b].
Another fact. Let f(x) be differentiable everywhere on [a, b] and f ′(x) is integrable over
[a, b]. Then
f(x) − f(a) =
∫ x
af ′(t) dt x ∈ [a, b].
Remark 2. What drives the Stein’s lemma is the formula of integration by parts:
6
suppose g(x) is differentiable then
Eg(Y )(Y − µ) =
∫
R
g(y)(y − µ)φ(y) dy = −σ2
∫ ∞
−∞g(y)φ′(y) dy
= σ2 limy→−∞
g(y)φ(y) − σ2 limy→+∞
g(y)φ(y) + σ2
∫ ∞
−∞g′(y)φ(y) dy.
It is reasonable to assume the two limits are zero. The last term is exactly σ2Eg′(Y ).
THEOREM 3 Let X ∼ Np(θ, Ip) for some p ≥ 3. Define
δc(X) =
(1 − c
p− 2
‖X‖2
)X.
Then
E‖δc(X) − θ‖2 = p− (p− 2)2E
[c(2 − c)
‖X‖2
].
Proof. Let gi(x) = c(p − 2)xi/‖x‖2 and g(x) = (g1(x), · · · , gp(x)) for x = (x1, x2, · · · , xp).
Then g(x) = c(p − 2)x/‖x‖2. It follows that
E‖δc(X) − θ‖2 = E‖(X − θ) − g(X)‖2 = E‖X − θ‖2 + E‖g(X)‖2 − 2E〈X − θ, g(X)〉
= p+ E‖g(X)‖2 − 2
p∑
i=1
E(Xi − θi)gi(X).
Since X1, · · · ,Xp are independent, by conditioning and Stein’s lemma,
E(Xi − θi)gi(X) = E [E(Xi − θi)gi(X)|(Xj , j 6= i) ] = E
[∂gi
∂xi(X)
].
It is easy to calculate that
∂gi
∂xi(x) = c(p− 2)
∑pk=1 x
2k − 2x2
i
(∑p
k=1 x2k)
2= c(p− 2)
‖x‖2 − 2x2i
‖x‖4.
Thus,
p∑
i=1
E(Xi − θi)gi(X) = c(p − 2)E
p∑
i=1
‖X‖2 − 2X2i
‖X‖4= c(p− 2)2E
[1
‖X‖2
].
On the other hand, E‖g(X)‖2 = c2(p − 2)2E(1/‖X‖2). Combine all above together, the
conclusion follows.
We have to show that E‖X‖−2 <∞ for p ≥ 3. Indeed,
E
[1
‖X‖2
]≤ 1 +E
[1
‖X‖2
]I(‖X‖ ≤ 1) ≤ 1 +
∫· · ·∫
‖x‖≤1
1
‖x‖2dx1 · · · dxp
7
since the density of a normal distribution is bounded by one. By the polar-transformation,
the last integral is equal to
∫ 1
0dr
∫ π
0dθ1 · · ·
∫ π
0dθp−1
rp−1J(θ1, · · · , θp)
r2dr ≤ Cπp−1
∫ 1
0rp−3 dr <∞
if the integer p ≥ 3, where J(θ1, · · · , θp) is a multivariate polynomial of sin θi and cos θi, i =
1, 2, · · · , p, bounded by a constant C.
When 0 ≤ c ≤ 2, the term c(2 − c) ≥ 0 and attains the maximum at c = 1. It follows
that
COROLLARY 1.2 The estimator δc(X) dominates X provided 0 < c < 2 and p ≥ 3.
When c = 1, the estimator δ1(X) is called the James-Stein estimator.
COROLLARY 1.3 The James-Stein estimator δ1 dominates all estimators δc with c 6= 1.
2 Likelihood Ratio Test
We consider test H0 : θ ∈ H©0 vs Ha : θ ∈ H©a, where H©0 and H©a are disjoint. The
probability density or probability mass function of a random observation is f(x|θ). The
likelihood ratio test statistic is
λ(x) =supH0
f(x|θ)supH0∪Ha
f(x|θ) .
When H0 is true, the value of λ(x) tends to be large. So the rejection region is
λ(x) ≤ c.
Example. The t-test as a likelihood ratio test.
Let X1, · · · ,Xn be i.i.d. N(µ, σ2) with both parameters unknown. Test H0 : µ = µ0 vs
Ha : µ 6= µ0. The joint density function of Xi’s is
f(x1, · · · , xn|µ, σ) =
(1√2π
)n
(σ2)−n/2 exp
(− 1
2σ2
n∑
i=1
(xi − µ)2
). (2.1)
This one has maximum at (X, V ) where
V =1
n
n∑
i=1
(Xi − X)2.
8
The maximum is
maxµ,σ
f(x1, · · · , xn|µ, σ) = CV −n/2 exp
(− 1
2V
∑(Xi − X)2
)= CV −n/2e−n/2
Where C = (2π)−n/2. When µ = µ0, f(x1, · · · , xn|µ, σ) has maximum at
V0 =1
n
n∑
i=1
(Xi − µ0)2.
And the corresponding maximum value is
maxσ
f(x1, · · · , xn|µ0, σ) = CV−n/20 exp
(− 1
2V0
∑(Xi − µ0)
2
)= CV
−n/20 e−n/2.
So the likelihood ratio is
Λ =
(V0
V
)−n/2
=
(V + (X − µ0)
2
V
)−n/2
.
So the rejection region is
Λ ≤ a =
∣∣∣∣X − µ0
S
∣∣∣∣ ≥ b
for some constant b, where S = (1/(n− 1))∑n
i=1(Xi − X)2. This yields exactly a student t
test.
Example. Let X1, · · · ,Xn be a random sample from
f(x|θ) =
e−(x−θ), if x ≥ θ;
0, otherwise.(2.2)
The joint density function is
f(x|θ) =
e−
P
xi+nθ, if θ ≤ x(1);
0, otherwise.
Consider the test H0 : θ ≤ θ0 versus Ha : θ > θ0. It is easy to see
maxθ∈R
f(x|θ) = e−P
xi+nx(1) ,
which is achieved at θ = x(1). Now consider maxθ≤θ0 f(x|θ). If θ0 ≥ x(1), the two maxima
are identical; If θ0 < x(1), it is achieved at θ = θ0 that yields the maximum value
maxθ≤θ0
f(x|θ) =
e−
P
xi+nx(1) , if θ0 ≥ x(1);
e−P
xi+nθ0, if θ0 < x(1).
9
This gives the likelihood ratio
λ(x) =
1, if x(1) ≤ θ0;
en(θ0−x(1)), if x(1) > θ0.
The rejection region is λ(x) ≤ a which is equivalent to x(1) ≥ b for some b.
Again, the general form of a hypothesis test is H0 : θ ∈ H©0 vs Ha : θ ∈ H©a.
THEOREM 4 Suppose X has pdf or pmf f(x|θ) and T (X) is sufficient for θ. Let λ(x) and
λ∗(y) be the LRT statistics obtained from X and Y = T (X). Then λ(x) = λ∗(T (x)) for any
x.
Proof. Suppose the pdf or pmf of T (X) is g(t|θ). To be simple, suppose X is discrete, that
is, Pθ(T (X) = t) = g(t|θ). Then
f(x|θ) = Pθ(X = x, T (X) = T (x))
= Pθ(T (X) = T (x))Pθ(X = x|T (X) = T (x))
= g(T (x)|θ)h(x). (2.3)
by the definition of sufficiency, where h(x) is a function depending only on x (not θ).
Evidently,
λ(x) =supθ∈H0
f(x|θ)supθ∈H0∪Ha
f(x|θ) and λ∗(t) =supθ∈H0
g(t|θ)supθ∈H0∪Ha
g(t|θ) . (2.4)
By (2.3) and (2.4)
λ(x) =supθ∈H0
f(T (x)|θ)supθ∈H0∪Ha
f(T (x)|θ) = λ∗(T (x)).
This theorem says that the simplification of λ(x) is eventually relevant to x through a
sufficient statistic for the parameter.
3 Evaluating tests
Consider the test H0 : θ ∈ H©0 versus Ha : θ ∈ H©a, where H©0 and H©a are disjoint. When
making decisions, there are two types of mistakes:
Type I error: Reject H0 when it is true.
Type II error: Do not reject H0 when it is false.
10
Not reject H0 reject H0
H0 correct type I error
H1 Type II error correct
Suppose R denotes the rejection region for the test. The chances of making the two
types of mistakes are Pθ(X ∈ R) for θ ∈ H©0 and Pθ(X ∈ Rc), θ ∈ H©a. So
Pθ(X ∈ R) =
probability of a Type I error, if θ ∈ H©0;
1 − probability of a Type II error, if θ ∈ H©a.
DEFINITION 3.1 The power function of the test above with the rejection region R is a
function of θ defined by β(θ) = Pθ(X ∈ R).
DEFINITION 3.2 For α ∈ [0, 1], a test with power function β(θ) is a size α test if
supθ∈ H©0β(θ) = α.
DEFINITION 3.3 For α ∈ [0, 1], a test with power function β(θ) is a level α test if
supθ∈ H©0β(θ) ≤ α.
There are some differences between the above two definitions when dealing with complicated
models.
Example. Let X1, · · · ,Xn be from N(µ, σ2) with µ unknown but known σ. Look at the
test H0 : µ ≤ µ0 vs Ha : µ > µ0. It is easy to check that by likelihood rato test, the rejection
region is
X − µ0
σ/√n
> c
.
So the power function is
β(µ) = Pθ
(X − µ0
σ/√n> c
)= Pθ
(X − µ
σ/√n> c+
µ0 − µ
σ/√n
)
= 1 − Φ
(c+
µ0 − µ
σ/√n
).
It is easy to see that
limµ→∞
β(µ) = 1 limµ→−∞
β(µ) = 0 and β(µ0) = a if P (Z > c) = a,
where Z ∼ N(0, 1).
11
Example. In general, a size α LRT is constructed by choosing c such that supθ∈ H©0Pθ(λ(X)
≤ c) = α. Let X1, · · · ,Xn be a random sample from N(θ, 1). Test H0 : θ = θ0 vsHa : θ 6= θ0.
Here one can easily see that
λ(x) = exp(−n
2(x− θ0)
2)
So λ(X) ≤ c = |X−θ0| ≥ d. For α given, d can be chosen such that Pθ0(|Z| > d√n) = α.
For the example in (2.2), the test is H0 : θ ≤ θ0 vs θ > θ0 and the rejection region
λ(X) ≤ c = X(1) > d. For size α, first choose d such that
α = Pθ0(X(1) > d) = e−n(d−θ0).
Thus, d = θ0 − (log α)/n. Now
supθ∈ H©0
Pθ(X(1) > d) = supθ≤θ0
Pθ(X(1) > d) = supθ≤θ0
e−n(d−θ) = e−n(d−θ0) = α.
4 Most Powerful Tests
Consider a hypothesis test H0 : θ ∈ H©0 vs H1 : θ ∈ H©a. There are a lot of decision rules.
We plan to select the best one. Since we are not able to reduce the two types of errors at
same time (the sum of chances of two errors and two correct decisions is one). We fix the
the chance of the first type of error and maximize the powers among all decision rules.
DEFINITION 4.1 Let C be a class of tests for testing H0 : θ ∈ H©0 vs H1 : θ ∈ H©a. A test
in class C, with power function β(θ), is a uniformly most powerful (UMP) class C test if
β(θ) ≥ β′(θ) for every θ ∈ H©a and every β′(θ) that is a power function of a test in class C.
In this section we only consider C as the set of tests such that the level α is fixed, i.e.,
β(θ); supθ∈ H©0β(θ) ≤ α.
The following gives a way to obtain UMP tests.
THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f(x|θ).Consider testing H0 : θ = θ0 vs H1 : θ = θ1, using a test with rejection region R that
satisfies
x ∈ R if f(x|θ1) > kf(x|θ0) and x ∈ Rc if f(x|θ1) < kf(x|θ0) (4.1)
for some k ≥ 0, and
α = Pθ0(X ∈ R). (4.2)
12
Then
(i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test.
(ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then
every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies
(4.1) except on a set A satisfying Pθ(X ∈ A) = 0 for θ = θ0 and θ1.
Proof. We will prove the theorem only for the case that f(x|θ) is continuous.
Since H©0 consists of only one point, the level α test and size α test are equivalent.
(i) Let R satisfy (4.1) and (4.2) and φ(x) = I(x ∈ R) with power function β(θ). Now take
another rejection region R′ with β′(θ) = Pθ(X ∈ R′) and β′(θ0) ≤ α. Set φ′(x) = I(x ∈ R′).
It suffices to show that
β(θ1) ≥ β′(θ1). (4.3)
Actually, (φ(x)− φ′(x))(f(x|θ1)− kf(x|θ0)) ≥ 0 (check it based on if f(x|θ1) > kf(x|θ0) or
not). Then
0 ≤∫
(φ(x) − φ′(x))(f(x|θ1) − kf(x|θ0)) dx = β(θ1) − β′(θ1) − k(β(θ0) − β′(θ0)). (4.4)
Thus the assertion (4.3) follows from (4.2) and that k ≥ 0.
(ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is
UMP. For the second UMP level α test, say, it has power function β′(θ). Then β(θ1) = β′(θ1).
Since k > 0, (4.4) implies β′(θ0) ≥ β(θ0) = α. So β′(θ0) = α.
Given a UMP level α test with power function β′(θ). By the proved (ii), β(θ1) = β′(θ1)
and β(θ0) = β′(θ0) = α. Then (4.4) implies that the expectation in there is zero. Being
always nonnegative, (φ(x) − φ′(x))(f(x|θ1) − kf(x|θ0)) = 0 except on a set A of Lebesgue
measure zero, which leads to (4.1) (by considering f(x|θ1) > kf(x|θ0) or not). Since X has
density, Pθ(X ∈ A) = 0 for θ = θ0 and θ = θ1.
Example. Let X ∼ Bin(2, θ). Consider H0 : θ = 1/2 vs Ha : θ = 3/4. Note that
f(x|θ1) =
(2
x
)(3
4
)x(1
4
)2−x
> k
(2
x
)(1
2
)x(1
2
)2−x
= kf(x|θ0)
is equivalent to 3x > 4k.(i) The case k ≥ 9/4 and the case k < 1/4 correspond to R = ∅ and Ω, respectively.
The according UMP level α = 0 and α = 1.
(ii) If 1/4 ≤ k < 3/4, then R = 1, 2, and α = P (Bin(2, 1/2) = 1 or 2) = 3/4.
(iii) If 3/4 ≤ k < 9/4, then R = 2. The level α = P (Bin(2, 1/2) = 2) = 1/4.
13
This example says that if we firmly want a size α test, then there are only two such α’s:
α = 1/4 and α = 3/4.
Example. Let X1, · · · ,Xn be a random sample from N(θ, σ2) with σ known. Look
at the test H0 : θ = θ0 vs H1 : θ = θ1, where θ0 > θ1. The density function f(x|θ, σ) =
(1/√
2π σ) exp(−(x− θ)2/(2σ2)). Now f(x|θ1) > kf(θ0) is equivalent to that R = X < c.Set α = Pθ0(X < c). One can check easily that c = θ0 + (σzα/
√n). So the UMP rejection
region is
R =
X < θ0 +
σ√nzα
.
LEMMA 4.1 Let X be a random variable. Let also f(x) and g(x) be non-decreasing real
functions. Then
E(f(X)g(X)) ≥ Ef(X) ·Eg(X)
provided all the above three expectations are finite.
Proof. Let the µ be the distribution of X. Noting that (f(x) − f(y))(g(x) − g(y)) ≥ 0 for
any x and y. Then
0 ≤∫∫
(f(x) − f(y))(g(x) − g(y))µ(dx)µ(dy)
= 2
∫f(x)g(x)µ(dx) − 2
∫f(x)g(y)µ(dx)µ(dy)
= 2[E(f(X)g(X)) − Ef(X) ·Eg(X)].
The desired result follows.
Let f(x|θ), θ ∈ R be a pmf or pdf with common support, that is, the set x : f(x|θ) > 0is identical for every θ. This family of distributions is said to have monotone likelihood ratio
(MLR) with respect to a statistic T (x) if
fθ2(x)
fθ1(x)on the support is a non-decreasing function in T (x) for any θ2 > θ1. (4.5)
LEMMA 4.2 Let X be a random vector with X ∼ f(x|θ), θ ∈ R, where f(x|θ) is a pmf
or pdf with a common support. Suppose f(x|θ) has MLR with a statistic T (x). Let
φ(x) =
1 when T (x) > C,
γ when T (x) = C,
0 when T (x) < C
(4.6)
14
for some C ∈ R, where γ ∈ [0, 1]. Then
Eθ1φ(X) ≤ Eθ2φ(X)
for any θ1 < θ2.
Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such
that f(x|θ2)/f(x|θ1) = g(T (x)), where g(t) is a nondecreasing function. Thus
Eθ2φ(X) =
∫φ(x)
f(x|θ2)f(x|θ1)
f(x|θ1) dx = Eθ1 (g(T (X)) · h(T (X))) ,
where
h(x) =
1 when x > C,
γ when x = C,
0 when x < C
Since both g(x) and h(x) are non-decreasing, by the positive correlation inequality,
Eθ1 (g(T (X)) · h(T (X))) ≥ Eθ1g(T (X)) ·Eθ1h(T (X)) = Eθ1φ(X)
since Eθ1g(T (X)) = 1.
In (4.6), if T (X) > C reject H0; if T (X) < C, don’t reject H0; if T (X) = C, with
chance γ reject H0 and with chance 1 − γ don’t reject H0. The last case is equivalent to
flipping a coin with chance γ for head and 1 − γ for tail. If the head occurs, reject H0;
otherwise, don’t reject H0. This is a randomization. The purpose is to make the test UMP.
The randomization occurs only when T (X) is discrete.
THEOREM 6 Consider testing H0 : θ ≤ θ0 vs H1 : θ > θ0. Let X ∼ f(x|θ), θ ∈ R, where
f(x|θ) is a pmf or pdf with a common support. Also let T be a sufficient statistic for θ. If
f(x|θ) has MLR w.r.t. T (x), then φ(X) in (4.6) is a UMP level α test, where α = Eθ0φ(X).
Proof. Let β(θ) = Eθφ(X). Then by the previous lemma, supθ≤θ0β(θ) = Eθ0φ(X) = α.
We need to show β(θ1) ≥ β′(θ1) for any θ1 > θ0 and β′ such that supθ≤θ0β′(θ) ≤ α.
Consider test H0 : θ = θ0 vs Ha : θ = θ1. We know that
α = Pθ0(X ∈ R) = Eθ0φ(X) and β′(θ1) ≤ α. (4.7)
Again, we use the same notation as in Lemma 4.2, f(x|θ1)/f(x|θ0) = g(T (x)), where
g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5.
15
If f(x|θ1)/f(x|θ0) > k, since g(t) is non-decreasing, then T (X) > C, which is the same
as saying to reject H0 by the functional form of φ(x) as in (4.6). If f(x|θ1)/f(x|θ0) < k
then T (X) < C, that is, not to reject H0. By Neyman-Pearson lemma and (4.7), φ(x)
corresponds to a UMP level-α test. Thus, β(θ1) ≥ β′(θ1).
16
5 Probability convergence
Let (Ω,F , P ) be a probability space. Let Also X, Xn; n ≥ 1 be a sequence of ran-
dom variables defined on this probability space. We say Xn → X in probability or Xn is
consistent with X if
P (|Xn −X| ≥ ǫ) → 0 as n→ ∞
for any ǫ > 0.
Example. Let Xn; n ≥ 1 be a sequence of independent random variables with mean
zero and variance one. Set Sn =∑n
i=1Xi, n ≥ 1. Then Sn/n → 0 in probability as n→ ∞.
In fact,
P
(∣∣∣∣Sn
n
∣∣∣∣ ≥ ǫ
)≤ E(Sn)2
nǫ2=
1
nǫ2→ 0
by Chebyshev’s inequality.
Recall X and Xn, n ≥ 1 are real measurable functions defined on (Ω,F). We say Xn
converges to X almost surely if
P (ω ∈ Ω : limn→∞
Xn(ω) = X(ω)) = 1.
For a sequence of events En; n ≥ 1 in F . Define
En; i.o. =
∞⋂
n=1
⋃
k≥n
Ek.
Here “i.o.” means “infinitely often”. If ω ∈ En; i.o., then there are infinitely many n
such that ω ∈ En. Conversely, if ω /∈ En; i.o., or ω ∈ En; i.o.c, then only finitely many
En contain ω.
In proving almost convergence, the following so-called Borel-Cantelli lemma is very
powerful.
LEMMA 5.1 (Borel-Cantelli Lemma). Let En; n ≥ 1 be a sequence of events in F . If
∞∑
n=1
P (En) <∞ (5.1)
then P (En occurs only finite times) = P (En, i.o.c) = 1.
Proof. Recall P (En) = E(IEn). Let X(ω) =∑∞
n=1 IEn(ω). Then
EX = E
∞∑
i=1
IEn =
∞∑
i=1
P (En) <∞.
17
Thus, X <∞ a.s.
LEMMA 5.2 (Second Borel-Cantelli Lemma). Let En; n ≥ 1 be a sequence of independent
events in F . If
∞∑
n=1
P (En) = ∞ (5.2)
then P (En, i.o.) = 1.
Proof. First,
En, i.o.c =⋃
n
⋂
k≥n
Eck.
With pk denoting P (Ek), we have that
P
⋂
k≥n
Eck
=
∏
k≥n
(1 − pk) ≤ exp
−
∑
k≥n
pk
= 0
as n→ ∞, where we used the inequality that 1 − x ≤ e−x for any x ∈ R. Thus,
P (En, i.o.c) ≤∑
n≥1
P
⋂
k≥n
Eck
= 0.
As an application, we have that
Example. Let Xn; n ≥ 1 be a sequence of i.i.d. random variables with mean µ and
finite fourth moment. Then
Sn
n→ µ a.s.
as n→ ∞.
Proof. W.L.O.G., assume µ = 0. Let ǫ > 0. Then by Chebyshev’s inequality
P
(∣∣∣∣Sn
n
∣∣∣∣ ≥ ǫ
)≤ E(Sn)4
n4ǫ4.
Now, by independence,
E(S4n) = E
∑
k
X4k +
(4
2
) ∑
1≤i6=j≤n
X2i X
2j +
(4
2
) ∑
1≤i6=j=n
XiX3j
= nE(X41 ) + 6n(n − 1)E(X2
1 ).
18
Thus, P(∣∣Sn
n
∣∣ ≥ ǫ)
= O(n−2). Consequently,∑
n≥1 P(∣∣Sn
n
∣∣ ≥ ǫ)< ∞ for any ǫ > 0. This
means that, with probability one, |Sn/n| ≤ ǫ for sufficiently large n. Therefore,
lim supn
|Sn|n
≤ ǫ a.s.
for any ǫ > 0. Let ǫ ↓ 0. The desired conclusion follows.
By using a truncation method, we have the following Kolmogorov’s Strong Law of Large
Numbers.
THEOREM 7 Let Xn; n ≥ 1 be a sequence of i.i.d. random variables with mean µ.
Then
Sn
n→ µ a.s.
as n→ ∞.
There are many variations of the above strong law of large numbers. Among them,
Etemadi showed that Theorem 7 holds if Xn; n ≥ 1 is a sequence of pairwise independent
and identically distributed random variables with mean µ.
The next proposition tells the relationship between convergence in probability and con-
vergence almost surely.
PROPOSITION 5.1 Let Xn and X be random variables taking values in Rk. If Xn →
X a.s. as n → ∞, then Xn converges to X in probability; second, Xn converges to X in
probability iff for any subsequence of Xn there exists a further subsequence, say, Xnk; k ≥
1 such that Xnk→ X a.s. as k → ∞.
Proof. The joint almost sure convergence and joint convergence in probability are equiv-
alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are
one-dimensional. Given ǫ > 0. The almost sure convergence implies that
P (|Xn −X| ≥ ǫ occurs only for finitely many n) = 1.
This says that P (|Xn −X| ≥ ǫ, i.o.) = 0. But
limnP (|Xn −X| ≥ ǫ) ≤ lim
nP
⋃
k≥n
|Xk −X| ≥ ǫ
= P (|Xn −X| ≥ ǫ, i.o.) = 0
19
where we use the fact that
⋃
k≥n
|Xk −X| ≥ ǫ ↑ |Xn −X| ≥ ǫ, i.o..
Now suppose Xn → X in probability. W.L.O.G., we assume the subsequence is Xn
itself. Then for any ǫ > 0, there exists nǫ > 0 such that supn≥nǫP (|Xn −X| ≥ ǫ) < ǫ. We
then have a subsequence nk such that P (|Xnk−X| ≥ 2−k) < 2−k for each k ≥ 1. By the
Borel-Cantelli lemma
P(|Xnk
−X| < 2−k for sufficiently large k)
= 1
which implies that Xnk→ X a.s. as k → ∞.
Now suppose Xn doesn’t converge to X in probability. That is, there exists ǫ0 > 0 and
subsequence Xkl; l ≥ 1 such that
P (|Xkl−X| ≥ ǫ0) ≥ ǫ0
for all l ≥ 1. Therefore any subsequence of Xklcan not converge to X in probability, hence
it can not converge to X almost surely.
Example Let Xi, i ≥ 1 be a sequence of i.i.d. random variables with mean zero and
variance one. Let Sn = X1 + · · ·+Xn, n ≥ 1. Then Sn/√
2n log log n→ 0 in pobability but
it does not converge to zero almost surely. Actually lim supn→∞ Sn/√
2 log log n = 1 a.s.
PROPOSITION 5.2 Let Xk; k ≥ 1 be a sequence of i.i.d. random variables with X1 ∼N(0, 1). Let Wn = max1≤i≤nXi. Then
limn→∞
Wn√log n
=√
2 a.s.
Proof. Recall
x√2π (1 + x2)
e−x2/2 ≤ P (X1 ≥ x) ≤ 1√2π x
e−x2/2 (5.3)
for any x > 0. Given ǫ > 0, let bn = (√
2 + ǫ)√
log n. Then
P (Wn ≥ bn) ≤ nP (X1 ≥ bn) ≤ n
bne−b2n/2 ≤ 1
nǫ2/2
for n ≥ 3. Let nk = [k4/ǫ2 ] + 1 for k ≥ 1. Then,
P (Wnk≥ bnk
) ≤ 1
k2
20
for large k. By the Borel-Cantelli lemma, P (Wnk≥ bnk
, i.o.) = 0. This implies that
lim supk
Wnk
bnk
≤√
2 + ǫ a.s.
For any n ≥ n1 there exists k such that nk ≤ n < nk+1. Note that nk+1/nk → 1 as n→ ∞and Wnk
≤Wn ≤Wnk+1we have that
lim supn
Wn√log n
≤√
2 + ǫ a.s.
Let ǫ ↓ 0. We obtain that
lim supn
Wn√log n
≤√
2 a.s. (5.4)
This proves the upper bound. Now we turn to prove the lower bound. Choose ǫ ∈ (0,√
2).
Set cn = (√
2 − ǫ)√
log n. Then
P (Wn ≤ cn) = P (X1 ≤ cn)n = (1 − P (X1 > cn))n ≤ e−nP (X1>cn).
By (5.3),
nP (X1 > cn) ≥ ncn√2π (1 + c2n)
e−c2n/2 ∼ C√2π log n
n1−(√
2−ǫ)2/2
as n is sufficiently large, where C is a constant. Since 1 − (√
2 − ǫ)2/2 > 0,
∑
n≥2
P (Wn ≤ cn) <∞.
By the Borel-Cantelli again, P (Wn ≤ cn, i.o.) = 0. This implies that
lim infn
Wn√log n
≥√
2 − ǫ a.s.
Let ǫ ↓ 0. We obtain
lim infn
Wn√log n
≥√
2 a.s.
This together with (5.4) implies the desired result.
The above results also hold for random variables taking abstract values. Now let’s
introduce such random variables.
21
A metric space is a set D equipped with a metric. A metric is a map d : D×D → [0,∞)
with the properties
(i) d(x, y) = 0 if and only if x = y;
(ii) d(x, y) = d(y, x);
(iii) d(x, z) ≤ d(x, y) + d(y, z)
for any x, y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open
ball is a set of the form y : d(x, y) < r. A subset of a metric space is open if and only if it
is the union of open balls; it is closed if and only if the complement is open. A sequence xn
converges to x if and only if d(xn, x) → 0, and denoted by xn → x. The closure A of a set
A is the set of all points that are the limit of a sequence in A; it is the smallest closed set
containing A. The interior A is the collection of all points x such that x ∈ G ⊂ A for some
open set G; it is the largest open set contained in A. A function f : D → E between metric
spaces is continuous at a point x if and only if f(xn) → f(x) for every sequence xn → x;
it is continuous at every x if and only if f−1(G) is also open for every open set G ⊂ E. A
subset of a metric space is dense if and only if its closure is the whole space. A metric space
is separable if and only if it has a countable dense subset. A subset K of a metric space is
compact if and inly if it is closed and every sequence in K has a convergent subsequence. A
set K is totally bounded if and only if for every ǫ > 0 it can be covered by finitely many balls
of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A
subset of a complete metric space is compact if and only if it is totally bounded and closed.
A norm space D is a vector space equipped with a norm. A norm is a map ‖ · ‖ : D →[0,∞) such that for every x, y in D, and α ∈ R,
(i) ‖x+ y‖ ≤ ‖x‖ + ‖y‖;(ii) ‖αx‖ = |α|‖x‖;(iii) ‖x‖ = 0 if and only if x = 0.
Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains
the open sets. A function taking values in a metric space is called Borel-measurable if it is
measurable relative to the Borel σ-field. A Borel-measurable map X : Ω → D defined on a
probability space (Ω,F , P ) is referred to as a random element with values in D.
The following lemma is obvious.
LEMMA 5.3 A continuous map between two metric spaces is Borel-measurable.
22
Example. Let C[a, b] = f(x); f is a continuous function on [a, b]. Define
‖f‖ = supa≤x≤b
|f(x)|.
This is a complete normed linear space which is also called a Banach space. This space is
separable. Any separable Banach space is isomorphic to a subset of C[a, b].
Remark. Proposition 5.1 still holds when the random variables taking values in a Polish
space.
Let X,Xn;n ≥ 1 be a sequence of random variables. We say Xn converges to X
weakly or in distribution if
P (Xn ≤ x) → P (X ≤ x)
as n→ ∞ for every continuous point x, that is, P (X ≤ x) = P (X < x).
Remark. The set of discontinuous points of any random variable is countable.
LEMMA 5.4 Let Xn and X be random variables. The following are equivalent:
(i) Xn converges weakly to X;
(ii) Ef(Xn) → Ef(X) for any bounded continuous function f(x) defined on R;
(iii) liminfnP (Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) limsupnP (Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;
Remark. The above lemma is actually true for any finite dimensional space. Further, it
is true for random variables taking values in separable, complete metric spaces, which are
also called Polish spaces.
Proof. (i) =⇒ (ii). Given ǫ > 0, choose K > 0 such that K and −K are continuous
point of all Xn’s and X and P (−K ≤ X ≤ K) ≥ 1 − ǫ. By (i)
limnP (|Xn| ≤ K) = lim
nP (Xn ≤ K) − lim
nP (Xn ≤ −K) = P (X ≤ K) − P (X ≤ −K)
≥ 1 − ǫ
In other words
limnP (|Xn| > K) ≤ ǫ. (5.5)
Since f(x) is continuous on [−K,K], we can cut the interval into, say, m subintervals
[ai, ai+1], 1 ≤ i ≤ m such that each ai is a continuous point of X and supai≤x≤ai+1|f(x) −
f(ai)| ≤ ǫ. Set
g(x) =
m∑
i=1
f(ai)I(ai ≤ x < ai+1).
23
It is easy to see that
|f(x) − g(x)| ≤ ǫ for |x| ≤ K and |f(x) − g(x)| ≤ ‖f‖ for |x| > K,
where ‖f‖ = supx∈R |f(x)|. It follows that
E|f(Y ) − Eg(Y )| ≤ ǫ+ ‖f‖ · P (|Y | > K)
for any random variable Y. Apply this to Xn and X we obtain from (5.5) that
lim supn
|Ef(Xn) − Eg(Xn)| ≤ (1 + ‖f‖)ǫ and |Ef(X) − Eg(X)| ≤ (1 + ‖f‖)ǫ.
Easily Eg(Xn) → Eg(X). Therefore by triangle inequality,
lim supn
|Ef(Xn) − Ef(X)| ≤ 2(1 + ‖f‖)ǫ.
Then (ii) follows by letting ǫ ↓ 0.
(ii) =⇒ (iii). Define fm(x) = (md(x,Gc)) ∧ 1 for m ≥ 1. Since G is open, fm(x) is a
continuous function bounded by one and fm(x) ↑ 1G(x) for each x as m → ∞. It follows
that
lim infn
P (Xn ∈ G) ≥ lim infn
Efm(Xn) → Efm(X)
for each m. (iii) follows by letting m → ∞.
(iii) and (iv) are equivalent because F c is open.
(iv) =⇒ (i). Let x be a continuous point. Then by (iii) and (iv)
lim supn
P (Xn ≤ x) ≤ P (X ≤ x) = P (X < x) and
lim infn
P (Xn ≤ x) ≥ lim infn
P (Xn < x) ≥ P (X < x).
Then (i) follows.
Example. Let Xn be uniformly distributed over 1/n, 2/n, · · · , n/n. Prove that Xn con-
verges weakly to U [0, 1].
Actually, for any bounded continuous function f(x),
Ef(Xn) =1
n
n∑
i=1
f
(i
n
)→∫ 1
0f(t) dt = Ef(Y )
where Y ∼ U [0, 1].
24
Example. Let Xi, i ≥ 1 be i.i.d. random variables. Define
µn =1
n
n∑
i=1
δXi through µn(A) =1
n
n∑
i=1
I(Xi ∈ A) =#i : Xi ∈ A
n
for any set A. Then, for any measurable function f(x), by the strong law of large numbers,
∫f(x)µn( dx) =
1
n
n∑
i=1
f(Xi) → Ef(X1) =
∫f(x)µ( dx)
where µ = L(X1). This concludes that µn → µ weakly as n→ ∞.
COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de-
fined on R1 (not necessarily one-dimensional). Then g(Xn) converges to g(X) weakly.
The following two results are called Delta methods.
THEOREM 8 Let Yn be a sequence of random variables that satisfies√n(Yn − θ) =⇒
N(0, σ2). Let g be a real-valued function such that g′(θ) 6= 0. Then
√n(g(Yn) − g(θ)) =⇒ N(0, σ2(g′(θ)2)).
Proof. By Taylor’s expansion g(x) = g(θ) + g′(θ)(x− θ) + a(x− θ) for some real function
a(x) such that a(t)/t → 0 as t→ 0. Then,
√n(g(Yn) − g(θ)) = g′(θ)
√n(Yn − θ) +
√n · a(Yn − θ).
We only need to show that√n · a(Yn − θ) → 0 in probability as n→ ∞ (latter we will see
this easily by Slusky’s lemma. But why now?) Indeed, for any m ≥ 1 there exists a δ > 0
such that |a(x)| ≤ |x|/m for all |x| ≤ δ. Thus,
P (|√n · a(Yn − θ)| ≥ b) ≤ P (|Yn − θ| > δ) + P (|√n(Yn − θ)| ≥ mb)
→ P (|N(0, σ2)| ≥ mb)
as n→ ∞. Now let m ↑ ∞. we obtain that√n · a(Yn − θ) → 0 in probability.
Similarly, one can prove that
THEOREM 9 Let Yn be a sequence of random variables that satisfies√n(Yn − θ) =⇒
N(0, σ2). Let g be a real-valued function such that g′′(x) exists around x = θ with g′(θ) = 0
and g′′(θ) 6= 0. Then
n(g(Yn) − g(θ)) =⇒ 1
2σ2g′′(θ)χ2
1.
25
Proof. By Taylor’s expansion again,
n(g(Yn) − g(θ)) =1
2σ2g′′(θ)
[n ·(Yn − θ
σ
)2]
+ o(n|Yn − θ|2).
This leads to the desired conclusion by the same arguments as in Theorem 8.
PROPOSITION 5.3 Let X, Xn;n ≥ 1 be a sequence of random variables on Rk. Look
at the following three statements:
(i) Xn converges to X almost surely;
(ii)Xn converges to X in probability;
(iii) Xn converges to X weakly.
Then (i) =⇒ (ii) =⇒ (iii).
Proof. The implication “(i) =⇒ (ii)” is proved earlier. Now we prove that (ii) =⇒ (iii).
For any closed set F let
F ǫ = x : infy∈F
d(x, y) ≤ ǫ.
Then F ǫ is a closed set. When X takes values in R, the metric d(x, y) = |x− y|. It is easy
to see that
Xn ∈ F ⊂ X ∈ F 1/k ∪ |Xn −X| ≥ 1/k
for any integer k ≥ 1. Therefore
P (Xn ∈ F ) ≤ P (X ∈ F 1/k) + P (|Xn −X| ≥ 1/k).
Let n→ ∞ we have that
lim supn
P (Xn ∈ F ) ≤ P (X ∈ F 1/k)
Note that F 1/k ↓ F as k → ∞. (iii) follows by letting k → ∞.
Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a
number M > 0 such that P (‖X‖ > M) < ǫ. A set of random vectors Xα; α ∈ A is called
uniformly tight if there exists a constant M > 0 such that
supα∈A
P (‖Xα‖ > M) < ǫ.
26
A random vector X taking values in a polish space (D, d) is also tight: For every ǫ > 0 there
exists a compact set K ⊂ D such that P (X /∈ K) < ǫ. This will be proved later.
Remark. It is easy to see that a finite number of random variables is uniformly tight if every
individual is tight. Also, if Xα; α ∈ A and Xα; α ∈ B are uniformly tight, respectively,
then Xα, α ∈ A∪B is also tight. Further, if Xn =⇒ X, then X,Xn; n ≥ 1 is uniformly
tight. Actually, for any ǫ > 0, choose M > 0 such that P (‖X‖ ≥ M) ≥ 1 − ǫ. Since the
set x ∈ Rk, ‖x‖ > M is an open set, lim infn→∞ P (‖Xn‖ > M) ≥ P (‖X‖ > M) ≥ 1 − ǫ.
The tightness of X,Xn follows.
LEMMA 5.5 Let µ be a probability measure on (D,B), where (D, d) is a Polish space and
B is the Borel σ-algebra. Then µ is tight.
Proof. Since (D, d) is Polish, it is separable, there exists a countable set x1, x2, · · · ⊂ D
such that it is dense in D.
Given ǫ > 0. For any integer k ≥ 1, since ∪n≥1B(xn, 1/k) = D, there exists nk such
that µ(∪nkn=1B(xn, 1/k)) > 1 − ǫ/2k. Set Dk = ∪nk
n=1B(xn, 1/k) and M = ∩k≥1Dk. Then M
is totally bounded (M is always covered by a finite number of balls with any prescribed
radius). Also
µ(M c) ≤∞∑
k=1
µ(Dck) ≤
∞∑
k=1
ǫ
2k= ǫ.
Let K be the closure of M. Since D is complete, M ⊂ K ⊂ D and K is compact. Also by
the above fact, µ(Kc) < ǫ.
LEMMA 5.6 Let Xn; n ≥ 1 be a sequence of random variables. Let Fn be the cdf of Xn.
Then there exists a subsequence Fnj with the property that Fnj (x) → F (x) at each continuity
point of x of a possibly defective distribution function F. Further, if Xn is tight, then F (x)
is a cumulative distribution function.
Proof. Put all rational numbers q1, q2, · · · in a sequence. Because the sequence Fn(q1)
is contained in the interval [0, 1], it has a converging subsequence. Call the indexing sub-
sequence n1j∞j=1 and the limit G(q1). Next, extract a further subsequence n2
j ⊂ n1j
along which Fn(q2) converges to a limit G(q2), a further subsequence n3j ⊂ n2
j along
which Fn(q3) converges to a limit G(q3), · · · , and so forth. The tail of the diagonal sequence
nj := njj belongs to every sequence ni
j. Hence Fnj (qi) → G(qi) for every i = 1, 2, · · · . Since
each Fn is nondecreasing, G(q) ≤ G(q′) if q ≤ q′. Define
F (x) = infq>x
G(q).
27
It is not difficult to verify that F (x) is decreasing and right-continuous at every point x. It
is also easy to show that Fnj (x) → F (x) at each continuity point of x of F by monotonicity
of F (x).
Now assume Xn is tight. For given ǫ ∈ (0, 1), there exists a rational number M > 0
such that P (−M < Xn ≤ M) ≥ 1 − ǫ. Equivalently, Fn(M) − Fn(−M) ≥ 1 − ǫ for all n.
Passing the subsequence nj to ∞. We have that G(M) − G(−M) ≥ 1 − ǫ. By definition,
F (M) − F (−M) ≥ 1 − ǫ. Therefore F (x) is a cdf.
THEOREM 10 (Almost-sure representations). If Xn =⇒ X0 then there exist random
variables Yn on ([0, 1], B[0, 1]) such that L(Yn) = L(Xn) for each n ≥ 0 and Yn → Y0
almost surely.
Proof. Let Fn(x) be the cdf of Xn. Let U be a random variable uniformly distributed on
([0, 1],B[0, 1]). Define Yn = F−1n (U) where the inverse here is the generalized inverse of a
non-decreasing function defined by
F−1n (t) = infx : Fn(x) ≥ t
for 0 ≤ t ≤ 1. One can verify that F−1n (t) → F−1
0 (t) for every t such that F0(x) is continuous
at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue
measure is zero. This shows that Yn = F−1n (U) → F−1
0 (U) = Y0 almost surely.
It is easy to check that L(Xn) = L(Yn) for all n = 0, 1, 2, · · · .
LEMMA 5.7 (Slusky). Let an, bn and Xn are sequences of random variables such
that an → a in probability and bn → b in probability and Xn =⇒ X for constants a and b
and some random variable X. Then anXn + bn =⇒ aX + b as n→ ∞.
Proof. We will show that (anXn +bn)−(aXn +b) → 0 in probability. If this is the case,
then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ > 0.
Since Xn is convergent, it is tight. So there exists M > 0 such that P (|Xn| ≥ M) ≤ ǫ.
Moreover, by the given condition,
P(|an − a| ≥ ǫ
2M
)< ǫ and P (|bn − b| ≥ ǫ) < ǫ
as n is sufficiently large. Note that
|(anXn + bn) − (aXn + b)| ≤ |an − a||Xn| + |bn − b|.
28
Thus,
P (|(anXn + bn) − (aXn + b)| ≥ 2ǫ)
≤ P(|an − a| ≥ ǫ
2M
)+ P (|Xn| ≥M) + P (||bn − b| ≥ ǫ) < 3ǫ
as n is sufficiently large.
Remark. All the above statements of weak convergence of random variables Xn can be
replaced by their distributions µn := L(Xn). So “Xn =⇒ X” is equivalent to “µn =⇒ µ”.
THEOREM 11 (Levy’s continuity theorem). Let µn;n = 0, 1, 2, · · · be a sequence of prob-
ability measures on Rk with characteristic function φn. We have that
(i) If µn =⇒ µ0 then φn(t) → φ0(t) for every t ∈ Rk;
(ii) If φn(t) converges pointwise to a limit φ0(t) that is continuous at 0, then the as-
sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with
characteristic function φ0.
Remark. The condition that “φ0(t) is continuous at 0” is essential. Let Xn ∼N(0, n), n ≥ 1. So P (Xn ≤ x) → 1/2 for any x, which means that Xn doesn’t converges
weakly. Note that φn(t) = e−nt2/2 → 0 as n→ ∞ and φn(0) = 1. So φ0(t) is not continuous.
If one ignores the continuity condition at zero, the conclusion that Xn converges weakly
may wrongly follow.
Proof of Theorem 11. (i) is trivial.
(ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a
probability measure on R and generated by Xn. Note that for every x and δ > 0,
1|δx| > 2 ≤ 2
(1 − sin δx
δx
)=
1
δ
∫ δ
−δ(1 − cos tx) dt.
Replace x by Xn, take expectations, and use Fubini’s theorem to obtain that
P
(|Xn| >
2
δ
)≤ 1
δ
∫ δ
−δ(1 − EeitXn) dt =
1
δ
∫ δ
−δ(1 − φn(t)) dt.
By the given condition,
lim supn
P
(|Xn| >
2
δ
)≤ 1
δ
∫ δ
−δ(1 − φ0(t)) dt.
The right hand side above goes to zero as δ ↓ 0. So tightness follows.
29
To prove the weak convergence, we have to show that∫f(x) dun →
∫f(x) dµ for every
bounded continuous function f(x). It is equivalent to that for any subsequence, there is a
further subsequence, say nk, such that∫f(x) dunk
→∫f(x) dµ0.
For any subsequence, by Lemma 5.6, the Helly selection principle, there is a further
subsequence, say µnk, such that it converges to µ0 weakly, by the earlier proof, the c.f. of
µ0 is φ0. We know that a c.f. uniquely determines a distribution. This says that every limit
is identical. So∫f(x) dunk
→∫f(x) dµ0.
THEOREM 12 (Central Limit Theorem). Let X1,X2, ·,Xn be a sequence of i.i.d. random
variables with mean zero and variance one. Let Xn = (X1 + · · ·+Xn)/n. Then√nXn =⇒
N(0, 1).
Proof. Let φ(t) be the c.f. of X1. Since EX1 = 0 and EX21 = 1 we have that φ′(0) =
iEX1 = 0 and φ′′(0) = i2EX21 = −1. By Taylor’s expansion
φ
(t√n
)= 1 − t2
2n+ o
(1
n
)
as n is sufficiently large. Hence
Eeit√
nXn = φ
(t√n
)n
=
(1 − t2
2n+ o
(1
n
))n
→ e−t2/2
as n→ ∞. By Levy’s continuity theorem, the desired conclusion follows.
To deal with multivariate random variables, we have the next tool.
THEOREM 13 (Cramer-Wold device). Let Xn and X be random variables taking values
in Rk. Then
Xn =⇒ X if and only if tTXn =⇒ tTX for all t ∈ Rk.
Proof. By Levy’s continuity Theorem 11, Xn =⇒ X if and only if E exp(itTXn) →E exp(itTX) for each t ∈ R
k, which is equivalent to E exp(iu(tTXn)) → E exp(iu(tTX)) for
any real number u. This is same as saying that the c.f. of tTXn converges to that of tTX
for each t ∈ Rk. It is equivalent to that tTXn =⇒ tTX for all t ∈ R
k.
As an application, we have the multivariate analogue of the one-dimensional CLT.
THEOREM 14 Let X1,X2, · · · be a sequence of i.i.d. random vectors in Rk with mean
vector µ = EX1 and covariance matrix Σ = E(X1 − µ)(X1 − µ)T . Then
1√n
n∑
i=1
(Xi − µ) =√n(Yn − µ) =⇒ Nk(0,Σ).
30
Proof. By the Cramer-Wold device, we need to show that
1√n
n∑
i=1
(tTXi − tTµ) =⇒ N(0, tT Σt)
for ant t ∈ Rk. Note that tTXi − tTµ, i = 1, 2, · · · are i.i.d. real random variables. By
the one-dimensional CLT,
1√n
n∑
i=1
(tTXi − tTµ) =⇒ N(0, Var(tTX1)).
This leads to the desired conclusion since Var(tTX1) = tT Σt.
We state the following theorem without giving the proof. The spirit of the proof is
similar to that of Theorem 12. It is called Lindeberg-Feller Theorem.
THEOREM 15 Let kn; n ≥ 1 be a sequence of positive integers and kn → +∞ as n→ ∞.
For each n let Yn,1, · · · , Yn,kn be independent random vectors with finite variances such that
kn∑
i=1
Cov (Yn,i) → Σ and
kn∑
i=1
E‖Yn,i‖2I‖Yn,i‖ > ǫ → 0
for every ǫ > 0. Then the sequence Σkni=1(Yn,i − EYn,i) =⇒ N(0,Σ).
The Central Limit Theorems hold not only for independent random variables, it is also
true for some other dependent random variables of special structures.
Let Y1, Y2, · · · be a sequence of random variables. A sequence of σ-algebra Fn satisfies
that Fn ⊃ σ(Y1, · · · , Yn) and is non-decreasing Fn ⊂ Fn+1 for all n ≥ 0, where F0 = ∅,Ω.When E(Yn+1|Fn) = 0 for any n ≥ 0, we say Yn; n ≥ 1 are martingale differences.
THEOREM 16 For each n ≥ 1, let Xni, 1 ≤ i ≤ kn be a sequence of martingale differ-
ences relative to nested σ-algebras Fni, 1 ≤ i ≤ kn, that is, Fni ⊂ Fnj for i < j, such
that
1. maxi |Xni|; n ≥ 1 is uniformly integrable;
2. E (maxi |Xni|) → 0;
3.∑
iX2ni → 1 in probability.
Then Sn =∑
iXni → N(0, 1) in distribution.
Example. Let Xi; i ≥ 1 be i.i.d. with mean zero and variance one. Then one can
verify that the required conditions in Theorem 16 hold for Xni = Xi/√n for i = 1, 2, · · · , n.
31
Therefore we recollect the classical CLT that∑n
i=1Xi/√n =⇒ N(0, 1). Actually,
P (maxi
|Xni| ≥ ǫ) ≤ nP (|X1| ≥√nǫ) ≤ n
nǫ2E|X1|2I|X1| ≥
√nǫ → 0 and
E(maxi
|Xni|)2 ≤ E
∑ni=1X
2i
n= 1
This shows that maxi |Xni| → 0 in probability and maxi |Xni|, n ≥ 1 is uniformly inte-
grable. So condition 2 holds.
LEMMA 5.8 Let Un; n ≥ 1 and Tn; n ≥ 1 be two sequences of random variables such
that
1. Un → a in probability, where a is a constant;
2. Tn is uniformly integrable;
3. UnTn is uniformly integrable;
4. ETn → 1.
Then E(TnUn) → a.
Proof. Write TnUn = Tn(Un − a) + aTn. Then E(TnUn) = ETn(Un − a) + aETn. Evi-
dently, Tn(Un−a) is uniformly integrable by the given condition. Since Tn is uniformly
integrable, it is not difficult to verify that Tn(Un − a) → 0 in probability. Then the result
follows.
By the Taylor expansion, we have the following equality
eix = (1 + ix)e−x2/2+r(x) (5.6)
for some function r(x) = O(x3) as x→ ∞. Look at the set-up in Theorem 16, we have that
eitSn = TnUn, where
Tn =∏
j
(1 + itXnj) and Un = exp(−(t2/2)∑
j
X2nj +
∑
j
r(tXnj)). (5.7)
We next obtain a general result on CLT by using Lemma 5.8.
THEOREM 17 Let Xnj , 1 ≤ i ≤ kn, n ≥ 1 be a triangular array of random variables.
Suppose
1. ETn → 1;
2. Tn is uniformly integrable;
32
3.∑
j X2nj → 1 in probability;
4. E(maxj |Xnj |) → 0.
Then Sn =∑
j Xnj converges to N(0, 1) in distribution.
Proof. Since |TnUn| = |eitSn | = 1, we know that TnUn is uniformly integrable. By Lemma
5.8, it is enough to show that Un → e−t2/2 in probability. Review (5.7) and condition 3, it
suffices to show that∑
j r(tXnj) → 0 in probability. The assertion r(x) = O(x3) say that
there exists A > 0 and δ > 0 such that |r(x)| ≤ A|x3| as |x| < δ. Then
P (|∑
j
r(tXnj)| > ǫ) ≤ P (maxj
|Xnj | ≥ δ) + P (|∑
j
r(tXnj)| > ǫ, maxj
|Xnj | < δ)
≤ δ−1E(maxj
|Xnj |) + P (maxj
|Xnj | ·∑
j
|Xnj |2 > ǫ) → 0
as n→ ∞ by the given conditions.
Proof of Theorem 16. Define Zn1 = Xn1 and
Znj = XnjI(
j−1∑
k=1
X2nk ≤ 2), 2 ≤ j ≤ kn.
It is easy to check that Znj is still a martingale relative to the original σ-algebra. Now
define J = infj ≥ 1;∑
1≤k≤j X2nk > 2 ∧ kn. Then
P (Xnk 6= Znk for some k ≤ kn) = P (J ≤ kn − 1)
≤ P (∑
1≤k≤kn
X2nk > 2) → 0
as n→ ∞. Therefore, P (Sn 6=∑knj=1 Znj) → 0 as n→ ∞. To prove the result, we only need
to show
kn∑
j=1
Znj =⇒ N(0, 1). (5.8)
We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and
Un. Let’s literally verify the four conditions in Theorem 17. By a martingale property and
iteration, ETn = 1. So condition 1 holds. Now maxj |Znj | ≤ maxj |Xnj | → 0 in probability
as n → ∞. So condition 4 holds. It is also easy to check that∑
j Z2nj → 1 in probability.
Then it remains to show condition 2.
33
By definition, Tn =∏kn
j=1(1 + itZnj) =∏
1≤j≤J(1 + itZnj). Thus,
|Tn| =J−1∏
k=1
(1 + t2X2nk)
1/2|1 + tXnJ |
≤ exp
(t2
2
J−1∑
k=1
X2nk
)(1 + |t| · |XnJ |)
≤ et2(1 + |t| · max
j|Xnj |)
which is uniform integrable by condition 1 in Theorem 16.
The next is an extreme limit theorem, which is different than the CLT. The limiting
distribution is called an extreme distribution.
THEOREM 18 Let Xi; 1 ≤ i ≤ n be i.i.d. N(0, 1) random variables. Let Wn = max1≤i≤nXi.
Then
P
(Wn ≤
√2 log n− (log2 n) + x
2√
2 log n
)→ e
− 12√
πex/2
for any x ∈ R, where log2 n = log(log n).
Proof. Let tn be the right hand side of “≤” in the probability above. Then
P (Wn ≤ tn) = P (X1 ≤ tn)n = (1 − P (X1 > tn))n.
Since (1−xn)n ∼ e−a as n→ ∞ if xn ∼ a/n. To prove the theorem, it suffices to show that
P (X1 > tn) ∼ 1
2√π n
ex/2. (5.9)
Actually, we know that
P (X1 > x) ∼ 1√2π x
e−x2/2
as x→ +∞. It is easy to calculate that
1√2π tn
∼ 1
2√π log n
andt2n2
∼ log n− log(√
log n) − x
2
as n→ ∞. This leads to (5.9).
34
As an application of the CLT of i.i.d. random variables, we will be able to derive the
classical χ2-test as follows.
Let’s consider a multinomial distribution with n trails, k classes with parameter p =
(p1, · · · , pk). Roll a die twenty times. Let Xi be the number of the occurrences of “i dots”,
1 ≤ i ≤ 6. Then (X1,X2, · · · ,X6) follows a multinomial distribution with “success rate”
p = (p1, · · · , p6). How to test the die is fair or not? From an introductory course, we know
that, the null hypothesis is H0 : p1 = · · · = p6 = 1/6. The test statistic is
χ2 =6∑
i=1
(Xi − npi)2
npi.
We use the fact that χ2 is roughly χ2(5) as n is large. We will prove this next.
In general, Xn = (Xn,1, · · · ,Xn,k) follows a multinomial distribution with n trails and
“success rate” p = (p1, · · · , pk). Of course,∑k
i=1Xn,i = n. To be more precise,
P (Xn,1 = xn,1, · · · ,Xn,k = xn,k) =n!
xn,1! · · · xn,k!p
xn,1
1 · · · pxn,k
k ,
where∑k
i=1 xn,i = n. Now we prove the following theorem,
THEOREM 19 As n→ ∞,
χ2n =
k∑
i=1
(Xi − npi)2
npi=⇒ χ2(k − 1).
We need a lemma.
LEMMA 5.9 Let Y ∼ Nk(0,Σ). Then
‖Y ‖2 d=
k∑
i=1
λiZ2i ,
where λ1, · · · , λk are eigenvalues of Σ and Z1, · · · , Zk are i.i.d. N(0, 1).
Proof. Decompose Σ = Odiag(λ1, · · · , λk)OT for some orthogonal matrix O. Then the k
coordinates of OY are independent with mean zero and variances λi’s. So ||Y ‖2 = ‖OY ‖2
is equal to∑k
i=1 λiZ2i , where Zi’s are i.i.d. N(0, 1).
Another fact we will use is that AXn =⇒ AX for any k×k matrix A if Xn =⇒ X, where
Xn and X are Rk-valued random vectors. This can be shown easily by the Cramer-Wold
device.
35
Proof of the Theorem. Let Y1, · · · , Yn be i.i.d. with a multinomial distribution with 1
trails and “success rate” p = (p1, · · · , pk). Then Xn = Y1+· · ·Yn. Each Yi is a k-dimensional
vector. By CLT,
Xn − EXn√n
=⇒ Nk(0,Cov(Y1)),
where it is easy to see that EXn = np and
Σ1 := Cov(Y1) =
p1(1 − p1) −p1p2 · · · −p1pk
−p1p2 p2(1 − p2) · · · −p2pk
· · · · · · · · · · · ·−p1pk −p2pk · · · pk(1 − pk)
.
Set D = diag(1/√p1, · · · , 1/√pk). Then
DΣ1DT == I −
√p1
√p2
...√pk
√p1
√p2
...√pk
T
.
Since∑k
i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also,
trace of Σ1 is k−1. So Σ1 has eigenvalue 0 with one multiplicity and 1 with k−1 multiplicity.
Therefore, by the lemma, we have that
D(Xn − np)√n
=⇒ N(0,DΣ1DT ).
By the lemma, since g(x) = ‖x‖2 is a continuous function on Rk,
k∑
i=1
(Xi − npi)2
npi=
∥∥∥∥D(Xn − np)√
n
∥∥∥∥2
=⇒k−1∑
i=1
Z2i
d= χ2(k − 1).
6 Central Limit Theorems for Stationary Random Variables
and Markov Chains
Central limit theorems hold not only for independent random variables but also for depen-
dent random variables. In the next we will study the Central Limit Theorems for α-mixing
random variables and and martingale (a special class of dependent random variables).
Let X1,X2, · · · be a sequence of random variables. For each n ≥ 1, define αn by
α(n) := sup |P (A ∩B) − P (A)P (B)| (6.10)
36
where the supremum is taken over all A ∈ σ(X1, · · · ,Xk), B ∈ σ(Xk+n,Xk+n+1, · · · ), and
k ≥ 1. Obviously, α(n) is decreasing. When α(n) → 0, then Xk and Xk+n are approximately
independent. In this case the sequence Xn is said to be α-mixing. If the distribution of
the random vector (Xn,Xn+1, · · · ,Xn+j) does not depend on n, the sequence is said to be
stationary.
The sequence is called m-dependent if (X1, · · · ,Xk) and (Xk+n, · · · ,Xk+n+l) are inde-
pendent for any n > m, k ≥ 1 and l ≥ 1. In this case the sequence is α-mixing with αn = 0
for n > m. An independent sequence is 0-dependent.
For stationary process, we have the following the Central Limit Theorems.
THEOREM 20 Suppose that X1,X2, · · · is stationary with EX1 = 0 and |X1| ≤ A for
some constant A > 0. Let Sn = X1 + · · · +Xn. If∑∞
i=1 α(n) <∞, then
V ar(Sn)
n→ σ2 = E(X2
1 ) + 2
∞∑
k=1
E(X1Xk+1) (6.11)
where the series converges absolutely. If σ > 0, then Sn/√n =⇒ N(0, σ2).
THEOREM 21 Suppose that X1,X2, · · · is stationary with EX1 = 0 and E|X1|2+δ < ∞for some constant δ > 0. Let Sn = X1 + · · · +Xn. If
∑∞i=1 α(n)δ/(2+δ) <∞, then
V ar(Sn)
n→ σ2 = E(X2
1 ) + 2
∞∑
k=1
E(X1Xk+1)
where the series converges absolutely. If σ > 0, then Sn/√n =⇒ N(0, σ2).
Easily we have the following corollary.
COROLLARY 6.1 Let X1,X2, · · · be m-dependent random variables with the same distri-
bution. If EX1 = 0 and EX2+δ1 < ∞ for some δ > 0. Then Sn/
√n → N(0, σ2), where
σ2 = EX21 + 2
∑m+1i=1 E(X1Xi).
Remark. If we argue by using the truncation method as in the proof of Theorem 21, we
can show that the above corollary still holds only under EX21 <∞.
LEMMA 6.1 If Y ∈ σ(X1, · · · ,Xk) and Z ∈ σ(Xk+n,Xk+n+1, · · · ,Xk+m) for n ≤ m ≤+∞. Assume |Y | ≤ C and |Z| ≤ D. Then
|E(Y Z) − (EY )EZ| ≤ 4CDα(n).
37
Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by
finite sums, so w.l.o.g, we further assume that
Y =m∑
i=1
yiAi and Z =n∑
j=1
zjBj
where Ai’s and Bj ’s are respectively partitions of the sample space. Then
|E(Y Z) − (EY )EZ| = |∑
i
yi
∑
j
zj (P (Ai ∩Bj) − P (Ai)P (Bj)) |
≤∑
i
|∑
j
zj (P (Ai ∩Bj) − P (Ai)P (Bj)) |.
Let ξi = 1 or −1 depends on whether or not zj(P (Ai ∩Bj)−P (Ai)P (Bj) ≥ 0 or not. Then
the last sum is equal to∑
i
∑j ξizj(P (Ai ∩ Bj) − P (Ai)P (Bj)) which is also identical to
∑j zj
∑j ξi(P (Ai ∩ Bj) − P (Ai)P (Bj)). By the same argument, there are ηi equal to 1 or
−1 such that the last quantity is bounded by∑
j
∑j ξiηj(P (Ai ∩ Bj) − P (Ai)P (Bj)). Let
E1 be the union of Ai such that ξi = 1 and E2 = Ec1; Let F1 and F2 be defined similarly for
Bj according to the sign of ηj’s. Then
|E(Y Z) − (EY )EZ| ≤∑
1≤i,j≤2
|P (Ei ∩ Fj) − P (Ei)P (Fj)| ≤ 4α(n)
LEMMA 6.2 If Y ∈ σ(X1, · · · ,Xk) and Z ∈ σ(Xk+n,Xk+n+1, · · · ). Assume for some
δ > 0, E(Y 2+δ) ≤ C and E(Z2+δ) ≤ D. Then
|E(Y Z) − (EY )EZ| ≤ 5(1 + C +D)α(n)δ/(2+δ).
Proof. Let Y0 = Y I(|Y | ≤ a) and Y1 = Y I(|Y | > a); and Z0 = ZI(|Z| ≤ a) and
Z1 = ZI(|Z| > a). Then
|E(Y Z) − (EY )EZ| ≤∑
1≤i,j≤2
|E(YiZj) − (EYi)EZj |.
By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2α(n). For
i = j = 1, |E(Y1Z1) − (EY1)EZ1| = |Cov(Y1, Z1)| which is bounded by√E(Y 2
1 )E(Z21 ) ≤√
CD/aδ since E(Y 21 ) ≤ C/aδ. Now
|E(Y0Z1) − (EY0)EZ1| ≤ E(|Y0 −EY0| · |Z1 − EZ1|) ≤ 2a · 2E|Z1|
which is less than 4D/aδ . The term for i = 1, j = 0 is bounded by 4C/aδ . In summary
|E(Y Z) − (EY )EZ| ≤ 4a2α(n) +
√CD + 4C + 4D
a2≤ 4a2α(n) +
4.5(C +D)
aδ.
38
Now take a = α(n)−1/(2+δ). The conclusion follows.
Proof of Theorem 20. By Lemma 6.2, |EX1X1+k| ≤ 4A2α(k), so the series (6.11) con-
verges absolutely. Moreover, let ρk = E(X1X1+k). Then ES2n = nEX2
1+2∑
1≤i<j≤nE(XiXj) =
nρ0 + 2∑n−1
k=1(n− k)ρk. Then it is to see that
E(S2n)
n→ ρ0 + 2
∞∑
k=1
ρk. (6.12)
We claim that
E(S4n) = n3δn (6.13)
for a sequence of constants δn → 0. Actually, E(S4n) ≤∑1≤i,j,k,l≤n |E(XiXjXkXl)|. Among
the terms in the sum there are n terms of X4i ; at most n2 of X2
i X2j and X3
i Xj respectively
for i 6= j. So the contribution of these terms is bounded by Cn2 for an universal constant
C. There are only two type of terms left: X2i XjXk and XiXjXkXl where these indices are
all different. We will deal with the second type of terms only. The first one can be done
similarly and more easily. Note that
∑
1≤i,j,k,l≤n
|E(XiXjXkXl)| ≤ 4!n∑
1≤i+j+k≤n
|E(X1X1+iX1+i+jX1+i+j+k)| (6.14)
where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is
bounded by 4A4αi. By the same argument, it is also bounded by 4A4αk. Therefore, the
sum in (6.14) is bounded by
4 · 4!A4n2∑
2≤i+k≤n
minαi, αk ≤ 2Kn2∑
2≤i+k≤n, i≤k
minαi, αk
≤ 2Kn2n∑
k=1
kαk = nδn (6.15)
for a sequence of constants δn such that 0 < δn → 0. This is because
n∑
k=1
kαk =∑
k≤√n
kαk +∑
√n<k≤n
kαk
≤ √n
∞∑
k=1
αk + n∑
k≥√n
αk = o(n).
39
Therefore (6.13) follows. From this, we know that ES4n ≤ nδn ≤ n3(δn + (1/n)) ≤ n3δ′n
where δ′n := supk≥nδk + (1/k); k ≥ n is decreasing to 0 and is bounded below by 1/n.
Thus, w.l.o.g., assume δn has this property.
Define pn = [√n log(1/δ[
√n])] and qn = [n/pn] and rn = [n/(pn + qn)]. Then
√n≪ pn ≤ √
n log n andp2
nδpn
n≤ δpn log2
(1
δ[√
n]
)≤ δpn log2
(1
δpn
)→ 0 (6.16)
as n→ ∞. Now for finite n, break X1,X2, · · · ,Xn into some blocks as follows:
X1, · · · ,Xpn | Xpn+1, · · · ,Xpn+qn | Xpn+qn+1, · · · ,X2pn+qn | · · · ,Xn
For i = 1, 2, · · · , rn, set
Un,i = X(i−1)(pn+qn)+1 +X(i−1)(pn+qn)+2 + · · · +Xi(pn+qn);
Vn,i = Xipn+(i−1)qn+1 +Xipn+(i−1)qn+2 + · · · +Xi(pn+qn);
Wn = Xrn(pn+qn)+1 +Xrn(pn+qn)+1 + · · · +Xn,
and S′n =
∑rni=1 Un,i and S′′
n =∑rn
i=1 Vn,i. In what follows, we will show that S′′n and Wn are
negligible. Now, by (6.12) and its reasoning
V ar(S′′n) = rnV ar(Vn,1) + 2
rn∑
k=2
(rn − k)E(Vn,1Vn,k)
≤ Crnqn + 2rn
rn∑
k=2
|E(Vn,1Vn,k)|.
By assumption, |Vn,i| ≤ Aqn for any i ≥ 1. By Lemma 6.1, |E(Vn,1Vn,k)| ≤ 4A2q2nα(k−1)pn.
Thus, by monotoncity,
rn∑
k=2
|E(Vn,1Vn,k)| ≤ 4A2q2n
rn∑
k=2
α(k−1)pn≤ 4A2q2n
rn∑
k=2
1
pn
(k−1)pn∑
j=(k−2)pn+1
αj
≤ 4A2q2npn
∞∑
j=1
αj .
Since qn/pn → 0, the above two inequalities imply that
V ar(S′′n)
n→ 0 (6.17)
as n→ ∞. This says that S′′n/
√n→ 0 in probability. By Chebyshev’s inequality and (6.12)
we see that Wn/√n→ 0 in probability. Thus, to prove the theorem, it suffices to show that
S′n/
√n =⇒ N(0, σ2). This will be done through the following two steps
Sn√n
=⇒ N(0, σ2) and EeitS′n/
√n − EeitSn/
√n → 0 (6.18)
40
for any t ∈ R, where Sn =∑rn
i=1 U′n,i and U ′
n,i; ı = 1, 2, · · · , rn are i.i.d. with the same
distribution as that of Un,1. First, by (6.12), V ar(Sn) = rnV ar(Un,1) ∼ rnpnσ2. It follows
that V ar(Sn)/n→ σ2. Also, EU ′n,i = 0. By (6.13) and (6.16),
1
V ar(Sn)2
rn∑
i=1
E(U ′n,i)
4 ∼ rnn2σ4
E(U4n,1) =
rnp3nδpn
n2σ4∼ σ−4 p
2nδpn
n→ 0
as n → ∞. By the Lyapounov central limit theorem, we know the first assertion in (6.18)
holds. Last, by Lemma 6.12,
|EeitS′n/
√n − (EeitUn,1/
√n)E exp(it
rn∑
j=2
Un,j/√n)| ≤ 16α(qn).
Then repeat this inequality for E exp(it∑rn
j=2 Un,j/√n) and so on, we obtain that
|EeitS′n/
√n − EeitSn/
√n| = |EeitS′
n/√
n −rn∏
j=1
EeitU′n,j/
√n| ≤ 16α(qn)rn.
Since α(n) is decreasing and∑
n≥1 α(n) < ∞, we have nα(n)/2 ≤ ∑[n/2]≤i≤n α(i) → 0.
Thus the above bound is o(rn/qn) = o(n/(pnqn)) → 0 as n → ∞. The second statement in
(6.18) follows. The proof is complete.
Proof of Theorem 21. By Lemma 6.2, |EX1Xk| ≤ 5(1 + 2E|X1|2+δ)α(n)2/(2+δ),
which is summable by the assumption. Thus the series EX21 +2
∑∞i=2E(X1Xk) is absolutely
convergent. By the exact same argument as in proving (6.11), we obtain that V ar(Sn)/n →σ2.
Now for a given constant A > 0 and all i = 1, 2, · · · , define
Yi = XiI|Xi| ≤ A − E(XiI|Xi| ≤ A) and Zi = XiI|Xi| > A − E(XiI|Xi| > A),
and S′n =
∑ni=1 Yi and S′′
n =∑n
i=1 Zi. Obviously, Sn = S′n + S′′
n. By Theorem 20,
Sn√n
=⇒ N(0, σ(A)2) (6.19)
as n → ∞ for each fixed A > 0, where σ(A)2 := EY 21 + 2
∑∞i=2EY1Yk. By the summable
assumption on α(n), it is easy to check that both the sum in the definition of σ(A)2 and
σ(A)2 := EZ21 + 2
∑∞i=2E(Z1Zk) are absolutely convergent, and
σ(A)2 → σ2 and σ(A)2 → 0 (6.20)
41
as A→ +∞. Now
lim supn→∞
P (|S′′
n|√n
≥ ǫ) ≤ lim supn→∞
V ar(S′′n)
nǫ2→ σ(A)
ǫ2. (6.21)
by using the same argument as in (6.2). Since P (Sn/√n ∈ F ) ≤ P (S′
n/√n ∈ F ǫ) +
P (|S′′n|/
√n ≥ ǫ) for any closed set F ⊂ R and any ǫ > 0, where F ǫ is the closed ǫ-
neighborhood of F. Thus, from (6.19) and (6.21)
lim supn→∞
P (Sn√n∈ F ) ≤ P (N(0, σ(A)2) ∈ F ǫ) + lim sup
n→∞P (
|S′′n|√n
≥ ǫ)
≤ P
(N(0, σ2) ∈ σ
σ(A)F ǫ
)+σ(A)
ǫ2
for any A > 0 and ǫ > 0. First let A→ +∞, and then let ǫ→ 0+ to obtain from (6.20) that
lim supn→∞
P (Sn√n∈ F ) ≤ P (N(0, σ2) ∈ F ).
The proof is complete.
Define by α(n) the supremum of
|E (f(X1, · · · ,Xk)g(Xk+n, · · · ,Xk+m)) −Ef(X1, · · · ,Xk) ·Eg(Xk+n, · · · ,Xk+m)| (6.22)
over all k ≥ 1,m ≥ 0 and measurable functions f(x) defined on Rm and g(x) defined on
Rm−n+1 with |f(x)| ≤ 1 and |g(x)| ≤ 1 for all x ∈ R.
LEMMA 6.3 The following are true:
(i) α(n) = supk≥1
supm≥0
supA∈σ(X1,··· ,Xk), B∈σ(Xk+n,··· ,Xk+n+m)
|P (A ∩B) − P (A)P (B)|.
(ii) For all n ≥ 1, we have that α(n) ≤ α(n) ≤ 4α(n).
Proof. (i) For two sets A and B, let A∆B = (A\B)∪ (B\A), which is their symmetric
difference. Then IA∆B = |IA − IB |. Let Zi; i ≥ 1 be a sequence of random variables. We
claim that
σ(Z1, Z2, · · · ) = B ∈ σ(Z1, Z2, · · · ); for any ǫ > 0, there exists l ≥ 1 and (6.23)
C ∈ σ(Z1, · · · , Zl) such that P (B∆C) < ǫ. (6.24)
If this is true, then for any ǫ > 0, there exists l < ∞ and C ∈ σ(Z1, · · · , Zl) such that
|P (B) − P (C)| < ǫ. Thus (ii) follows.
42
Now we prove this claim. First, note that σ(Z1, Z2, · · · ) = σZi ∈ Ei; Ei ∈ B(R); i ≥1. Denote by F the right hand side of (6.23). Then F contains all Zi ∈ Ei; Ei ∈B(R); i ≥ 1. We only need to show that the right hand side is a σ-algebra.
It is easy to see that (i) Ω ∈ F ; (ii) Bc ∈ F if B ∈ F since A∆B = Ac∆Bc. (iii) If Bi ∈ Ffor i ≥ 1, there exist mi < ∞ and Ci ∈ σ(Z1, · · · , Zmi) such that P (Bi∆Ci) < ǫ/2i for all
i ≥ 1. Evidently, ∪ni=1Bi ↑ ∪∞
i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that
|P (∪∞i=1Bi)−P (∪n0
i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0i=1Bi)∆(∪n0
i=1Ci) ⊂ ∪n0i=1(Bi∆Ci).
Write B = ∪∞i=1Bi and B = ∪n0
i=1Bi and C = ∪n0i=1Ci. Note B∆C ⊂ (B\B)∪ (B∆C). These
facts show that P (B∆C) < ǫ and C ∈ σ(Z1, Z2, · · · , Zl) for some l < ∞. Thus F is a
σ-algebra.
(ii) Take f and g to be indicator functions, from (i) we get α(n) ≤ α(n). By Lemma
6.1, we have that α(n) ≤ 4α(n).
Let Xi; i ≥ 1 be a sequence of random variables. We say it is a Markov chain if
P (Xk+1 ∈ A|X1, · · · ,Xk) = P (Xk+1 ∈ A|Xk) for any Borel set A and any k ≥ 1. A Markov
chain has the property that given the present observations, the past and future
observations are (conditionally) independent. This is literally stated in Proposition
10.3.
Let’s look at the strong mixing coefficient α(n). Recall the definition of α(n). IfX1,X2, · · ·is a Markov chain, for simplicity, we write f and g, respectively, for f(X1, · · · ,Xk) and
g(Xk+n, · · · ,Xk+m). Let f = E(f |Xk+1)) and g = E(g|Xk+n−1). Then by Proposition 10.3,
E(fg) = E(f g), and Ef = Ef and Eg = Eg. Thus,
E(fg) − (Ef)Eg = E(f g) − (Ef)Eg.
The point is that f is Xk+1 measurable, and g is Xk+n−1 measurable. Therfore
α(n) ≤ sup|f |≤1 |g|≤1
|E(f(Xk+1)g(Xk+n−1) − (Ef(Xk+1))Eg(Xk+n−1)| (6.25)
≤ 4 supA, B
|P (Xk+1 ∈ A,Xk+n−1 ∈ B) − P (Xk+1 ∈ A)P (Xk+n−1 ∈ B)| (6.26)
by Lemma 6.1 by taking k there to be k + 1 and k + n to be k + n+ 1 and Xi = 0 for all
positive integer i excluding k + 1 and k + n+ 1.
If the Markov process is stationary, then
α(n+ 1) ≤ 4 supA, B
|P (X1 ∈ A,Xn ∈ B) − P (X1 ∈ A)P (Xn ∈ B)|. (6.27)
43
Let Pn(x,B) represent the probability ofXn+1 ∈ B given X1 = x, and π be the marginal
probability distribution of X1. Then
P (Xn+1 ∈ B, X1 ∈ A) =
∫
APn(x,B)π( dx)
From this,
α(n+ 2) ≤ 4 · supA,B
|∫
A(Pn(x,B) − P (Xn ∈ B))π( dx)|.
For two probability measures µ and ν, define its total variation distance ‖µ−ν‖ = supA |µ(A)−ν(A)|. Then
α(n + 2) ≤ 4
∫‖Pn(x, ·) − π‖π( dx), n ≥ 2.
In practice, for example, in the theory of Monto Carlo Markov Chains, we have some
conditions to guarantee the convergence rate of ‖Pn(x, ·) − π‖. In other words, we have a
function M(x) ≥ 0 and γ(n) such that
‖Pn(x, ·) − π‖ ≤M(x)γ(n)
for all x and n. Here are some examples:
If a chain is so-called polynomially ergodic of order m, then γ(n) = n−m;
If a chain is geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1);
If a chain is uniformly geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1) and
supxM(x) <∞.
These three conditions, combining with the Central Limit Theorems of stationary ran-
dom variables stated in Theorems 20 and 21, imply the following CLT for Markov chains
directly.
Let EπM =∫M(x)π( dx).
THEOREM 22 Suppose X1,X2, · · · is a stationary Markov chain with L(X1) = π and
EπM <∞. If the chain is Geometrically ergodic, in particular, if it is uniformly geometri-
cally ergodic, then
Sn − nEX1√n
=⇒ N(0, σ2) (6.28)
where σ2 = E(X21 ) + 2
∑i=2 ∞E(X1Xk).
44
THEOREM 23 Suppose X1,X2, · · · is a stationary Markov chain with L(X1) = π and
EπM <∞.
(i) If |X1| ≤ C for some constant C > 0, and the chain is polynomially ergodic of order
m > 1, then (6.28) holds.
(ii) If the chain is polynomially ergodic of order m, and E|X1|2+δ <∞ for some δ > 0
satisfying mδ > 2 + δ, then (6.28) holds.
Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A),
here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ,
randomly draw X1 under distribution P (X0, ·) and Xn+1 under P (Xn, ·) for any n ≥ 1.
Through this a Markov chain X1,X2, · · · with initial distribution λ and transitional kernel
P (x, ·) has been generated. Let π be the stationary distribution of this chain, that is,
π(·) =
∫P (x, ·)π( dx).
Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn
and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0, it
then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π
in this case) holds, then the CLT holds for the Markov chain of the same transitional prob-
ability kernel P (x, ·) and stationary distribution π for any initial distribution λ.
7 U-Statistics
Let X1, · · · ,Xn be a sequence of i.i.d. random variables. Sometimes we are interested in
estimating θ = Eh(X1, · · · ,Xm) for 1 ≤ m ≤ n. To make discussion simple, we assume
h(x1, · · · , xm) is a symmetric function. If X(1) ≤ X(2) ≤ · · · ≤ X(n), the order statistic, is
sufficient and complete, then
Un =1(nm
)∑
1≤i1<···<im≤n
h(Xi1 , · · · ,Xim) (7.1)
is an unbiased estimator of θ and is also a function of X(1), · · · ,X(n), and hence is uniformly
minimu variance unbiased estimator (UMVUE). Tosee Un is a function of X(1), · · · ,X(n),
one can rewrite it as
Un =1(nm
)∑
1≤i1<···<im≤n
h(X(i1), · · · ,X(im)). (7.2)
We can see that Un is a conditional expectation of h(x1, · · · ,Xm). In fact, the following is
true.
45
PROPOSITION 7.1 The equality Un = E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) holds for all 1 ≤m ≤ n.
Proof. One can see that
E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) = E(h(Xi1 , · · · ,Xim)|X(1), · · · ,X(n))
for any 1 ≤ i1 < · · · < im ≤ n. Summing both sidees above over all such indices and then
dividing by(nm
), we obtain
E(h(X1, · · · ,Xm)|X(1), · · · ,X(n)) =1(nm
)∑
1≤i1<···<im≤n
h(X(i1), · · · ,X(im)) = Un
by (7.2)
Example Let θ = E(X21 ). We know E(X2
1 ) = (1/2)E(X1 −X2)2. Taking h(x1, x2) =
(1/2)(x1 − x2)2, then the corresponding U -statistic is
Un =2
n(n− 1)· 1
2
∑
1≤i<j≤n
(Xi −Xj)2 =
1
n− 1
n∑
I=1
(Xi − X)2 = S2,
which is the sample variance.
Proof. One-sample Wilcoxon statistic θ = P (X1 +X2 ≤ 0). The statistic is
Un =2
n(n− 1)
∑
1≤i<j≤n
I(Xi +Xj ≤ 0)
which is U -statistic.
Example. Gini’s mean difference is to measure the concentration of the distribution,
the parameter is θ = E|X1 −X2|. The statistic is
Un =2
n(n− 1)
∑
1≤i<j≤n
|Xi −Xj |.
This is again a U -statistic.
Now we calculate the variance of U -statistics. Assume E[h(X1, · · · ,Xm)2] <∞. Set
hk(x1, · · · , xk) = E (h(X1, · · · ,Xm)|X1 = x1, · · · ,Xk = xk)
= Eh(x1, · · · , xk,Xk+1, · · · ,Xm). (7.3)
Evidently, hm(x1, · · · , xm) = h(x1, · · · , xm).
46
THEOREM 24 (Hoeffding’s Theorem) For a U -statistic given by (7.1) with E[h(X1, · · · ,Xm)2] <
∞,
V ar(Un) =1(nm
)m∑
k=1
(m
k
)(n−m
m− k
)ζk
where ζk = V ar(hk(X1, · · · ,Xk)).
Proof. Consider two sets i1, · · · , im and j1, · · · , jm ofm distinct integers from 1, 2, · · · , nwith exactly k integers in common. Then the total number of distinct choices is
(nm
)(mk
)(n−mm−k
).
Now
V ar(Un) =1(
nm
)2∑
Cov(h(Xi1 , · · · ,Xim), h(Xj1 , · · · ,Xjm)),
where the sum is over all indices 1 ≤ i1 < · · · < im ≤ n and 1 ≤ j1 < · · · < jm ≤ n.
If i1, · · · , im and j1, · · · , jm have no integers in common, the covariance is zero by
independence. Now we classfisy the sum over the number of integers in common. Then
V ar(Un) =1(nm
)2∑m
k=1
(n
m
)(m
k
)(n−m
m− k
)·
Cov(h(X1, · · · ,Xm), h(X1, · · · ,Xk,Xm+1, · · · ,X2m−k))
=1(nm
)m∑
k=1
(m
k
)(n−m
m− k
)· V ar(ζk)
where the fact (from conditioning) that the covariance above equal to V ar(ζk) is used in
the last step.
LEMMA 7.1 Under the condition of Theorem 24,
(i) 0 = ζ0 ≤ ζ1 ≤ · · · ≤ ζm = V ar(h);
(ii) For fixed m and some k ≥ 1 such that ζ0 = · · · = ζk−1 = 0 and ζk > 0, we have
V ar(Un) =k!(m
k
)2ζk
nk+O
(1
nk+1
)
as n→ ∞.
Proof. (i) For any 1 ≤ k ≤ m− 1,
hk(X1, · · · ,Xk) = E (h(X1, · · · ,Xn)|X1, · · · ,Xk)
= E (hk+1(X1, · · · ,Xk+1)|X1, · · · ,Xk) .
47
Then by Jensen’s inequality, E(h2k) ≤ E(h2
k+1). Note that Ehk = Ehk+1. Then V ar(hk) ≤V ar(hk+1). So (i) follows.
(ii) By Theorem 24,
V ar(Un) =1(nm
)(m
k
)(n−m
m− k
)ζk +
1(nm
)m∑
i=k+1
(m
i
)(n−m
m− i
)ζi.
Since m is fixed,(nm
)∼ nm/m! and
(n−mm−i
)≤ nm−i for any k + 1 ≤ i ≤ m. Thus the last
term above is bounded by O(n−(k+1)) as n→ ∞. Now
1(nm
)(m
k
)(n−m
m− k
)ζk
=m!(m
k
)ζk
n(n− 1) · · · (n−m+ 1)· (n−m)!
(m− k)!(n − 2m+ k)!
=k!(m
k
)2ζk
nk· n · n · · ·nn(n− 1) · · · (n− k + 1)
· (n −m) · · · (n− 2m+ k + 1)
(n− k) · · · (n−m+ 1)
=k!(m
k
)2ζk
nk·
k−1∏
i=1
(1 − i
n
)−1
·2m−k−1∏
i=m
(1 − i
n
)·
m−1∏
i=k
(1 − i
n
)−1
.
Note that since m is fixed, each of the above product is equal to 1+O(1/n). The conclusion
then follows.
Let X1, · · · ,Xn be a sample, and Tn be a statistic based on this sample. The projection
of Tn on some random variables Y1, · · · , Yp is defined by
T ′n = ETn +
p∑
i=1
(E(Tn|Yi) −ETn).
Suppose Tn is a symmetric function ofX1, · · · ,Xn. Set ψ(Xi) = E(Tn|Xi) for i = 1, 2, · · · , n.Then ψ(X1), · · · , ψ(Xn) are i.i.d. random variables with mean ETn. If V ar(Tn) <∞, then
1√nV ar(ψ(X1))
n∑
i=1
(ψ(Xi) − ETn) → N(0, 1) (7.4)
in distribution as n→ ∞. Now Let T ′n be the projection of Tn on X1, · · · ,Xn. Then
Tn − T ′n = (Tn − ETn) −
n∑
i=1
(ψ(Xi) − ETn). (7.5)
We will next show that Tn − T ′n is negligible comparing the order appeared in (7.4). Then
the CLT holds for Tn by (7.4) and thre Slusky lemma.
48
LEMMA 7.2 Let Tn be a symmetric statistic with V ar(Tn) < ∞ for every n, and T ′n be
the projection of Tn on X1, · · · ,Xn. Then ET ′n = ETn and
E(Tn − T ′n)2 = V ar(Tn) − V ar(T ′
n).
Proof. Since ETn = ET ′n,
E(Tn − T ′n)2 = V ar(Tn) + V ar(T ′
n) − 2Cov(Tn, T′n).
First, V ar(T ′n) = nV ar(E(Tn|X1)) by independence. Second, by (7.5)
Cov(Tn, T′n)
= V ar(Tn) + nV ar(E(Tn|X1)) − 2n∑
i=1
Cov(Tn − ETn)(ψ(Xi) − ETn). (7.6)
Now the i-th covariance above is equal to E(TnE(Tn|Xi))−E(E(Tn|Xi))2 = V ar(E(Tn|X1)).
We already see V ar(T ′n) = nV ar(E(Tn|X1)). So the proof is complete by the two facts and
(7.6).
THEOREM 25 Let Un be the statistic given by (7.1) with E[h(X1, · · · ,Xm)]2 <∞.
(i) If ζ1 > 0, then√n(Un − ETn) → N(0,m2ζ1) in distribution.
(ii) If ζ1 = 0 and ζ2 > 0, then
n(Un − EUn) → m(m− 1)
2
∞∑
j=1
λj(χ21j − 1),
where χ21j ’s are i.i.d. χ2(1)-random variables, and λj ’s are constants satisfying
∑∞j=1 λ
2j =
ζ2.
Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let
U ′n be the projection of Un on X1, · · · ,Xn. Then
U ′n = Eh+
1(nm
)∑
1≤i1<···<im≤n
E(h(Xi1 , · · · ,Xik)|Xi) − Eh.
Observe that
1(nm
)(∑
1≤i1<···<im≤n
E(h(Xi1 , · · · ,Xik)|X1) − Eh)
=1(nm
)∑
2≤i2<···<im≤n
(h1(X1) − Eh)
=
(n−1m−1
)(nm
) (h1(X1) − Eh) =m
n(h1(X1) − Eh).
49
Thus
U ′n = Eh+
m
n
n∑
i=1
(h1(Xi) − Eh).
This says that V ar(U ′n) = (m2/n)ζ1, where ζ1 = V ar(h1(X1)). By lemma 7.1, V ar(Un) =
(m2/n)ζ1+O(n−2). From Lemma 7.2, V ar(Un−U ′n) = O(n−2) as n→ ∞. By the Chebyshev
inequality,√n(Un − U ′
n) → 0 in probability. The result follows from (7.4) and the Slusky
lemma by noting V art(ψ(X1)) = V ar(h1(X1)) = ζ1.
50
8 Empirical Processes
Let X1,X2, · · · be a random sample, where Xi takes values in a metric space M. Define a
probability measure µn such that
µn(A) =1
n
n∑
i=1
I(Xi ∈ A), A ⊂M.
The measure µn is called a probability measure. If X1 takes real values, that is, M = R,
set
Fn(t) =1
n
n∑
i=1
I(Xi ≤ t), t ∈ R, A = (−∞, t].
The random process Fn(t), t ∈ R is called an empirical process. By the LLN and CLT,
Fn(t) → F (t) a.s. and√n(Fn(t) − F (t)) =⇒ N(0, F (t)(1 − F (t))),
where F (t) is the cdf of X1. Among many interesting questions, here we are interested in
the uniform convergence of the above almost sure and weak convergences. Specifically, we
want to answer the following two questions:
1) Does ∆n := supt
|Fn(t) − F (t)| → 0 a.s.?
2) Fix n ≥ 1, regard Fn(t); t ∈ R as one point in a big space, does the CLT hold?
The answers are yes for both questions. We first answer the first question.
THEOREM 26 (Glivenko-Cantelli) Let X1,X2, · · · be a sequence of i.i.d. random vari-
ables with cdf F (t). Then
∆n = supt
|Fn(t) − F (t)| → 0 a.s.
as n→ ∞.
Remark. When F (t) is continuous and strictly increasing in the support of X1, the above
the distribution of ∆n is independent of the distribution of X1. First, F (Xi)’s are i.i.d.
U [0, 1]-distributed random variables. Second,
∆n = supt
| 1n
∑
i
I(Xi ≤ t) − F (t)| = supt
| 1n
∑
i
I(F (Xi) ≤ F (t)) − F (t)|
= sup0≤y≤1
| 1n
∑
i
I(Yi ≤ y) − y|,
51
where Yi = F (Xi); 1 ≤ i ≤ n are i.i.d. random variables with uniform distribution over
[0, 1].
Proof. Set
Fn(t−) =1
n
n∑
i=1
I(Xi < t) and F (t−) = P (X1 < t), t ∈ R.
By the strong law of large numbers, Fn(t) → F (t) a.s. and Fn(t−) → F (t−) a.s. as n→ ∞.
Given ǫ > 0, choose −∞ = t0 < t1 < · · · < tk = ∞ such that F (ti−)−F (ti−1) < ǫ for every
i. Now, for ti−1 ≤ t < ti,
Fn(t) − F (t) ≤ Fn(ti−) − F (ti−) + ǫ
Fn(t) − F (t) ≥ Fn(ti) − F (ti) − ǫ.
This says that
∆n ≤ max1≤i≤k
|Fn(ti) − F (ti)|, |Fn(ti−) − F (ti−)| + ǫ→ ǫ a.s.
as n→ ∞. The conclusion follows by letting ǫ ↓ 0.
Given (t1, t2, · · · , tk) ∈ Rk, it is easy to check that
√n(Fn(t1) − F (t1), · · · , Fn(tk) −
F (tk)) =⇒ Nk(0, Σ) where
Σ = (σij) and σij = F (ti ∧ tj) − F (ti)F (tj), 1 ≤ i, j ≤ n. (8.1)
Let D[−∞,∞] be the set of all functions defined on (−∞,∞) such that they are right-
continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric,
it becomes a Polish space. The random vector√n(Fn − F ), viewed as an element in
D[−∞,∞], converges weakly to a continuous Gaussian process ξ(t), where ξ(±∞) = 0 and
the covariance structure is given in (8.1). This is usually called a Brownian bridge. It
has same distribution of Gλ F, where Gλ is the limiting point when F is the cumulative
distribution function of the uniform distribution over [0, 1].
THEOREM 27 (Donsker). If X1,X2, · · · are i.i.d. random variables with distribution
function F, then the sequence of empirical processes√n(Fn − F ) converges in distribution
in the space D[−∞,∞] to a random element GF , whose marginal distributions are zero-
mean with covariance function (8.1).
Later we will see that
√n‖Fn − F‖∞ =
√n sup
t|Fn(t) − F (t)| =⇒ sup
t|GF (t)|,
52
where
P
(sup
t|GF (t)| ≥ x
)= 2
∞∑
j=1
(−1)j+1e−2j2x2, x ≥ 0.
Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that
P (√n‖Fn − F‖∞ > x) ≤ 2e−2x2
, x > 0.
Let X1,X2, · · · be i.i.d. random variables taking values in (X ,A) and L(X1) = P. We
write µf =∫f(x)µ( dx). So
Pnf =1
n
n∑
i=1
f(Xi) and Pf =
∫
Xf(x)P ( dx).
By the Glivenko-Cantelli theorem
supx∈R
|Pn((−∞, x]) − P ((−∞, x])| → 0 a.s.
Also, if P has no point mass, then
supany measurable set A
|Pn(A) − P (A)| = 1
because Pn(A) = 1 and P (A) = 0 for A = X1,X2, · · · ,Xn. We are searching for F such
that
supf∈F
|Pnf − Pf | → 0 a.s. (8.2)
The previous two examples say (8.2) holds for F = I(−∞, x], x ∈ R but doesn’t hold for
the power set 2R. The class F is called P -Glivenko-Cantelli if (8.2) holds.
Define Gn =√n(Pn −P ). Given k measurable functions f1, · · · , fk such that Pf2
i <∞for all i, one can also check by the multivariate Central Limit Theorem (CLT) that
(Gnf1, · · · ,Gnfk) =⇒ Nk(0,Σ)
where Σ = (σij)1≤i,j≤n and σij = P (fifj) − (Pfi)Pfj . This tells us that Gnf, f ∈ Fsatisfies CLT in R
k. If F has infinite many members, how we define weak convergence? We
first define a space similar to Rk. Set
l∞(F) = bounded function z : F → R
with norm ‖z‖∞ = supF∈F |z(F )|. Then (l∞(F), ‖ · ‖∞) is a Banach space; it is separable
if F is countable. When F has finite elements, (l∞(F), ‖ · ‖∞) = (Rk, ‖ · ‖∞).
So, Gn : F → R is a random element and can be viewed as an element in Banach space
l∞(F). We say F is P -Donsker if Gn =⇒ G, where G is a tight element in l∞(F).
53
LEMMA 8.1 (Berstein’s inequality). Let X1, · · · ,Xn be independent random variables
mean zero, |Xj | ≤ K for some constant K and all j, and σ2j = EX2
j > 0. Let Sn =∑n
i=1Xi
and s2n =∑n
j=1 σ2j . Then
P (|Sn| ≥ x) ≤ 2e−x2/2(s2n+Kx), x > 0.
Proof. It suffices to show that
P (Sn ≥ x) ≤ e−x2/2(s2n+Kx), x > 0. (8.3)
First, for any λ > 0,
P (Sn > x) ≤ e−λxEeλSn = e−λxΠni=1Ee
λXi .
Now
EeλXj = 1 +
∞∑
i=2
λi
i!EXi
j ≤ 1 +σ2
jλ2
2
∞∑
i=2
(λK)i−2 = 1 +σ2
jλ2
2(1 − λK)≤ exp
(σ2
jλ2
2(1 − λK)
)
if λK < 1. Thus
P (Sn > x) ≤ exp
(−λx+
λ2s2n2(1 − λK)
)
Now (8.3) follows by choosing λ = x/(s2n +Kx).
Recall Gnf =∑n
i=1(f(Xi) − Pf)/√n. The random variable Yi := (f(Xi) − Pf)/
√n
here corresponds to Xi in the above lemma. Note that V ar(∑n
i=1 Yi) = V ar(f(X1)) ≤ Pf2
and ‖Yi‖∞ ≤ 2‖f‖∞/√n. Apply the above lemma to
∑ni=1 Yi, we obtain
COROLLARY 8.1 For any bounded, measurable function f,
P (|Gnf | ≥ x) ≤ 2 exp
(−1
4
x2
Pf2 + x‖f‖∞/√n
)
for any x > 0.
Notice that
P (|Gnf | ≥ x) ≤ 2e−Cx when x is large and
P (|Gnf | ≥ x) ≤ 2e−Cx2when x is small. (8.4)
Let’s estimate Emax1≤i≤m |Yi| provided m is large and P (|Yi| ≥ x) ≤ e−x for all i and
x > 0. Then E|Yi|k ≤ k! for k = 1, 2, · · · . The immediate estimate is
E max1≤i≤m
|Yi| ≤m∑
i=1
E|Yi| ≤ m.
54
By Holder’s inequality,
E max1≤i≤m
|Yi| ≤ (E max1≤i≤m
|Yi|2)1/2 ≤ (
m∑
i=1
E|Yi|2)1/2 ≤√
2m.
Let ψ(x) = ex − 1, x ≥ 0. Following this logic, by Jensen’s inequality
ψ(E max1≤i≤m
|Yi|/2) ≤ m · max1≤i≤m
Eψ(|Yi|/2) ≤ 2m.
Take inverse ψ−1 for both sides. We obtain that
E max1≤i≤m
|Yi| ≤ 2 log(2m).
This together with (8.4) leads to the following lemma.
LEMMA 8.2 For any finite class F of bounded, measurable, square-integrable functions
with |F| elements, there is an universal constant C > 0 such that
C · Emaxf
|Gnf | ≤ maxf
‖f‖∞log(1 + |F|)√
n+ max
f‖f‖P,2
√log(1 + |F|) .
where ‖f‖P,2 = (Pf2)1/2.
Proof. Define a = 24‖f‖∞/√n and b = 24Pf2. Define Af = (Gnf)I|Gnf | > b/a and
Bf = (Gnf)I|Gnf | ≤ b/a, then Gnf = Af +Bf . It follows that
Emaxf
|Gnf | ≤ Emaxf
|Af | + Emaxf
|Bf |. (8.5)
For x ≥ b/a and x ≤ b/a the exponent in the Bernstein inequality is bounded above by
−3x/a and −3x2/b, respectively. By Bernstein’s inequality,
P (|Af | ≥ x) ≤ P (|Gnf | ≥ x ∨ b/a) ≤ 2 exp
(−3x
a
),
P (|Bf | ≥ x) = P (b/a ≥ |Gnf | ≥ x) ≤ P (|Gnf | ≥ x ∧ b/a) ≤ 2 exp
(−3x2
b
)
for all x ≥ 0. Let ψp(x) = exp(xp) − 1, x ≥ 0, p ≥ 1.
Eψ1
( |Af |a
)= E
∫ |Af |/a
0ex dx =
∫ ∞
0P (|Af | > ax)ex dx ≤ 1.
By a similar argument we find that Eψ2(|Bf |/√b) ≤ 1. Because ψp(·) is convex for all p ≥ 1,
by Jensen’s inequality
ψ1
(Emax
f
|Af |a
)≤ Eψ1
(maxf |Af |
a
)≤ E
∑
f
ψ1
( |Af |a
)≤ |F|.
55
Taking inverse function, we have the first term on the right hand side. Similarly we have
the second term. The conclusion follows from (8.5).
We actually use the following lemma above
LEMMA 8.3 Let X be a real valued random variable and f : [0,∞) → R be differentiable
with∫ s0 |f ′(t)| dt <∞ for any s > 0 and
∫∞0 P (X > t)|f ′(t)| dt <∞. Then
Ef(X) =
∫ ∞
0P (X ≥ t)f ′(t) dt + f(0).
Proof. First, since∫ t0 |f ′(t)| dt <∞ for any t > 0, we have
f(X) − f(0) =
∫ X
0f ′(t) dt =
∫ ∞
0I(X ≥ t)f ′(t) dt.
Then, by Fubini’s theorem,
Ef(X) = E
∫ ∞
0I(X ≥ t)f ′(t) dt + f(0) =
∫ ∞
0P (X ≥ t)f ′(t) dt + f(0).
Remark. When f(x) is differentiable, f ′(x) is not necessarily Lebesgue integrable even
though it is Riemann-integrable. The following is an example: Let
f(x) =
x2 cos(1/x2), if x 6= 0;
0, otherwise.
Then
f ′(x) =
2x cos(1/x2) + (2/x) sin(1/x2), if x 6= 0;
0, otherwise
is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to
show g(x) := (2/x) sin(1/x2) is not integrable. Suppose it is,∫ 1
0
1
x
∣∣∣∣sin(
1
x2
)∣∣∣∣ dx =1
2
∫ 1
0
| sin t|t
dt = +∞,
yields a contradiction.
Now we describe the size of F . For any f ∈ F , define
‖f‖P,r = (P |f |r)1/r.
Let l and u be two functions, the bracket [l, u] is set of all functions f such that l ≤ f ≤ u.
An ǫ-bracket in Lr(P ) is a bracket [l, u] such that ‖u − l‖P,r < ǫ. The bracketing number
N[ ](ǫ,F , Lr(P )) is the minimum numbers of ǫ-brackets needed to cover F . The entropy with
bracketing is the logarithm of the bracketing number.
56
8.1 Outer Measures and Expectations
Recall X is a random variable from (Ω,G) → (R,B(R)) if X−1(B) ∈ G for any set B ∈ B(R).
For an arbitrary map X the inverse may not be in G), particularly for small G). For
example, when G = ∅,Ω, many maps are not random variables. In empirical processes,
we will deal with Z =: supt∈T Xt for some index set T. If T is big, then Z may not be
measurable. It does not make sense to study expectations and probabilities related to such
a random variable. But we have another way to get around this.
Definition Let X be an arbitrary map from (Ω,G, P ) → (R,B(R)). Define
E∗X = infEY ; Y ≥ X and Y is a measurable map : (Ω,G) → (R,B(R));P ∗(X ∈ A) = infP (X ∈ B); X ∈ B ∈ G, B ⊃ A, A,B ∈ B(R).
One can easily show that E∗X, as an infimum, can be achieved, i.e., there exists a random
variable X∗ : (Ω,F , P ) → (R, B(R)) such that EX∗ = E∗X. Further, X∗ is P -almost surely
unique, i.e., if there exist two such random variables X∗1 and X∗
2 , then P (X∗1 = X∗
2 ) = 1.
We call X∗ the measurable cover function. Obviously
(X1 +X2)∗ ≤ X∗
1 +X∗2 and X∗
1 ≤ X∗2 if X1 ≤ X2.
One can define E∗ and P∗ similarly.
Let (M,d) be a metric space. A sequence of arbitrary maps Xn : (Ωn,Gn) → (M,d)
converges in distribution to a random vector X if
E∗f(Xn) → Ef(X)
for any bounded, continuous function f defined on (M,d). We still have an analogue of the
Portmanteau theorem.
THEOREM 28 The following are equivalent:
(i) E∗f(Xn) → Ef(X) for every bounded, continuous function f defined on (M,d);
(ii) E∗f(Xn) → Ef(X) for every bounded, Lipschitz function f defined on (M,d), that
is, there is a constant C > 0 such that |f(x) − f(y)| ≤ Cd(x, y) for any x, y ∈M ;
(iii) lim infn P∗(Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) lim supn P∗(Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;
(v) lim supn P∗(Xn ∈ H) = P (X ∈ H) for any set H such that P (X ∈ ∂H) = 0, where
∂H is the boundary of H.
57
Let
J[ ](δ,F , L2(P )) =
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ.
The proof of the following thorem is easy. It is omitted.
THEOREM 29 Every class F of measurable functions such that N[ ](ǫ,F , L1(P )) <∞ for
every ǫ > 0 is P -Glivenko-Cantelli.
THEOREM 30 Every class F of measurable functions with J[ ](1,F , L2(P )) < ∞ is P -
Donsker.
To prove the theorem, we need some preparation.
THEOREM 31 A sequence of arbitrary maps Xn = (Xn,t, t ∈ T ) : (Ωn, Gn) → l∞(T )
converges weakly to a tight random element if and only if both of the following conditions
hold:
(i) The sequence (Xn,t1 , · · · ,Xn,tk) converges in distribution in Rk for every finite set
of points t1, · · · , tk in T ;
(ii) for every ǫ, η > 0 there exists a partition of T into finitely many sets T1, · · · , Tk
such that
lim supn→∞
P ∗(
supi
sups,t∈Ti
|Xn,s −Xn,t| ≥ ǫ
)≤ η.
Proof. We only prove sufficiency.
Step 1: A preparation. For each integer m ≥ 1, let Tm1 , · · · , Tm
kmbe a partition of T such
that
lim supn→∞
P ∗(
supj
sups,t∈T m
j
|Xn,s −Xn,t| ≥1
2m
)≤ 1
2m. (8.1)
Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g.,
assume the partitions are successive refinements as m increases. Define a semi-metric
ρm(s, t) =
0, if s, t belong to the same partitioning set Tmj for some j;
1, otherwise.
Easily, by the nesting of the partitions, ρ1 ≤ ρ2 ≤ · · · . Define
ρ(s, t) =
∞∑
m=1
ρm(s, t)
2mfor s, t ∈ T.
58
Obviously, ρ(s, t) ≤ ∑∞k=m+1 2−k ≤ 2−m when s, t ∈ Tm
j for some j. So (T, ρ) is totally
bounded. Let T0 be the countable ρ-dense subset constructed by choosing an arbitrary
point tmj from every Tmj .
Step 2: Construct the limit of Xn. For two finite subsets of T, S = (s1, · · · , sp) and
U = (s1, · · · , sp, , · · · , sq), by assumption (i), there are two probability measures µp on Rp
and µq on Rq such that
(Xn,s1, · · · ,Xn,sp) =⇒ µp; (Xn,s1, · · · ,Xn,sp , · · · ,Xn,sq) =⇒ µq
µq(A× Rq−p) = µp(A) for any Borel set A ⊂ R
p.
By the Kolmogorov consistency theorem there exists a stochastic process Xt; t ∈ T0 on
some probability space such that (Xn,t1 , · · · ,Xn,tk) =⇒ (Xt1 , · · · ,Xtk) for every finite set
of points t1, · · · , tk in T0. It follows that, for a finite set S ⊂ T,
supi
sups,t∈T m
is,t∈S
|Xn,s −Xn,t| → supi
sups,t∈T m
is,t∈S
|Xs −Xt|
as n→ ∞ for fixed m ≥ 1. By the Portmanteau theorem and (8.1)
P
sup
isup
s,t∈T mi
s,t∈S
|Xs −Xt| >1
2m
≤ lim sup
n→∞P ∗
sup
isup
s,t∈T mi
s,t∈S
|Xn,s −Xn,t| ≥1
2m
≤ 1
2m
for each m ≥ 1. By Monotone Convergence Theorem, since T is countable, let S ↑ T, we
obtain that
P
sup
isup
s,t∈T mi
s,t∈T0
|Xs −Xt| >1
2m
≤ 1
2m.
If ρ(s, t) < 2−m, then ρm(s, t) < 1 (otherwise, ρm(s, t) = 1 implies ρ(s, t) ≥ 2−m). This
says that s, t ∈ Tmj for the same j. So the above inequality implies that
P
sup
ρ(s,t)<2−m
s,t∈T0
|Xs −Xt| >1
2m
≤ 1
2m
for each m ≥ 1. By the Borel-Cantelli lemma, when m is large, with probability one
supρ(s,t)<2−m
s,t∈T0
|Xs −Xt| ≤1
2m
59
as m is large enough. This says that, with probability one, Xt, as a function of t is uniformly
continuous on T0. Extending this from T0 to T, we obtain a ρ-continuous process Xt, t ∈ T.
We then have that with probability one
|Xs −Xt| ≤1
2mfor any s, t ∈ T with ρ(s, t) <
1
2m(8.2)
for sufficiently large m.
Step 3: Show Xn =⇒ X. Define πm : T → T as the map that maps every point in
the partitioning set Tmi onto the point tmi ∈ Tm
i . For any bounded, Lipschitz continuous
function f : l∞(T ) → R,
limn→∞
|E∗f(Xn) − Ef(X)| ≤ limn→∞
|E∗f(Xn) − E∗f(Xn πm)|+ lim
n→∞|E∗f(Xn πm) − Ef(X πm)| + E|f(X πm) − f(X)|.
Xn πm and X πm are essentially finite dimensional random vectors, by assumption (i),
the first one converges to the second in distribution. So the middle term above is zero.
Now, if s, t ∈ Tmi , ρ(s, t) ≤ 2−m. It follows that
‖X πm −X‖∞ = supi
supt∈T m
i
|Xt −Xtmi| < 1
2m−1. (8.3)
as m is large enough. Thus, E|f(X πm) − f(X)| ≤ ‖f‖∞2−m+1. Since f is Lipschitz,
|E∗f(Xn) − E∗f(Xn πm))| ≤ ‖f‖∞2−m + P(‖Xn −Xn πm‖ ≥ 2−m
)
By the same argument as in (8.3), using (8.1) to obtain
limn→∞
|E∗f(Xn) − Ef(X)| ≤ ‖f‖∞2−m + 2−m + 2−m+1‖f‖∞.
Letting m → ∞, we have that E∗f(Xn) → Ef(X) as n→ ∞.
Let F be a class of measurable functions f : X → R. Review
a(δ) =δ√
logN[ ](δ,F , L2(P )),
J[ ](δ,F , L2(P )) =
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ.
LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗f2 < δ2
and |f | ≤ F for every f ∈ F . Then there exists a constant C > 0 such that
C ·E∗P ‖Gn‖F ≤ J[ ](δ,F , L2(P )) +
√nP ∗FF >
√na(δ).
60
Proof. Step 1: Truncation. Recall Gnf = (1/√n)∑n
i=1(f(Xi) − Pf). If |f | ≤ g, then
|Gnf | ≤1√n
n∑
i=1
(g(Xi) + Pg).
It follows that
E∗ ∥∥GnfIF >√na(δ)
∥∥F ≤ 2
√nPFF >
√na(δ).
We will bound E∗ ‖GnfIF ≤ √na(δ)‖F next. The bracketing number of the class of
functions fIF >√na(δ) if f ranges over F are smaller than the bracketing number of
the class F . To simplify notation, we assume, w.l.o.g., |f | ≤ √na(δ) for every f ∈ F .
Step 2: Discretization of the integral. Choose an integer q0 such that 4δ ≤ 2−q0 ≤ 8δ.
We claim there exists a nested sequence of F-partitions Fqi; 1 ≤ i ≤ Nq, indexed by the
integers q ≥ q0, into Nq disjoint subsets and measurable functions ∆qi ≤ 2F such that
C∑
q≥q0
1
2q
√logNq ≤
∫ δ
0
√logN[ ](ǫ,F , L2(P )) dǫ, (8.4)
supf,g∈Fqi
|f − g| ≤ ∆qi, and P∆2qi ≤
1
22q(8.5)
for an universal constant C > 0. For convenience, write N(ǫ) = N[ ](ǫ,F , L2(P )). Thus,
∫ δ
0
√logN(ǫ) dǫ ≥
∫ 1/2q0+3
0
√logN(ǫ) dǫ =
∞∑
q=q0+3
∫ 1/2q
1/2q+1
√logN(ǫ) dǫ
≥∞∑
q=q0+3
1
2q+1
√logN
(1
2q−3
).
Re-indexing the sum, we have that
1
16
∞∑
q=q0
1
2q
√logNq ≤
∫ δ
0
√logN(ǫ) dǫ,
where Nq = N(2−q). By definition of Nq, there exists a partition Fqi = [lqi, uqi]; 1 ≤ i ≤Nq, where lqi and uqi are functions such that E(uqi − lqi)
2 < 2−q. Set ∆qi = uqi − lqi. Then
∆qi ≤ 2F and supf,g∈Fqi|f − g| ≤ ∆qi, and P∆2
qi ≤ 2−2q.
We can also assume, w.l.o.g., that the partitions Fqi; 1 ≤ i ≤ Nq are successive
refined as q increases. Actually, at q-th level, we can make intersections of elements from
Fki; 1 ≤ i ≤ Nk as k goes from q0 to q : ∩qk=q0
Fki. The total possible number of
such intersections is no more than Nq := Nq0Nq0+1 · · ·Nq. Since the current Fqi’s become
61
smaller, all requirements on Fqi still hold obviously except possibly (8.4). Now we verify
(8.4). Actually, noticing√
log Nq ≤∑qk=q0
√logNk. Then
∑
q≥q0
1
2q
√log Nq ≤
∑
q≥q0
q∑
k=q0
1
2q
√logNk =
∞∑
k=q0
∑
q≥k
1
2q
√logNk = 2
∞∑
k=q0
1
2k
√logNk.
So (8.4) still holds when replacing Nq by Nq.
Step 3: Chaining-a skill. For each fixed q ≥ q0, choose a fixed element fqi from each
partition set Fqi, and set
πqf = fqi, ∆qf = ∆qi, if f ∈ Fqi. (8.6)
Thus, πqf and ∆qf runs through a set of Nq functions if f run through F . Without loss of
generality, we can assume
∆qf ≤ ∆q−1f for any f ∈ F and q ≥ q0 + 1. (8.7)
Actually, let ∆qi be the measurable cover function of supf,g∈Fqi|f−g|. Then (8.5) also holds.
Let Fq−1 j be a partition set at the q−1 level such that Fqi ⊂ Fq−1 j . Then supf,g∈Fqi|f−g| ≤
supf,g∈Fq−1 j|f − g|. Thus, ∆qi ≤ ∆q−1 j. The assertion (8.7) follows by replacing ∆qi with
∆qi.
By (8.6), P (πqf −f)2 ≤ maxi P∆2qi < 2−2q. Thus, P
∑q(πqf −f)2 =
∑q P (πqf −f)2 <
∞. So∑
q(πqf − f)2 <∞ a.s. This implies
πqf → f a.s. on P (8.8)
as q → ∞ for any f ∈ F . Define
aq = 2−q/√
logNq+1 ,
τ = τ(n, f) = infq ≥ q0 : ∆qf >√naq.
The value of τ is thought to be +∞ if the above set is empty. This is the first time ∆qf >√naq. By construction, 2a(δ) = 2δ(logN[ ](δ,F , L2(P )))−1/2 ≤ aq0 since the denominator is
decreasing in δ. By Step 1, we know that |∆qf | ≤ 2√na(δ) ≤ √
naq0. This says that τ > q0.
We claim that
f − πq0f =
∞∑
q0+1
(f − πqf)Iτ = q +
∞∑
q0+1
(πqf − πq−1f)Iτ ≥ q, a.s on P. (8.9)
In fact, write f − πq0f = (f − πq1f) +∑q1
q0+1(πqf − πq−1f) for q1 > q0. Now,
62
(i) If τ = ∞, the right hand side above is identical to limq→∞ πqf −πq0f = f −πq0f a.s.
by (8.8);
(ii) If τ = q1 <∞, the right hand is equal to (f − πq1f) +∑q1
q0+1(πqf − πq−1f).
Step 4. Bound terms in the chain. Apply Gn to the both sides of (8.9). For |f | ≤ g,
note that |Gnf | ≤ |Gng| + 2√nPg. By (8.6), |f − πqf | ≤ ∆qf. One obtains that
E∗‖∞∑
q0+1
Gn
f︷ ︸︸ ︷(f − πqf)Iτ = q ‖F
≤∞∑
q0+1
E∗‖Gn(∆qfIτ = q︸ ︷︷ ︸g
)‖F + 2√n
∞∑
q0+1
‖P (∆qfIτ = q)‖F . (8.10)
Now, by (8.7), ∆qfIτ = q ≤ ∆q−1fIτ = q ≤ √naq−1. Moreover, P (∆qfIτ = q)2 ≤
2−2q. By Lemma 8.2, the middle term above is bounded by
∞∑
q0+1
(aq−1 logNq + 2−q√
logNq).
By Holder’s inequality, P (∆qfIτ = q) ≤ (P (∆qf)2)1/2P (τ = q)1/2 ≤ 2−qP (∆qf >√naq)
1/2 ≤ 2−q(√naq)
−1(P (∆qf)2)1/2 ≤ 2−2q(√naq)
−1. So the last term in (8.10) is
bounded by 2 · 2−2q/aq. In summary,
E∗
∥∥∥∥∥∥
∞∑
q0+1
Gn(f − πqf)Iτ = q
∥∥∥∥∥∥F
≤ C
∞∑
q0+1
2−q√
logNq (8.11)
for some universal constant C > 0.
Second, there are at mostNq functions πqf−πq−1f and two values the indicator functions
I(τ ≥ q) takes. Because the partitions are nested, the function |πqf − πq−1f |Iτ ≥ q ≤∆q−1fIτ ≥ q ≤ √
naq−1. The L2(P )-norm of πqf −πq−1f is bounded by 2−q+1. Applying
Lemma 8.2 again to obtain
E∗
∥∥∥∥∥∥
∞∑
q0+1
Gn(πqf − πq−1f)Iτ ≥ q
∥∥∥∥∥∥F
≤∞∑
q0+1
(aq−1 logNq + 2−q√
logNq)
≤ C
∞∑
q0+1
2−q√
logNq (8.12)
for some universal constant C > 0.
63
At last, we consider πq0f. Because |πq0f | ≤ F ≤ a(δ)√n ≤ √
naq0 and P (πq0f)2 ≤ δ2
by assumption, another application of Lemma 8.2 leads to
E∗‖Gnπq0f‖F ≤ aq0 logNq0 + δ√
logNq0.
In view of the choice of q0, this is no more than the bound in (8.12). All above inequalities
together with (8.4) yield the desired result.
COROLLARY 8.2 For any class F of measurable functions with envelope function F, there
exists an universal constant C such that
E∗P ‖Gn‖F ≤ C · J[ ](‖F‖P,2,F , L2(P )).
Proof. Since F has a single bracket [−F, F ], we have that N[ ](δ,F , L2(P )) = 1 for δ =
2‖F‖P,2. Review the definition in Lemma 8.4. Choose δ = ‖F‖P,2. It follows that
a(δ) =‖F‖P,2√
logN[ ](‖F‖P,2,F , L2(P )).
Now√nP ∗FI(F >
√na(δ)) ≤ ‖F‖2
P,2/a(δ) = ‖F‖P,2
√logN[ ](‖F‖P,2,F , L2(P )) by Markov’s
inequality, which is bounded by J[ ](‖F‖P,2,F , L2(P )) since the integrand is non-decreasing
and hence∫ ‖F‖P,2
0
√logN[ ](ǫ,F , L2(P )) dǫ ≥ ‖F‖P,2
√logN[ ](‖F‖P,2,F , L2(P )).
Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is
easily satisfied. Now we verify (ii).
Note there is no cover on F and we don’t know if Pf2 < δ2. Let G = f − g, f, g ∈ F.With a given set of ǫ-brackets [li, ui] over F we can construct 2ǫ-brackets over G by taking
differences [li − uj, ui − lj ] of upper and lower bounds. Therefore, the bracketing number
N[ ](2ǫ,G, L2(P )) are bounded by the square of the bracketing number N[ ](ǫ,F , L2(P )).
Easily, ‖(ui − lj) − (li − uj)‖ ≤ 2ǫ. Hence
N[ ](2ǫ,G, L2(P )) ≤ N[ ](ǫ,F , L2(P ))2.
This says that
J[](ǫ,G, L2(P )) <∞. (8.13)
64
For a given, small δ > 0, by the definition of N[ ](δ,F , L2(P )), choose a minimal number
of brackets of size δ that cover F , and use them to form a partition of F = ∪iFi. The subsets
G consisting of differences f − g of functions f and g belonging to the same partitioning
set consists of functions of L2(P )-norm smaller than δ. Hence, by Lemma 8.4, there exists
a finite number a(δ) := a(δ)F < a(2δ)G and a universal constant C such that
C · E∗ supi
supf,g∈Fi
|Gn(f − g)| = CE∗ suph∈H
|Gnh|
≤ J[](δ,H, L2(P )) + 2√nPFI(F > a(δ)
√n)
≤ J[ ](δ,G, L2(P )) + 2√nPFI(F > a(δ)
√n).
since H := f − g; f, g ∈ Fi, for all i ⊂ G, where the envelope function F can be taken
equal to the supremum of the absolute values of the upper and lower bounds of finitely
many brackets that cover F , for instance a minimal set of brackets of size 1. This F is
square integrable. The second term above is bounded by a(δ)−1PF 2I(F > a(δ)√n) → 0 as
n→ ∞ for fixed δ. First let n→ ∞, then let δ ↓ 0 the left hand side goes to 0. So part (ii)
in (31) is valid by using Markov’s inequality.
Example. Let F = ft = I(−∞, t], t ∈ R. Then
‖ft − fs‖2,P = (F (t) − F (s))1/2 for s < t.
Cut [0, 1] into many pieces with length less than ǫ2. Since F (t) is increasing, right continuous
and of left limits, there exist a partition, say, −∞ = t0 < t1 < · · · < tk = ∞ with
k = [1/ǫ2] + 1 such that
F (ti−) − F (ti−1) < ǫ2 for i = 1, 2, · · · , k.
Since ‖I(−∞, ti)−I(−∞, ti−1]‖P,2 = (F (ti−)−F (ti−1))1/2 < ǫ.Namely, (I(−∞, ti−1], I(−∞, ti))
is an ǫ-bracket for i = 1, 2, · · · , k. It follows that
N[ ](ǫ,F , L2(P )) ≤ 2
ǫ2for ǫ ∈ (0, 1).
But∫ 10
√log(1/ǫ) dǫ <∞. The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s
Theorem 27 follow from Theorems 29 and 30.
Since ‖f‖ = supt |f(t)| is the norm in l∞, it is continuous. By the Delta method, we
have
supt∈R
√n|Fn(t) − F (t)| → sup
t∈R
G(t)
65
weakly, i.e., in distribution, as n → ∞, where G(t) is the Gaussian process we mentioned
before.
Example. Let F = fθ; θ ∈ Θ be a collection of measurable functions with Θ ⊂ Rd
bounded. Suppose Pmr <∞ and |fθ1(x)−fθ2(x)| ≤ m(x)‖θ1−θ2‖ for any θ1 and θ2. Then
N[ ](ǫ‖m‖P,r, F , Lr(P )) ≤ K
(diam(Θ)
ǫ
)d
(8.14)
for any 0 < ǫ < diam(Θ). As long as ‖θ1−θ2‖ < ǫ, we have that l(x) := fθ1(x)−m(x)ǫ(x) ≤fθ2(x) ≤ fθ1(x) + m(x)ǫ(x) =: u(x). Also, ‖u − l‖P,r = ǫ‖m‖P,r. Naturally, [l, u] is an
ǫ‖m‖P,r-bracket. It suffices to calculate the minimal number of balls of radius ǫ need to
cover Θ.
Note that Θ ⊂ Rd is in a cube with side length no bigger than diam(Θ). We can cover
Θ with fewer than (2diam(Θ)/ǫ)d cubes of size ǫ. The circumscribed balls have radius a
multiple of ǫ and also cover Θ. The intersection of these balls with Θ cover Θ. So the claim
is true. This says that F is a P -Donsker class.
Example (Sobolev classes). For k ≥ 1, let
F =
f : [0, 1] → R; ‖f‖∞ ≤ 1 and
∫ 1
0(f (k)(x))2 dx ≤ 1
.
Then there exists an universal constant K such that
logN[ ](ǫ,F , ‖ · ‖∞) ≤ K
(1
ǫ
)1/k
Since ‖f‖P,2 ≤ ‖f‖∞ for any P, it is easy to check that N[ ](ǫ,F , L2(P )) ≤ N[ ](ǫ,F , ‖ · ‖∞)
for any P. So F is P -Donsker class for any P.
Example (Bounded Variation). Let
F = f : R → [−1, 1] of bounded variation 1.
Any function of bounded variation is the difference of two monotone increasing functions.
Then for any r ≥ 1 and probability measure P,
logN[ ](ǫ,F , Lr(P )) ≤ K
(1
ǫ
).
Therefore F is P -Donsker for every P.
66
9 Consistency and Asymptotic Normality of Maximum Like-
lihood Estimators
9.1 Consistency
Let X1,X2, · · · ,Xn be a random sample from a population distribution with pdf or pmf
f(x|θ), where θ is an unknown parameter. The MLE θ under certain conditions on f(x|θ)will be consistent and satisfy Central Limit Theorems. But for some cases those will not
be true. Let’s see a good example and a pathological example.
Example. Let X1,X2, · · · ,Xn be i.i.d. from Exp(θ) with pdf
f(x|θ) =
θe−θx, if x > 0;
0, otherwise.
So EX1 = 1/θ and V ar(X1) = 1/θ2. It is easy to check that θ = 1/X. By the CLT,
√n
(X − 1
θ
)=⇒ N(0, θ−2)
as n → ∞. Let g(x) = 1/x, x > 0. Then g(EX1) = θ and g′(EX1) = −θ2. By the Delta
method,
√n(θ − θ) =⇒ N(0, g′(µ)2σ2) = N(0, θ2)
as n→ ∞. Of course, θ → θ in probability.
Example. Let X1,X2, · · · ,Xn be i.i.d. from U [0, θ], where θ is unknown. The MLE
estimator θ = maxXi. First, it is easy to check that θ → θ in probability. But θ doesn’t
follow the CLT. In fact,
P
(n(θ − θ)
θ≤ x
)→
1 − e−x, if x > 0;
0, otherwise
as n→ ∞.
For some cases even the consistency, i.e., θ → θ in probability, doesn’t hold. Next, we
will study some sufficient conditions for consistency. Later we provide sufficient conditions
for CLT.
Let X1, · · · ,Xn be a random sample from a density pθ with reference measure µ, that
is, Pθ(X1 ∈ A) =∫A pθ(x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θn
maximizes the function h(θ) :=∑
log pθ(Xi) over Θ, or equivalently, the function
Mn(θ) =1
n
n∑
i=1
logpθ
pθ0
(Xi),
67
where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,
Mn(θ) →M(θ) := Eθ0 logpθ
pθ0
(X1) =
∫
R
pθ0(x) logpθ
pθ0
(x)µ( dx) (9.1)
in probability as n → ∞. The number −M(θ) is called the Kullback-Leibler divergence of
pθ and pθ0. Let Y = pθ0(Z)/pθ(Z), where Z ∼ pθ. Then EθY = 1 and
M(θ) = −Eθ(Y log Y ) ≤ −(EθY ) log(EθY ) = 0
by Jensen’s inequality. Of course M(θ0) = 0, that is, θ0 attains the maximum of M(θ).
Obviously, M(θ0) = Mn(θ0) = 0. The following gives the consistency of θn with θ0.
THEOREM 32 Suppose supθ∈Θ |Mn(θ) −M(θ)| P→ 0 and supθ:d(θ,θ0)≥ǫ M(θ) < M(θ0) for
any ǫ > 0. Then θn → θ0 in probability.
Proof. For any ǫ > 0, we need to show that
P (d(θn, θ0) ≥ ǫ) → 0 (9.2)
as n→ ∞. By the condition, there exists δ > 0 such that
supθ:d(θ,θ0)≥ǫ
M(θ) < M(θ0) − δ.
Thus, if d(θn, θ0) ≥ ǫ, then M(θn) < M(θ0) − δ. Note that
M(θn) −M(θ0) = Mn(θn) −Mn(θ0) + (Mn(θ0) −M(θ0)) + (M(θn) −Mn(θn))
≥ −2 supθ∈Θ
|Mn(θ) −M(θ)|
since Mn(θn) ≥Mn(θ0). Consequently,
P (d(θn, θ0) ≥ ǫ) ≤ P (M(θn) −M(θ0) < −δ) = P (supθ∈Θ
|Mn(θ) −M(θ)| ≥ δ/2) → 0.
by the first assumption.
Let X1, · · · ,Xn be a random sample from a density pθ with reference measure µ, that
is, Pθ(X1 ∈ A) =∫A pθ(x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θn
maximizes the function∑
log pθ(Xi) over Θ, or equivalently, the function
Mn(θ) :=1
n
n∑
i=1
logpθ
pθ0
(Xi),
68
where θ0 is the true parameter. This is usually practiced by solving the equation ∂Mn(θ)/∂θ =
0 for θ. Let
ψθ(x) =∂ log(pθ(x)/pθ0(x))
∂θ.
Then the maximum likelihood estimator θn is the zeroes of the function
Ψn(θ) :=1
n
n∑
i=1
ψθ(Xi). (9.3)
Though we still use the notation Ψn in the following theorem, it is not necessarily the one
above. It can be any function of properties stated below.
THEOREM 33 Let Θ be a subset of the real line and let Ψn be random functions and Ψ a
fixed function of θ such that Ψn(θ) → Ψ(θ) in probability for every θ. Assume that Ψn(θ) is
continuous and has exactly one zero θn or is nondecreasing satisfying Ψn(θn) = oP (1). Let
θ0 be a point such that Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ) for every ǫ > 0. Then θnP→ θ0.
Proof. Case 1. Suppose Ψn(θ) is continuous and has exactly one zero θn. Then
P (Ψn(θ0 − ǫ) < 0,Ψn(θ0 + ǫ) > 0) ≤ P (θ0 − ǫ ≤ θn ≤ θ0 + ǫ).
Since Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability, and Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ), the left hand
side goes to one.
Case 2. Suppose Ψn(θ) is nondecreasing satisfying Ψn(θn) = oP (1). Then
P (|θn − θ0| ≥ ǫ) ≤ P (θn > θ0 + ǫ) + P (θn < θ0 − ǫ)
≤ P (Ψn(θn) ≥ Ψn(θ0 + ǫ)) + P (Ψn(θn) ≤ Ψn(θ0 − ǫ)).
Now, Ψn(θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability. This together with Ψn(θn) = oP (1) shows that
P (|θn − θ0| ≥ ǫ) → 0 as n→ ∞.
Example. Let X1, · · · ,Xn be a random sample from Exp(θ), that is, the density
function is pθ(x) = θ−1 exp(−x/θ)I(x ≥ 0) for θ > 0. We know that, under the true model,
the MLE θn = 1/Xn → θ0 in probability. Let’s verify that this conclusion indeed can be
deduced from Theorem 33.
Actually, ψθ(x) = (log pθ(x))′ = −θ−1 + xθ−2 for x ≥ 0. Thus Ψn(θ) = −θ−1 + Xnθ
−2,
which goes to Ψ(θ) = Eθ0(−θ−1 + X1θ−2) = θ−2(θ0 − θ), that is positive or negative
depending on if θ is samller or bigger than θ0. Also, θn = 1/Xn. Applying Theorem 33 to
−Ψn and −Ψ, we obtain the consisitency result.
69
9.2 Asymptotic Normality
Now we study the central limit theorems for the MLE.
Now we illustrate the idea of showing the normality of MLE. Recalling (9.3). Do the
Taylor’s expansion for ψθ(Xi).
ψθn(Xi)
.= ψθ0(Xi) + (θn − θ0)ψθ0(Xi) +
1
2(θn − θ0)
2ψθ0(Xi)
We will use the following notation:
Pf =
∫f(x)µ(dx) for any real function f(x) and Pn =
1
n
n∑
i=1
δXi .
Then Pnf = (1/n)∑n
i=1 f(Xi). Thus,
0 = Ψn(θn).= Pnψθ0 + (θn − θ0)Pnψθ0 +
1
2(θn − θ0)
2Pnψθ0 .
Reorganize it in the following form
√n(θn − θ0)
.=
−√n(Pnψθ0)
Pnψθ0 + 12(θn − θ0)Pnψθ0
.
Recall the Fisher information
I(θ0) = Eθ0
(∂ log pθ(X1)
∂θ|θ=θ0
)2
= Eθ0(ψθ0(X1))2 = −Eθ0
(∂2 log pθ(X1)
∂2θ|θ=θ0
)
= −Eθ0(ψθ0(X1)).
By CLT and LLN,√n(Pnψθ0) =⇒ N(0, I(θ0)), Pnψθ0 → −I(θ0) and Pnψθ0 → pθ0ψθ0(X1)
in probability. This illustrates that
√n(θn − θ0) =⇒ N(0, I(θ0)
−1)
as n→ ∞. The next two theorems will make these steps rigorous.
Let g(θ) = Eθ0(mθ(X1)) =∫mθ(x)pθ0(x)µ( dx). We need the following condition:
g(θ) = g(θ0) +1
2(θ − θ0)
TVθ0(θ − θ0) + o(‖θ − θ0‖2), (9.1)
where
Vθ0 =
(Eθ0
(∂mθ
∂θi∂θj|θ=θ0
))
1≤i,j≤d
, θ = (θ1, · · · , θd) ∈ Rd.
70
THEOREM 34 For each θ in an open subset of Euclidean space let x 7→ mθ(x) be a mea-
surable function such that θ 7→ mθ(x) is differentiable at θ0 for P -almost every x with
derivative mθ0(x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a mea-
surable function n(x) with En(X1)2 <∞
|mθ1(x) −mθ2(x)| ≤ n(x)‖θ1 − θ2‖.
Furthermore, assume the map θ 7→ Emθ(X1) has expression as in (9.1). If Pnmθn≥
supθ Pnmθ − oP (n−1) and θnP→ θ0, then
√n(θn − θ0) = −V −1
θ0
1√n
n∑
i=1
mθ0(Xi) + oP (1).
In particular,√n(θn − θ0) =⇒ N(0, V −1
θ0E(mθ0m
Tθ0
)V −1θ0
) as n→ ∞.
A statistical model (pθ, θ ∈ Θ) is called differentiable in quadratic mean if there exists a
measurable vector-valued function lθ0 such that, as θ → θ0,
∫ [√pθ −
√pθ0 −
1
2(θ − θ0)
T lθ0
√pθ0
]2
du = o(‖θ − θ0‖2).
THEOREM 35 Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at
an inner point θ0 of Θ ⊂ Rk. Furthermore, suppose that there exists a measurable function
l(x) with Eθ0l2(X1) <∞ such that, for every θ1 and θ2 in a neighborhood of θ0,
| log pθ1(x) − log pθ2(x)| ≤ l(x)‖θ1 − θ2‖.
If the Fisher information matrix Iθ0 is non-singular and θn is consistent, then
√n(θn − θ0) = I−1
θ0
1√n
n∑
i=1
lθ0(Xi) + oP (1)
In particular,√n(θn − θ0) =⇒ N(0, I−1
θ0) as n→ ∞, where
I(θ0) = −(E∂2 log pθ(X1)
∂θi∂θj
)
1≤i,j≤k
.
We need some preparations in proving the above theorems.
Given functions x → mθ(x), θ ∈ Rd, we need conditions that ensure that, for a given
sequence rn → ∞ and any sequence hn = O∗P (1),
Gn
(rn(mθ0+hn/rn
−mθ0) − hTn mθ0
)P→ 0. (9.2)
71
LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ(x), as a function
of x, be measurable for each θ; and as a function of θ, is differentiable for almost every x
(w.r.t. P ) with derivative mθ0(x) and such that for every θ1 and θ2 in a neighborhood of θ0
and a measurable function m such that pm2 <∞,
‖mθ1(x) −mθ2(x)‖ ≤ m(x)‖θ1 − θ2‖.
Then (9.2) holds for every random sequence hn that is bounded in probability.
Proof. Because hn is bounded in probability, to show (9.2), it is enough to show, w.l.o.g.
sup|θ|≤1
|Gn(rn(mθ0+θ/rn−mθ0) − θT mθ0)|
P→ 0
as n→ ∞. Define
Fn = fθ := rn(mθ0+θ/rn−mθ0) − θT mθ0, |θ| ≤ 1.
Then
|fθ1(x) − fθ2(x)| ≤ 2mθ0(x)‖θ1 − θ2‖
for any θ1 and θ2 in the unit ball in Rd by the Lipschitz condition. Further, set
Hn = sup|θ|≤1
|rn(mθ0+θ/rn−mθ0) − θT mθ0|.
Then Hn is a cover and Hn → 0 as n → ∞ by the definition of partial derivative. By
Bounded Convergence Theorem, δn := (PH2n)1/2 → 0. Thus, by Corollary 8.2 and Example
8.14,
E∗P‖Gn‖Fn ≤ C · J[ ](‖H‖P,2,Fn, L2(P )) ≤ C
∫ δn
0
√log(Kǫ−d) dǫ→ 0
as n→ ∞. The desired conclusion follows.
We need some preparation before proving Theorem 34.
Let Pn be the empirical distribution of a random sample of size n from a distribution P,
and, for every θ in a metric space (Θ, d), mθ(x) be a measurable function. Let θn (nearly)
maximize the criterion function Pnmθ. The number θ0 is the truth, that is, the maximizer
of mθ over θ ∈ Θ. Recall Gn =√n(Pn − P ).
THEOREM 36 (Rate of Convergence) Assume that for fixed constants C and α > β, for
every n and every sufficiently small δ > 0,
supd(θ,θ0)>δ
P (mθ −mθ0) ≤ −Cδα,
E∗ supd(θ,θ0)<δ
|Gn(mθ −mθ0)| ≤ Cδβ.
72
If the sequence θn satisfies Pnmθn≥ Pnmθ0 −OP (nα/(2β−2α)) and converges in outer prob-
ability to θ0, then n1/(2α−2β)d(θn, θ0) = O∗P (1).
Proof. Set rn = n1/(2α−2β) and Pnmθn≥ Pnmθ0 −Rn with 0 ≤ Rn = OP (nα/(2β−2α)).
Partition (0,∞) by Sj,n = θ : 2j−1 < rnd(θ, θ0) ≤ 2j for all integers j. If rnd(θn, θ0) ≥ 2M
for a given M, then θn is in one of the shells Sj,n with j ≥M. In that case, supθ∈Sj,n(Pnmθ−
Pnmθ0) ≥ −Rn. It follows that
P ∗(rnd(θn, θ0) ≥ 2M ) ≤∑
j≥M,2j≤ǫrn
P ∗(
supθ∈Sj,n
(Pnmθ − Pnmθ0) ≥ −K
rαn
)(9.3)
+P ∗(2d(θn, θ0) ≥ ǫ) + P (rαnRn ≥ K).
The middle term on right goes to zero; the last term can be arbitrarily small by choosing
large K. We only need to show the sum is arbitrarily small when M is large enough as
n→ ∞. By the given condition
supθ∈Sj,n
P (mθ −mθ0) ≤ −C 2j−1α
rαn
.
For M such that (1/2)C2(M−1)α ≥ K, by the fact that sups(fs + gs) ≤ sups fs + sups gs,
P ∗(
supθ∈Sj,n
(Pnmθ − Pnmθ0) ≥ −K
rαn
)≤ P ∗
(sup
θ∈Si,n
Gn(mθ −mθ0) ≥ C√n
2(j−1)α
2rαn
).
Therefore, the sum in (9.3) is bounded by
∑
j≥M,2j≤ǫrn
P ∗(
supθ∈Si,n
|Gn(mθ −mθ0)| ≥ C√n
2(j−1)α
2rαn
)≤∑
j≥M
(2j/rn)β2rαn√
n2(j−1)α
by Markov’s inequality and the definition of rn. The right hand side goes to zero as M → ∞.
COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ(x) is a measurable
function such that, there exists m(x) with Pm2 <∞ satisfying
|mθ1(x) −mθ2(x)| ≤ m(x)‖θ1 − θ2‖.
Furthermore, assume
Pmθ = Pmθ0 +1
2(θ − θ0)
TVθ0(θ − θ0) + o(‖θ − θ0‖2) (9.4)
with Vθ0 nonsingular. If Pnmθn≥ Pnmθ0−OP (n−1) and θn
P→ θ0, then√n(θn−θ0) = OP (1).
73
Proof. (9.4) implies the first condition of Theorem 36 holds with α = 2 since Vθ0 is
nonsingular. Now we use Corollary 8.2 to the class of functions F = mθ−mθ0; ‖θ−θ0‖ < δto see the second condition is valid with β = 1. This class has envelope function F = δm,
while
E∗ sup‖θ−θ0‖<δ
|Gn(mθ −mθ0)| ≤ C ·∫ δ‖m‖P,2
0
√log[ ](ǫ,F , L2(P )) dǫ
≤∫ δ‖m‖P,2
0
√√√√log
(K
(δ
ǫ
)d)dǫ = C1δ
by (8.14) with Diam(Θ) = const · δd, where C1 depends on ‖m‖P,2.
Proof of Theorem 34. By Lemma 9.1,
Gn
(rn(mθ0+hn/rn
−mθ0) − hTn mθ0
)P→ 0. (9.5)
Expand P (rn(mθ0+hn/rn−mθ0) by condition 9.1, we obtain
nPn(mθ0+hn/√
n −mθ0) =1
2hT
nVθ0 hn + hTnGnmθ0 + oP (1).
By Corollary 9.1,√n(θn − θ) is bounded in probability (same as tightness). Take hn =
√n(θn − θ) and hn = −V −1
θ0Gnmθ0 , we then obtain the Taylor’s expansions of Pnmθn
and
Pnmθ0−V −1θ0
Gnmθ0as follows
nPn(mθn−mθ0) =
1
2hT
nVθ0hn + hTnGnmθ0 + oP (1),
nPn(mθ0−V −1θ0
Gnmθ0/√
n −mθ0) = −1
2Gnm
Tθ0V −1
θ0Gnmθ0 + oP (1),
where the second one is obtained through a bit algebra. By the definition of θn, the left
hand side of the first equation is greater than that of the second one. So are he right hand
sides. Take the difference and make it to a complete square, we then have
1
2(hn + V −1
θ0Gnmθ0)
TVθ0(hn + V −1θ0
Gnmθ0) + oP (1) ≥ 0.
We know Vθ0 is strictly negative-definite, the quadratic form must converge to zero in
probability. This is also the case ‖hn + V −1θ0
Gnmθ0‖, that is, hn = −V −1θ0
Gnmθ0 + oP (1).
74
10 Appendix
Let A be a collection of subsets of Ω and B is generated by A, that is, B = σ(A). Let P be
a probability measure on (Ω,B).
LEMMA 10.1 Suppose A has the following property: (i) Ω ∈ A, (ii)Ac ∈ A if A ∈ A,and (iii) ∪m
i=1Ai ∈ A if Ai ∈ A for all 1 ≤ i ≤ m. Then, for any B ∈ B and ǫ > 0, there
exists A ∈ A such that P (B∆A) < ǫ.
Proof. Let B′ be the set of B ∈ B satisfying the conclusion. Obviously, A ⊂ B′ ⊂ B. It is
enough to verify that B′ is a σ-algebra.
It is easy to see that (i) Ω ∈ B′; (ii) Bc ∈ B′ if B ∈ B′ since A∆B = Ac∆Bc. (iii) If
Bi ∈ B′ for i ≥ 1, there exist Ai ∈ A such that P (Bi∆Ai) < ǫ/2i for all i ≥ 1. Evidently,
∪ni=1Bi ↑ ∪∞
i=1Bi as n → ∞. Therefore, there exists n0 < ∞ such that |P (∪∞i=1Bi) −
P (∪n0i=1Bi)| ≤ ǫ/2. It is easy to check that (∪n0
i=1Bi)∆(∪n0i=1Ai) ⊂ ∪n0
i=1(Bi∆Ai). Write
B = ∪∞i=1Bi and B = ∪n0
i=1Bi and A = ∪n0i=1Ai. Then A ∈ A. Note B∆A ⊂ (B\B)∪ (B∆A).
The above facts show that P (B∆A) < ǫ. Thus B′ is a σ-algebra.
LEMMA 10.2 Let X1,X2, · · · ,Xm, m ≥ 1 be random variables defined on (Ω,F ,P). Let
f(x1, · · · , xm) be a real measurable function with E|f(X1, · · · ,Xm)|p < ∞ for some p ≥ 1.
Then there exists fn(X1, · · · ,Xm); n ≥ 1 such that
(i) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) a.s.
(ii) fn(X1, · · · ,Xm) → f(X1, · · · ,Xm) in Lp(Ω,F ,P).
(iii) For each n ≥ 1, fn(X1, · · · ,Xm) =∑km
i=1 ci gi1(X1) · · · gim(Xm) for some km < ∞,
constants ci, and gij(Xj) = IAi,j (Xj) for some sets Ai,j ∈ B(R), and all 1 ≤ i ≤ km and
1 ≤ j ≤ m.
Proof. To save notation, we write f = f(x1, · · · , xm). Since E|fI(|f | ≤ C) − f |p → 0
as C → ∞, choose Ck such that E|fI(|f | ≤ Ck) − f |p ≤ 1/k2 for k = 1, 2, · · · . We will
show that there exists function gk = gk(x1, · · · , xk) of the form in (iii) such that E|fI(|f | ≤Ck) − gk|p ≤ 1/k2 for all k ≥ 1. Thus, E|f − gk|p < 2/k2 for all k ≥ 1. The assertion (ii)
then follows. Also, it implies E(∑
k≥1 |f − gk|p) < ∞, then∑
k≥1 |f − gk|p < ∞ a.s. we
ontain (i). Therefore, to prove this lemma, we assume w.l.o.g. that f is bounded and we
need to show that there exist fn = fn(x1, · · · , xm); n ≥ 1 of the form in (iii) such that
E|f − fn|p ≤ 1
n2(10.6)
for all n ≥ 1.
75
Since f is bouded, for any n ≥ 1, there exists hn such that
supx∈Rm
|f(x) − hn(x)| < 1
2n2, (10.7)
where hn is a simple function, i.e., hn(x) =∑kn
i=1 ciIx ∈ Bi for some kn < ∞, constants
ci’s and Bi ∈ B(Rn).
Now set X = (X1, · · · ,Xm) ∈ Rm and µ be the probability measure of X under proba-
bility P. Let A be the set of all finite unions of sets in A1 := ∏mi=1Ai ∈ B(Rn); Ai ∈ B(R).
By the construction of B(Rn), we know that B(Rn) = σ(A). It is not difficult to verify that
A satisfies the conditions in Lemma 10.1. Thus there exist Ei ∈ A such that
∫|IBi(x) − IEi(x)|p dµ = µ(Bi∆Ei) <
1
(2cn2)pand Ei =
ki⋃
j=1
m∏
l=1
Ai,j,l
where c = 1 +∑kn
i=1 |ci| and Ai,j,l ∈ B(R) for all i, j and l. Now, since ‖ · ‖p is a norm, we
have
‖hn(X) −kn∑
i=1
ciIEi(X)‖p ≤kn∑
i=1
|ci| · ‖IEi(X) − IBi(X)‖p ≤ 1
2n2. (10.8)
Observe that ν(E) := IE(X) is a measure on (Rn,B(Rn)). Note that the intersection of any
finite product sets∏m
l=1Ai,j,l is still in A1. By the inclusion-exclusion formula, IEi(X) is a
finite linear combination of IF (X) where F ∈ A1. Thus, fn(X) :=∑kn
i=1 ciIEi(X) is of the
form in (iii). Now, by (10.7) and (10.8)
‖f(X) − fn(X)‖p ≤ ‖f(X) − hn(X)‖p + ‖hn(X) − fn(X)‖p ≤ 1
n2.
Thus, (10.6) follows.
LEMMA 10.3 Let ξ1, · · · , ξk be random variables, and φ(x), x ∈ Rn be a measurable func-
tion with E|φ(ξ1, · · · , ξk)| <∞. Let F and G be two σ-algebras. Suppose
E(f1(ξ1) · · · fk(ξk)|F) = E(f1(ξ1) · · · fk(ξk)|G) a.s. (10.9)
holds for all bounded, measurable functions fi : R → R, 1 ≤ i ≤ n. Then E(φ(ξ1, · · · , ξk)|F) =
E(φ(ξ1, · · · , ξk)|G) almost surely.
Proof. From (10.9), we know that the desired conclusion holds for φ that is a finite linear
combination of f1 · · · fk’s. By Lemma 10.2, there exist such functions φn; n ≥ 1, such that
76
E|φn(ξ1, · · · , ξk) − φ(ξ1, · · · , ξk)| → 0 as n→ ∞. Therefore,
E|E(φn(ξ1, · · · , ξk)|F) − E(φ(ξ1, · · · , ξk)|F)|≤ E|E(φn(ξ1, · · · , ξk) − φ(ξ1, · · · , ξk)| → 0.
We know that E(φn(ξ1, · · · , ξk)|F) = E(φn(ξ1, · · · , ξk)|G) for all n ≥ 1, the conclusion
follows since the L1-limit is unique up to a set of measure zero.
PROPOSITION 10.1 Let X1,X2, · · · be a Markov chain. Let 1 ≤ l ≤ m ≤ n− 1. Assume
φ(x), x ∈ Rn−m is a measurable function such that E|φ(Xm+1, · · · ,Xn)| <∞. Then
E (φ(Xm+1, · · · ,Xn)|Xm, · · · ,Xl) = E (φ(Xm+1, · · · ,Xn)|Xm) .
Proof. Let p = n − m. By Lemma 10.3, we only need to prove the lemma for φ(x) =
φ1(x1) · · · φ(xp) with x = (x1, · · · , xp) and φi’s being bounded functions. We do this by
induction.
If n = m+1, by the Markov property, Y := E(φ1(Xm+1)|Xm, · · · ,X1) = E(φ1(Xm+1)|Xm).
Review the property that E(E(Z|F)|G) = E(E(Z|G)|F) = E(Z|F) if F ⊂ G. We know that
E (φ1(Xm+1)|Xm, · · · ,Xl) = E(Y |Xm, · · · ,Xl) = E(φ1(Xm+1)|Xm)
since Y = E(φ1(Xm+1)|Xm) is σ(Xm, · · · ,Xl)-measurable.
Now assume n ≥ m+ 2. Using the same argument to obtain,
Eφ1(Xm+1) · · · φp+1(Xn+1)|Xm, · · · ,Xl)= Eφ1(Xm+1) · · · φp−1(Xn−1)φp(Xn)|Xm, · · · ,Xl= Eφ1(Xm+1) · · · φp−1(Xn−1)φp(Xn)|Xm
where φp(Xn) = φp(Xn) · E(φp+1(Xn+1)|Xn), and the last step is by induction. Since
φ1(Xm+1) · · · φp−1(Xn−1)φp(Xn) is equal to E(φ1(Xm+1) · · · φp+1(Xn+1)|Xm, · · · ,X1). The
conclusion follows.
The following property says that a Markov chain has a reverse property.
PROPOSITION 10.2 Let X1,X2, · · · be a Markov chain. Let 1 ≤ m < n. Assume
φ(x), x ∈ Rm is a measurable function such that E|φ(X1, · · · ,Xm)| <∞. Then
E(φ(X1, · · · ,Xm)|Xm+1, · · · ,Xn) = E(φ(X1, · · · ,Xm)|Xm+1).
77
Proof. By lemma 10.3, we assume, without loss of generality, φ(x) is a bounded function.
From the definition of conditional expectation, to prove the proposition, it suffices to show
that
Eφ(X1, · · · ,Xm)g(Xm+1, · · · ,Xn)= EE(φ(X1, · · · ,Xm)|Xm+1) · g(Xm+1, · · · ,Xn) (10.10)
for any bounded, measurable function g(x), x ∈ Rn. By Lemma 10.3 again, we only need
to prove (10.10) for g(x) = g1(x1) · · · gp(xp) where p = n−m. Now
EE(φ(X1, · · · ,Xm)|Xm+1) · g1(Xm+1) · · · gp(Xn)= EZ · g2(Xm+2) · · · gp(Xn)
where Z = E(φ(X1, · · · ,Xm)g1(Xm+1)|Xm+1). Use the fact E(E(ξ|F) · η) = E(ξ ·E(η|F))
for any ξ, η and σ-algebra F to obtain that
EZ · g2(Xm+2) · · · gp(Xn) = Eφ(X1, · · · ,Xm)g1(Xm+1) ·E(η|Xm+1)
where η = g2(Xm+2) · · · gp(Xn). By Proposition 10.1, E(η|Xm+1) = E(η|Xm+1, · · · ,X1).
Therefore the above is identical to
EE(φ(X1, · · · ,Xm)g1(Xm+1) · η|Xm+1, · · · ,X1)= Eφ(X1, · · · ,Xm)g(Xm+1, · · · ,Xn)
since g(x) = g1(x1) · · · gp(xp).
The following proposition says that a Markov chain has the property: given the present,
the past and future are independent.
PROPOSITION 10.3 Let 1 ≤ k < l ≤ m be such that l − k ≥ 2. Let φ(x), x ∈ Rk
and ψ(x), x ∈ Rm−l+1 be two measurable functions satisfying E(φ(X1, · · · ,Xl)
2) < ∞ and
E(ψ(Xl, · · · ,Xm)2) <∞. Then
Eφ(X1, · · · ,Xk) · ψ(Xl, · · · ,Xm)|Xk+1,Xl−1= Eφ(X1, · · · ,Xk)|Xk+1,Xl−1 ·Eψ(Xl, · · · ,Xm)|Xk+1,Xl−1.
Proof. Fo saving notation, we write φ = φ(X1, · · · ,Xk) and ψ = ψ(Xl, · · · ,Xm). Then
E(φψ|Xk+1,Xl−1) = EE(φψ|X1, · · ·Xl−1) |Xk+1,Xl−1= Eφ ·E(ψ|Xl−1) |Xk+1,Xl−1
78
by Proposition 10.1 and that φ is σ(X1, · · · ,Xl−1) measurable. The last term is also equal
to E(ψ|Xl−1) ·Eφ|Xk+1,Xl−1. The conclusion follows since E(ψ|Xl−1) = E(ψ|Xk, Xl−1).
79