10 fundamental theorems for econometrics
TRANSCRIPT
10 Fundamental Theorems for Econometrics
Thomas S. Robinson (https://ts-robinson.com)
2020-09-30 โ v0.1
Preface
A list of 10 econometric theorems was circulated on Twitter citing what Jeffrey Wooldridge claims you justneed to apply repeatedly in order to do econometrics. As a political scientist with applied statistics training,this list caught my attention because it contains many of the theorems I see used in (methods) papers, butwhich I typically glaze over for lack of understanding. The complete list (slightly paraphrased) is:
1. Law of Iterated Expectations, Law of Total Variance2. Linearity of Expectations, Variance of a Sum3. Jensenโs Inequality, Chebyshevโs Inequality4. Linear Projection and its Properties5. Weak Law of Large Numbers, Central Limit Theorem6. Slutskyโs Theorem, Continuous Convergence Theorem, Asymptotic Equivalence Lemma7. Big Op, Little op, and the algebra of them8. Delta Method9. Frisch-Waugh Partialling Out
10. For PD matrices A and B, A-B is PSD if and only if ๐ตโ1 โ ๐ดโ1 is PSD.
As an exercise in improving my own knowledge of these fundamentals, I decided to work through each theoremโ using various lecture notes found online, and excellent textbooks like Aronow & Millerโs (2019) Foundationsof Agnostic Statistics, Angrist and Pischkeโs (2008) Mostly Harmless Econometrics, and Wassermanโs (2004)All of Statistics.
3
I found for a list of important theorems there were few consistent sources that contained explanations andproofs of each item. Often, textbooks had excellent descriptive intuitions but would hold back on offeringfull, annotated proofs. Or full proofs were offered without explaining the wider significance of the theorems.Some of the concepts, moreover, had different definitions dependent on the field or source of the proof (likeSlutskyโs Theorems)!
This resource is an attempt to collate my writing on these theorems โ the intuitions, proofs, and examples โinto a single document. I have taken some liberties in doing so โ for instance combining Wooldridgeโs firsttwo points into a single chapter on โExpectation Theoremsโ, and often omit continuous proofs where discreteproofs are similar and easier to follow. That said, I have tried to be reasonably exhaustive in my proofs sothat they are accessible to those (like me) without a formal statistics background.
The inspiration for this project was Jeffrey Wooldridgeโs list, an academic whose work I admire greatly. Thisdocument, however, is in no way endorsed by or associated with him. Most of the applied examples (andinvisible corrections to my maths) stem from discussions with Andy Eggers and Musashi Harukawa. Therewill inevitably still be some errors, omissions, and confusing passages. I would be more than grateful toreceive any feedback at [email protected] or via the GitHub repo for this project.
Prerequisites
I worked through these proofs learning the bits of maths I needed as I went along. For those who want toconsult Google a little less than I had to, the following should ease you into the more formal aspects of thisdocument:
โข A simple working knowledge of probability theory
โข The basics of expectation notation, but you donโt need to know any expectation rules (I cover theimportant ones in Chapter 1).
โข A basic understanding of linear algebra (i.e. how you multiply matrices, what transposition is, and whatthe identity matrix looks like). More complicated aspects like eigenvalues and Gaussian eliminationmake fleeting appearances, particularly in Chapter 9, but these are not crucial.
โข Where relevant, I provide coded examples in R. Iโve kept my use of packages to a minimum so thecode should be reasonably easy to read/port to other programming languages.
Version notes
v0.1
This is the first complete draft, and some sections are likely to be changed in future versions. For instance,in Chapter 9 I would like to provide a more comprehensive overview of quadratic form in linear algebra, howwe derive gradients, and hence the shape of PD matrices. Again, any suggestions on ways to improve/addto this resource are very much welcome!
10 Fundamental Theorems for Econometrics by Thomas Samuel Robinson is licensed under CC BY-NC-SA4.0
4
Chapter 1
Expectation Theorems
This chapter sets out some of the basic theorems that can be derived from the definition of expectations,as highlighted by Wooldridge. I have combined his first two points into a single overview of expectationmaths. The theorems themselves are not as immediately relevant to applied research as some of the latertheorems on Wooldridgeโs list. However, they often form the fundamental basis upon which future proofsare conducted.
1.1 Law of Iterated Expectations
The Law of Iterated Expectations (LIE) states that:
๐ผ[๐] = ๐ผ[๐ผ[๐|๐ ]] (1.1)
In plain English, the expected value of ๐ is equal to the expectation over the conditional expectation of ๐given ๐ . More simply, the mean of X is equal to a weighted mean of conditional means.
Aronow & Miller (2019) note that LIE is โone of the most important theoremsโ, because being able to expressunconditional expectation functions in terms of conditional expectations allow you to hold some parametersfixed, making calculations more tractable.
1.1.1 Proof of LIE
First, we can express the expectation over conditional expectations as a weighted sum over all possible valuesof Y, and similarly express the conditional expectations using summation too:
๐ผ[๐ผ[๐|๐ ]] = โ๐ฆ
๐ผ[๐|๐ = ๐ฆ]๐ (๐ = ๐ฆ) (1.2)
= โ๐ฆ
โ๐ฅ
๐ฅ๐(๐ = ๐ฅ|๐ = ๐ฆ)๐(๐ = ๐ฆ) (1.3)
= โ๐ฆ
โ๐ฅ
๐ฅ๐(๐ = ๐ฆ|๐ = ๐ฅ)๐(๐ = ๐ฅ)., (1.4)
5
Note that the final line follows due to Bayesโ Rule.1 And so:
... = โ๐ฆ
โ๐ฅ
๐ฅ๐(๐ = ๐ฅ)๐(๐ = ๐ฆ|๐ = ๐ฅ) (1.5)
= โ๐ฅ
๐ฅ๐(๐ = ๐ฅ) โ๐ฆ
๐(๐ = ๐ฆ|๐ = ๐ฅ) (1.6)
= โ๐ฅ
๐ฅ๐(๐ = ๐ฅ) (1.7)
= ๐ผ[๐] โก (1.8)
The last steps of the proof are reasonably simple. Equation 1.5 is a trivial rearrangement of terms. Thesecond line follows since ๐ฆ does not appear in ๐ฅ๐(๐ = ๐ฅ) and so we can move the summation over ๐ to withinthe summation over ๐. The final line follows from the fact that the sum of the conditional probabilities๐(๐ = ๐ฆ|๐ = ๐ฅ) = 1 (by simple probability theory).
1.2 Law of Total Variance
The Law of Total Variance (LTV) states the following:
๐ฃ๐๐[๐ ] = ๐ผ[๐ฃ๐๐[๐ |๐]] + ๐ฃ๐๐(๐ผ[๐ |๐]) (1.9)
1.2.1 Proof of LTV
LTV can be proved almost immediately using LIE and the definition of variance:
๐ฃ๐๐(๐ ) = ๐ผ[๐ 2] โ ๐ผ[๐ ]2 (1.10)= ๐ผ[๐ผ[๐ 2|๐]] โ ๐ผ[๐ผ[๐ |๐]]2 (1.11)= ๐ผ[๐ฃ๐๐[๐ |๐] + ๐ผ[๐ ]2]] โ ๐ผ[๐ผ[๐ |๐]]2 (1.12)= ๐ผ[๐ฃ๐๐[๐ |๐]] + (๐ผ[๐ผ[๐ ]2] โ ๐ผ[๐ผ[๐ |๐]]2) (1.13)= ๐ผ[๐ฃ๐๐[๐ |๐]] + ๐ฃ๐๐(๐ผ[๐ |๐]) โก (1.14)
The second line applies LIE to both ๐ 2 and ๐ separately. Then we apply the definition of variance to๐ผ[๐ 2|๐], and subsequently decompose this term (since ๐ผ[๐ด + ๐ต] = ๐ผ[๐ด] + ๐ผ[๐ต].
1.3 Linearity of Expectations
The Linearity of Expectations (LOE) simply states that:
๐ผ[๐๐ + ๐๐ ] = ๐๐ผ[๐] + ๐๐ผ[๐ ], (1.15)
where ๐ and ๐ are real numbers, and ๐ and ๐ are random variables.1Bayesโ Rule states ๐(๐ด|๐ต) = ๐(๐ต|๐ด)๐(๐ด)
๐(๐ต) . Therefore:
๐(๐ = ๐ฅ|๐ = ๐ฆ) ร ๐(๐ = ๐ฆ) = ๐(๐ = ๐ฆ|๐ = ๐ฅ)๐(๐ = ๐ฅ)๐(๐ = ๐ฆ)๐(๐ = ๐ฆ)
= ๐(๐ = ๐ฆ|๐ = ๐ฅ)๐(๐ = ๐ฅ).
6
1.3.1 Proof of LOE
๐ผ[๐๐ + ๐๐ ] = โ๐ฅ
โ๐ฆ
(๐๐ฅ + ๐๐ฆ)๐(๐ = ๐ฅ, ๐ = ๐ฆ) (1.16)
= โ๐ฅ
โ๐ฆ
๐๐ฅ๐(๐ = ๐ฅ, ๐ = ๐ฆ) + โ๐ฅ
โ๐ฆ
๐๐ฆ๐(๐ = ๐ฅ, ๐ = ๐ฆ) (1.17)
= ๐ โ๐ฅ
๐ฅ โ๐ฆ
๐(๐ = ๐ฅ, ๐ = ๐ฆ) + ๐ โ๐ฆ
๐ฆ โ๐ฅ
๐(๐ = ๐ฅ, ๐ = ๐ฆ) (1.18)
(1.19)
The first line simply expands the expectation into summation form i.e. the expectation is the sum of ๐๐ +๐๐for each (discrete) value of ๐ and ๐ weighted by their joint probability. We then expand out these terms.Since summations are commutative, we can rearrange the order of the summations for each of the two partsin the final line, and shift the real numbers and random variables outside the various operators.
Now note that โ๐ ๐(๐ผ = ๐, ๐ฝ = ๐) โก ๐(๐ฝ = ๐) by probability theory. Therefore:
... = ๐ โ๐ฅ
๐ฅ๐(๐ = ๐ฅ) + ๐ โ๐ฆ
๐ฆ๐(๐ = ๐ฆ) (1.20)
The two terms within summations are just the weighted averages of ๐ and ๐ respectively, i.e. the expectationsof ๐ and ๐ , so:
... = ๐๐ผ[๐] + ๐๐ผ[๐ ] โก (1.21)(1.22)
1.4 Variance of a Sum
There are two versions of the Variance of a Sum (VOS) law:
โข ๐ฃ๐๐(๐ + ๐ ) = ๐ฃ๐๐(๐) + ๐ฃ๐๐(๐ ), when X and Y are independentโข ๐ฃ๐๐(๐ + ๐ ) = ๐ฃ๐๐(๐) + ๐ฃ๐๐(๐ ) + 2๐ถ๐๐ฃ(๐, ๐ ), when X and Y are correlated
1.4.1 Proof of VoS: ๐, ๐ are independent
๐ฃ๐๐(๐ + ๐ ) = ๐ผ[(๐ + ๐ )2] โ (๐ผ[๐ + ๐ ])2 (1.23)= ๐ผ[(๐2 + 2๐๐ + ๐ 2)] โ (๐ผ[๐] + ๐ผ[๐ ])2 (1.24)
The first line of the proof is simply the definition of variance. In the second line, we expand the equation inthe first term and using LOE decompose the second term. We can expand this equation further, continuingto use LOE and noting that :
... = ๐ผ[๐2] + ๐ผ[2๐๐ ] + ๐ผ[๐ 2] โ (๐ผ[๐]2 + 2๐ผ[๐]๐ผ[๐ ] + ๐ผ[๐ ]2) (1.25)= ๐ผ[๐2] + ๐ผ[๐ 2] โ ๐ผ[๐]2 โ ๐ผ[๐ ]2 (1.26)= ๐ ๐๐[๐] + ๐ ๐๐[๐ ] โก (1.27)
since ๐ผ[๐ด]๐ผ[๐ต] = ๐ผ[๐ด๐ต] when ๐ด and ๐ต are independent.
7
1.4.2 Proof of VoS: ๐, ๐ are dependent
As before, we can expand out the variance of a sum into its expected values:
๐ฃ๐๐(๐ + ๐ ) = ๐ผ[๐2] + ๐ผ[2๐๐ ] + ๐ผ[๐ 2] โ (๐ผ[๐]2 + 2๐ผ[๐]๐ผ[๐ ] + ๐ผ[๐ ]2). (1.28)
Since ๐ and ๐ are assumed to be dependent, the non-squared terms do not necessarily cancel each otherout anymore. Instead, we can rearrange as follows:
๐ฃ๐๐(๐ + ๐ ) = ๐ฃ๐๐(๐) + ๐ฃ๐๐(๐ ) + ๐ผ[2๐๐ ] โ 2๐ผ[๐]๐ผ[๐ ] (1.29)= ๐ฃ๐๐(๐) + ๐ฃ๐๐(๐ ) + 2(๐ผ[๐๐ ] โ ๐ผ[๐]๐ผ[๐ ]), (1.30)
and note that ๐ผ[๐ด๐ต] โ ๐ผ[๐ด]๐ผ[๐ต] = ๐ถ๐๐ฃ(๐ด, ๐ต):
... = ๐ฃ๐๐(๐) + ๐ฃ๐๐(๐ ) + 2๐ถ๐๐ฃ(๐ด, ๐ต) โก (1.31)
Two further points are worth noting. First, the independent version of the proof is just a special case of thedependent version of the proof. When ๐ and ๐ are independent, the covariance between the two randomvariables is zero, and therefore the the variance of the sum is just equal to the sum of the variances.
Second, nothing in the above proofs rely on there being just two random variables. In fact,๐ฃ๐๐(โ๐
๐ ๐๐) = โ๐๐ ๐ฃ๐๐(๐๐) when all variables are independent from each other, and equal to
โ๐๐ ๐ฃ๐๐(๐๐) + 2 โ๐
1โค๐<๐โค๐ ๐ถ๐๐ฃ(๐๐, ๐๐). This can be proved by induction using the above proofs, butintuitively: we can replace, for example, ๐ with ๐ = (๐1 + ๐2) and iteratively apply the above proof firstto ๐ + ๐ and then subsequently expand ๐ฃ๐๐(๐ ) as ๐ฃ๐๐(๐1 + ๐2).
8
Chapter 2
Inequalities involving expectations
This chapter discusses and proves two inequalities that Wooldridge highlights - Jensenโs and Chebyshevโs.Both involve expectations (and the theorems derived in the previous chapter).
2.1 Jensenโs Inequality
Jensenโs Inequality is a statement about the relative size of the expectation of a function compared with thefunction over that expectation (with respect to some random variable). To understand the mechanics, I firstdefine convex functions and then walkthrough the logic behind the inequality itself.
2.1.1 Convex functions
A function ๐ is convex (in two dimensions) if all points on a straight line connecting any two points on thegraph of ๐ is above or on that graph. More formally, ๐ is convex if for โ๐ฅ1, ๐ฅ2 โ โ, and โ๐ก โ [0, 1]:
๐(๐ก๐ฅ1, (1 โ ๐ก)๐ฅ2) โค ๐ก๐(๐ฅ1) + (1 โ ๐ก)๐(๐ฅ2).
Here, ๐ก is a weighting parameter that allows us to range over the full interval between points ๐ฅ1 and ๐ฅ2.
Note also that concave functions are defined as the opposite of convex functions i.e. a function โ is concaveif and only if โโ is convex.
2.1.2 The Inequality
Jensenโs Inequality (JI) states that, for a convex function ๐ and random variable ๐:
๐ผ[๐(๐)] โฅ ๐(๐ธ[๐])
This inequality is exceptionally general โ it holds for any convex function. Moreover, given that concavefunctions are defined as negative convex functions, it is easy to see that JI also implies that if โ is a concavefunction, โ(๐ผ[๐]) โฅ ๐ผ[โ(๐)].1
Interestingly, note the similarity between this inequality and the definition of variance in terms of expecta-tions:
1Since โโ(๐ฅ) is convex, ๐ผ[โโ(๐)] โฅ โโ(๐ผ[๐]) by JI. Hence, โ(๐ผ[๐]) โ ๐ผ[โ(๐)] โฅ 0 and so โ(๐ผ[๐]) โฅ ๐ผ[โ(๐)].
9
๐ฃ๐๐(๐) = ๐ผ[๐2] โ (๐ผ[๐])2,
and since ๐ฃ๐๐(๐) is always positive:
๐ผ[๐2] โ (๐ผ[๐])2 โฅ 0๐ผ[๐2] โฅ (๐ผ[๐])2).
We can therefore define ๐(๐) = ๐2 (a convex function), and see that variance itself is an instance of JensenโsInequality.
2.1.3 Proof
Assume ๐(๐) is a convex function, and ๐ฟ(๐) = ๐ + ๐๐ is a linear function tangential to ๐(๐) at point ๐ผ[๐].Hence, since ๐ is convex and ๐ฟ is tangential to ๐, we know by definition that:
๐(๐ฅ) โฅ ๐ฟ(๐ฅ), โ๐ฅ โ ๐. (2.1)
So, therefore:
๐ผ[๐(๐)] โฅ ๐ผ[๐ฟ(๐)] (2.2)โฅ ๐ผ[๐ด + ๐๐] (2.3)โฅ ๐ + ๐๐ผ[๐] (2.4)โฅ ๐ฟ(๐ผ[๐]) (2.5)โฅ ๐(๐ผ[๐]) โก (2.6)
The majority of this proof is straightforward. If one function is always greater than or equal to anotherfunction, then the unconditional expectation of the first function must be at least as big as that of thesecond. The interior lines of the proof follow from the definition of ๐ฟ, the linearity of expectations, andanother application of the definition of ๐ฟ respectively.
The final line then follows because, by the definition of the straight line ๐ฟ, we know that ๐ฟ[๐ผ[๐]] is tangentialwith ๐ at ๐ผ[๐ผ[๐]] = ๐ผ[๐] = ๐(๐ผ[๐]).2
2.1.4 Application
In Chapter 2 of Agnostic Statistics (2019), the authors note (almost in passing) that the standard error of themean is not unbiased, i.e. that ๐ผ[๏ฟฝ๏ฟฝ] โ ๐, even though it is consistent i.e. that ๏ฟฝ๏ฟฝ
๐โโ ๐. The bias of the meanโs
standard error is somewhat interesting (if not surprising), given how frequently we deploy the standarderror (and, in a more general sense, highlights how important asymptotics are not just for the estimation ofparameters, but also those parametersโ uncertainty). The proof of why ๏ฟฝ๏ฟฝ is biased also, conveniently for thischapter, uses Jensenโs Inequality.
The standard error of the mean is denoted as
๐ = โ๐ (๏ฟฝ๏ฟฝ)2Based on lecture notes by Larry Wasserman.
10
,
where ๐ (๏ฟฝ๏ฟฝ) = ๐ (๐)๐ .
Our best estimate of this quantity ๏ฟฝ๏ฟฝ = โ ๐ (๏ฟฝ๏ฟฝ) is simply the square root of the sample variance estimator.We know that the variance estimator itself is unbiased and a consistent estimator of the sampling variance(see Agnostic Statistics Theorem 2.1.9).
The bias in the estimate of the sample meanโs standard error originates from the square root functionNote that the square root is a strictly concave function. This means we can make two claims about theestimator. First, as with any concave function we can use the inverse version of Jensenโs Inequality, i.e. that๐ผ[๐(๐)] โค ๐(๐ผ[๐]). Second, since the square root is a strictly concave function, we can use the weaker โlessthan or equal toโ operator with the strict โless thanโ inequality. Hence, the proof is reasonably easy:
๐ผ [๏ฟฝ๏ฟฝ] = ๐ผ [โ ๐ (๏ฟฝ๏ฟฝ)] < โ๐ผ[ ๐ (๏ฟฝ๏ฟฝ)] (by Jensenโs Inequality)
< โ๐ (๏ฟฝ๏ฟฝ) (since the sampling variance is unbiased)< ๐. โก
The first line follows by first defining the conditional expectation of the sample meanโs standard error, andthen applying the noted variant of Jensenโs inequality. Then, since we know that the standard error estimatorof the variance is unbiased, we can replace the expectation with the true sampling variance, and note finallythat the square root of the true sampling variance is, by definition, the true standard error of the samplemean. Hence, we see that our estimator of the sampling meanโs standard error is strictly less than the truevalue and therefore is biased.
2.2 Chebyshevโs Inequality
The other inequality Wooldridge highlights is the Chebyshev Inequality. This inequality states that for aset of probability distributions, no more than a specific proportion of that distribution is more than a setdistance from the mean.
More formally, if ๐ = ๐ผ[๐] and ๐2 = ๐ฃ๐๐(๐), then:
๐(|๐| โฅ ๐) โค 1๐2 , (2.7)
where ๐ = (๐ โ ๐)/๐ (Wasserman, 2004, p.64) and ๐ indicates the number of standard deviations.
2.2.1 Proof
First, let us define the variance (๐2) as:
๐2 = ๐ผ[(๐ โ ๐)2]. (2.8)
By expectation theory, we know that we can express any unconditional expectation as the weighted sum ofits conditional components i.e. ๐ผ[๐ด] = โ๐ ๐ผ[๐ด|๐๐]๐ (๐๐), where โ๐ ๐(๐๐) = 1. Hence:
... = ๐ผ[(๐ โ ๐)2|๐๐ โค |๐ โ ๐|]๐ (๐๐ โค |๐ โ ๐|) + ๐ผ[(๐ โ ๐)2|๐๐ > |๐ โ ๐|]๐ (๐๐ > |๐ โ ๐|) (2.9)
11
Since any probability is bounded between 0 and 1, and variance must be greater than or equal to zero, thesecond term must be non-negative. If we remove this term, therefore, the right-hand side is necessarily eitherthe same size or smaller. Therefore we can alter the equality to the following inequality:
๐2 โฅ ๐ผ[(๐ โ ๐)2|๐๐ โค ๐ โ ๐]๐(๐๐ โค |๐ โ ๐|) (2.10)
This then simplifies:
๐2 โฅ (๐๐)2๐(๐๐ โค |๐ โ ๐|)โฅ ๐2๐2๐(๐๐ โค |๐ โ ๐|)
1๐2 โฅ ๐(|๐| โฅ ๐) โก
Conditional on ๐๐ โค |๐ โ ๐|, (๐๐)2 โค (๐ โ ๐)2, and therefore ๐ผ[(๐๐)2] โค ๐ผ[(๐ โ ๐)2]. Then, the last stepsimply rearranges the terms within the probability function.3
2.2.2 Applications
Wasserman (2004) notes that this inequality is useful when we want to know the probable bounds of anunknown quantity, and where direct computation would be difficult. It can also be used to prove the WeakLaw of Large Numbers (point 5 in Wooldridgeโs list!), which I demonstrate here.
It is worth noting, however, that the inequality is really powerful โ it guarantees that a certain amount of aprobability distribution is within a certain region โ irrespective of the shape of that distribution (so long aswe can estimate the mean and variance)!
For some well-defined distributions, this theorem is weaker than what we know by dint of their form. Forexample, we know that for a normal distribution, approximately 95 percent of values lie within 2 standarddeviations of the mean. Chebyshevโs Inequality only guarantees that 75 percent of values lie within twostandard deviations of the mean (since ๐(|๐| โฅ ๐) โค 1
22 ). Crucially, however, even if we didnโt knowwhether a given distribution was normal, so long as it is a well-behaved probability distribution (i.e. theunrestricted integral sums to 1) we can guarantee that 75 percent will lie within two standard deviations ofthe mean.
3๐๐ โค |๐ โ ๐| โก ๐ โค |๐ โ ๐|/๐ โก |๐| โฅ ๐, since ๐ is strictly non-negative.
12
Chapter 3
Linear Projection
This chapter provides a basic introduction to projection using both linear algebra and geometric demon-strations. I discuss the derivation of the orthogonal projection, its general properties as an โoperatorโ, andexplore its relationship with ordinary least squares (OLS) regression. I defer a discussion of linear projectionsโapplications until the penultimate chapter on the Frisch-Waugh Theorem, where projection matrices featureheavily in the proof.
3.1 Projection
Formally, a projection ๐ is a linear function on a vector space, such that when it is applied to itself you getthe same result i.e. ๐ 2 = ๐ .1
This definition is slightly intractable, but the intuition is reasonably simple. Consider a vector ๐ฃ in two-dimensions. ๐ฃ is a finite straight line pointing in a given direction. Suppose there is some point ๐ฅ not onthis straight line but in the same two-dimensional space. The projection of ๐ฅ, i.e. ๐๐ฅ, is a function thatreturns the point โclosestโ to ๐ฅ along the vector line ๐ฃ. Call this point ๐ฅ. In most contexts, closest refersto Euclidean distance, i.e. โโ๐(๐ฅ๐ โ ๐ฅ๐)2, where ๐ ranges over the dimensions of the vector space (in thiscase two dimensions).2 Figure 3.1 depicts this logic visually. The green dashed line shows the orthogonalprojection, and red dashed lines indicate other potential (non-orthgonal) projections that are further awayin Euclidean space from ๐ฅ than ๐ฅ.
In short, projection is a way of simplifying some n-dimensional space โ compressing information onto a(hyper-) plane. This is useful especially in social science settings where the complexity of the phenomena westudy mean exact prediction is impossible. Instead, we often want to construct models that compress busyand variable data into simpler, parsimonious explanations. Projection is the statistical method of achievingthis โ it takes the full space and simplifies it with respect to a certain number of dimensions.
While the above is (reasonably) intuitive it is worth spelling out the maths behind projection, not leastbecause it helps demonstrate the connection between linear projection and linear regression.
To begin, we can take some point in n-dimensional space, ๐ฅ, and the vector line ๐ฃ along which we want toproject ๐ฅ. The goal is the following:
1Since ๐ is (in the finite case) a square matrix, a projection matrix is an idempotent matrix โ I discuss this property inmore detail later on in this note.
2Euclidean distance has convenient properties, including that the closest distance between a point and a vector line isorthogonal to the vector line itself.
13
Figure 3.1: Orthogonal projection of a point onto a vector line.
๐๐๐ ๐๐๐๐โโ๐
( ๐ฅ๐ โ ๐ฅ)2 = ๐๐๐ ๐๐๐๐ โ๐
( ๐ฅ๐ โ ๐ฅ)2
= ๐๐๐ ๐๐๐๐ โ๐
(๐๐ฃ๐ โ ๐ฅ)2
This rearrangement follows since the square root is a monotonic transformation, such that the optimalchoice of ๐ is the same across both ๐๐๐ ๐๐๐โs. Since any potential ๐ฅ along the line drawn by ๐ฃ is some scalarmultiplication of that line (๐๐ฃ), we can express the function to be minimised with respect to ๐, and thendifferentiate:
๐๐๐ โ
๐(๐๐ฃ๐ โ ๐ฅ)2 = โ
๐2๐ฃ๐(๐๐ฃ๐ โ ๐ฅ)
= 2(โ๐
๐๐ฃ2๐ โ โ
๐๐ฃ๐๐ฅ)
= 2(๐๐ฃโฒ๐ฃ โ ๐ฃโฒ๐ฅ) โ 0
Here we differentiate the equation and rearrange terms. The final step simply converts the summationnotation into matrix multiplication. Solving:
2(๐๐ฃโฒ๐ฃ โ ๐ฃโฒ๐ฅ) = 0๐๐ฃโฒ๐ฃ โ ๐ฃโฒ๐ฅ = 0
๐๐ฃโฒ๐ฃ = ๐ฃโฒ๐ฅ๐ = (๐ฃโฒ๐ฃ)โ1๐ฃโฒ๐ฅ.
From here, note that ๐ฅ, the projection of ๐ฅ onto the vector line, is ๐ฃ๐ = ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ๐ฅ. Hence, we can definethe projection matrix of ๐ฅ onto ๐ฃ as:
14
๐๐ฃ = ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ.
In plain English, for any point in some space, the orthogonal projection of that point onto some subspace,is the point on a vector line that minimises the Euclidian distance between itself and the original point. Avisual demonstration of this point is shown and discussed in Figure ?? below.
Note also that this projection matrix has a clear analogue to the linear algebraic expression of linear regression.The vector of coefficients in a linear regression ๐ฝ can be expressed as (๐โฒ๐)โ1๐โฒ๐ฆ. And we know thatmultiplying this vector by the matrix of predictors ๐ results in the vector of predicted values ๐ฆ. Now wehave ๐ฆ = ๐(๐โฒ๐)โ1๐โฒ๐ โก ๐๐๐ฆ. Clearly, therefore, linear projection and linear regression are closely relatedโ and I return to this point below.
3.2 Properties of the projection matrix
The projection matrix ๐ has several interesting properties. First, and most simply, the projection matrix issquare. Since ๐ฃ is of some arbitrary dimensions ๐ ร ๐, its transpose is of dimensions ๐ ร ๐. By linear algebra,the shape of the full matrix is therefore ๐ ร ๐, i.e. square.
Projection matrices are also symmetric, i.e. ๐ = ๐ โฒ. To prove symmetry, note that transposing both sidesof the projection matrix definition:
๐ โฒ = (๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ)โฒ (3.1)= ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ (3.2)= ๐, (3.3)
since (๐ด๐ต)โฒ = ๐ตโฒ๐ดโฒ and (๐ดโ1)โฒ = (๐ดโฒ)โ1.Projection matrices are idempotent:
๐๐ = ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ (3.4)= ๐ฃ(๐ฃโฒ๐ฃ)โ1๐ฃโฒ (3.5)= ๐, (3.6)
since (๐ด)โ1๐ด = ๐ผ and ๐ต๐ผ = ๐ต.
Since, projection matrices are idempotent, this entails that projecting a point already on the vector line willjust return that same point. This is fairly intuitive: the closest point on the vector line to a point alreadyon the vector line is just that same point.
Finally, we can see that the projection of any point is orthogonal to the respected projected point on vectorline. Two vectors are orthogonal if ๐๐ = 0. Starting with the expression in Equation 3.1 (i.e. minimising theEuclidean distance with respect to ๐):
2(๐๐ฃโฒ๐ฃ โ ๐ฃโฒ๐ฅ) = 0๐ฃโฒ๐๐ฃ โ ๐ฃโฒ๐ฅ = 0๐ฃโฒ(๐๐ฃ โ ๐ฅ) = 0๐ฃโฒ( ๐ฅ โ ๐ฅ) = 0,
hence the line connecting the original point ๐ฅ is orthogonal to the vector line.
15
The projection matrix is very useful in other fundamental theorems in econometrics, like Frisch Waugh LovellTheorem discussed in Chapter 8.
3.3 Linear regression
Given a vector of interest, how do we capture as much information from it as possible using set of predictors?Projection matrices essentially simplify the dimensionality of some space, by casting points onto a lower-dimensional plane. Think of it like capturing the shadow of an object on the ground. There is far moredetail in the actual object itself but we roughly know its position, shape, and scale from the shadow thatโscast on the 2d plane of the ground.
Note also this is actually quite similar to how we think about regression. Loosely, when we regress ๐ on๐, we are trying to characterise how the components (or predictors) within ๐ characterise or relate to๐ . Of course, regression is also imperfect (after all, the optimisation goal is to minimise the errors of ourpredictions). So, regression also seems to capture some lower dimensional approximation of an outcome.
In fact, linear projection and linear regression are very closely related. In this final section, I outline howthese two statistical concepts relate to each other, both algebraically and geometrically,
Suppose we have a vector of outcomes ๐ฆ, and some n-dimensional matrix ๐ of predictors. We write thelinear regression model as:
๐ฆ = ๐๐ฝ + ๐, (3.7)
where ๐ฝ is a vector of coefficients, and ๐ is the difference between the prediction and the observed value in๐ฆ. The goal of linear regression is to minimise the sum of the squared residuals:
๐๐๐ ๐๐๐ ๐2 = ๐๐๐ ๐๐๐(๐ฆ โ ๐๐ฝ)โฒ(๐ฆ โ ๐๐ฝ)
Differentiating with respect to ๏ฟฝand solving:
๐๐๐ฝ (๐ฆ โ ๐๐ฝ)โฒ(๐ฆ โ ๐๐ฝ) = โ2๐(๐ฆ โ ๐๐ฝ)
= 2๐โฒ๐๐ฝ โ 2๐โฒ๐ฆ โ 0๐โฒ๐ ๐ฝ = ๐โฒ๐ฆ
(๐โฒ๐)โ1๐โฒ๐ ๐ฝ = (๐โฒ๐)โ1๐โฒ๐ฆ๐ฝ = (๐โฒ๐)โ1๐โฒ๐ฆ.
To get our prediction of ๐ฆ, i.e. ๐ฆ, we simply multiply our beta coefficient by the matrix X:
๐ฆ = ๐(๐โฒ๐)โ1๐โฒ๐ฆ.
Note how the OLS derivation of ๐ฆ is very similar to ๐ = ๐(๐โฒ๐)โ1๐, the orthogonal prediction matrix.The two differ only in that that ๐ฆ includes the original outcome vector ๐ฆ in its expression. But, note that๐๐ฆ = ๐(๐โฒ๐)โ1๐โฒ๐ฆ = ๐ฆ! Hence the predicted values from a linear regression simply are an orthogonalprojection of ๐ฆ onto the space defined by ๐.
16
3.3.1 Geometric interpretation
It should be clear now that linear projection and linear regression are connected โ but it is probably lessclear why this holds. To understand whatโs going on, letโs depict the problem geometrically.3
To appreciate whatโs going on, we first need to invert how we typically think about observations, variablesand datapoints. Consider a bivariate regression problem with three observations. Our data will include threevariables: a constant (c, a vector of 1โs), a predictor (X), and an outcome variable (Y). As a matrix, thismight look something like the following:
Y X c2 3 13 1 12 1 1
Typically we would represent the relationship geometrically by treating the variables as dimensions, suchthat every datapoint is an observation (and we would typically ignore the constant column since all its valuesare the same).
An alternative way to represent this data is to treat each observation (i.e. row) as a dimension and thenrepresent each variable as a vector. What does that actually mean? Well consider the column ๐ = (2, 3, 2).This vector essentially gives us the coordinates for a point in three-dimensional space: ๐1 = 2, ๐2 = 3, ๐3 = 2.Drawing a straight line from the origin (0,0,0) to this point gives us a vector line for the outcome. Whilevisually this might seem strange, from the perspective of our data itโs not unusual to refer to each variable asa column vector, and thatโs precisely because it is a quantity with a magnitude and direction (as determinedby its position in ๐ dimensions).
Our predictors are the vectors ๐ and ๐ (note the vector ๐ is now slightly more interesting because it is adiagonal line through the three-dimensional space). We can extend either vector line by multiplying it bya constant e.g. 2๐ = (6, 2, 2). With a single vector, we can only move forwards or backwards along a line.But if we combine two vectors together, we can actually reach lots of points in space. Imagine placing thevector ๐ at the end of the ๐. The total path now reaches a new point that is not intersected by either ๐ or๐. In fact, if we multiply ๐ and ๐ by some scalars (numbers), we can snake our way across a whole array ofdifferent points in three-dimensional space. Figure 3.2 demonstrates some of these combinations in the twodimensional space created by ๐ and ๐.
The comprehensive set of all possible points covered by linear combinations of ๐ and ๐ is called the span orcolumn space. In fact, with the specific set up of this example (3 observations, two predictors), the span ofour predictors is a flat plane. Imagine taking a flat bit of paper and aligning one corner with the origin, andthen angling surface so that the end points of the vectors ๐ and ๐ are both resting on the cardโs surface.Keeping that alignment, any point on the surface of the card is reachable by some combination of ๐ and๐. Algebraically we can refer to this surface as ๐๐๐(๐, ๐), and it generalises beyond two predictors (althoughthis is much harder to visualise).
Crucially, in our reduced example of three-dimensional space, there are points in space not reachable bycombining these two vectors (any point above or below the piece of card). We know, for instance that thevector line ๐ฆ lies off this plane. The goal therefore is to find a vector that is on the column space of (๐, ๐)that gets closest to our off-plane vector ๐ฆ as possible. Figure 3.3 depicts this set up visually โ each dimensionis an observation, each column in the matrix is represented a vector, and the column space of (๐, ๐) is theshaded grey plane. The vector ๐ฆ lies off this plane.
From our discussion in Section 3.1, we know that the โbestโ vector is the orthogonal projection from thecolumn space to the vector ๐ฆ. This is the shortest possible distance between the flat plane and the observed
3This final section borrows heavily from Ben Lambertโs explanation of projection and a demonstration using R by AndyEggers.
17
Figure 3.2: Potential combinations of two vectors.
๐ + ๐๐ผ๐ + ๐พ๐
๐ผ๐
๐๐
outcome, and is just ๐ฆ. Moreover, since ๐ฆ lies on the column space, we know we only need to combine somescaled amount of ๐ and ๐ to define the vector ๐ฆ, i.e., ๐ฝ1๐ + ๐ฝ0๐. Figure 3.4 shows this geometrically. Andin fact, the scalar coefficients ๐ฝ1, ๐ฝ0 in this case are just the regression coefficients derived from OLS. Why?Because we know that the orthogonal projection of ๐ฆ onto the column space minimises the error between ourprediction ๐ฆ and the observed outcome vector ๐ฆ. This is the same as the minimisation problem that OLSsolves, as outlined at the beginning of this section!
Consider any other vector on the column space, and the distance between itself and and ๐ฆ. Each non-orthogonal vector would be longer, and hence have a larger predictive error, than ๐ฆ. For example, Figure3.5 plots two alternative vectors on ๐๐๐(๐, ๐) alongside ๐ฆ. Clearly, ๐ < ๐โฒ < ๐โณ, and this is true of any othervector on the column space too.
Hence, linear projection and linear regression can be seen (both algebraically and geometrically) to be solvingthe same problem โ minimising the (squared) distance between an observed vector ๐ฆ and prediction vector
๐ฆ. This demonstration generalises to many dimensions (observations), though of course it becomes muchharder to intuit the geometry of highly-dimensional data. And similarly, with more observations we couldalso extend the number of predictors too such that ๐ is not a single column vector but a matrix of predictorvariables (i.e. multivariate regression). Again, visualising what the column space of this matrix would looklike geometrically becomes harder.
To summarise, this section has demonstrated two features. First, that linear regression simply is an orthogo-nal projection. We saw this algebraically by noting that the derivation of OLS coefficients, and subsequentlythe predicted values from a linear regression, is identical to ๐๐ฆ (where ๐ is a projection matrix). Second,and geometrically, we intuited why this is the case: namely that projecting onto a lower-dimensional columnspace involves finding the linear combination of predictors that minimises the Euclidean distance to ๐ฆ, i.e. ๐ฆ.The scalars we use to do so are simply the regression coefficients we would generate using OLS regression.
18
Chapter 4
Weak Law of Large Numbers and CentralLimit Theorem
This chapter focuses on two fundamental theorems that form the basis of our inferences from samples topopulations. The Weak Law of Large Numbers (WLLN) provides the basis for generalisation from a samplemean to the population mean. The Central Limit Theorem (CLT) provides the basis for quantifying ouruncertainty over this parameter. In both cases, I discuss the theorem itself and provide an annotated proof.Finally, I discuss how the two theorems complement each other.
4.1 Weak Law of Large Numbers
4.1.1 Theorem in Plain English
Suppose we have a random variable ๐. From ๐, we can generate a sequence of random variables๐1, ๐2, ..., ๐๐ that are independent and identically distributed (i.i.d.) draws of ๐. Assuming ๐ is finite, wecan perform calculations on this sequence of random numbers. For example, we can calculate the mean ofthe sequence ๏ฟฝ๏ฟฝ๐ = 1
๐ โ๐๐=1 ๐๐. This value is the sample mean โ from a much wider population, we have
drawn a finite sequence of observations, and calculated the average across them. How do we know that thissample parameter is meaningful with respect to the population, and therefore that we can make inferencesfrom it?
WLLN states that the mean of a sequence of i.i.d. random variables converges in probability to the expectedvalue of the random variable as the length of that sequence tends to infinity. By โconverging in probabilityโ,we mean that the probability that the difference between the mean of the sample and the expected value ofthe random variable tends to zero.
In short, WLLN guarantees that with a large enough sample size the sample mean should approximatelymatch the true population parameter. Clearly, this is powerful theorem for any statistical exercise: givenwe are (always) constrained by a finite sample, WLLN ensures that we can infer from the data somethingmeaningful about the population. For example, from a large enough sample of voters we can estimate theaverage support for a candidate or party.
More formally, we can state WLLN as follows:
๏ฟฝ๏ฟฝ๐๐โโ ๐ผ[๐], (4.1)
where๐โโ denotes โconverging in probabilityโ.
23
4.1.2 Proof
To prove WLLN, we use Chebyshevโs Inequality (CI). More specifically we first have to prove ChebyshevโsInequality of the Sample Mean (CISM), and then use CISM to prove WLLN. The following steps are basedon the proof provided in Aronow and Miller (2019).
Proof of Chebyshevโs Inequality of the Sample Mean. Chebyshevโs Inequality for the Sample Mean (CISM)states that:
๐(|๏ฟฝ๏ฟฝ๐ โ ๐ผ[๐]| โฅ ๐) โค ๐ฃ๐๐(๐)๐2๐ , (4.2)
where ๏ฟฝ๏ฟฝ๐ is the sample mean of a sequensce of ๐ independent draws from a random variable ๐. RecallCI states that ๐ (|(๐ โ ๐)/๐| โฅ ๐) โค 1
๐2 . To help prove CISM, we can rearrange the left hand side of theinequality by multiplying both sides of the inequality within the probability function by ๐, such that:
๐(|(๐ โ ๐)| โฅ ๐๐) โค 1๐2 . (4.3)
Then, finally, let us define ๐โฒ = ๐๐ . Hence:
๐(|(๏ฟฝ๏ฟฝ โ ๐ผ[๐])| โฅ ๐) = ๐(|(๏ฟฝ๏ฟฝ โ ๐ผ[๐])| โฅ ๐โฒ๐) (4.4)
โค 1๐โฒ2 (4.5)
โค ๐2
๐2 (4.6)
โค ๐ฃ๐๐(๏ฟฝ๏ฟฝ)๐2 (4.7)
โค ๐ฃ๐๐(๐)๐2๐ โก (4.8)
This proof is reasonably straightfoward. Using our definition of ๐โฒ allows us to us rearrange the probabilitywithin CISM to match the form of the Chebyshev Inequality stated above, which then allows us to infer thebounds of the probability. We then replace ๐โฒ with ๐
๐ , expand and simplify. The move made between thepenultimate and final line relies on the fact that variance of the sample mean is equal to the variance in therandom variable divided by the sample size (n).1
Applying CISM to WLLN proof. Given that all probabilities are non-negative and CISM, we can now write:
0 โค ๐(|๏ฟฝ๏ฟฝ๐ โ ๐ผ[๐]| โฅ ๐) โค ๐ฃ๐๐(๐)๐2๐ . (4.9)
Note that for the first and third term of this multiple inequality, as ๐ approaches infinity both terms approach0. In the case of the constant zero, this is trivial. In the final term, note that ๐ฃ๐๐(๐) denotes the inherentvariance of the random variable, and therefore is constant as ๐ increases. Therefore, as the denominatorincreases, the term converges to zero.
Since the middle term is sandwiched in between these two limits, by definition we know that this term mustalso converge to zero.2 Therefore:
1See Aronow and Miller 2019, p.98.2To see why this is the case, given the limits of the first and third terms, Equation ?? is of the form 0 โค ๐ด โค 0 as ๐ โ โ.
The only value of ๐ด that satisfies this inequality is 0.
24
lim๐โโ๐(|๏ฟฝ๏ฟฝ๐ โ ๐ผ[๐]| โฅ ๐) = 0 โก (4.10)
Hence, WLLN is proved: for any value of ๐, the probability that the difference between the sample meanand the expected value is greater or equal to ๐ converges on zero. Since ๐โs value is arbitrary, it can be setto something infinitesimally small, such that the sample mean and expected value converge in value.
4.2 Central Limit Theorem
WLLN applies to the value of the statistic itself (the mean value). Given a single, n-length sequence drawnfrom a random variable, we know that the mean of this sequence will converge on the expected value of therandom variable. But often, we want to think about what happens when we (hypothetically) calculate themean across multiple sequences i.e. expectations under repeat sampling.
The Central Limit Theorem (CLT) is closely related to the WLLN. Like WLLN, it relies on asymptoticproperties of random variables as the sample size increases. CLT, however, lets us make informative claimsabout the distribution of the sample mean around the true population parameter.
4.2.1 Theorem in Plain English
CLT states that as the sample size increases, the distribution of sample means converges to a normaldistribution. That is, so long as the underlying distribution has a finite variance (bye bye Cauchy!), thenirrespective of the underlying distribution of ๐ the distribution of sample means will be a normal distribution!
In fact, there are multiple types of CLT that apply in a variety of different contexts โ cases includingBernoulli random variables (de Moivre - Laplace), where random variables are independent but do not needto be identically distributed (Lyapunov), and where random variables are vectors in โ๐ space (multivariateCLT).
In what follows, I will discuss a weaker, more basic case of CLT where we assume random variables arescalar, independent, and identically distributed (i.e. drawn from the same unknown distribution function).In particular, this section proves that the standardized difference between the sample mean and populationmean for i.i.d. random variables converges in distribution to the standard normal distribution ๐(0, 1). Thisvariant of the CLT is called the Lindeberg-Levy CLT, and can be stated as:
๏ฟฝ๏ฟฝ๐ โ ๐๐โ๐
๐โโ ๐(0, 1), (4.11)
where๐โโ denotes โconverging in distributionโ.
In general, the CLT is useful because proving that the sample mean is normally distributed allows us toquantify the uncertainty around our parameter estimate. Normal distributions have convenient propertiesthat allow us to calculate the area under any portion of the curve, given just the same mean and standarddeviation. We already know by WLLN that the sample mean will (with a sufficiently large sample) approx-imate the population mean, so we know that the distribution is also centred around the true populationmean. By CLT, the dispersion around that point is therefore normal, and to quantify the probable boundsof the point estimate (under the assumption of repeat sampling) requires only an estimate of the variance.
4.2.2 Primer: Characteristic Functions
CLT is harder (and lengthier) to prove than other proofs weโve encountered so far โ it relies on showing thatthe sample mean converges in distribution to a known mathematical form that uniquely and fully describes
25
the normal distribution. To do so, we use the idea of a characteristic functions, which simply denotes afunction that completely defines a probability function.
For example, and we will use this later on, we know that the characteristic function of the normal distributionis ๐๐๐ก๐โ ๐2๐ก2
2 . A standard normal distriibution (where ๐ = 0, ๐2 = 1) therefore simplifies to ๐โ ๐ก22 .
More generally, we know that for any scalar random variable ๐, the characteristic function of ๐ is definedas:
๐๐(๐ก) = ๐ผ[๐๐๐ก๐], (4.12)
where ๐ก โ โ and ๐ is the imaginary unit. Proving why this is the case is beyond the purview of this section,so unfortunately I will just take it at face value.
We can expand ๐๐๐ก๐ as an infinite sum, using a Taylor Series, since ๐๐ฅ = 1 + ๐ฅ + ๐ฅ22! + ๐ฅ3
3! + .... Hence:
๐๐(๐ก) = ๐ผ[1 + ๐๐ก๐ + (๐๐ก๐)2
2! + (๐๐ก๐)3
3! + ...], (4.13)
Note that ๐2 = โ1, and since the latter terms tend to zero faster than the second order term we cansummarise them as ๐(๐ก2) (they are no larger than of order ๐ก2). Therefore we can rewrite this expression as:
๐๐(๐ก) = ๐ผ[1 + ๐๐ก๐ โ ๐ก2
2 ๐2 + ๐(๐ก2)]. (4.14)
In the case of continuous random variables, the expected value can be expressed as the integral across allspace of the expression multiplied by the probability density, such that:
๐๐(๐ก) = โซโ
โโ[1 + ๐๐ก๐ โ ๐ก2
2 ๐2 + ๐(๐ก2)]๐๐๐๐, (4.15)
and this can be simplified to:
๐๐(๐ก) = 1 + ๐๐ก๐ผ[๐] โ ๐ก2
2 ๐ผ[๐2] + ๐(๐ก2)], (4.16)
since 1ร๐๐ = ๐๐, the total area under a probability density necessarily sums to 1; โซ ๐๐๐๐๐ is the definitionof the expected value of X, and so by similar logic โซ ๐2๐๐๐๐ = ๐ผ[๐2].In Ben Lambertโs video introducing the CLT proof, he notes that if we assume X has mean 0 and variance1, the characteristic function of that distribution has some nice properties, namely that it simplifies to:
๐๐(๐ก) = 1 โ ๐ก2
2 + ๐(๐ก2)], (4.17)
since ๐ผ[๐] = 0 cancelling the second term, and ๐ผ[๐2] โก ๐ผ[(๐ โ 0)2] = ๐ผ[(๐ โ ๐)2] = ๐ฃ๐๐(๐) = 1.
One final piece of characteristic function math that will help finalise the CLT proof is to note that if wedefine some random variable ๐๐ = โ๐
๐=1 ๐ ๐, where all ๐ ๐ are i.i.d., then the characteristic function of ๐๐can be expressed as ๐๐๐
(๐ก) = [๐๐ (๐ก)]๐. Again, I will not prove this property here.
26
4.2.3 Proof of CLT
This proof is based in part on Ben Lambertโs excellent YouTube series, as well as Lemons et al. (2002).
Given the above discussion of a characteristic function, let us assume a sequence of independent and identi-cally distributed (i.i.d.) random variables ๐1, ๐2, ..., ๐๐, each with mean ๐ and finite3 variance ๐2. The sumof these random variables has mean ๐๐ (since each random variable has the same mean) and the varianceequivalent to ๐๐2 (because the random variables are i.i.d. we know that ๐ฃ๐๐(๐ด, ๐ต) = ๐ฃ๐๐(๐ด)๐ฃ๐๐(๐ต)).Now letโs consider the standardized difference between the actual sum of the random variables and the mean.Standardization simply means dividing a parameter estimate by its standard deviation. In particular, wecan consider the following standardized random variable:
๐๐ = โ๐๐=1(๐๐ โ ๐)
๐โ๐ , (4.18)
where ๐๐, in words, is the standardised difference between the sum of i.i.d. random variables and theexpected value of the sequence. Note that we use the known variance in the denominator.
We can simplify this further:
๐๐ =๐
โ๐=1
1โ๐๐๐, (4.19)
where we define a new random variable ๐๐ = ๐๐โ๐๐ .
๐๐ has some convenient properties. First, since each random variable ๐๐ in our sample has mean ๐, we knowthat ๐ผ[๐๐] = 0 since ๐ผ[๐๐] = ๐ and therefore ๐ โ ๐ = 0. Note that this holds irrespective of the distributionand value of ๐ผ[๐๐].The variance of ๐๐ is also recoverable. First note three basic features of variance: if ๐ is a constant, and ๐and ๐ are random variables, ๐ฃ๐๐(๐) = 0; ๐ฃ๐๐(๐๐) = ๐2๐ฃ๐๐(๐); and from the variance of a sum ๐ฃ๐๐(๐ โ๐ ) =๐ฃ๐๐(๐ด) โ ๐ฃ๐๐(๐ต). Therefore:
๐ฃ๐๐( 1๐ (๐๐ โ ๐) = 1
๐2 ๐ฃ๐๐(๐๐ โ ๐) (4.20)
๐ฃ๐๐(๐๐ โ ๐) = ๐ฃ๐๐(๐๐) โ ๐ฃ๐๐(๐) (4.21)= ๐ฃ๐๐(๐๐). (4.22)
Hence:
๐ฃ๐๐(๐๐) = ๐ฃ๐๐(๐๐)๐2 = 1, (4.23)
since ๐ฃ๐๐(๐๐) = ๐2.
At this stage, the proof is tantalisingly close. While we have not yet fully characterised the distribution of๐๐ or even ๐๐, the fact that ๐๐ has unit variance and a mean of zero means suggests we are on the righttrack to proving that this does asymptotically tend in distribution to the standard normal. In fact, recallfrom the primer on characteristic functions, that Lambert notes for any random variable with unit varianceand mean of 0, ๐๐(๐ก) = 1 โ ๐ก2
2 + ๐(๐ก2). Hence, we can now say that:
3Hence why degenerate distributions like the Cauchy are not covered by CLT.
27
๐๐๐(๐ก) = 1 โ ๐ก2
2 + ๐(๐ก2). (4.24)
Now let us return to ๐๐ = โ๐๐=1
1โ๐ ๐๐ and using the final bit of characteristic function math in the primer,we can express the characteristic function of ๐๐ as:
๐๐๐(๐ก) = [๐๐ ( ๐กโ๐)]๐, (4.25)
since ๐๐ is divided by the square root of the sample size. Given our previously stated expression of thecharacteristic function of ๐๐:
๐๐๐(๐ก) = [1 โ ๐ก2
2๐ + ๐(๐ก2)]]๐. (4.26)
We can now consider what happens as ๐ โ โ. By definition, we know that ๐(๐ก2) converges to zero fasterthan the other terms, so we can safely ignore it. As a result, and noting that ๐๐ฅ = lim(1 + ๐ฅ
๐ )๐:
lim๐โโ
๐๐๐(๐ก) = ๐โ ๐ก2
2 . (4.27)
This expression shows that as ๐ tends to infinity, the characteristic function of ๐๐ is the standard normaldistribution (as noted in the characteristic function primer). Therefore:
lim๐โโ
๐๐ = ๐(0, 1) (4.28)
lim๐โโ
๏ฟฝ๏ฟฝ๐ โ ๐๐โ๐ = ๐(0, 1). โก (4.29)
The last line here simply follows from the definition of ๐๐.
4.2.4 Generalising CLT
From here, it is possible to intuit the more general CLT that the distribution of sampling means is normallydistributed around the true mean ๐ with variance ๐2
๐ . Note this is only a pseudo-proof, because as Lambertnotes, multiplying through by ๐ is complicated by the limit operator with respect to ๐. However, it is usefulto see how these two CLT are closely related.
First, we can rearrange the limit expression using known features of the normal distribution:
lim๐โโ
๐๐๐โโ ๐(0, 1) (4.30)
lim๐โโ
โ๐๐=1(๐๐) โ ๐๐โ
๐๐2๐โโ ๐(0, 1) (4.31)
lim๐โโ
๐โ๐=1
(๐๐) โ ๐๐ ๐โโ ๐(0, ๐๐2) (4.32)
lim๐โโ
๐โ๐=1
(๐๐)๐โโ ๐(๐๐, ๐๐2), (4.33)
28
since ๐๐(๐, ๐) = ๐(๐๐, ๐2๐), and ๐(๐, ๐) + ๐ = ๐(๐ + ๐, ๐).At this penultimate step, we know that the sum of i.i.d. random variables is a normal distribution. To seethat the sample mean is also normally distributed, we simply divide through by ๐:
lim๐โโ
๏ฟฝ๏ฟฝ = 1๐
๐โ๐=1
(๐๐)๐โโ ๐(๐, ๐2
๐ ). (4.34)
4.2.5 Limitation of CLT (and the importance of WLLN)
Before ending, it is worth noting that CLT is a claim with respect to repeat sampling from a population(holding ๐ constant each time). It is not, therefore, a claim that holds with respect to any particular sampledraw. We may actually estimate a mean value that, while probable, lies away from the true populationparameter (by definition, since the sample means are normally distributed, there is some dispersion). Con-structing uncertainty estimates using CLT on this estimate alone does not guarantee that we are in factcapturing either the true variance or the true parameter.
That being said, with sufficiently high-N, we know that WLLN guarantees (assuming i.i.d. observations)that our estimate converges on the population mean. WLLNโs asymptotics rely only on sufficiently largesample sizes for a single sample. Hence, both WLLN and CLT are crucial for valid inference from sampleddata. WLLN leads us to expect that our parameter estimate will in fact be centred approximately near thetrue parameter. Here, CLT can only say that across multiple samples from the population the distributionof sample means is centred on the true parameter. With WLLN in action, however, CLT allows us to makeinferential claims about the uncertainty of this converged parameter.
29
Chapter 5
Slutskyโs Theorem
5.1 Theorem in plain English
Slutskyโs Theorem allows us to make claims about the convergence of random variables. It states that arandom variable converging to some distribution ๐, when multiplied by a variable converging in probabilityon some constant ๐, converges in distribution to ๐ ร ๐. Similarly, if you add the two random variables, theyconverge in distribution to ๐ plus ๐. More formally, the theorem states that if ๐๐
๐โโ ๐ and ๐ด๐๐โโ, where
๐ is a constant, then:
1. ๐๐ + ๐ด๐๐โโ ๐ + ๐
2. ๐ด๐๐๐๐โโ ๐๐
Note that if ๐ด๐ or ๐ต๐ do not converge in probability to constants, and instead converge towards somedistribution, then Slutskyโs Theorem does not hold. More trivially, if all variables converge in probabilityto constants, then ๐ด๐๐๐ + ๐ต๐
๐โโ ๐๐ + ๐ต.
5.2 Coded demonstration
This theorem is reasonably intuitive. Suppose that the random variable ๐๐ converges in distribution to astandard normal distribution ๐(0, 1). For part 1) of the Theorem, note that when we multiply a standardnormal by a constant we โstretchโ the distribution (assuming |๐| > 1, else we โcompressโ it). Recall fromthe discussion of the standard normal in Chapter 5 that ๐๐(0, 1) = ๐(0, ๐2). As ๐ approaches infinity,therefore, by definition ๐ด๐
๐โโ ๐, and so the degree to which the standard normal is stretched will converge
to that constant too. To demonstrate this feature visually, consider the following simulation:library(ggplot2)set.seed(89)N <- c(10,20,500)
results <- data.frame(n = as.factor(levels(N)),X_n = as.numeric(),A_n = as.numeric(),ax = as.numeric())
for (n in N) {X_n <- rnorm(n)
31
A_n <- 2 + exp(-n)aX <- A_n * X_n
results <- rbind(results, cbind(n, X_n, A_n, aX))
}
ggplot(results, aes(x = aX)) +facet_wrap(n~., ncol = 3, labeller = "label_both") +geom_density() +labs(y = "p(aX)")
n: 10 n: 20 n: 500
โ6 โ3 0 3 6 โ6 โ3 0 3 6 โ6 โ3 0 3 6
0.00
0.05
0.10
0.15
0.20
aX
p(aX
)
Here we have defined two random variables: X_n is a standard normal, and A_n converges in value to 2.Varying the value of n, I take ๐ draws from a standard normal distribution and calculate the value theconverging constant ๐ด๐. I then generate the product of these two variables. The figure plots the resultingdistribution aX. We can see that as n increases, the distribution becomes increasingly normal, remainscentred around 0 and the variance approaches 4 (since 95% of the curve is approximately bounded between0 ยฑ 2 ร โ๐ฃ๐๐(๐๐) = 0 ยฑ 2 ร 2 = 0 ยฑ 4).
Similarly, if we add the constant ๐ to a standard distribution, the effect is to shift the distribution in itsentirety (since a constant has no variance, it does not โโstretchโ the distribution). As ๐ด๐ converges inprobability, therefore, the shift converges on the constant ๐. Again, we can demonstrate this result in R:library(ggplot2)set.seed(89)N <- c(10,20,500)
results <- data.frame(n = as.factor(levels(N)),
32
X_n = as.numeric(),A_n = as.numeric(),a_plus_X= as.numeric())
for (n in N) {X_n <- rnorm(n)A_n <- 2 + exp(-n)a_plus_X <- A_n + X_n
results <- rbind(results, cbind(n, X_n, A_n, a_plus_X))
}
ggplot(results, aes(x = a_plus_X)) +facet_wrap(n~., ncol = 3, labeller = "label_both") +geom_density() +geom_vline(xintercept = 2, linetype = "dashed") +labs(y = "p(a+X)", x="a+X")
n: 10 n: 20 n: 500
โ1 0 1 2 3 4 5 โ1 0 1 2 3 4 5 โ1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
a+X
p(a+
X)
As n becomes larger, the resulting distribution becomes approximately normal, with variance of 1 and amean value centred around 0 + ๐ = 2.
Slutskyโs Theorem is so useful precisely because it allows us to combine multiple random variables withknown asymptotics, and retain this knowledge i.e. we know what the resultant distribution will converge toassuming ๐ โ โ.
33
5.3 Proof of Slutskyโs Theorem
Despite the intuitive appeal of Slutskyโs Theorem, the proof is less straightforward. It relies on the continuousmapping theorem (CMT), which in turns rests on several other theorems such as the Portmanteau Theorem.To avoid the rabbit hole of proving all necessary antecedent theorems, I simply introduce and state thecontinuous mapping theorem (CMT) here, and then show how this can be used to prove Slutskyโs Theorem.
5.3.1 CMT
The continuous mapping theorem states that if there is some random variable such that ๐๐๐โโ ๐, then
๐(๐๐) ๐โโ ๐(๐), so long as ๐ is a continuous function. In approximate terms (which are adequate for ourpurpose), a continuous function is one in which for a given domain the function can be represented as ansingle unbroken curve (or hyperplane in many dimensions). For example, consider the graph of ๐(๐ฅ) = ๐ฅโ1.For the domain ๐ท+ โถ โ > 0, this function is continuous. But for the domain ๐ทโ โถ โ, the function isdiscontinuous because the function is undefined when ๐ฅ = 0.
In short, CMT states that a continuous function preserves the asymptotic limits of a random variable. Morebroadly (and again, I do not prove this here), CMT entails that ๐(๐๐, ๐๐, ..., ๐๐) ๐โโ ๐(๐ , ๐, ..., ๐) if all๐๐, ๐๐, ... etc. converge in distribution to ๐ , ๐, ... respectively.
5.3.2 Proof using CMT
How does this help prove Slutskyโs Theorem? We know by the definitions in Slutskyโs Theorem that ๐๐๐โโ ๐
and, by a similar logic, we know that ๐ด๐๐โโ ๐ (since ๐ด๐
๐โโ ๐, and converging in probability entails converging
in distribution). So we can note that the joint vector (๐๐, ๐ด๐) ๐โโ (๐, ๐). By CMT, therefore, ๐(๐๐, ๐ด๐) ๐โโ๐(๐, ๐). Hence, any continuous function ๐ will preserve the limits of the respective distributions.
Given this result, it is sufficient to note that both addition and multiplication are continuous functions.Again, I do not show this here but the continuity of addition and multiplication (both scalar and vector) canbe proved mathematically (for example see one such proof here For an intuitive explanation, think about thediagonal line ๐ฆ = ๐ โ any multiplication of that line is still a single, uninterrupted line (๐ฆ = ๐๐) assuming๐ is a constant. Similarly, adding a constant to the function of a line also yields an uninterrupted line(e.g. ๐ฆ = ๐ + ๐).
Hence, CMT guarantees both parts 1 and 2 of the Theorem. โก
5.4 Applications
Slutskyโs Theorem is a workhorse theorem that allows researchers to make claims about the limiting distri-butions of multiple random variables. Instead of being used in applied settings, it typically underpins themodelling strategies used in applied research. For example, Aronow and Samii (2016) consider the problemof weighting multiple regression when the data sample is unrepresentative of the population of interest. Intheir proofs, they apply Slutskyโs Theorem at two different points to prove that their weighted regression esti-mates converge in probability on the weighted expectation of individual treatment effects, and subsequently,that the same coefficient converges in probability to the true average treatment effect in the population.
34
5.4.1 Proving the consistency of sample variance, and the normality of the t-statistic
In the remainder of this chapter, I consider applications of both Central Mapping Theorem and SlutskyโsTheorem in fundamental statistical proofs. I first show how CMT can be used to prove the consistency ofthe variance of a random variable, and subsequently how in combination with Slutskyโs Theorem this helpsprove the normality of a t-statistic. These examples are developed from David Hunterโs notes on asymptotictheory that accompany his Penn State course in large-sample theory.
5.4.1.1 Consistency of the sample variance estimator
First, let us define the sample variance (๐ 2๐) of a sequence of i.i.d random variables drawn from a distribution
๐ with ๐ผ[๐] = ๐ and ๐ฃ๐๐(๐) = ๐2 as:
๐ 2๐ = 1
๐ โ 1๐
โ๐=1
(๐๐ โ ๐๐)2.
We can show that the sample variance formula above is a consistent estimator of the true variance ๐2. Thatis, as the sequence of i.i.d. random variables ๐1, ๐2, ... increases in length, the sample variance estimator ofthat sequence converges in probability to the true variance value ๐2.
We can prove this by redefining ๐ 2 as follows:
๐ 2๐ = ๐
๐ โ 1 [ 1๐
๐โ๐=1
(๐๐ โ ๐)2 โ ( ๐๐ โ ๐)2] ,
which clearly simplifies to the conventional definition of ๐ 2 as first introduced.
From here, we can note using WLLN, that ( ๐๐ โ ๐)๐โโ 0, and hence that ( ๐๐ โ ๐)2 ๐
โโ 0. Note that thisterm converges in probability to a constant. Moreover, 1
๐ โ๐๐=1(๐๐ โ ๐)2 ๐
โโ ๐ผ[๐๐ โ ๐] = ๐ฃ๐๐(๐) = ๐2, bydefinition.
Now let us define an arbitrary continuous function ๐(๐ด๐, ๐ต๐). We know by CMT that if ๐ด๐๐โโ ๐ด, ๐ต๐
๐โโ ๐ต
then ๐(๐ด๐, ๐ต๐)๐โโ ๐(๐ด, ๐ต). And hence, using the implications above we know that for any continuous
function ๐ that ๐( 1๐ โ๐
๐=1(๐๐ โ ๐)2, ( ๐๐ โ ๐))๐โโ ๐(๐2, 0).
Since subtraction is a continuous function, we therefore know that:
[ 1๐
๐โ๐=1
(๐๐ โ ๐)2 โ ( ๐๐ โ ๐)2]๐โโ [๐2 โ 0] = ๐2.
Separately we can intuitively see that ๐๐โ1
๐โโ 1. Hence, by applying CMT again to this converging variable
multiplied by the converging limit of the above (since multiplication is a continuous function), we can seethat:
๐ 2๐
๐โโ 1 ร ๐2 = ๐2 โก
35
5.4.1.2 Normality of the t-statistic
Letโs define a t-statistic as:
๐ก๐ =โ๐(๏ฟฝ๏ฟฝ๐ โ ๐)
โ ๐2
By the Central Limit Theorem (CLT, Chapter 5), we know that for a random variable ๐ with mean ๐ andvariance ๐2 that
โ๐(๏ฟฝ๏ฟฝ๐ โ ๐) ๐โโ ๐(0, ๐2).
We also know from the proof above that if ๏ฟฝ๏ฟฝ2 = ๐ 2 then ๏ฟฝ๏ฟฝ2 ๐โโ ๐2 โ a constant. Given this, we can also note
that 1๐2
๐โโ 1
๐2 .
Hence, by Slutskyโs Theorem:
โ๐(๏ฟฝ๏ฟฝ๐ โ ๐) ร 1โ๐2
๐โโ ๐(0, ๐2) ร 1โ๐2 (5.1)
= ๐๐(0, 1) ร 1๐ (5.2)
= ๐(0, 1) โก (5.3)
One brief aspect of this proof that is noteworthy is that since Slutskyโs Theorem rests on the CMT, theapplication of Slutskyโs Theorem requires that the function of the variables ๐ (in this case multiplication)is continuous and defined for the specified domain. Note that 1
0 is undefined and therefore that the aboveproof only holds when we assume ๐2 > 0. Hence why in many statistics textbooks and discussions of modelasymptotics, authors note that they must assume a positive, non-zero variance.
36
Chapter 6
Big Op and little op
6.1 Stochastic order notation
โBig Opโ (big oh-pee), or in algebraic terms ๐๐, is a shorthand means of characterising the convergence inprobability of a set of random variables. It directly builds on the same sort of convergence ideas that werediscussed in Chapters 5 and 6.
Big Op means that some given random variable is stochastically bounded. If we have some random variable๐๐ and some constant ๐๐ (where n indexes both sets), then
๐๐ = ๐๐(๐๐)is the same as saying that
๐(|๐๐๐๐
| > ๐ฟ) < ๐, โ๐ > ๐.
๐ and ๐ here are just finite numbers, and ๐ is some arbitrary (small) number. In plain English, ๐๐ meansthat for a large enough ๐ there is some number (๐) such that the probability that the random variable ๐๐
๐๐is larger than that number is essentially zero. It is โbounded in probabilityโ (van der Vaart, 1998, Section2.2).
โLittle opโ (litle oh-pee), or ๐๐, refers to convergence in probability towards zero. ๐๐ = ๐๐(1) is the same assaying
lim๐โโ
(๐ |๐๐| โฅ ๐) = 0, โ๐ > 0.
By definition of the notation, if ๐๐ = ๐๐(๐๐) then
๐ฅ๐๐๐
= ๐๐(1).
In turn, we can therefore express ๐๐ = ๐๐(๐๐) as
lim๐โโ
(๐ |๐๐๐๐
| โฅ ๐) = 0, โ๐ > 0.
In other words, ๐๐ = ๐๐(๐๐) if and only if ๐๐๐๐
๐โโ 0.
37
6.1.1 Relationship of big-O and little-o
๐๐ and ๐๐ may seem quite similar, and thatโs because they are! Another way to express ๐๐ = ๐๐(๐๐), is
โ๐ โ๐๐, ๐ฟ๐ ๐ .๐ก.โ๐ > ๐๐, ๐ (|๐๐๐๐
| โฅ ๐ฟ๐) โค ๐.
This restatement makes it clear that the values of ๐ฟ and ๐ are to be found with respect to ๐. That is, weonly have to find one value of ๐ and ๐ฟ for each ๐๐๐ ๐๐๐๐, and these can differ across ๐โs.Using the same notation, ๐๐ = ๐๐(๐๐) can be expressed as
โ๐, ๐ฟ โ๐๐,๐ฟ ๐ .๐ก.โ๐ > ๐๐,๐ฟ, ๐ (|๐๐๐๐
| โฅ ๐ฟ) โค ๐.
๐๐ is therefore a more general statement, ranging over all values of ๐ and ๐ฟ, and hence any combination ofthose two values. In other words, for any given pair of values for ๐ and ๐ฟ there must be some ๐ that satisfiesthe above inequality (assuming ๐๐ = ๐๐(๐๐)).Note also, therefore that ๐๐(๐๐) entails ๐๐(๐๐), but that the inverse is not true. If for all ๐ and ๐ฟ there issome ๐๐,๐ฟ that satisfies the inequality, then it must be the case that for all ๐ there exists some ๐ฟ such thatthe inequality also holds. But just because for some ๐ฟ๐ the inequality holds, this does not mean that it willhold for all ๐ฟ.
6.2 Notational shorthand and โarithmeticโ properties
Expressions like ๐๐ = ๐๐ ( 1โ๐ ) do not contain literal identities. Big and little o are merely shorthand waysof expressing how some random variable converges (either to a bound or zero). Suppose for instance thatwe know ๐๐ = ๐๐( 1
๐ ). We also therefore know that ๐๐ = ๐๐( 1๐0.5 ). Analogously, think about an object
accelerating at a rate of at least 10๐๐ โ2 โ that car is also accelerating at a rate at least 5๐๐ โ2. But itโs notthe case that ๐๐( 1
๐ ) = ๐๐( 1โ๐ ). For instance a car accelerating at least as fast as 5๐๐ โ2 is not necessarilyaccelerating at least as fast as 10๐๐ โ2.
Hence, when we use stochastic order notation we should be careful to think of it as implying somethingrather than making the claim that some random variable or expression involving random variables equalssome stochastic order.
That being said, we can note some simple implications of combining ๐๐ and/or ๐๐ terms, including:
โข ๐๐(1) + ๐๐(1) = ๐๐(1) โ this is straightforward: two terms that both converge to zero at the samerate, collectively converge to zero at that rate. Note this is actually just an application of ContinuousMapping Theorem, since If ๐๐ = ๐๐(1), ๐๐ = ๐๐(1) then ๐๐
๐โโ 0, ๐๐
๐โโ 0 then the addition of these
two terms is a continuous mapping function, and therefore ๐๐ + ๐๐๐โโ 0, โด๐๐ + ๐๐ = ๐๐(1).
โข ๐๐(1) + ๐๐(1) = ๐๐(1) โ a term that is bounded in probability (๐๐(1)) plus a term converging inprobability to zero, is bounded in probability.
โข ๐๐(1)๐๐(1) = ๐๐(1) โ a bounded probability multiplied by a term that converges (in the same order)to zero itself converges to zero.
โข ๐๐(๐ ) = ๐ ร ๐๐(1) โ again this is easy to see, since suppose ๐๐ = ๐๐(๐ ), then ๐๐/๐ = ๐๐(1), and so๐๐ = ๐ ๐๐(1).
Further rules, and intuitive explanations for their validity, can be found in Section 2.2 of van der Vaart(1998). The last rule above, however, is worth dwelling on briefly since it makes clear why we use differentrate terms (๐ ) in the little-o operator. Consider two rates ๐ (1) = 1โ๐ , ๐ (2) = 1
3โ2 , and some random variable
38
๐๐๐โโ 0, that is ๐๐ = ๐๐(1). Given the final rule (and remembering the equals signs should not be read
literally), if ๐(1)๐ = ๐๐(๐ (1)), then
๐(1)๐ = 1โ๐ ร ๐๐,
and if ๐(2)๐ = ๐๐(๐ (2)), then
๐(2)๐ = 1
3โ๐ ร ๐๐.
For each value of ๐๐ as ๐ approaches infinity, ๐(1)๐ is smaller ๐(2)
๐ . In other words, ๐(2)๐ will converge in
probably towards zero more slowly. This implication of the notation, again,
6.3 Why is this useful?1
A simple (trivial) example of this notation is to consider a sequence of random variables ๐๐ with known๐ผ[๐๐] = ๐. We can therefore decompose ๐๐ = ๐+๐๐(1), since we know by the Weak Law of Large Numbersthat ๐๐
๐โโ ๐. This is useful because, without having to introduce explicit limits into our equations, we
know that with a sufficiently large ๐, the second term of our decomposition converges to zero, and thereforewe can (in a hand-wavey fashion) ignore it.
Letโs consider a more meaningful example. Suppose now that ๐๐ โผ ๐(0, ๐). Using known features of normaldistributions, we can rearrange this to
๐๐โ๐ โผ ๐(0, 1).
There exists some ๐ such that the probability that a value from ๐(0, 1) exceeds ๐ is less than ๐ > 0, andtherefore
๐๐ = ๐๐(โ๐).
๐๐ is also little-op of ๐ since
๐๐๐ โผ ๐(0, ๐
๐2 )
โผ ๐(0, 1๐)
And so we just need to prove the righthand side above is ๐๐(1). To do so note that:
๐(|๐(0, 1๐)| > ๐) = ๐( 1โ๐|๐(0, 1)| > ๐)
= ๐(|๐(0, 1)| > โ๐๐)๐โโ 0.
The last follows sinceโ๐ โ โ, and so the probability that the standard normal is greater than โ decreases
to zero. Hence ๐๐ = ๐๐(๐).1The first two examples in this section are adapted from Ashesh Rambachanโs Asymptotics Review lecture slides, from
Harvard Math Camp โ Econometrics 2018.
39
lim๐โโ
๐ (โฃ๐(0, 1๐ )
๐ โฃ โฅ ๐) = 0 = ๐๐(1),
for all ๐ > 0, and therefore that
๐๐ = ๐๐(๐).
The big-O, little-o notation captures the complexity of the equation or, equivalently, the rate at which itconverges. One way to read ๐๐ = ๐๐(๐๐) is that, for any multiple of ๐, ๐๐ converges in probability to zeroat the rate determined by ๐๐. So, for example, ๐๐(๐2
๐) converges faster than ๐๐(๐๐), since for some randomvariable ๐๐, ๐๐
๐2๐< ๐๐
๐๐, ๐ > 1.
When we want to work out the asymptotic limits of a more complicated equation, where multiple terms areaffected by the number of observations, if we have a term that converges in probability to zero at a fasterrate than others then we can safely ignore that term.
6.4 Worked Example: Consistency of mean estimators
A parameter is โconsistentโ if it converges in probability to the true parameter as the number of observationsincreases. More formally, a parameter estimate ๐ is consistent if
๐(| ๐ โ ๐| โฅ ๐)๐โโ 0,
where ๐ is the true parameter.
One question we can ask is how fast our consistent parameter estimate converges on the true parametervalue. This is an โappliedโ methods problem to the extent that, as researchers seeking to make an inferenceabout the true parameter, and confronted with potentially many ways of estimating it, we want to choosean efficient estimator i.e. one that gets to the truth quickest!
Letโs suppose we want to estimate the population mean of ๐, i.e. ๏ฟฝ๏ฟฝ. Suppose further we have two potentialestimators, the sample mean is 1
๐ โ๐๐=1 ๐๐ and the median is ๐(๐+1)/2, where ๐ = 2๐ + 1 (weโll assume
an odd number of observations for the ease of calculation) and ๐ is an ordered sequence from smallest tolargest.
We know by the Central Limit Theorem that the sample mean
๏ฟฝ๏ฟฝ๐ โผ ๐ฉ(๐, ๐2
๐ ),
and note that I use ๐ฉ to denote the normal distribution function, to avoid confusion with the total numberof observations ๐ .
Withholding the proof, the large-sample distribution of the median estimator can be expressed approxi-mately2 as
Med(๐1, ๐2, ..., ๐๐) โผ ๐ฉ(๐, ๐๐2
2๐ ).
2See this Wolfram MathWorld post for more information about the exact CLT distribution of sample medians.
40
How do these estimators perform in practice? Letโs first check this via Monte Carlo, by simulating drawsof a standard normal distribution with various sizes of N and plotting the resulting distribution of the twoestimators:library(tidyverse)library(ccaPP) # This pkg includes a fast algorithm for the median
# Compute sample mean and median 1000 times, using N draws from std. normalrep_sample <- function(N) {sample_means <- c()sample_medians <- c()for (s in 1:1000) {
sample <- rnorm(N)sample_means[s] <- mean(sample)sample_medians[s] <- fastMedian(sample)
}return(data.frame(N = N, Mean = sample_means, Median = sample_medians))
}
set.seed(89)Ns <- c(5,seq(50,250, by = 50)) # A series of sample sizes
# Apply function and collect results, then pivot dataset to make plotting easiersim_results <- do.call("rbind", lapply(Ns, FUN = function(x) rep_sample(x))) %>%pivot_longer(-N, names_to = "Estimator", values_to = "estimate")
ggplot(sim_results, aes(x = estimate, color = Estimator, fill = Estimator)) +facet_wrap(~N, ncol = 2, scales = "free_y", labeller = "label_both") +geom_density(alpha = 0.5) +labs(x = "Value", y = "Density") +theme(legend.position = "bottom")
41
N: 200 N: 250
N: 100 N: 150
N: 5 N: 50
โ2 โ1 0 1 โ2 โ1 0 1
0
1
2
012345
0
2
4
6
0.00
0.25
0.50
0.75
0
1
2
3
4
0
2
4
Value
Den
sity
Estimator Mean Median
Here we can see that for both the mean and median sample estimators, the distribution of parameters isnormally distributed around the true mean (๐ = 0). The variance of the sample mean distribution, however,shrinks faster than that of the sample median estimator. In other words, the sample mean is more โefficientโ(in fact it is the most efficient estimator). Efficiency here captures what we noted mathematically above โthat the rate of convergence on the true parameter (i.e. the rate at which the estimator converges on zero)is faster for the sample mean than the median.
Note that both estimators are therefore unbiased (they are centred on ๐), normally distributed, and areconsistent (the sampling distributions shrink towards the true parameter as N increases), but that thevariances shrinks at slightly different rates.
We can quantify this using little-o notation and the behaviour of these estimators with large-samples. First,we can define the estimation errors of the mean and median respectively as
๐Mean = ๐ โ ๐
= ๐ฉ(๐, ๐2
๐ ) โ ๐ฉ(๐, 0)
= ๐ฉ(0, ๐2
๐ ).
Similarly,
๐Med. = ๐ฉ(๐, ๐๐2
2๐ ) โ ๐ฉ(๐, 0)
= ๐ฉ(0, ๐๐2
2๐ ).
With both mean and median expressions, we can see that the error of the estimators is centered around zero
42
(i.e. it is unbiased), and that the dispersion of the error around zero decreases as ๐ increases. Given earlierdiscussions in this chapter, we can rearrange both to find out their rate of convergence.
For the sample mean:
๐Mean = 1โ๐
๐ฉ(0, ๐2)
๐Mean๐โ0.5 = ๐ฉ(0, ๐2)
We know that for a normal distribution, there will be some ๐๐, ๐๐, such that ๐(|๐ฉ(0, ๐2)| โฅ ๐๐) < ๐, andhence:
๐Mean = ๐๐( 1โ๐
).
Similarly, for the sample median:
๐Med. = ๐ฉ(0, ๐๐2
2๐ )
= ( ๐2๐ )
0.5๐ฉ(0, ๐2)
๐Med./ ( ๐2๐ )
0.5= ๐ฉ(0, ๐2)
๐Med. = ๐๐ ([ ๐2๐ ]
0.5)
= ๐๐ (โ๐โ2๐
) .
Now we can see that the big-op of the sample medianโs estimating error is โslowerโ (read: larger) thanthe big-op of the sample mean, meaning that the sample mean converges on the true parameter with fewerobservations than the sample median.
Another, easy way to see the intuition behind this point is to note that at intermediary steps in the aboverearrangements:
๐Mean = 1โ๐
๐ฉ(0, ๐2)
๐Med. =โ๐โ2๐
๐ฉ(0, ๐2),
and so, for any sized sample, the estimating error of the median is larger than that of the mean. To visualisethis, we can plot the estimation error as a function of ๐ using the rates derived above:N <- seq(0.01,100, by = 0.01)mean_convergence <- 1/sqrt(N)median_convergence <- sqrt(pi)/sqrt(2*N)
plot_df <- data.frame(N, Mean = mean_convergence, Median = median_convergence) %>%pivot_longer(-N, names_to = "Estimator", values_to = "Rate")
ggplot(plot_df, aes(x = N, y = Rate, color = Estimator)) +geom_line() +ylim(0,1) +theme(legend.position = "bottom")
43
Figure 6.1: Simulated distribution of sample mean and median estimators for different sized samples.
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100N
Rat
e
Estimator Mean Median
Note that the median rate line is always above the mean line for all ๐ (though not by much) โ it thereforehas a slower convergence.
44
Chapter 7
Delta Method
7.1 Delta Method in Plain English
The Delta Method (DM) states that we can approximate the asymptotic behaviour of functions over arandom variable, if the random variable is itself asymptotically normal. In practice, this theorem tells usthat even if we do not know the expected value and variance of the function ๐(๐) we can still approximateit reasonably. Note that by Central Limit Theorem we know that several important random variables andestimators are asymptotically normal, including the sample mean. We can therefore approximate the meanand variance of some transformation of the sample mean using its variance.
More specifically, suppose that we have some sequence of random variables ๐๐, such that as ๐ โ โ
๐๐ โผ ๐(๐, ๐2
๐ ).
We can rearrange this statement, to capture that the difference between the random variable and someconstant ๐ converges to a normal distribution around zero, with a variance determined by the number ofobservations:1
(๐๐ โ ๐) โผ ๐(0, ๐2
๐ ).
Further rearrangement yields
(๐๐ โ ๐) โผ ๐โ๐๐(0, 1)โ๐(๐๐ โ ๐)
๐ โผ ๐(0, 1),
by first moving the finite variance and ๐ terms outside of the normal distribution, and then dividing through.
Given this, if ๐ is some smooth function (i.e. there are no discontinuous jumps in values) then the DeltaMethod states that:
โ๐(๐(๐๐) โ ๐(๐))|๐โฒ(๐)|๐ โ ๐(0, 1),
where ๐โฒ is the first derivative of ๐. Rearranging again, we can see that1There are clear parallels here to how we expressed estimator consistency.
45
๐(๐๐) โ ๐ (๐(๐), ๐โฒ(๐)2๐2
๐ ) .
Note that the statement above is an approximation because ๐(๐๐) = ๐(๐) +๐โฒ(๐)( ๐ โ ๐ +๐โณ(๐) (๐๐โ๐)2
2! +...,i.e. an infinite sum. The Delta Method avoids the infinite regress by ignoring higher order terms (Liu, 2012).I return to this point below in the proof.
DM also generalizes to multidimensional functions, where instead of converging on the standard normal therandom variable must converge in distribution to a multivariate normal, and the derivatives of ๐ are replacedwith the gradient of g (a vector of all partial derivatives).[^fn_gradient] For the sake of simplicity I do notprove this result here, and instead focus on the univariate case.
โ๐ =โกโขโขโขโฃ
๐๐๐๐ฅ1๐๐๐๐ฅ2โฎ
๐๐๐๐ฅ๐
โคโฅโฅโฅโฆ
7.2 Proof
Before offering a full proof, we need to know a little bit about Taylor Series and Taylorโs Theorem. I brieflyoutline this concept here, then show how this expansion helps to prove DM.
7.2.1 Taylorโs Series and Theorem
Suppose we have some continuous function ๐ that is infinitely differentiable. By that, we mean that we meansome function that is continuous over a domain, and for which there is always some further derivative of thefunction. Consider the case ๐(๐ฅ) = ๐2๐ฅ,
๐โฒ(๐ฅ) = 2๐2๐ฅ
๐โณ(๐ฅ) = 4๐2๐ฅ
๐โด(๐ฅ) = 8๐2๐ฅ
๐โ(๐ฅ) = 16๐2๐ฅ
...For any integer ๐, the ๐th derivative of ๐(๐ฅ) is defined. An interesting non-infinitely differentiable functionwould be ๐(๐ฅ) = |๐ฅ| where โโ < ๐ฅ < โ. Here note that when ๐ฅ > 0, the first order derivative is 1 (thefunction is equivalent to ๐ฅ), and similarly at ๐ฅ < 0, the first order derivative is -1 (the function is equivalentto โ๐ฅ). When ๐ฅ = 0, however, the first derivative is undefined โ the first derivative jumps discontinuously.
The Taylor Series for an infinitely differentiable function at a given point ๐ฅ = ๐ is an expansion of thatfunction in terms of an infinite sum:
๐(๐ฅ) = ๐(๐) + ๐โฒ(๐)(๐ฅ โ ๐) + ๐โณ(๐)2! (๐ฅ โ ๐)2 + ๐โด(๐)
3! (๐ฅ โ ๐)3 + ...
Taylor Series are useful because they allow us to approximate a function at a lower polynomial order, usingTaylorโs Theorem. This Theorem loosely states that, for a given point ๐ฅ = ๐, we can approximate acontinuous and k-times differentiable function to the ๐th order using the Taylor Series up to the ๐th derivative.In other words, if we have some continuous differentiable function ๐(๐ฅ), its first-order approximation (i.e. itslinear approximation) at point ๐ is defined as
46
๐(๐) + ๐โฒ(๐)(๐ฅ โ ๐).
To make this more concrete, consider the function ๐(๐ฅ) = ๐๐ฅ. The Taylor Series expansion of ๐ at point๐ฅ = 0 is
๐(๐ฅ) = ๐(0) + ๐โฒ(0)(๐ฅ โ 0) + ๐โณ(0)2! (๐ฅ โ 0)2 + ๐โด(0)
3! (๐ฅ โ 0)3 + ...
So up to the first order, Taylors Theorem states that
๐(๐ฅ) โ ๐(0) + ๐โฒ(0)(๐ฅ โ 0) = 1 + ๐ฅ,which is the line tangent to ๐๐ฅ at ๐ฅ = 0. If we consider up to the second order (the quadratic approximation)our fit would be better, and even more so if we included the third, fourth, fifth orders and so on, up untilthe โth order โ at which point the Taylor Approximation is the function precisely.
7.2.2 Proof of Delta Method
Given Taylorโs Theorem, we know that so long as ๐ is a continuous and derivable up to the ๐th derivative,where ๐ โฅ 2, then at the point ๐:
๐(๐๐) โ ๐(๐) + ๐โฒ(๐)(๐๐ โ ๐).
Subtracting ๐(๐) we have:
(๐(๐๐) โ ๐(๐)) โ ๐โฒ(๐)(๐๐ โ ๐).
We know by CLT and our assumptions regarding ๐๐ that (๐๐ โ ๐) ๐โโ ๐(0, ๐2๐ ). Therefore we can rewrite
the above as
(๐(๐๐) โ ๐(๐)) โ ๐โฒ(๐)๐(0, ๐2
๐ ),
Hence, by the properties of normal distributions (multiplying by a constant, adding a constant):
๐(๐๐) โ ๐ (๐(๐), ๐โฒ(๐)2๐2
๐ ) โก
7.3 Applied example
Bowler et al. (2006) use the DM to provide confidence intervals for predicted probabilities generated from alogistic regression. Their study involves surveying politiciansโ attitudes toward electoral rule changes. Theyestimate a logistic model of the support for change on various features of the politicians including whetherthey won under existing electoral rules or not. To understand how winning under existing rules affectsattitudes, they then generate the predicted probabilities for losers and winners separately.
Generating predicted probabilities from a linear regression involves a non-linear transformation of an asymp-totically normal parameter (the logistic coefficient), and therefore we must take account of this transformationwhen variance of the predicted probability.
To generate the predicted probability we use the equation
47
๐ = ๐(๏ฟฝ๏ฟฝ+ ๐ฝ1๐1+...+๐ฝ๐๐๐)
1 + ๐(๏ฟฝ๏ฟฝ+ ๐ฝ1๐1+...+ ๐ฝ๐๐๐),
where ๐ is the predicted probability. Estimating the variance around the predicted probability is thereforequite difficult โ it involves multiple estimators, and non-linear transformations. But we do know that,assuming i.i.d and correct functional form, the estimating error of the logistic equation is asymptoticallymultivariate normal on the origin. And so the authors can use DM to calculate 95 percent confidence intervals.In general, the delta method is a useful way of estimating standard and errors and confidence intervals whenusing (but not limited to) logistic regression and other models involving non-linear transformations of modelparameters.
7.4 Alternative strategies
The appeal of the delta method is that it gives an analytic approximation of a functionโs distribution,using the asymptotic properties of some more (model) parameter. But there are alternative methods toapproximating these distributions (and thus standard errors) that do not rely on deriving the order conditionsof that function.
One obvious alternative is the bootstrap. For a given transformation of a random variable, calculate theoutput of the function ๐ต times using samples of the same size as the original sample, but with replacementand take either the standard deviation or the ๐ and 1 โ ๐ percentiles of the resultant parameter distribution.This method does not require the user to calculate the derivative of a function. It is a non-parametricalternative that simply approximates the distribution itself, rather than approximates the parameters of aparametric distribution.
The bootstrap is computationally more intensive (requiring ๐ต separate samples and calculations etc.) but,on the other hand, is less technical to calculate. Moreover, the Delta Methodโs approximation is limitedanalytically by the number of terms considered in the Taylor Series expansion. While the first order TaylorTheorem may be reasonable, it may be imprecise. To improve the precision one has to undertake to find thesecond, third, fourth etc. order terms (which may be analytically difficult). With bootstrapping, however,you can improve precision simply by taking more samples (increasing ๐ต) (King et al., 2000).
Given the ease with which we can acquire and deploy computational resources now, perhaps the delta methodis no longer as useful in applied research. But the proof and asymptotic implications remain statisticallyinteresting and worth knowing.
48
Chapter 8
Frisch-Waugh-Lovell Theorem
8.1 Theorem in plain English
The Frisch-Waugh-Lovell Theorem (FWL; after the initial proof by Frisch and Waugh (1933), and latergeneralisation by Lovell (1963)) states that:
Any predictorโs regression coefficient in a multivariate model is equivalent to the regression coefficient esti-mated from a bivariate model in which the residualised outcome is regressed on the residualised componentof the predictor; where the residuals are taken from models regressing the outcome and the predictor on allother predictors in the multivariate regression (separately).
More formally, assume we have a multivariate regression model with ๐ predictors:
๐ฆ = ๐ฝ1๐ฅ1 + ... ๐ฝ๐๐ฅ๐ + ๐. (8.1)
FWL states that every ๐ฝ๐ in Equation 8.1 is equal to ๐ฝโ๐ , and the residual ๐ = ๐โ in:
๐๐ฆ = ๐ฝโ๐๐๐ฅ๐ + ๐โ (8.2)
where:
๐๐ฆ = ๐ฆ โ โ๐โ ๐
๐ฝ๐ฆ๐๐ฅ๐
๐๐ฅ๐ = ๐ฅ๐ โ โ๐โ ๐
๐ฝ๐ฅ๐๐ ๐ฅ๐.
(8.3)
where ๐ฝ๐ฆ๐ and ๐ฝ๐ฅ๐
๐ are the regression coefficients from two separate regression models of the outcome (omitting๐ฅ๐) and ๐ฅ๐ respectively.
In other words, FWL states that each predictorโs coefficient in a multivariate regression explains that varianceof ๐ฆ not explained by both the other k-1 predictorsโ relationship with the outcome and their relationshipwith that predictor, i.e. the independent effect of ๐ฅ๐.
49
8.2 Proof
8.2.1 Primer: Projection matrices1
We need two important types of projection matrices to understand the linear algebra proof of FWL. First,the prediction matrix that was introduced in Chapter 4:
๐ = ๐(๐โฒ๐)โ1๐โฒ. (8.4)
Recall that this matrix, when applied to an outcome vector (๐ฆ), produces a set of predicted values ( ๐ฆ).Reverse engineering this, note that ๐ฆ = ๐ ๐ฝ = ๐(๐โฒ๐)โ1๐โฒ๐ฆ = ๐๐ฆ.
Since ๐๐ฆ produces the predicted values from a regression on ๐, we can define its complement, the residualmaker:
๐ = ๐ผ โ ๐(๐โฒ๐)โ1๐โฒ, (8.5)
since ๐๐ฆ = ๐ฆ โ ๐(๐โฒ๐)โ1๐โฒ๐ฆ โก ๐ฆ โ ๐๐ฆ โก ๐ฆ โ ๐ ๐ฝ โก ๐, the residuals from regressing ๐ on ๐.
Given these definitions, note that M and P are complementary:
๐ฆ = ๐ฆ + ๐โก ๐๐ฆ + ๐๐ฆ
๐ผ๐ฆ = ๐๐ฆ + ๐๐ฆ๐ผ๐ฆ = (๐ + ๐)๐ฆ๐ผ = ๐ + ๐.
(8.6)
With these projection matrices, we can express the FWL claim (which we need to prove) as:
๐ฆ = ๐1 ๐ฝ1 + ๐2 ๐ฝ2 + ๐๐1๐ฆ = ๐1๐2 ๐ฝ2 + ๐,
(8.7)
8.2.2 FWL Proof 2
Let us assume, as in Equation 8.7 that:
๐ = ๐1 ๐ฝ1 + ๐2 ๐ฝ2 + ๐. (8.8)
First, we can multiply both sides by the residual maker of ๐1:
๐1๐ = ๐1๐1 ๐ฝ1 + ๐1๐2 ๐ฝ2 + ๐1 ๐, (8.9)
which first simplifies to:1Citation Based on lecture notes from the University of Osloโs โEconometrics โ Modelling and Systems Estimationโ course
(author attribution unclear), and Davidson et al. (2004).}2Citation: Adapted from York University, Canadaโs wiki for statistical consulting.
50
๐1๐ = ๐1๐2 ๐ฝ2 + ๐1 ๐, (8.10)
because ๐1๐1 ๐ฝ1 โก (๐1๐1) ๐ฝ1 โก 0 ๐ฝ1 = 0. In plain English, by definition, all the variance in ๐1 is explainedby ๐1 and therefore a regression of ๐1 on itself leaves no part unexplained so ๐1๐1 is zero.3
Second, we can simplify this equation further because, by the properties of OLS regression, ๐1 and ๐ areorthogonal. Therefore the residual of the residuals are the residuals! Hence:
๐1๐ = ๐1๐2 ๐ฝ2 + ๐ โก.
8.2.3 Interesting features/extensions
A couple of interesting features come out of the linear algebra proof:
โข FWL also holds for bivariate regression when you first residualise Y and X on a ๐ ร 1 vector of 1โs(i.e. the constant) โ which is like demeaning the outcome and predictor before regressing the two.
โข ๐1 and ๐2 are technically sets of mutually exclusive predictors i.e. ๐1 is an ๐ ร ๐ matrix {๐1, ..., ๐๐},and ๐2 is an ๐ร๐ matrix {๐๐+1, ..., ๐๐+๐}, where ๐ฝ1 is a corresponding vector of regression coefficients๐ฝ1 = {๐พ1, ..., ๐พ๐}, and likewise ๐ฝ2 = {๐ฟ1, ..., ๐ฟ๐}, such that:
๐ = ๐1๐ฝ1 + ๐2๐ฝ2
= ๐1 ๐พ1 + ... + ๐๐ ๐พ๐ + ๐๐+1 ๐ฟ1 + ... + ๐๐+๐ ๐ฟ๐,
Hence the FWL theorem is exceptionally general, applying not only to arbitrarily long coefficient vectors,but also enabling you to back out estimates from any partitioning of the full regression model.
8.3 Coded example
set.seed(89)
## Generate random datadf <- data.frame(y = rnorm(1000,2,1.5),
x1 = rnorm(1000,1,0.3),x2 = rnorm(1000,1,4))
## Partial regressions
# Residual of y regressed on x1y_res <- lm(y ~ x1, df)$residuals
# Residual of x2 regressed on x1x_res <- lm(x2 ~ x1, df)$residuals
resids <- data.frame(y_res, x_res)
## Compare the beta values for x2
3Algebraically, ๐1๐1 = (๐ผ โ ๐1(๐โฒ1๐1)โ1๐โฒ
1)๐1 = ๐1 โ ๐1(๐โฒ1๐1)โ1๐โฒ
1๐ = ๐1 โ ๐1๐ผ = ๐1 โ ๐1 = 0.
51
# Multivariate regression:summary(lm(y~x1+x2, df))
#### Call:## lm(formula = y ~ x1 + x2, data = df)#### Residuals:## Min 1Q Median 3Q Max## -4.451 -1.001 -0.039 1.072 5.320#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 2.33629 0.16427 14.222 <2e-16 ***## x1 -0.31093 0.15933 -1.952 0.0513 .## x2 0.02023 0.01270 1.593 0.1116## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 1.535 on 997 degrees of freedom## Multiple R-squared: 0.006252,Adjusted R-squared: 0.004258## F-statistic: 3.136 on 2 and 997 DF, p-value: 0.04388# Partials regressionsummary(lm(y_res ~ x_res, resids))
#### Call:## lm(formula = y_res ~ x_res, data = resids)#### Residuals:## Min 1Q Median 3Q Max## -4.451 -1.001 -0.039 1.072 5.320#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 5.228e-17 4.850e-02 0.000 1.000## x_res 2.023e-02 1.270e-02 1.593 0.111#### Residual standard error: 1.534 on 998 degrees of freedom## Multiple R-squared: 0.002538,Adjusted R-squared: 0.001538## F-statistic: 2.539 on 1 and 998 DF, p-value: 0.1114
Note: This isnโt an exact demonstration because there is a degrees of freedom of error. The (correct)multivariate regression degrees of freedom is calculated as ๐ โ 3 since there are three variables. In thepartial regression the degrees of freedom is ๐ โ 2. This latter calculation does not take into account theadditional loss of freedom as a result of partialling out ๐1.
8.4 Application: Sensitivity analysis
Cinelli and Hazlett (2020) develop a series of tools for researchers to conduct sensitivity analyses on regressionmodels, using an extension of the omitted variable bias framework. To do so, they use FWL to motivatethis bias. Suppose that the full regression model is specified as:
52
๐ = ๐๐ท + ๐ ๐ฝ + ๐พ๐ + ๐full, (8.11)
where ๐ , ๐ฝ, ๐พ are estimated regression coefficients, D is the treatment variable, X are observed covariates,and Z are unobserved covariates. Since, Z is unobserved, researchers measure:
๐ = ๐Obs.๐ท + ๐ ๐ฝObs. + ๐Obs (8.12)
By FWL, we know that ๐Obs. is equivalent to the coefficient of regressing the residualised outcome (withrespect to X), on the residualised outcome of D (again with respect to X). Call these two residuals ๐๐ and๐ท๐.
And recall that the regression model for the final-stage of the partial regressions is bivariate (๐๐ โผ ๐ท๐).Conveniently, a bivariate regression coefficient can be expressed in terms of the covariance between theleft-hand and right-hand side variables:4
๐Obs. = ๐๐๐ฃ(๐ท๐, ๐๐)๐ฃ๐๐(๐ท๐) . (8.13)
Note that given the full regression model in Equation 8.11, the partial outcome ๐๐ is actually composed ofthe elements ๐๐ท๐ + ๐พ๐๐, and so:
๐Obs. = ๐๐๐ฃ(๐ท๐, ๐๐ท๐ + ๐พ๐๐)๐ฃ๐๐(๐ท๐) (8.14)
Next, we can expand the covariance using the expectation rule that ๐๐๐ฃ(๐ด, ๐ต + ๐ถ) = ๐๐๐ฃ(๐ด, ๐ต) + ๐๐๐ฃ(๐ด, ๐ถ)and since ๐ , ๐พ are scalar, we can move them outside the covariance functions:
๐Obs. = ๐๐๐๐ฃ(๐ท๐, ๐ท๐) + ๐พ๐๐๐ฃ(๐ท๐, ๐๐)๐ฃ๐๐(๐ท๐) (8.15)
Since ๐๐๐ฃ(๐ด, ๐ด) = ๐ฃ๐๐(๐ด) and therefore:
๐Obs. = ๐๐ฃ๐๐(๐ท๐) + ๐พ๐๐๐ฃ(๐ท๐, ๐๐)๐ฃ๐๐(๐ท๐) โก ๐ + ๐พ ๐๐๐ฃ(๐ท๐, ๐๐)
๐ฃ๐๐(๐ท๐) โก ๐ + ๐พ ๐ฟ (8.16)
Frisch-Waugh is so useful because it simplifies a multivariate equation into a bivariate one. While compu-tationally this makes zero difference (unlike in the days of hand computation), here it allows us to use aconvenient expression of the bivariate coefficient to show and quantify the bias when you run a regressionin the presence of an unobserved confounder. Moreover, note that in Equation 8.14, we implicitly use FWLagain since we know that the non-stochastic aspect of Y not explained by X are the residualised componentsof the full Equation 8.11.
8.4.1 Regressing the partialled-out X on the full Y
In Mostly Harmless Econometrics (MHE; Angrist and Pischke (2009)), the authors note that you also getan identical coefficient to the full regression if you regress the residualised predictor on the non-residualised๐ . We can use the OVB framework above to explain this case.
Letโs take the full regression model as:4If ๐ฆ = ๏ฟฝ๏ฟฝ + ๐ฝ๐ฅ + ๐, then by least squares ๐ฝ = ๐๐๐ฃ(๐ฅ,๐ฆ)
๐ฃ๐๐(๐ฅ) and ๏ฟฝ๏ฟฝ = ๐ฆ โ ๐ฝ๏ฟฝ๏ฟฝ.
53
๐ = ๐ฝ1๐1 + ๐ฝ2๐2 + ๐. (8.17)
MHE states that:
๐ = ๐ฝ1๐2๐1 + ๐. (8.18)
Note that this is just FWL, except we have not also residualised ๐ . Our aim is to check whether there isany bias in the estimated coefficient from this second equation. As before, since this is a bivariate regressionwe can express the coefficient as:
๐ฝ1 = ๐๐๐ฃ(๐2๐1, ๐ )๐ฃ๐๐(๐2๐1)
= ๐๐๐ฃ(๐2๐1, ๐ฝ1๐1 + ๐ฝ2๐2)๐ฃ๐๐(๐2๐1)
= ๐ฝ1๐๐๐ฃ(๐2๐1, ๐1)
๐ฃ๐๐(๐2๐1) + ๐ฝ2๐๐๐ฃ(๐2๐1, ๐2)
๐ฃ๐๐(๐2๐1)= ๐ฝ1 + ๐ฝ2 ร 0= ๐ฝ1
(8.19)
This follows from two features. First, ๐๐๐ฃ(๐2๐1, ๐1) = ๐ฃ๐๐(๐2๐1). Second, it is clear that๐๐๐ฃ(๐2๐1, ๐2) = 0 because ๐2๐1 is ๐1 stripped of any variance associated with ๐2 and so, bydefinition, they do not covary. Therefore, we can recover the unbiased regression coefficient using anadapted version of FWL where we do not residualise Y โ as stated in MHE.
54
Chapter 9
Positive Definite Matrices
9.1 Terminology
A ๐ ร ๐ symmetric matrix ๐ is positive definite (PD) if and only if ๐ฅโฒ๐๐ฅ > 0, for all non-zero ๐ฅ โ โ๐. Forexample, take the 3 ร 3 identity matrix, and a column vector of non-zero real numbers [๐, ๐, ๐]:
[๐ ๐ ๐] โกโขโฃ
1 0 00 1 00 0 1
โคโฅโฆ
โกโขโฃ
๐๐๐โคโฅโฆ
= [๐ ๐ ๐] โกโขโฃ
๐๐๐โคโฅโฆ
= ๐2 + ๐2 + ๐2.
Since by definition ๐2, ๐2, and ๐2 are all greater than zero (even if ๐, ๐, or ๐ are negative), their sum is alsopositive.
A matrix is positive semi-definite (PSD) if and only if ๐ฅโฒ๐๐ฅ โฅ 0 for all non-zero ๐ฅ โ โ๐. Note that PSDdiffers from PD in that the transformation of the matrix is no longer strictly positive.
One known feature of matrices (that will be useful later in this chapter) is that if a matrix is symmetric andidempotent then it will be positive semi-definite. Take some non-zero vector ๐ฅ, and a symmetric, idempotentmatrix ๐ด. By idempotency we know that ๐ฅโฒ๐ด๐ฅ = ๐ฅโฒ๐ด๐ด๐ฅ. By symmetry we know that ๐ดโฒ = ๐ด, and therefore:
๐ฅโฒ๐ด๐ฅ = ๐ฅโฒ๐ด๐ด๐ฅ= ๐ฅโฒ๐ดโฒ๐ด๐ฅ= (๐ด๐ฅ)โฒ๐ด๐ฅ โฅ 0,
and hence PSD.1
9.1.1 Positivity
Both PD and PSD are concerned with positivity. For scalar values like -2, 5, 89, positivity simply refers totheir sign โ and we can tell immediately whether the numbers are positive or not. Some functions are also(strictly) positive. Think about ๐(๐ฅ) = ๐ฅ2 + 1. For all ๐ฅ โ โ, ๐(๐ฅ) โฅ 1 > 0. PD and PSD extend this notion
1This short proof is taken from this discussion.
55
of a positivity to matrices, which is useful when we consider multidimensional optimisation problems or thecombination of matrices.
While for abstract matrices like the identity matrix it is easy to verify PD and PSD properties, for morecomplicated matrices we often require other more complicated methods. For example, we know that asymmetric matrix is PSD if and only if all its eigenvalues are non-negative. The eigenvalue ๐ is a scalarsuch that, for a matrix ๐ด and non-zero ๐ ร 1 vector ๐ฃ, ๐ด โ ๐ฃ = ๐ โ ๐ฃ. While I do not explore this further inthis chapter, there are methods available for recovering these values from the preceding equation. Furtherdiscussion of the full properties of PD and PSD matrices can be found here as well as in print (e.g. Hornand Johnson, 2013, Chapter 7).
9.2 ๐ด โ ๐ต is PSD iff ๐ตโ1 โ ๐ดโ1 is PSD
Wooldridgeโs list of 10 theorems does not actually include a general claim about the importance P(S)Dmatrices. Instead, he lists a very specific feature of two PD matrices. In plain English, this theorem statesthat, assuming ๐ด and ๐ต are both positive definite, ๐ด โ ๐ต is positive semi-definite if and only if the inverseof ๐ต minus the inverse of ๐ด is positive semi-definite.
Before we prove this theorem, itโs worth noting a few points that are immediately intuitive from its statement.Note that the theorem moves from PD matrices to PSD matrices. This is because we are subtracting onematrix from another. While we know A and B are both PD, if they are both equal then ๐ฅโฒ(๐ด โ ๐ต)๐ฅ willequal zero. For example, if ๐ด = ๐ต = ๐ผ2 = ( 1 0
0 1 ), then ๐ด โ ๐ต = ( 0 00 0 ). Hence, ๐ฅโฒ(๐ด โ ๐ต)๐ฅ = 0 and therefore
๐ด โ ๐ต is PSD, but not PD.
Also note that this theorem only applies to a certain class of matrices, namely those we know to be PD.This hints at the sort of applied relevance this theorem may have. For instance, we know that variance is astrictly positive quantity.
The actual applied relevance of this theorem is not particularly obvious, at least from the claim alone. Inhis post, Wooldridge notes that he repeatedly uses this fact โto show the asymptotic efficiency of variousestimators.โ In his Introductory Economics textbook (2013), for instance, Wooldridge makes use of theproperties of PSD matrices in proving that the Gauss-Markov (GM) assumptions ensure that OLS is thebest, linear, unbiased estimator (BLUE). And, more generally, PD and PSD matrices are very helpful inoptimisation problems (of relevance to machine learning too). Neither appear to be direct applications ofthis specific, bidirectional theorem. In the remainder of this chapter, therefore, I prove the theorem itself forcompleteness. I then broaden the discussion to explore how PSD properties are used in Wooldridgeโs BLUEproof as well as discuss the more general role of PD matrices in optimisation problems.
9.2.1 Proof
The proof of Wooldridgeโs actual claim is straightforward. In fact, given the symmetry of the proof, we onlyneed to prove one direction (i.e. if ๐ด โ ๐ต is PSD, then ๐ดโ1 โ ๐ตโ1 is PSD.)
Letโs assume, therefore, that ๐ด โ ๐ต is PSD. Hence:
๐ฅโฒ(๐ด โ ๐ต)๐ฅ โฅ 0๐ฅโฒ๐ด๐ฅ โ ๐ฅ๐ต๐ฅ โฅ 0
๐ฅโฒ๐ด๐ฅ โฅ ๐ฅโฒ๐ต๐ฅ๐ด๐ฅ โฅ ๐ต๐ฅ
๐ด โฅ ๐ต.
Next, we can invert our two matrices while maintaining the inequality:
56
๐ดโ1๐ด๐ตโ1 โฅ ๐ดโ1๐ต๐ตโ1
๐ผ๐ตโ1 โฅ ๐ดโ1๐ผ๐ตโ1 โฅ ๐ดโ1.
Finally, we can just remultiply both sides of the inequality by our arbitrary non-zero vector:
๐ฅโฒ๐ตโ1 โฅ ๐ฅโฒ๐ดโ1
๐ฅโฒ๐ตโ1๐ฅ โฅ ๐ฅโฒ๐ดโ1๐ฅ๐ฅโฒ๐ตโ1๐ฅ โ ๐ฅโฒ๐ดโ1๐ฅ โฅ 0
๐ฅโฒ(๐ตโ1 โ ๐ดโ1)๐ฅ โฅ 0.
Proving the opposite direction (if ๐ตโ1 โ ๐ดโ1 is PSD then ๐ด โ ๐ต is PSD) simply involves replacing A with๐ตโ1 an ๐ต with ๐ดโ1 and vice versa throughout the proof, since (๐ดโ1)โ1 = ๐ด. โก
9.3 Applications
9.3.1 OLS as the best linear unbiased estimator (BLUE)
First, letโs introduce the four Gauss-Markov assumptions. I only state these briefly, in the interest of space,spending a little more time explaining the rank of a matrix. Collectively, these assumptions guarantee thatthe linear regression estimates ๐ฝ are BLUE (the best linear unbiased estimator of ๐ฝ).
1. The true model is linear such that ๐ฆ = ๐๐ฝ + ๐ข, where ๐ฆ is a ๐ ร 1 vector, ๐ is a ๐ ร (๐ + 1) matrix,and ๐ข is an unobserved ๐ ร 1 vector.
2. The rank of ๐ is (๐ + 1) (full-rank), i.e. that there are no linear dependencies among the variables in๐. To understand what the rank of matrix denotes, consider the following 3 ร 3 matrix:
๐1 = โกโขโฃ
1 0 00 1 02 0 0
โคโฅโฆ
Note that the third row of ๐1 is just two times the first column. They are therefore entirely linearlydependent, and so not separable. The number of independent rows (the rank of the matrix) is therefore 2.One way to think about this geometrically, as in Chapter 3, is to plot each row as a vector. The third vectorwould completely overlap the first, and so in terms of direction we would not be able to discern betweenthem. In terms of the span of these two columns, moreover, there is no point that one can get to using acombination of both that one could not get to by scaling either one of them.
A slightly more complicated rank-deficient (i.e. not full rank) matrix would be:
๐2 = โกโขโฃ
1 0 00 1 02 1 0
โคโฅโฆ
Here note that the third row is not scalar multiple of either other column. But, it is a linear combinationof the other two. If rows 1, 2, and 3 are represented by ๐, ๐, and ๐ respectively, then ๐ = 2๐ + ๐. Again,geometrically, there is no point that the third row vector can take us to which cannot be achieved using onlythe first two rows.
An example of a matrix with full-rank, i.e. no linear dependencies, would be:
57
๐2 = โกโขโฃ
1 0 00 1 02 0 1
โคโฅโฆ
It is easy to verify that ๐1 and ๐2 are rank-deficient, whereas ๐3 is of full-rank in R:M1 <- matrix(c(1,0,2,0,1,0,0,0,0), ncol = 3)M1_rank <- qr(M1)$rank
M2 <- matrix(c(1,0,2,0,1,1,0,0,0), ncol = 3)M2_rank <- qr(M2)$rank
M3 <- matrix(c(1,0,2,0,1,0,0,0,1), ncol = 3)M3_rank <-qr(M3)$rank
print(paste("M1 rank:", M1_rank))
## [1] "M1 rank: 2"print(paste("M2 rank:", M2_rank))
## [1] "M2 rank: 2"print(paste("M3 rank:", M3_rank))
## [1] "M3 rank: 3"
3. ๐ผ[๐ข|๐] = 0 i.e. that the model has zero conditional mean or, in other words, our average error is zero.
4. Var(๐ข๐|๐) = ๐2, Cov(๐ข๐, ๐ข๐|๐) = 0 for all ๐ โ ๐, or equivalently that ๐ ๐๐(๐ข|๐) = ๐2๐ผ๐. This matrixhas diagonal elements all equal to ๐2 and all off-diagonal elements equal to zero.
BLUE states that the regression coefficient vector ๐ฝ is the best, or lowest variance, estimator of the true ๐ฝ.Wooldridge (2013) has a nice onproof of this claim (p.812). Here I unpack hisi proof in slightly more detail,noting specifically how PD matrices are used.
To begin our proof of BLUE, let us denote any other linear estimator as ๐ฝ = ๐ดโฒ๐ฆ, where ๐ด is some ๐ร(๐+1)matrix consisting of functions of ๐.
We know by definition that ๐ฆ = ๐๐ฝ + ๐ข and therefore that:
๐ฝ = ๐ดโฒ(๐๐ฝ + ๐ข) = ๐ดโฒ๐๐ฝ + ๐ดโฒ๐ข.
The conditional expectation of ๐ฝ can be expressed as:
๐ผ( ๐ฝ|๐) = ๐ดโฒ๐๐ฝ + ๐ผ(๐ดโฒ๐ข|๐),
and since ๐ด is a function of ๐, we can move it outside the expectation:
๐ผ( ๐ฝ|๐) = ๐ดโฒ๐๐ฝ + ๐ดโฒ๐ผ(๐ข|๐).
By the GM assumption no. 3, we know that ๐ผ(๐ข|๐) = 0, therefore:
๐ผ( ๐ฝ|๐) = ๐ดโฒ๐๐ฝ.
58
Since we are only comparing ๐ฝ against other unbiased estimators, we know the conditional mean of any otherestimator must equal the true parameter, and therefore that
๐ดโฒ๐๐ฝ = ๐ฝ.
The only way that this is true is if ๐ดโฒ๐ = ๐ผ . Hence, we can rewrite our estimator as
๐ฝ = ๐ฝ + ๐ดโฒ๐ข.
The variance of our estimator ๐ฝ then becomes
๐ ๐๐( ๐ฝ|๐) = (๐ฝ โ [๐ฝ + ๐ดโฒ๐ข])(๐ฝ โ [๐ฝ + ๐ดโฒ๐ข])โฒ
= (๐ดโฒ๐ข)(๐ดโฒ๐ข)โฒ
= ๐ดโฒ๐ข๐ขโฒ๐ด= ๐ดโฒ[Var(๐ข|๐)]๐ด= ๐2๐ดโฒ๐ด,
since by GM assumption no. 4 the variance of the errors is a constant scalar ๐2.
Hence:
Var( ๐ฝ|๐) โ Var( ๐ฝ|๐) = ๐2๐ดโฒ๐ด โ ๐2(๐โฒ๐)โ1
= ๐2[๐ดโฒ๐ด โ (๐โฒ๐)โ1].We know that ๐ดโฒ๐ = ๐ผ , and so we can manipulate this expression further:
Var( ๐ฝ|๐) โ Var( ๐ฝ|๐) = ๐2[๐ดโฒ๐ด โ (๐โฒ๐)โ1]= ๐2[๐ดโฒ๐ด โ ๐ดโฒ๐(๐โฒ๐)โ1๐โฒ๐ด]= ๐2๐ดโฒ[๐ด โ ๐(๐โฒ๐)โ1๐โฒ๐ด]= ๐2๐ดโฒ[๐ผ โ ๐(๐โฒ๐)โ1๐โฒ]๐ด= ๐2๐ดโฒ๐๐ด.
Note that we encountered ๐ in the previous chapter. It is the residual maker, and has the known property ofbeing both symmetric and idempotent. Recall from the first section that we know any symmetric, idempotentmatrix is positive semi-definite, and so we know that
Var( ๐ฝ|๐) โ Var( ๐ฝ|๐) โฅ 0,
and thus that the regression estimator ๐ฝ is more efficient (hence better) than any other unbiased, linearestimator of ๐ฝ. โกNote that ๐ฝ and ๐ฝ are both (๐ + 1) ร 1 vectors. As Wooldridge notes at the end of the proof, for any(๐ + 1) ร 1 vector ๐, we can calculate the scalar ๐โฒ๐ฝ. Think of ๐ as the row vector of the ith observation from๐. Then ๐โฒ๐ฝ = ๐โฒ
๐๐ฝ0 + ๐1๐ฝ1 + ... + ๐๐๐ฝ๐ = ๐ฆ๐. Both ๐โฒ ๐ฝ and ๐โฒ ๐ฝ are both unbiased estimators of ๐โฒ๐ฝ. Noteas an extension of the proof above that
Var(๐โฒ ๐ฝ|๐) โ Var(๐โฒ ๐ฝ|๐) = ๐โฒ[Var( ๐ฝ|๐) โ Var( ๐ฝ|๐)]๐.
We know that Var( ๐ฝ|๐) โ Var( ๐ฝ|๐) is PSD, and hence by definition that:
59
๐โฒ[Var( ๐ฝ|๐) โ Var( ๐ฝ|๐)]๐ โฅ 0,
and hence, for any observation in X (call it ๐ฅ๐), and more broadly any linear combination of ๐ฝ, if the GMassumptions hold the estimate ๐ฆ๐ = ๐ฅ๐ ๐ฝ has the smallest variance of any possible linear, unbiased estimator.
9.3.2 Optimisation problems
Optimisation problems, in essence, are about tweaking some parameter(s) until an objective function is asgood as it can be. The objective function summarises some aspect of the model given a potential solution.For example, in OLS, our objection function is defined as โ๐(๐ฆ๐ โ ๐ฆ๐)2 โ the sum of squared errors. Typically,โas good as it can beโ stands for โis minimisedโ or โis maximised.โ For example with OLS we seek to minimisethe sum of the squared error terms. In a slight extension of this idea, many machine learning models aim tominimise the prediction error on a โhold-outโ sample of observations i.e. observations not used to select themodel parameters. The objective loss function may be the sum of squares, or it could be the mean squarederror, or some more convoluted criteria.
By โtweakingโ we mean that the parameter values of the model are adjusted in the hope of generatingan even smaller (bigger) value from our objective function. For example, in least absolute shrinkage andselection (LASSO) regression, the goal is to minimise both the squared prediction error (as in OLS) as wellas the total size of the coefficient vector. More formally, we can write this objective function as:
(๐ฆ โ ๐๐ฝ)2 + ๐||๐ฝ||1,
where ๐ is some scalar, and ||๐ฝ||1 is the sum of the absolute size of the coefficients i.e. โ๐ |๐ฝ๐|.There are two ways to potentially alter the value of the LASSO loss function: we can change the valueswithin the vector ๐ฝ or adjust the value of ๐. In fact, iterating through values of ๐, we can solve the squarederror part of the loss function, and then choose from our many different values of ๐ which results in thesmallest (read: minimised) objective function.
With infinitely many values of ๐, we can perfectly identify the optimal model. But we are often constrainedinto considering only a subset of possible cases. If we are too coarse in terms of which ๐ values to consider,we may miss out on substantial optimisation.
This problem is not just present in LASSO regression. Any non-parametric model (particularly those commonin machine learning) is going to face similar optimisation problems. Fortunately, there are clever ways toreduce the computational intensity of these optimisation problems. Rather than iterating through a rangeof values (an โexhaustive grid-searchโ) we can instead use our current loss value to adjust our next choice ofvalue for ๐ (or whatever other parameter we are optimisimng over). This sequential method helps us narrowin on the optimal parameter values without having to necessarily consider may parameter combinations farfrom the minima.
Of course, the natural question is how do we know how to adjust the scalar ๐, given our existing value?Should it be increased or decreased? One very useful algorithm is gradient descent (GD), which I will focuson in the remainder of this section. Briefly, the basics of GD are:
1. Take a (random) starting solution to your model2. Calculate the gradient (i.e. the k-length vector of derivatives) of the loss at that point3. If the gradient is positive (negative), decrease (increase) your parameter by the gradient value.4. Repeat 1-3 until you converge on a stable solution.
Consider a quadratic curve in two-dimensions, as in Figure 9.1. If the gradient at a given point is positive,then we know we are on the righthand slope. To move closer to the minimum point of the curve we want togo left, so we move in the negative direction. If the gradient is negative, we are on the lefthand slope andwant to move in the positive direction. After every shift I can recalculate the gradient and keep adjusting.
60
Crucially, these movements are dictated by the absolute size of the gradient. Hence, as I approach theminimum point of the curve, the gradient and therefore the movements will be smaller. In 9.1, we see thateach iteration involves not only a move towards the global minima, but also that the movements get smallerwith each iteration.
Figure 9.1: Gradient descent procedure in two dimensions.
PD matrices are like the parabola above. Geometrically, they are bowl-shaped and are guaranteed to havea global minimum.2 Consider rolling a ball on the inside surface of this bowl. It would run up and downthe edges (losing height each time) before eventually resting on the bottom of the bowl, i.e. converging onthe global minimum. Our algorithm is therefore bound to find the global minimum, and this is obviously avery useful property from an optimisation perspective.
If a matrix is PSD, on the other hand, we are not guaranteed to converge on a global minima. PSD matriceshave โsaddle pointsโ where the slope is zero in all directions, but are neither (local) minima or maxima inall dimensions. Geometrically, for example, PSD matrices can look like hyperbolic parabaloids (shaped likea Pringles crisp). While there is a point on the surface that is flat in all dimensions, it may be a minima inone dimension, but a maxima in another.
PSD matrices prove more difficult to optimise because we are not guaranteed to converge on that point. Ata point just away from the saddle point, we may actually want to move in opposite direction to the gradientdependent on the axis. In other words, the valence of the individual elements of the gradient vector point indifferent directions. Again, imagine dropping a ball onto the surface of a hyperbolic parabaloid. The ball islikely to pass the saddle point then run off one of the sides: gravity is pulling it down in to a minima in onedimension, but away from a maxima in another. PSD matrices therefore prove trickier to optimise, and caneven mean we do not converge on a miniimum loss value. Therefore our stable of basic algorithms like GDlike gradient descent are less likely to be effective optimisers.
2See these UPenn lecture notes for more details.
61
9.3.3 Recap
In this final section, we have covered two applications of positive (semi-) definiteness: the proof of OLS asBLUE, and the ease of optimisation when a matrix is PD. There is clearly far more that can be discussedwith respect to P(S)D matrices, and this chapter links or cites various resources that can be used to gofurther.
62
Bibliography
Angrist, J. D. and Pischke, J.-S. (2008). Mostly harmless econometrics: An empiricistโs companion. Princetonuniversity press.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricistโs Companion.Princeton University Press, Princeton.
Aronow, P. and Miller, B. (2019). Foundations of Agnostic Statistics. Cambridge University Press.
Aronow, P. M. and Samii, C. (2016). Does regression produce representative estimates of causal effects?American Journal of Political Science, 60(1):250โ267.
Bowler, S., Donovan, T., and Karp, J. A. (2006). Why politicians like electoral institutions: Self-interest,values, or ideology? The Journal of Politics, 68(2):434โ446.
Cinelli, C. and Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable bias. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 82(1):39โ67.
Davidson, R., MacKinnon, J. G., et al. (2004). Econometric theory and methods, volume 5. Oxford UniversityPress New York.
Frisch, R. and Waugh, F. V. (1933). Partial time regressions as compared with individual trends. Econo-metrica: Journal of the Econometric Society, pages 387โ401.
Horn, R. A. H. and Johnson, C. R. (2013). Matrix Analysis. NY, USA, 2nd edition edition.
King, G., Tomz, M., and Wittenberg, J. (2000). Making the most of statistical analyses: Improving inter-pretation and presentation. American Journal of Political Science, 44:341โ355.
Lemons, D., Langevin, P., and Gythiel, A. (2002). An introduction to stochastic processes in physics. Johnshopkins paperback. Johns Hopkins University Press. Citation Key: lemons2002introduction tex.lccn:2001046459.
Liu, X. (2012). Appendix A: The Delta Method, pages 405โ406. John Wiley & Sons, Ltd.
Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple regression analysis. Journalof the American Statistical Association, 58(304):993โ1010.
van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics.Springer New York, New York, NY.
Wooldridge, J. M. (2013). Introductory econometrics : a modern approach. Mason, OH, 5th edition edition.
63