review course statistics probability theory - statistical
TRANSCRIPT
Review Course StatisticsProbability Theory - Statistical Inference - Matrix Algebra
Prof. Dr. Christian Conrad
Heidelberg University
Winter term 2012/13
Christian Conrad (Heidelberg University) Winter term 2012/13 1 / 88
Review Course Statistics
Christian ConradEmail: [email protected]
Tue 09.10. – Thu 11.11.1209.00 – 12.00 and 14.00 – 16.00HEU I
Slideshttp://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)“Passwort: econometrics12_13”
Christian Conrad (Heidelberg University) Winter term 2012/13 2 / 88
Econometrics
Lecture: Christian ConradTue, 9.00-12.00, Bergheimer Str. 58, Hörsaal
Office hours: Mo 11.00-12.00, Bergheimer Str. 58, 01.019aEmail: [email protected]
Tutorial: Matthias HartmannMon, 14.00-16.00Theory: Bergheimer Str. 58, 00.010STATA: Bergheimer Str. 58, 99.005-6
Wed, 14.00-16.00Theory: Grabengasse 3-5, NUni HS 10STATA: Bergheimer Str. 58, 99.005-6
Lecture notes, problem sets . . .http://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)
Christian Conrad (Heidelberg University) Winter term 2012/13 3 / 88
Review Course Statistics
Contents
1. Review of Probability Theory
2. Review of Statistics
3. Matrix Algebra
Christian Conrad (Heidelberg University) Winter term 2012/13 4 / 88
Review Course Statistics
Literature
Stock, J. H. and M. W. Watson, Introduction to Econometrics, 3rd edition,Pearson, 2012.
Review of Probability Theory: Chapter 2 and Chapter 17.2, Appendix 17.1Review of Statistics: Chapter 3 and Chapter 17.2Matrix Algebra: Appendix 18.1
Christian Conrad (Heidelberg University) Winter term 2012/13 5 / 88
1 Review of Probability Theory
1 Review of Probability Theory1.1 Random variables and probability distributions1.2 Expected values, mean, and variance1.3 Two random variables1.4 Random sampling and the distribution of the sample average1.5 Large-sample approximations to sampling distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 6 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Random experiment
Our starting point is a random experiment with
mutually exclusive potential outcomes ω and
the set of all possible outcomes Ω, called the sample space.
An event A is a collection of outcomes, and, hence a subset of Ω.
The probability P(A) of an event is the proportion of the time the eventoccurs in the long run.
Example: Tossing a die once
Describe the sample space and the events A: "the outcome is an oddnumber" and B: "the outcome is an even number". What is P(A)?
Christian Conrad (Heidelberg University) Winter term 2012/13 7 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Random variables
A random variable X is a numerical summary of a random outcome,i.e. the random variable assigns a real number X(ω) = x to each outcomeω ∈ Ω. x is called the realization.X can be either discrete or continuous
a discrete random variable takes only a discrete set of values, like 0,1,2,. . .a continuous random variable takes on a continuum of possible values
Christian Conrad (Heidelberg University) Winter term 2012/13 8 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Cumulative density function (cdf)
The cumulative density function F is the probability that the random variable isless than or equal to a particular value x:
FX(x) = P(X ≤ x) = P(ω|X(ω) ≤ x)
Properties of the cdf:1 FX is nondecreasing in x.2 FX is right-continuous, that means lim
x→x0x>x0
FX(x) = FX(x0).
3 limx→−∞
FX(x) = 0, limx→∞
FX(x) = 1.
4 P((a, b]) = FX(b)− FX(a).
Christian Conrad (Heidelberg University) Winter term 2012/13 9 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Probability distributionThe probability distribution of a discrete random variable is the list of allpossible values, x1, x2, . . ., of the random variable and the probability that eachvalue will occur. It takes the form
P(X = xi) = pi, i = 1, 2, ...
where 0 ≤ pi ≤ 1 and∑
i pi = 1 so that
FX(xi) = P(X ≤ xi) =∑
xt≤xi
P(X = xt)
is a step function.Example: A random variable is Bernoulli distributed if the outcome is binarywith
X =
1 with probability p
0 with probability 1 − p.
Christian Conrad (Heidelberg University) Winter term 2012/13 10 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Probability density function (pdf)For a continuous random variable the probabilities are represented by theprobability density function (pdf) fX(x), such that the area under the pdfbetween any two points a and b (where a < b) is the probability that therandom variable falls between these to points:
P(a < X ≤ b) =∫ b
afX(x)dx
and
FX(x) = P(X ≤ x) =∫ x
−∞fX(u)du
A function fX(x) is a pdf if and only if fX(x) ≥ 0 for all x and∫∞−∞ fX(x)dx = 1.
Christian Conrad (Heidelberg University) Winter term 2012/13 11 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 12 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Example:A continuous random variable X with density
fX(x) =1√
2πσX
exp
(
− (x − µX)2
2σ2X
)
with parameters µX and σX > 0 is said to be normally distributed. We use thenotation:
X ∼ N (µX , σ2X)
If µX = 0 and σ2X = 1 the random variable is said to be standard normal
distributed. In the case we denote the pdf and cdf by φ(x) and Φ(x).
Christian Conrad (Heidelberg University) Winter term 2012/13 13 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 14 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 15 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 16 / 88
1 Review of Probability Theory1.1 Random variables and probability distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 17 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Expected value
The expected value (or mean) of a random variable X, denoted by µX = E(X),is given by
E[X] =∑
i
xiP(X = xi),
if X is discrete and
E[X] =∫ ∞
−∞x · fX(x)dx
if X is continuous. It is the “long run average value of the random variable overmany repeated trials”.
Christian Conrad (Heidelberg University) Winter term 2012/13 18 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Variance
The variance of a random variable X, denoted by σ2X = Var(X), is given by
Var[X] = E[(X − µX)2] =
∑
i
(xi − µX)2P(X = xi),
if X is discrete and
Var[X] =∫ ∞
−∞(x − µX)
2 · fX(x)dx
if X is continuous. It is a measure of the dispersion or the “spread” of aprobability distribution. The square root of the variance is called the standarddeviation and denoted by σX.
The variance can be written as: Var[X] = E[X2]− (E[X])2.
Christian Conrad (Heidelberg University) Winter term 2012/13 19 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Similarly, for any function g, we define the expectation E[g(X)] as
E[g(X)] =∑
i
g(xi)P(X = xi)
if X is discrete and
E[g(X)] =∫ ∞
−∞g(x)fX(x)dx
if X is continuous.
Jensen’s Inequality: If g(X) is a convex function, then
g(E[X]) ≤ E[g(X)].
In particular: (Expectation Inequality)
|E[X]| ≤ E[|X|]
Christian Conrad (Heidelberg University) Winter term 2012/13 20 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Cauchy-Schwarz Inequality:
|E[XY]| ≤√
E[X2]E[Y2]
(for the Proof see Appendix 17.2)
Higher order moments:
For r = 1, 2, . . ., we define the r′th moment of X as
E(Xr)
and the r′th central moment of X as
E[(X − E(X))r].
Remark: If E[Xr] < ∞, then all the raw moments of order less than r also exist.
Christian Conrad (Heidelberg University) Winter term 2012/13 21 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Skewness:
Skewness =E[(X − µX)
3]
σ3X
The skewness describes how much a distribution deviates from symmetry.For a symmetric distribution: Skewness = 0. The distribution has a long right(left) tail if Skewness > 0 (Skewness < 0).
Kurtosis:
Kurtosis =E[(X − µX)
4]
σ4X
The kurtosis of a distribution is a measure of how much mass is in its tails.The greater the kurtosis of a distribution, the more likely are outliers. Thekurtosis of a normally distributed random variable is 3. A distribution withkurtosis exceeding 3 is called leptokurtic or heavy-tailed.
Christian Conrad (Heidelberg University) Winter term 2012/13 22 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Christian Conrad (Heidelberg University) Winter term 2012/13 23 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Example: Calculate the mean and variance of a Bernoulli distributed randomvariable.
Example (E2.1): Let Y denote the number of “heads” that occur when twocoins are tossed.
1 Derive the probability distribution of Y.2 Derive the cumulative probability distribution of Y.3 Derive the mean and variance of Y.
Example: Consider the discrete random variable X and the functiong(X) = a + bX with a, b ∈ R. Derive E[g(X)] and Var[g(X)].
Example (E2.8): The random variable Y has a mean of 1 and a variance of 4.Let Z = 1
2 (Y − 1). Show that µZ = 0 and σ2Z = 1.
Christian Conrad (Heidelberg University) Winter term 2012/13 24 / 88
1 Review of Probability Theory1.2 Expected values, mean, and variance
Example (E17.5): Suppose that W is a random variable with E[W4] < ∞.Show that E[W2] < ∞. [Hint: Calculate the variance of W2.]
Example (E2.21): X is a random variable with moments E[X], E[X2], E[X3] andso forth.
1 Show E[(X − µx)3] = E[X3]− 3 E[X2] E[X] + 2(E[X])3.
2 Show E[(X − µx)4] = E[X4]− 4 E[X] E[X3] + 6 (E[X])2 E[X2]− 3(E[X])4.
Christian Conrad (Heidelberg University) Winter term 2012/13 25 / 88
1 Review of Probability Theory1.3 Two random variables
Joint and marginal distributions
The joint cdf of the random variables X and Y is given by
FX,Y(x, y) = P(X ≤ x, Y ≤ y).
The joint pdf fX,Y(x, y) of X and Y is given by
fX,Y(x, y) = P(X = x, Y = y)
if X and Y are discrete and by
fX,Y(x, y) =∂2
∂x∂yFX,Y(x, y)
if X and Y are continuous.
Christian Conrad (Heidelberg University) Winter term 2012/13 26 / 88
1 Review of Probability Theory1.3 Two random variables
Christian Conrad (Heidelberg University) Winter term 2012/13 27 / 88
1 Review of Probability Theory1.3 Two random variables
Suppose X and Y are discrete with outcomes x1, x2, . . . , xl and y1, y2, . . . , yk.Then the marginal probability distribution of Y is given
P(Y = yj) =
l∑
i=1
P(X = xi, Y = yj) for j = 1, . . . , k.
If X and Y are continuous the marginal density function of Y is given by
fY(y) =∫ ∞
−∞fX,Y(x, y)dx.
Christian Conrad (Heidelberg University) Winter term 2012/13 28 / 88
1 Review of Probability Theory1.3 Two random variables
Conditional distributions/density
The conditional distribution/density of Y given X = x is given by
P(Y = y|X = x) =P(X = x, Y = y)
P(X = x)
if X and Y are discrete and
fY|X=x(y) =fX,Y(x, y)
fX(x)
if X and Y are continuous.
Christian Conrad (Heidelberg University) Winter term 2012/13 29 / 88
1 Review of Probability Theory1.3 Two random variables
Christian Conrad (Heidelberg University) Winter term 2012/13 30 / 88
1 Review of Probability Theory1.3 Two random variables
Conditional expectation
The conditional expectation of Y given X = x is given by
E(Y|X = x) =k∑
j=1
yjP(Y = yj|X = x)
if X and Y are discrete and
E(Y|X = x) =∫ ∞
−∞y · fY|X=x(y)dy
if X and Y are continuous.
Conditional variance
Var[Y|X = x] = E[(Y − E(Y|X = x))2|X = x]
Christian Conrad (Heidelberg University) Winter term 2012/13 31 / 88
1 Review of Probability Theory1.3 Two random variables
Christian Conrad (Heidelberg University) Winter term 2012/13 32 / 88
1 Review of Probability Theory1.3 Two random variables
The law of iterated expectations
simple law of iterated expectations:
E(Y) = E(E(Y|X))
extended law of iterated expectations:
E(Y|X) = E(E(Y|X, Z)|X)
in general: let x and w be random vectors with x = f (w) for some functionf :
E[Y|x] = E[E[Y|w]|x]finally:
E(g(X)Y|X) = g(X)E(Y|X)(for details see Wooldridge, Appendix 2A)
Christian Conrad (Heidelberg University) Winter term 2012/13 33 / 88
1 Review of Probability Theory1.3 Two random variables
Independence
Two random variables are independent, if knowing the value of one of thevariables provides no information about the other. X and Y are independentlydistributed, if for all values x and y,
P(Y = y|X = x) = P(Y = y)
if X and Y are discrete andfY|X=x(y) = fY(y)
if X and Y are continuous. Alternatively, we can say that X and Y areindependently distributed, if the joint distribution equals the product of themarginal distributions, i.e.
P(X = x, Y = y) = P(X = x)P(Y = y)
fX,Y(x, y) = fX(x)fY (y)
Christian Conrad (Heidelberg University) Winter term 2012/13 34 / 88
1 Review of Probability Theory1.3 Two random variables
Covariance and correlation
The covariance between X and Y is
σX,Y = Cov(X, Y) = E[(X − µX)(Y − µY)].
The correlation between X and Y is
ρXY = Corr(X, Y) =σXY
σXσY.
Christian Conrad (Heidelberg University) Winter term 2012/13 35 / 88
1 Review of Probability Theory1.3 Two random variables
Properties of Covariance and Correlation
−1 ≤ Corr(X, Y) ≤ 1 is a measure of linear dependence, free of units ofmeasurement.
Cov(a + bX, c + dY) = bdCov(X, Y)
Cov(X, Y) = E[XY] − E[X]E[Y]
X, Y are statistically independent ⇒ Cov(X, Y) = Corr(X, Y) = 0.
Cov(X, Y),Corr(X, Y) 6= 0 ⇒ X, Y are statistically dependent.
Cov(X, Y) = Corr(X, Y) = 0 6⇒ X, Y are statistically independent.
Christian Conrad (Heidelberg University) Winter term 2012/13 36 / 88
1 Review of Probability Theory1.3 Two random variables
Christian Conrad (Heidelberg University) Winter term 2012/13 37 / 88
1 Review of Probability Theory1.3 Two random variables
Correlation and conditional mean
If E[Y|X] = µy, i.e. the conditional mean of Y does not depend on X, thenCov(X, Y) = 0 and Corr(X, Y) = 0.
However, Cov(X, Y) = 0 does not imply that the conditional mean of Y doesnot depend on X.
(see Example E2.23)
Christian Conrad (Heidelberg University) Winter term 2012/13 38 / 88
1 Review of Probability Theory1.3 Two random variables
More an expectation, variance and covariance
Consider the random variables X, Y and Z.
E(aX + bY) = aE(X) + bE(Y)
Var(aX + bY) = a2Var(X) + b2Var(Y) + 2abCov(X, Y)
If X and Y are independent, then
Var(X + Y) = Var(X) + Var(Y)
since Cov(X, Y) = 0. Finally,
Cov(a + bX + cY, Z) = bσXZ + cσYZ
(for Proofs see Appendix 2.1)
Christian Conrad (Heidelberg University) Winter term 2012/13 39 / 88
1 Review of Probability Theory1.3 Two random variables
Example E2.6:
The following table gives the joint probability distribution between employmentstatus and college graduation among those either employed or looking forwork (unemployed) in the working age U.S. population for 2008.
1 Compute E[Y].2 The unemployment rate is the fraction of the labor force that is
unemployed. Show that the unemployment rate is given by 1 − E[Y].3 Calculate E[Y |X = 1] and E[Y |X = 0].4 Calculate the unemployment rate for (i) college graduates and (ii)
non-college graduates.5 A randomly selected member of this population reports being
unemployed. What is the probability that this worker is a collegegraduate? A non-college graduate?
6 Are educational achievement and employment status independent?Explain.
Christian Conrad (Heidelberg University) Winter term 2012/13 40 / 88
1 Review of Probability Theory1.3 Two random variables
Christian Conrad (Heidelberg University) Winter term 2012/13 41 / 88
1 Review of Probability Theory1.3 Two random variables
Example E2.19: Consider two random variables X and Y. Suppose that Ytakes on k values y1, . . . yk and that X takes on l values x1, . . . , xl.
1 Show that P(Y = yj) =∑l
i=1 P(Y = yj |X = xi)P(X = xi). [Hint: Use thedefinition of P(Y = yj |X = xi).]
2 Use your answer to 1. to verify the equationE[Y] =
∑li=1 E[Y |X = xi]P(X = xi).
3 Suppose that X and Y are independent. Show that σXY = 0 andCorr(X, Y) = 0.
Christian Conrad (Heidelberg University) Winter term 2012/13 42 / 88
1 Review of Probability Theory1.3 Two random variables
Example E2.20: Consider three random variables X, Y and Z. Suppose that Ytakes on k values y1, . . . yk, that X takes on l values x1, . . . , xl and that Z takeson m values z1, . . . , zm. The joint probability distribution of X, Y, Z isP(X = x, Y = y, Z = z), and the conditional probability distribution of Y given Xand Z is
P(Y = y |X = x, Z = z) =P(X = x, Y = y, Z = z)
P(X = x, Z = z).
1 Explain how the marginal probability that Y = y can be calculated fromthe joint probability distribution. [Hint: This is a generalization of theequation P(Y = y) =
∑li=1 P(X = xi, Y = y).]
2 Show that E[Y] = E[E[Y |X, Z]]. [Hint: This is a generalization of theequations E[Y] =
∑li=1 E[Y |X = xi]P(X = xi) and E[Y] = E[E[Y |X]].]
Christian Conrad (Heidelberg University) Winter term 2012/13 43 / 88
1 Review of Probability Theory1.3 Two random variables
Example E2.23: This exercise provides an example of a pair of randomvariables X and Y for which the conditional mean of Y given X depends on Xbut Corr(X, Y) = 0. Let X and Z be two independently distributed standardnormal random variables, and let Y = X2 + Z.
1 Show that E[Y |X] = X2.2 Show that µY = 1.3 Show that E[XY] = 0. [Hint: Use the fact that the odd moments of a
standard normal random variable are all zero.]4 Show that Cov(X, Y) = 0 and thus Corr(X, Y) = 0.
Christian Conrad (Heidelberg University) Winter term 2012/13 44 / 88
1 Review of Probability Theory1.3 Two random variables
Example E2.26: Suppose that Y1, Y2, . . .Yn are random variables with acommon mean µY , a common variance σ2
Y , and the same correlation ρ (so thatthe correlation between Yi and Yj is equal to ρ for all pairs i and j where i 6= j).
1 Show that Cov(Yi, Yj) = ρσ2Y for i 6= j.
2 Suppose that n = 2. Show that E[Y] = µY and Var[Y] = 12σ
2Y + 1
2ρσ2Y .
3 For n ≥ 2, show that E[Y] = µY and Var[Y] = 1nσ
2Y + n−1
n ρσ2Y .
4 When n is very large, show that Var[Y] ≈ ρσ2Y .
Christian Conrad (Heidelberg University) Winter term 2012/13 45 / 88
1 Review of Probability Theory1.4 Random sampling and the distribution of the sample average
Random sampling
n objects, denoted by X1,X2, . . . ,Xn, are randomly drawn from a population.That is, the Xi are random variables, independently and identically distributed(i.i.d.).
The sampling distribution of the sample average
The sample average is defined as Xn = 1n
∑ni=1 Xi and is a random variable
itself, i.e. has a pdf called the sampling distribution. The mean and thevariance of X are given by
E[Xn] = µX
and
Var[Xn] =1nσ2
X.
If each Xi is normally distributed, i.e. Xi ∼ N (µX , σ2X), then Xn ∼ N (µX, σ
2X/n),
since the sum of normally distributed random variables is again normallydistributed.
Christian Conrad (Heidelberg University) Winter term 2012/13 46 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
We have seen that we can derive the exact sampling distribution of Xn, if eachof the Xi is normally distributed. Since this result holds for any value of n, thesampling distribution is called the finite-sample distribution of Xn.
In practice, we often do not know the distribution of the Xi. Nevertheless, inthis situation we will be able to make statements about the asymptoticdistribution of Xn. That is, we will provide an approximation to the samplingdistribution which becomes exact in the limit, i.e. for n → ∞.
Christian Conrad (Heidelberg University) Winter term 2012/13 47 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Convergence in probability and the law of large numbers
Convergence in ProbabilityLet z1, z2, . . . , zn, . . . be a sequence of random variables. The sequence zn issaid to converge in probability to a constant c, if for any ε > 0
limn→∞
P(|zn − c| ≥ ε) = 0.
That is, the probability that zn is in the range c − ε to c + ε tends to 1 asn → ∞. Notation:
znP−→ c or plim zn = c
A sequence of random vectors (matrices) converges in probability if eachelement converges in probability.
Useful result: if znP−→ c and yn
P−→ d, then
zn + ynP−→ c + d and znyn
P−→ cd
Christian Conrad (Heidelberg University) Winter term 2012/13 48 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Law of Large Numbers (LLN)If X1, . . . ,Xn are independently and identically distributed (i.i.d.) with E[Xi] = µX
and Var[Xi] < ∞, then
Xn =1n
n∑
i=1
XiP−→ µX .
The LLN says that the sample average Xn converges in probability to thepopulation mean.
We can prove the LLN by using Chebychev’s inequality (see Appendix 17.2):if Y is a random variable, c is any constant, then
P(|Y − c| ≥ ε) ≤ E[(Y − c)2]
ε2,
for any positive constant ε.
Christian Conrad (Heidelberg University) Winter term 2012/13 49 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 50 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Convergence in distribution and the central limit theorem
Convergence in DistributionLet Sn be a sequence of random variables and Fn the cdf of Sn. We say thatSn converges in distribution to a random variable S if the cdf Fn of Sn
converges to the cdf F of S at every continuity point of F. We call F theasymptotic distribution of Sn.
Notation:Sn
d−→ S or simply Snd−→ F
Central Limit Theorem (CLT)Let Xi be i.i.d. with E[Xi] = µX and Var[Xi] = σ2
X < ∞. Then,
√n
Xn − µX
σX
d−→ N (0, 1).
The central limit theorem states that the distribution of the standardizedsample average becomes arbitrarily well approximated by the standardnormal distribution as n → ∞.
Christian Conrad (Heidelberg University) Winter term 2012/13 51 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 52 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Christian Conrad (Heidelberg University) Winter term 2012/13 53 / 88
1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions
Slutsky’s Theorem
If znP−→ c and Sn
d−→ S, then
1 zn + Snd−→ c + S
2 znSnd−→ cS
3 Sn/znd−→ S/c if c 6= 0
Continuous Mapping TheoremIf g is a continuous function, then
1 if znP−→ c then g(zn)
P−→ g(c)
2 if Snd−→ S then g(Sn)
d−→ g(S)
Christian Conrad (Heidelberg University) Winter term 2012/13 54 / 88
2 Review of Statistics
2 Review of Statistics2.1 Estimation of the population mean2.2 Hypothesis tests concerning the population mean2.3 Confidence intervals for the population mean
Christian Conrad (Heidelberg University) Winter term 2012/13 55 / 88
2 Review of Statistics
Statistical tools help us answer questions about unknown characteristics ofdistributions in populations of interest.E.g. what is the mean of the distribution of earnings of recent collegegraduates?
The key insight of statistics is that one can learn about a populationdistribution by selecting a random sample from that population.E.g. rather than survey the entire U.S. population, we might survey, say, 1000members of the population, selected by random sampling.
Most of the interesting questions in economics involve relationships betweentwo or more variables or comparisons between different populations.E.g. is there a gap between the mean earnings for male and female recentcollege graduates?
Christian Conrad (Heidelberg University) Winter term 2012/13 56 / 88
2 Review of Statistics
Three types of statistical methods are used in econometrics:
estimation: computing a “best guess” numerical value for an unknowncharacteristic of a population distribution.
hypothesis testing: formulating a specific hypothesis about thepopulation, then using sample evidence to decide whether it is true.
confidence intervals: using a set of data to estimate an interval or rangefor an unknown population characteristic.
Christian Conrad (Heidelberg University) Winter term 2012/13 57 / 88
2 Review of Statistics2.1 Estimation of the population mean
Estimator and estimate
Let X1, . . .Xn be a sequence of i.i.d. random variables and ϑ an unknowncharacteristic of the distribution of Xi.
A function ϑn = ϑ(X1, ...,Xn) is called estimator of ϑ.
The realized value ϑ(x1, ..., xn) of an estimator ϑ(X1, ...,Xn) is called theestimate of ϑ based on the sample x1, ..., xn.
While ϑ(X1, ...,Xn) is a random variable, ϑ(x1, ..., xn) is a nonrandomnumber.
Christian Conrad (Heidelberg University) Winter term 2012/13 58 / 88
2 Review of Statistics2.1 Estimation of the population mean
Properties of an estimatorAn estimator ϑn of an unknown parameter ϑ is
unbiased ifE(ϑn) = ϑ
asymptotically unbiased if
limn→∞
E(ϑn) = ϑ
consistent if ϑn converges in probability to ϑ, that is, for all ε > 0 we have
limn→∞
P(∣
∣
∣ϑn − ϑ∣
∣
∣ ≥ ε)
= 0
Let ϑn be another estimator of ϑ and suppose that both ϑn and ϑn areunbiased. Then, ϑn is said to be more efficient than ϑn ifVar[ϑn] < Var[ϑn].
Christian Conrad (Heidelberg University) Winter term 2012/13 59 / 88
2 Review of Statistics2.1 Estimation of the population mean
Mean square errorThe mean square error (MSE) of an estimator is defined as
MSE(ϑn) = E[(ϑn − ϑ)2] = (E[ϑn]− ϑ)2 + Var[ϑn].
Hence, for unbiased estimators: MSE(ϑn) = Var(ϑn).
If the estimator is asymptotically unbiased and its variance goes to zero, i.e.
limn→∞
E[ϑn]− ϑ = 0 and limn→∞
Var[ϑn] = 0,
ϑn is said to converge in mean square to ϑ, which implies that ϑn isconsistent.1 Why?
1A sequence of random variables zn converges in mean square to a constant c, if
limn→∞
E[(zn − c)2] = 0
Convergence in mean square implies convergence in probability.Christian Conrad (Heidelberg University) Winter term 2012/13 60 / 88
2 Review of Statistics2.1 Estimation of the population mean
Properties of X
Example: What are the properties of the following three estimators ofµX = E(X):
a) ϑa = 1n
∑ni=1 Xi
b) ϑb = X1
c) ϑc = ϑa +1n
Christian Conrad (Heidelberg University) Winter term 2012/13 61 / 88
2 Review of Statistics2.1 Estimation of the population mean
Efficiency of X
Let µX be an estimator of µX that is a linear function of X1, . . . ,Xn, that is,µX =
∑ni=1 aiXi, where a1, . . . , a2 are nonrandom constants. If µX is unbiased,
then Var(X) < Var(µX) unless µX = X. Thus X is the Best Linear UnbiasedEstimator (BLUE); that is, X is the most efficient estimator of µX among allunbiased estimators that are linear in the Xi’s. (Proof ?)
Another motivation: for which choice of m is
n∑
i=1
(Xi − m)2
minimized, i.e. what is the best predictor of Xi in a mean square error sense?Again, this is X! (see Appendix 3.2)
Christian Conrad (Heidelberg University) Winter term 2012/13 62 / 88
2 Review of Statistics2.1 Estimation of the population mean
Estimating the variance and covariance
The variance σ2X can be estimated by the sample variance:
s2X =
1n − 1
n∑
i=1
(Xi − X)2
The sample variance is an unbiased estimator of the population variance (thisis to be shown in E3.18). The sample variance is also consistent (seeAppendix 3.3). Is sX =
√
s2X a consistent estimator of σX?
Similarly, we can show that the sample covariance
sXY =1
n − 1
n∑
i=1
(Xi − X)(Yi − Y)
is unbiased and consistent (see E3.20).
Christian Conrad (Heidelberg University) Winter term 2012/13 63 / 88
2 Review of Statistics2.1 Estimation of the population mean
Example E3.2 Let Y be a Bernoulli random variable with success probabilityP(Y = 1) = p, and let Y1, . . . , Yn be i.i.d. draws from this distribution. Let p bethe fraction of successes (1s) in this sample.
1 Show that p = Y.2 Show that p is an unbiased estimator of p.3 Show that Var(p) = p(1−p)
n .
Example E3.18 This exercise shows that the sample variance is an unbiasedestimator of the population variance when Y1, . . . , Yn are i.i.d. with mean µY
and variance σ2Y .
1 Use the equation Var(aX + bY) = a2σ2X + 2abσXY + b2σ2
Y to show thatE[(Yi − Y)2] = Var(Yi)− 2Cov(Yi, Y) + Var(Y).
2 Use the equation Cov(a + bX + cV, Y) = bσXY + cσVY to show that
Cov(Y, Yi) =σ2
Yn .
3 Use the results in 1. and 2. to show that E[s2Y ] = σ2
Y .
Christian Conrad (Heidelberg University) Winter term 2012/13 64 / 88
2 Review of Statistics2.1 Estimation of the population mean
Example E3.191 Y is an unbiased estimator of µY . Is Y
2an unbiased estimator of µ2
Y?
2 Y is a consistent estimator of µY . Is Y2
a consistent estimator of µ2Y?
Christian Conrad (Heidelberg University) Winter term 2012/13 65 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
We need to introduce some more distributions
Gauss-statisticConsider the sequence X1, ...,Xn of random variables with Xi
i.i.d.∼ N (µX , σ2X).
Xn = 1n
∑ni=1 Xi is an estimator of µX with
Xn ∼ N (µX ,σ2
n ) and Z = Xn−µX
σX/√
n ∼ N (0, 1).
Z is called the Gauss-statistic.
χ2-distributionConsider the sequence X1, ...,Xn of random variables with Xi
i.i.d.∼ N(0, 1). Then
Y =
n∑
i=1
X2i ∼ χ2(n).
We say that Y is χ2-distributed with n degrees of freedom. What are the meanand variance of Y?Later on we will use that (n − 1)s2
X/σ2X ∼ χ(n − 1). (To get some intuition see
E2.24)Christian Conrad (Heidelberg University) Winter term 2012/13 66 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
t-distributionConsider the random variables X ∼ N (0, 1) and Y ∼ χ2(n); X, Y areindependent. Then the distribution of the random variable T with
T =X√
Yn
is called the t-distribution with n degrees of freedom.
Christian Conrad (Heidelberg University) Winter term 2012/13 67 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
Null and alternative hypothesis
The starting point of statistical hypothesis testing is specifying the hypothesisto be testet, called the null hypothesis. Hypothesis testing entails using datato compare the null hypothesis to a second hypothesis, called the alternativehypothesis that holds if the null does not.
We are interested in testing the hypothesis that the population mean µX takeson a specific value, µX,0:
H0 : µX = µX,0
The most general alternative hypothesis is
H1 : µX 6= µX,0.
Because under the alternative µX can be either less than or greater than µX,0
it is called a two-sided alternative.
Christian Conrad (Heidelberg University) Winter term 2012/13 68 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
The problem facing facing the statistician is to use the evidence in a randomlyselected sample of data to decide whether to “accept” the null hypothesis or toreject it in favor of the alternative hypothesis.
In any given sample, the sample average X will rarely be exactly equal tohypothesized value µX,0. Differences between X and µX,0 can arise becausethe true mean in fact does not equal µX,0 (the null hypothesis is false) orbecause the mean equals µX,0 (the null hypothesis is true) but X differs fromµX,0 because of random sampling.
When we undertake a statistical test, we can make two types of mistakes: wecan incorrectly reject the null hypothesis when it is true (type I error ), or wecan fail to reject the null hypothesis when it is false (type II error ).
Christian Conrad (Heidelberg University) Winter term 2012/13 69 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
How to proceed:
Specify the null and alternative hypothesis
Prespecify the probability of making a type I error, i.e. the significancelevel α:
α = P(rejecting H0|H0 is true).
Typically, we choose α = 0.05.
Derive a test statistic T = T(X1, . . . ,Xn) and its distribution under H0.
Determine a certain critical value such that the null hypothesis will berejected, if the test statistic exceeds this value. The set of values of thetest statistic for which the test rejects the null hypothesis is the rejectionregion (R), and the values of the test statistic for which it does not rejectthe null hypothesis is the acceptance region (A).
The null hypothesis is rejected if tact = T(x1, . . . , xn) ∈ R.
Christian Conrad (Heidelberg University) Winter term 2012/13 70 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
The t-Statistic (two-sided test)
X1, ...,Xn are i.i.d. N (µX , σ2X).
Case 1: µX unknown, σ2X known.
1 H0 : µX = µX,0 against H1 : µX 6= µX,0
2 e.g. α = 5%3 If H0 is true, then
T =√
nX − µX,0
σX∼ N (0, 1)
4 Determine the critical value:
α = P(|T| > z1−α
2)
5 Reject H0 ⇔ |tact| > z1−α
2. (For α = 0.05 the critical value is z0.975 = 1.96.)
Christian Conrad (Heidelberg University) Winter term 2012/13 71 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
Case 2: µX unknown, σ2X unknown.
What happens if we replace σX by sX in the test statistic?
√n
X − µX,0
sX=
√n X−µX,0
σX√
(n−1)s2X/σ
2X
n−1
∼ t(n − 1)
since the numerator is N (0, 1), the denominator is the square roof a χ2(n − 1)random variable and X and s2
X are independently distributed.
Christian Conrad (Heidelberg University) Winter term 2012/13 72 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
Now: X1, ...,Xn are i.i.d., E[X4i ] < ∞, but the distribution is unknown
Show that in this case:
T =√
nX − µX,0
sX
d−→ N (0, 1)
Thus, as long as the sample size is large, the distribution of the test statistic iswell approximated by the standard normal distribution.
Christian Conrad (Heidelberg University) Winter term 2012/13 73 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
The p-Value
The p-value is the probability of drawing a statistic at least as adverse to thenull hypothesis as the one you actually computed in your sample, assumingthe null hypothesis is correct:
p − value = P
(
∣
∣
∣
∣
X − µX,0
SE(X)
∣
∣
∣
∣
>
∣
∣
∣
∣
∣
|Xact − µX,0
SE(X)
∣
∣
∣
∣
∣
)
= P(|T| > |tact|) = 2(1 − Φ(|tact|)
with SE(X) = sx/√
n being the standard error of X.
For a prespecified α, reject the null hypothesis if p < α, otherwise do notreject.
Christian Conrad (Heidelberg University) Winter term 2012/13 74 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
Christian Conrad (Heidelberg University) Winter term 2012/13 75 / 88
2 Review of Statistics2.2 Hypothesis tests concerning the population mean
Sometimes we are interested in one-sided alternative hypothesis, which canbe written as
H1 : µX > µX,0
orH1 : µX < µX,0.
In this case the p-value is given by
p − value = 1 − Φ(|tact|).
The N (0, 1) critical value for a one-sided test with α = 0.05 is 1.64 and -1.64,respectively.
Christian Conrad (Heidelberg University) Winter term 2012/13 76 / 88
2 Review of Statistics2.3 Confidence intervals for the population mean
Because of random sampling error, it is impossible to learn the exact value ofthe population mean of X using only the information in the sample. However, itis possible to use data from a random sample to construct a set of values thatcontains the true population mean µX with a certain prespecified probability.Such a set is called a confidence interval and the probability α is calledconfidence level.E.g, for α = 0.05 the confidence interval for the mean corresponds to the setof values for the which the null hypothesis cannot be rejected:
[X − 1.96SE(X);X + 1.96SE(X)]
Christian Conrad (Heidelberg University) Winter term 2012/13 77 / 88
2 Review of Statistics2.3 Confidence intervals for the population mean
Example:
Assume that the body height X of the participants of a statistics lecture isnormally distributed with σX = 10. The average height in a randomsample of size n = 25 equals X = 183 cm. Test the hypothesis thatH0 : µX = 190 against the alternative H1 : µX 6= 190 at the 5% significancelevel. What is the p-value of the test?
Now, assume that the standard deviation is unknown. However, you knowthat sx = 10 cm.
Finally, skip the assumption that X is normally distributed. Now, n = 50and sx = 10 cm. How would you proceed?
Christian Conrad (Heidelberg University) Winter term 2012/13 78 / 88
2 Review of Statistics2.3 Confidence intervals for the population mean
Example E2.24: Suppose Yi is distributed i.i.d. N (0, σ2Y ) for i = 1, 2, . . . , n.
1 Show that E[Y2i /σ
2Y) = 1.
2 Show that W = (1/σ2Y)∑n
i=1 Y2i is distributed χ2(n).
3 Show that E[W] = n. [Hint: Use your answer to 1.]4 Show that
V =Y1
√∑ni=2 Y2
in−1
is distributed t(n − 1).
Christian Conrad (Heidelberg University) Winter term 2012/13 79 / 88
3 Matrix Algebra
3 Matrix Algebra3.1 Basic principles3.2 Multivariate statistics
Christian Conrad (Heidelberg University) Winter term 2012/13 80 / 88
3 Matrix Algebra3.1 Basic principles
Basic principles
A matrix A is a n × K rectangular array of numbers, written as
A =
a1,1 a1,2 · · · a1,K
a2,1 a2,2 · · · a2,K...
.... . .
...an,1 an,2 · · · an,K
= (ai,j)i=1,...,n,j=1,...,K.
The transpose of a matrix, denoted as A′ is obtained by flipping the matrix onits diagonal. Thus
A′ = (aj,i)i=1,...,n,j=1,...,K.
Example:
A =
(
1 2 30 −6 7
)
A′ =
1 02 −63 7
Christian Conrad (Heidelberg University) Winter term 2012/13 81 / 88
3 Matrix Algebra3.1 Basic principles
Special matrices
A matrix A is
square if n = K.
symmetric if A = A′ which requires ai,j = aj,i.
diagonal if the off-diagonal elements are all zero, so that ai,j = 0 if i 6= j.
is upper (lower) diagonal if all elements below (above) the diagonal equalzero.
An important diagonal matrix is the identity matrix, which has ones on thediagonal. The k × k identity matrix is denoted as
Ik =
1 0 · · · 00 1 · · · 0...
.... . .
...0 0 · · · 1
.
Christian Conrad (Heidelberg University) Winter term 2012/13 82 / 88
3 Fundamentals of matrix algebra3.1 Basic principles
Basic operations
Matrix addition
A = (ai,j)i=1,...,n,j=1,...,K,B = (bi,j)i=1,...,n,j=1,...,K.
C = A + B = (ci,j)i=1,...,n,j=1,...,K = (ai,j + bi,j)i=1,...,n,j=1,...,K.
Example:(
1 3 11 0 0
)
+
(
0 0 57 5 0
)
=
(
1 + 0 3 + 0 1 + 51 + 7 0 + 5 0 + 0
)
=
(
1 3 68 5 0
)
Skalar multiplication
A = (ai,j)i=1,...,n,j=1,...,K, λ ∈ R.
λ · A = (λ · ai,j)i=1,...,n,j=1,...,K.
Example:
4 ·(
1 2 30 −6 7
)
=
(
4 8 120 −24 28
)
Christian Conrad (Heidelberg University) Winter term 2012/13 83 / 88
3 Matrix Algebra3.1 Basic principles
Matrix multiplication
A = (ai,j)i=1,...,l,j=1,...,n,B = (bi,j)i=1,...,n,j=1,...,K.
C = A · B = (ci,j)i=1,...,l, j=1,...,K ci,j =
n∑
k=1
ai,k · bk,j
Example:
(
1 0 2−1 3 1
)
×
3 12 11 0
=
(
5 14 2
)
.
with
5 = 1 · 3 + 0 · 2 + 2 · 1;
1 = 1 · 1 + 0 · 1 + 2 · 0;
4 = −1 · 3 + 3 · 2 + 1 · 1;
5 = −1 · 1 + 3 · 1 + 1 · 0.
Christian Conrad (Heidelberg University) Winter term 2012/13 84 / 88
3 Matrix Algebra3.1 Basic principles
Properties
Matrix additioni) A + B = B + Aii) (A + B) + C = A + (B + C)
Matrix multiplicationi) (A · B) · C = A · (B · C)ii) A · B 6= B · A
Example:(
1 23 4
)
·
(
0 10 0
)
=
(
0 10 3
)
,
(
0 10 0
)
·
(
1 23 4
)
=
(
3 40 0
)
.
iii) A · (B + C) = A · B + A · C(B + C) · A = B · A + C · A
iv) Multiplication with identity matrix for any n × K matrix M:
M · IK = In · M = M.
v) Idempotence: A · A = A.vi) Transpose of a product: (A · B · C)′ = C′ · B′ · A′
Christian Conrad (Heidelberg University) Winter term 2012/13 85 / 88
3 Matrix Algebra3.1 Basic principles
Quadratic form
Consider a symmetric matrix A ∈ RK×K and a vector x ∈ R
K×1. Theexpression
x′Ax =
K∑
i=1
K∑
j=1
xiai,jxj
is called quadratic form.
Implications:
A is positive definite if x′Ax > 0 for all x 6= 0.A is negative definite if x′Ax < 0 for all x 6= 0.
Problem: Show that the matrix
A =
2 −1 0−1 2 −10 −1 2
is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 86 / 88
3 Matrix Algebra3.1 Basic principles
Rank and inverse of a matrixThe rank of the n × K matrix (K ≤ n)
A = (a1, ..., aK)
is the number of linearly independent columns aj and is written as rank(A). Ahas full rank if rank(A) = K.Properties:
A square k × k matrix A is said to be nonsingular if it is has full rank, e.g.rank(A) = k. This means that there is no k × 1 vector c 6= 0 such thatAc = 0.If a square k × k matrix A is nonsingular then there exists a unique matrixk × k matrix A−1 called the inverse of A which satisfies
AA−1 = A−1A = Ik.
If A is positive or negative definite, then A is nonsingular.Problem: Compute the inverse of the matrix
A =
[
8 22 1
]
.
is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 87 / 88
3 Matrix Algebra3.1 Basic principles
Trace of a matrix
The trace of a k × k square matrix A is defined to be the sum of the elementson the main diagonal, i.e.,
tr(A) =
k∑
i=1
ai,i
Properties for square matrices A and B and real λ are :
tr(λA) = λtr(A);
tr(A′) = tr(A);
tr(A + B) = tr(A) + tr(B);
tr(Ik) = k;
If A is an n × K matrix and B is an K × n matrix, then
tr(AB) = tr(BA).
Christian Conrad (Heidelberg University) Winter term 2012/13 88 / 88