review course statistics probability theory - statistical

Review Course StatisticsProbability Theory - Statistical Inference - Matrix Algebra

Prof. Dr. Christian Conrad

Heidelberg University

Winter term 2012/13

Christian Conrad (Heidelberg University) Winter term 2012/13 1 / 88

Review Course Statistics

Christian ConradEmail: [email protected]

Tue 09.10. – Thu 11.11.1209.00 – 12.00 and 14.00 – 16.00HEU I

Slideshttp://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)“Passwort: econometrics12_13”


http://elearning2.uni-heidelberg.de/

Econometrics

Lecture: Christian ConradTue, 9.00-12.00, Bergheimer Str. 58, Hörsaal

Office hours: Mo 11.00-12.00, Bergheimer Str. 58, 01.019aEmail: [email protected]

Tutorial: Matthias HartmannMon, 14.00-16.00Theory: Bergheimer Str. 58, 00.010STATA: Bergheimer Str. 58, 99.005-6

Wed, 14.00-16.00Theory: Grabengasse 3-5, NUni HS 10STATA: Bergheimer Str. 58, 99.005-6

Lecture notes, problem sets . . .http://elearning2.uni-heidelberg.de/⇒ 10_MScE1C: Ökonometrie (WS 2012/13)


http://elearning2.uni-heidelberg.de/


Contents

1. Review of Probability Theory

2. Review of Statistics

3. Matrix Algebra



Literature

Stock, J. H. and M. W. Watson, Introduction to Econometrics, 3rd edition,Pearson, 2012.

Review of Probability Theory: Chapter 2 and Chapter 17.2, Appendix 17.1Review of Statistics: Chapter 3 and Chapter 17.2Matrix Algebra: Appendix 18.1


1 Review of Probability Theory

1 Review of Probability Theory1.1 Random variables and probability distributions1.2 Expected values, mean, and variance1.3 Two random variables1.4 Random sampling and the distribution of the sample average1.5 Large-sample approximations to sampling distributions


1 Review of Probability Theory1.1 Random variables and probability distributions

Random experiment

Our starting point is a random experiment with

mutually exclusive potential outcomes ω and

the set of all possible outcomes Ω, called the sample space.

An event A is a collection of outcomes, and, hence a subset of Ω.

The probability P(A) of an event is the proportion of the time the eventoccurs in the long run.

Example: Tossing a die once

Describe the sample space and the events A: "the outcome is an oddnumber" and B: "the outcome is an even number". What is P(A)?



Random variables

A random variable X is a numerical summary of a random outcome,i.e. the random variable assigns a real number X(ω) = x to each outcomeω ∈ Ω. x is called the realization.X can be either discrete or continuous

a discrete random variable takes only a discrete set of values, like 0,1,2,. . .a continuous random variable takes on a continuum of possible values



Cumulative density function (cdf)

The cumulative density function F is the probability that the random variable isless than or equal to a particular value x:

FX(x) = P(X ≤ x) = P(ω|X(ω) ≤ x)

Properties of the cdf:1 FX is nondecreasing in x.2 FX is right-continuous, that means lim

x→x0x>x0

FX(x) = FX(x0).

3 limx→−∞

FX(x) = 0, limx→∞

FX(x) = 1.

4 P((a, b]) = FX(b)− FX(a).



Probability distributionThe probability distribution of a discrete random variable is the list of allpossible values, x1, x2, . . ., of the random variable and the probability that eachvalue will occur. It takes the form

P(X = xi) = pi, i = 1, 2, ...

where 0 ≤ pi ≤ 1 and∑

i pi = 1 so that

FX(xi) = P(X ≤ xi) =∑

xt≤xi

P(X = xt)

is a step function.Example: A random variable is Bernoulli distributed if the outcome is binarywith

X =

1 with probability p

0 with probability 1 − p.



Probability density function (pdf)For a continuous random variable the probabilities are represented by theprobability density function (pdf) fX(x), such that the area under the pdfbetween any two points a and b (where a < b) is the probability that therandom variable falls between these to points:

P(a < X ≤ b) =∫ b

afX(x)dx

and

FX(x) = P(X ≤ x) =∫ x

−∞fX(u)du

A function fX(x) is a pdf if and only if fX(x) ≥ 0 for all x and∫∞−∞ fX(x)dx = 1.



Example:A continuous random variable X with density

fX(x) =1√

2πσX

exp

(

− (x − µX)2

2σ2X

)

with parameters µX and σX > 0 is said to be normally distributed. We use thenotation:

X ∼ N (µX , σ2X)

If µX = 0 and σ2X = 1 the random variable is said to be standard normal

distributed. In the case we denote the pdf and cdf by φ(x) and Φ(x).


1 Review of Probability Theory1.2 Expected values, mean, and variance

Expected value

The expected value (or mean) of a random variable X, denoted by µX = E(X),is given by

E[X] =∑

i

xiP(X = xi),

if X is discrete and

E[X] =∫ ∞

−∞x · fX(x)dx

if X is continuous. It is the “long run average value of the random variable overmany repeated trials”.



Variance

The variance of a random variable X, denoted by σ2X = Var(X), is given by

Var[X] = E[(X − µX)2] =

∑

i

(xi − µX)2P(X = xi),


Var[X] =∫ ∞

−∞(x − µX)

2 · fX(x)dx

if X is continuous. It is a measure of the dispersion or the “spread” of aprobability distribution. The square root of the variance is called the standarddeviation and denoted by σX.

The variance can be written as: Var[X] = E[X2]− (E[X])2.



Similarly, for any function g, we define the expectation E[g(X)] as

E[g(X)] =∑

i

g(xi)P(X = xi)


E[g(X)] =∫ ∞

−∞g(x)fX(x)dx

if X is continuous.

Jensen’s Inequality: If g(X) is a convex function, then

g(E[X]) ≤ E[g(X)].

In particular: (Expectation Inequality)

|E[X]| ≤ E[|X|]



Cauchy-Schwarz Inequality:

|E[XY]| ≤√

E[X2]E[Y2]

(for the Proof see Appendix 17.2)

Higher order moments:

For r = 1, 2, . . ., we define the r′th moment of X as

E(Xr)

and the r′th central moment of X as

E[(X − E(X))r].

Remark: If E[Xr] < ∞, then all the raw moments of order less than r also exist.



Skewness:

Skewness =E[(X − µX)

3]

σ3X

The skewness describes how much a distribution deviates from symmetry.For a symmetric distribution: Skewness = 0. The distribution has a long right(left) tail if Skewness > 0 (Skewness < 0).

Kurtosis:

Kurtosis =E[(X − µX)

4]

σ4X

The kurtosis of a distribution is a measure of how much mass is in its tails.The greater the kurtosis of a distribution, the more likely are outliers. Thekurtosis of a normally distributed random variable is 3. A distribution withkurtosis exceeding 3 is called leptokurtic or heavy-tailed.



Example: Calculate the mean and variance of a Bernoulli distributed randomvariable.

Example (E2.1): Let Y denote the number of “heads” that occur when twocoins are tossed.

1 Derive the probability distribution of Y.2 Derive the cumulative probability distribution of Y.3 Derive the mean and variance of Y.

Example: Consider the discrete random variable X and the functiong(X) = a + bX with a, b ∈ R. Derive E[g(X)] and Var[g(X)].

Example (E2.8): The random variable Y has a mean of 1 and a variance of 4.Let Z = 1

2 (Y − 1). Show that µZ = 0 and σ2Z = 1.



Example (E17.5): Suppose that W is a random variable with E[W4] < ∞.Show that E[W2] < ∞. [Hint: Calculate the variance of W2.]

Example (E2.21): X is a random variable with moments E[X], E[X2], E[X3] andso forth.

1 Show E[(X − µx)3] = E[X3]− 3 E[X2] E[X] + 2(E[X])3.

2 Show E[(X − µx)4] = E[X4]− 4 E[X] E[X3] + 6 (E[X])2 E[X2]− 3(E[X])4.


1 Review of Probability Theory1.3 Two random variables

Joint and marginal distributions

The joint cdf of the random variables X and Y is given by

FX,Y(x, y) = P(X ≤ x, Y ≤ y).

The joint pdf fX,Y(x, y) of X and Y is given by

fX,Y(x, y) = P(X = x, Y = y)

if X and Y are discrete and by

fX,Y(x, y) =∂2

∂x∂yFX,Y(x, y)

if X and Y are continuous.



Suppose X and Y are discrete with outcomes x1, x2, . . . , xl and y1, y2, . . . , yk.Then the marginal probability distribution of Y is given

P(Y = yj) =

l∑

i=1

P(X = xi, Y = yj) for j = 1, . . . , k.

If X and Y are continuous the marginal density function of Y is given by

fY(y) =∫ ∞

−∞fX,Y(x, y)dx.



Conditional distributions/density

The conditional distribution/density of Y given X = x is given by

P(Y = y|X = x) =P(X = x, Y = y)

P(X = x)

if X and Y are discrete and

fY|X=x(y) =fX,Y(x, y)

fX(x)




Independence

Two random variables are independent, if knowing the value of one of thevariables provides no information about the other. X and Y are independentlydistributed, if for all values x and y,

P(Y = y|X = x) = P(Y = y)

if X and Y are discrete andfY|X=x(y) = fY(y)

if X and Y are continuous. Alternatively, we can say that X and Y areindependently distributed, if the joint distribution equals the product of themarginal distributions, i.e.

P(X = x, Y = y) = P(X = x)P(Y = y)

fX,Y(x, y) = fX(x)fY (y)



Covariance and correlation

The covariance between X and Y is

σX,Y = Cov(X, Y) = E[(X − µX)(Y − µY)].

The correlation between X and Y is

ρXY = Corr(X, Y) =σXY

σXσY.



Properties of Covariance and Correlation

−1 ≤ Corr(X, Y) ≤ 1 is a measure of linear dependence, free of units ofmeasurement.

Cov(a + bX, c + dY) = bdCov(X, Y)

Cov(X, Y) = E[XY] − E[X]E[Y]

X, Y are statistically independent ⇒ Cov(X, Y) = Corr(X, Y) = 0.

Cov(X, Y),Corr(X, Y) 6= 0 ⇒ X, Y are statistically dependent.

Cov(X, Y) = Corr(X, Y) = 0 6⇒ X, Y are statistically independent.



Correlation and conditional mean

If E[Y|X] = µy, i.e. the conditional mean of Y does not depend on X, thenCov(X, Y) = 0 and Corr(X, Y) = 0.

However, Cov(X, Y) = 0 does not imply that the conditional mean of Y doesnot depend on X.

(see Example E2.23)



More an expectation, variance and covariance

Consider the random variables X, Y and Z.

E(aX + bY) = aE(X) + bE(Y)

Var(aX + bY) = a2Var(X) + b2Var(Y) + 2abCov(X, Y)

If X and Y are independent, then

Var(X + Y) = Var(X) + Var(Y)

since Cov(X, Y) = 0. Finally,

Cov(a + bX + cY, Z) = bσXZ + cσYZ

(for Proofs see Appendix 2.1)



Example E2.6:

The following table gives the joint probability distribution between employmentstatus and college graduation among those either employed or looking forwork (unemployed) in the working age U.S. population for 2008.

1 Compute E[Y].2 The unemployment rate is the fraction of the labor force that is

unemployed. Show that the unemployment rate is given by 1 − E[Y].3 Calculate E[Y |X = 1] and E[Y |X = 0].4 Calculate the unemployment rate for (i) college graduates and (ii)

non-college graduates.5 A randomly selected member of this population reports being

unemployed. What is the probability that this worker is a collegegraduate? A non-college graduate?

6 Are educational achievement and employment status independent?Explain.



Example E2.19: Consider two random variables X and Y. Suppose that Ytakes on k values y1, . . . yk and that X takes on l values x1, . . . , xl.

1 Show that P(Y = yj) =∑l

i=1 P(Y = yj |X = xi)P(X = xi). [Hint: Use thedefinition of P(Y = yj |X = xi).]

2 Use your answer to 1. to verify the equationE[Y] =

∑li=1 E[Y |X = xi]P(X = xi).

3 Suppose that X and Y are independent. Show that σXY = 0 andCorr(X, Y) = 0.



Example E2.20: Consider three random variables X, Y and Z. Suppose that Ytakes on k values y1, . . . yk, that X takes on l values x1, . . . , xl and that Z takeson m values z1, . . . , zm. The joint probability distribution of X, Y, Z isP(X = x, Y = y, Z = z), and the conditional probability distribution of Y given Xand Z is

P(Y = y |X = x, Z = z) =P(X = x, Y = y, Z = z)

P(X = x, Z = z).

1 Explain how the marginal probability that Y = y can be calculated fromthe joint probability distribution. [Hint: This is a generalization of theequation P(Y = y) =

∑li=1 P(X = xi, Y = y).]

2 Show that E[Y] = E[E[Y |X, Z]]. [Hint: This is a generalization of theequations E[Y] =

∑li=1 E[Y |X = xi]P(X = xi) and E[Y] = E[E[Y |X]].]



Example E2.23: This exercise provides an example of a pair of randomvariables X and Y for which the conditional mean of Y given X depends on Xbut Corr(X, Y) = 0. Let X and Z be two independently distributed standardnormal random variables, and let Y = X2 + Z.

1 Show that E[Y |X] = X2.2 Show that µY = 1.3 Show that E[XY] = 0. [Hint: Use the fact that the odd moments of a

standard normal random variable are all zero.]4 Show that Cov(X, Y) = 0 and thus Corr(X, Y) = 0.



Example E2.26: Suppose that Y1, Y2, . . .Yn are random variables with acommon mean µY , a common variance σ2

Y , and the same correlation ρ (so thatthe correlation between Yi and Yj is equal to ρ for all pairs i and j where i 6= j).

1 Show that Cov(Yi, Yj) = ρσ2Y for i 6= j.

2 Suppose that n = 2. Show that E[Y] = µY and Var[Y] = 12σ

2Y + 1

2ρσ2Y .

3 For n ≥ 2, show that E[Y] = µY and Var[Y] = 1nσ

2Y + n−1

n ρσ2Y .

4 When n is very large, show that Var[Y] ≈ ρσ2Y .


1 Review of Probability Theory1.4 Random sampling and the distribution of the sample average

Random sampling

n objects, denoted by X1,X2, . . . ,Xn, are randomly drawn from a population.That is, the Xi are random variables, independently and identically distributed(i.i.d.).

The sampling distribution of the sample average

The sample average is defined as Xn = 1n

∑ni=1 Xi and is a random variable

itself, i.e. has a pdf called the sampling distribution. The mean and thevariance of X are given by

E[Xn] = µX

and

Var[Xn] =1nσ2

X.

If each Xi is normally distributed, i.e. Xi ∼ N (µX , σ2X), then Xn ∼ N (µX, σ

2X/n),

since the sum of normally distributed random variables is again normallydistributed.


1 Review of Probability Theory1.5 Large-sample approximations to the sampling distributions

We have seen that we can derive the exact sampling distribution of Xn, if eachof the Xi is normally distributed. Since this result holds for any value of n, thesampling distribution is called the finite-sample distribution of Xn.

In practice, we often do not know the distribution of the Xi. Nevertheless, inthis situation we will be able to make statements about the asymptoticdistribution of Xn. That is, we will provide an approximation to the samplingdistribution which becomes exact in the limit, i.e. for n → ∞.



Convergence in probability and the law of large numbers

Convergence in ProbabilityLet z1, z2, . . . , zn, . . . be a sequence of random variables. The sequence zn issaid to converge in probability to a constant c, if for any ε > 0

limn→∞

P(|zn − c| ≥ ε) = 0.

That is, the probability that zn is in the range c − ε to c + ε tends to 1 asn → ∞. Notation:

znP−→ c or plim zn = c

A sequence of random vectors (matrices) converges in probability if eachelement converges in probability.

Useful result: if znP−→ c and yn

P−→ d, then

zn + ynP−→ c + d and znyn

P−→ cd



Law of Large Numbers (LLN)If X1, . . . ,Xn are independently and identically distributed (i.i.d.) with E[Xi] = µX

and Var[Xi] < ∞, then

Xn =1n

n∑

i=1

XiP−→ µX .

The LLN says that the sample average Xn converges in probability to thepopulation mean.

We can prove the LLN by using Chebychev’s inequality (see Appendix 17.2):if Y is a random variable, c is any constant, then

P(|Y − c| ≥ ε) ≤ E[(Y − c)2]

ε2,

for any positive constant ε.



Convergence in distribution and the central limit theorem

Convergence in DistributionLet Sn be a sequence of random variables and Fn the cdf of Sn. We say thatSn converges in distribution to a random variable S if the cdf Fn of Sn

converges to the cdf F of S at every continuity point of F. We call F theasymptotic distribution of Sn.

Notation:Sn

d−→ S or simply Snd−→ F

Central Limit Theorem (CLT)Let Xi be i.i.d. with E[Xi] = µX and Var[Xi] = σ2

X < ∞. Then,

√n

Xn − µX

σX

d−→ N (0, 1).

The central limit theorem states that the distribution of the standardizedsample average becomes arbitrarily well approximated by the standardnormal distribution as n → ∞.



Slutsky’s Theorem

If znP−→ c and Sn

d−→ S, then

1 zn + Snd−→ c + S

2 znSnd−→ cS

3 Sn/znd−→ S/c if c 6= 0

Continuous Mapping TheoremIf g is a continuous function, then

1 if znP−→ c then g(zn)

P−→ g(c)

2 if Snd−→ S then g(Sn)

d−→ g(S)


2 Review of Statistics

2 Review of Statistics2.1 Estimation of the population mean2.2 Hypothesis tests concerning the population mean2.3 Confidence intervals for the population mean



Statistical tools help us answer questions about unknown characteristics ofdistributions in populations of interest.E.g. what is the mean of the distribution of earnings of recent collegegraduates?

The key insight of statistics is that one can learn about a populationdistribution by selecting a random sample from that population.E.g. rather than survey the entire U.S. population, we might survey, say, 1000members of the population, selected by random sampling.

Most of the interesting questions in economics involve relationships betweentwo or more variables or comparisons between different populations.E.g. is there a gap between the mean earnings for male and female recentcollege graduates?



Three types of statistical methods are used in econometrics:

estimation: computing a “best guess” numerical value for an unknowncharacteristic of a population distribution.

hypothesis testing: formulating a specific hypothesis about thepopulation, then using sample evidence to decide whether it is true.

confidence intervals: using a set of data to estimate an interval or rangefor an unknown population characteristic.


2 Review of Statistics2.1 Estimation of the population mean

Estimator and estimate

Let X1, . . .Xn be a sequence of i.i.d. random variables and ϑ an unknowncharacteristic of the distribution of Xi.

A function ϑn = ϑ(X1, ...,Xn) is called estimator of ϑ.

The realized value ϑ(x1, ..., xn) of an estimator ϑ(X1, ...,Xn) is called theestimate of ϑ based on the sample x1, ..., xn.

While ϑ(X1, ...,Xn) is a random variable, ϑ(x1, ..., xn) is a nonrandomnumber.



Properties of an estimatorAn estimator ϑn of an unknown parameter ϑ is

unbiased ifE(ϑn) = ϑ

asymptotically unbiased if

limn→∞

E(ϑn) = ϑ

consistent if ϑn converges in probability to ϑ, that is, for all ε > 0 we have

limn→∞

P(∣

∣

∣ϑn − ϑ∣

∣

∣ ≥ ε)

= 0

Let ϑn be another estimator of ϑ and suppose that both ϑn and ϑn areunbiased. Then, ϑn is said to be more efficient than ϑn ifVar[ϑn] < Var[ϑn].



Mean square errorThe mean square error (MSE) of an estimator is defined as

MSE(ϑn) = E[(ϑn − ϑ)2] = (E[ϑn]− ϑ)2 + Var[ϑn].

Hence, for unbiased estimators: MSE(ϑn) = Var(ϑn).

If the estimator is asymptotically unbiased and its variance goes to zero, i.e.

limn→∞

E[ϑn]− ϑ = 0 and limn→∞

Var[ϑn] = 0,

ϑn is said to converge in mean square to ϑ, which implies that ϑn isconsistent.1 Why?

1A sequence of random variables zn converges in mean square to a constant c, if

limn→∞

E[(zn − c)2] = 0

Convergence in mean square implies convergence in probability.Christian Conrad (Heidelberg University) Winter term 2012/13 60 / 88


Properties of X

Example: What are the properties of the following three estimators ofµX = E(X):

a) ϑa = 1n

∑ni=1 Xi

b) ϑb = X1

c) ϑc = ϑa +1n



Efficiency of X

Let µX be an estimator of µX that is a linear function of X1, . . . ,Xn, that is,µX =

∑ni=1 aiXi, where a1, . . . , a2 are nonrandom constants. If µX is unbiased,

then Var(X) < Var(µX) unless µX = X. Thus X is the Best Linear UnbiasedEstimator (BLUE); that is, X is the most efficient estimator of µX among allunbiased estimators that are linear in the Xi’s. (Proof ?)

Another motivation: for which choice of m is

n∑

i=1

(Xi − m)2

minimized, i.e. what is the best predictor of Xi in a mean square error sense?Again, this is X! (see Appendix 3.2)



Estimating the variance and covariance

The variance σ2X can be estimated by the sample variance:

s2X =

1n − 1

n∑

i=1

(Xi − X)2

The sample variance is an unbiased estimator of the population variance (thisis to be shown in E3.18). The sample variance is also consistent (seeAppendix 3.3). Is sX =

√

s2X a consistent estimator of σX?

Similarly, we can show that the sample covariance

sXY =1

n − 1

n∑

i=1

(Xi − X)(Yi − Y)

is unbiased and consistent (see E3.20).



Example E3.2 Let Y be a Bernoulli random variable with success probabilityP(Y = 1) = p, and let Y1, . . . , Yn be i.i.d. draws from this distribution. Let p bethe fraction of successes (1s) in this sample.

1 Show that p = Y.2 Show that p is an unbiased estimator of p.3 Show that Var(p) = p(1−p)

n .

Example E3.18 This exercise shows that the sample variance is an unbiasedestimator of the population variance when Y1, . . . , Yn are i.i.d. with mean µY

and variance σ2Y .

1 Use the equation Var(aX + bY) = a2σ2X + 2abσXY + b2σ2

Y to show thatE[(Yi − Y)2] = Var(Yi)− 2Cov(Yi, Y) + Var(Y).

2 Use the equation Cov(a + bX + cV, Y) = bσXY + cσVY to show that

Cov(Y, Yi) =σ2

Yn .

3 Use the results in 1. and 2. to show that E[s2Y ] = σ2

Y .



Example E3.191 Y is an unbiased estimator of µY . Is Y

2an unbiased estimator of µ2

Y?

2 Y is a consistent estimator of µY . Is Y2

a consistent estimator of µ2Y?


2 Review of Statistics2.2 Hypothesis tests concerning the population mean

We need to introduce some more distributions

Gauss-statisticConsider the sequence X1, ...,Xn of random variables with Xi

i.i.d.∼ N (µX , σ2X).

Xn = 1n

∑ni=1 Xi is an estimator of µX with

Xn ∼ N (µX ,σ2

n ) and Z = Xn−µX

σX/√

n ∼ N (0, 1).

Z is called the Gauss-statistic.

χ2-distributionConsider the sequence X1, ...,Xn of random variables with Xi

i.i.d.∼ N(0, 1). Then

Y =

n∑

i=1

X2i ∼ χ2(n).

We say that Y is χ2-distributed with n degrees of freedom. What are the meanand variance of Y?Later on we will use that (n − 1)s2

X/σ2X ∼ χ(n − 1). (To get some intuition see

E2.24)Christian Conrad (Heidelberg University) Winter term 2012/13 66 / 88


t-distributionConsider the random variables X ∼ N (0, 1) and Y ∼ χ2(n); X, Y areindependent. Then the distribution of the random variable T with

T =X√

Yn

is called the t-distribution with n degrees of freedom.



Null and alternative hypothesis

The starting point of statistical hypothesis testing is specifying the hypothesisto be testet, called the null hypothesis. Hypothesis testing entails using datato compare the null hypothesis to a second hypothesis, called the alternativehypothesis that holds if the null does not.

We are interested in testing the hypothesis that the population mean µX takeson a specific value, µX,0:

H0 : µX = µX,0

The most general alternative hypothesis is

H1 : µX 6= µX,0.

Because under the alternative µX can be either less than or greater than µX,0

it is called a two-sided alternative.



The problem facing facing the statistician is to use the evidence in a randomlyselected sample of data to decide whether to “accept” the null hypothesis or toreject it in favor of the alternative hypothesis.

In any given sample, the sample average X will rarely be exactly equal tohypothesized value µX,0. Differences between X and µX,0 can arise becausethe true mean in fact does not equal µX,0 (the null hypothesis is false) orbecause the mean equals µX,0 (the null hypothesis is true) but X differs fromµX,0 because of random sampling.

When we undertake a statistical test, we can make two types of mistakes: wecan incorrectly reject the null hypothesis when it is true (type I error ), or wecan fail to reject the null hypothesis when it is false (type II error ).



How to proceed:

Specify the null and alternative hypothesis

Prespecify the probability of making a type I error, i.e. the significancelevel α:

α = P(rejecting H0|H0 is true).

Typically, we choose α = 0.05.

Derive a test statistic T = T(X1, . . . ,Xn) and its distribution under H0.

Determine a certain critical value such that the null hypothesis will berejected, if the test statistic exceeds this value. The set of values of thetest statistic for which the test rejects the null hypothesis is the rejectionregion (R), and the values of the test statistic for which it does not rejectthe null hypothesis is the acceptance region (A).

The null hypothesis is rejected if tact = T(x1, . . . , xn) ∈ R.



The t-Statistic (two-sided test)

X1, ...,Xn are i.i.d. N (µX , σ2X).

Case 1: µX unknown, σ2X known.

1 H0 : µX = µX,0 against H1 : µX 6= µX,0

2 e.g. α = 5%3 If H0 is true, then

T =√

nX − µX,0

σX∼ N (0, 1)

4 Determine the critical value:

α = P(|T| > z1−α

2)

5 Reject H0 ⇔ |tact| > z1−α

2. (For α = 0.05 the critical value is z0.975 = 1.96.)



Case 2: µX unknown, σ2X unknown.

What happens if we replace σX by sX in the test statistic?

√n

X − µX,0

sX=

√n X−µX,0

σX√

(n−1)s2X/σ

2X

n−1

∼ t(n − 1)

since the numerator is N (0, 1), the denominator is the square roof a χ2(n − 1)random variable and X and s2

X are independently distributed.



Now: X1, ...,Xn are i.i.d., E[X4i ] < ∞, but the distribution is unknown

Show that in this case:

T =√

nX − µX,0

sX

d−→ N (0, 1)

Thus, as long as the sample size is large, the distribution of the test statistic iswell approximated by the standard normal distribution.



The p-Value

The p-value is the probability of drawing a statistic at least as adverse to thenull hypothesis as the one you actually computed in your sample, assumingthe null hypothesis is correct:

p − value = P

(

∣

∣

∣

∣

X − µX,0

SE(X)

∣

∣

∣

∣

>

∣

∣

∣

∣

∣

|Xact − µX,0

SE(X)

∣

∣

∣

∣

∣

)

= P(|T| > |tact|) = 2(1 − Φ(|tact|)

with SE(X) = sx/√

n being the standard error of X.

For a prespecified α, reject the null hypothesis if p < α, otherwise do notreject.



Sometimes we are interested in one-sided alternative hypothesis, which canbe written as

H1 : µX > µX,0

orH1 : µX < µX,0.

In this case the p-value is given by

p − value = 1 − Φ(|tact|).

The N (0, 1) critical value for a one-sided test with α = 0.05 is 1.64 and -1.64,respectively.


2 Review of Statistics2.3 Confidence intervals for the population mean

Because of random sampling error, it is impossible to learn the exact value ofthe population mean of X using only the information in the sample. However, itis possible to use data from a random sample to construct a set of values thatcontains the true population mean µX with a certain prespecified probability.Such a set is called a confidence interval and the probability α is calledconfidence level.E.g, for α = 0.05 the confidence interval for the mean corresponds to the setof values for the which the null hypothesis cannot be rejected:

[X − 1.96SE(X);X + 1.96SE(X)]



Example:

Assume that the body height X of the participants of a statistics lecture isnormally distributed with σX = 10. The average height in a randomsample of size n = 25 equals X = 183 cm. Test the hypothesis thatH0 : µX = 190 against the alternative H1 : µX 6= 190 at the 5% significancelevel. What is the p-value of the test?

Now, assume that the standard deviation is unknown. However, you knowthat sx = 10 cm.

Finally, skip the assumption that X is normally distributed. Now, n = 50and sx = 10 cm. How would you proceed?



Example E2.24: Suppose Yi is distributed i.i.d. N (0, σ2Y ) for i = 1, 2, . . . , n.

1 Show that E[Y2i /σ

2Y) = 1.

2 Show that W = (1/σ2Y)∑n

i=1 Y2i is distributed χ2(n).

3 Show that E[W] = n. [Hint: Use your answer to 1.]4 Show that

V =Y1

√∑ni=2 Y2

in−1

is distributed t(n − 1).


3 Matrix Algebra

3 Matrix Algebra3.1 Basic principles3.2 Multivariate statistics


3 Matrix Algebra3.1 Basic principles

Basic principles

A matrix A is a n × K rectangular array of numbers, written as

A =

a1,1 a1,2 · · · a1,K

a2,1 a2,2 · · · a2,K...

.... . .

...an,1 an,2 · · · an,K

= (ai,j)i=1,...,n,j=1,...,K.

The transpose of a matrix, denoted as A′ is obtained by flipping the matrix onits diagonal. Thus

A′ = (aj,i)i=1,...,n,j=1,...,K.

Example:

A =

(

1 2 30 −6 7

)

A′ =

1 02 −63 7



Special matrices

A matrix A is

square if n = K.

symmetric if A = A′ which requires ai,j = aj,i.

diagonal if the off-diagonal elements are all zero, so that ai,j = 0 if i 6= j.

is upper (lower) diagonal if all elements below (above) the diagonal equalzero.

An important diagonal matrix is the identity matrix, which has ones on thediagonal. The k × k identity matrix is denoted as

Ik =

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

.


3 Fundamentals of matrix algebra3.1 Basic principles

Basic operations

Matrix addition

A = (ai,j)i=1,...,n,j=1,...,K,B = (bi,j)i=1,...,n,j=1,...,K.

C = A + B = (ci,j)i=1,...,n,j=1,...,K = (ai,j + bi,j)i=1,...,n,j=1,...,K.

Example:(

1 3 11 0 0

)

+

(

0 0 57 5 0

)

=

(

1 + 0 3 + 0 1 + 51 + 7 0 + 5 0 + 0

)

=

(

1 3 68 5 0

)

Skalar multiplication

A = (ai,j)i=1,...,n,j=1,...,K, λ ∈ R.

λ · A = (λ · ai,j)i=1,...,n,j=1,...,K.

Example:

4 ·(

1 2 30 −6 7

)

=

(

4 8 120 −24 28

)



Matrix multiplication

A = (ai,j)i=1,...,l,j=1,...,n,B = (bi,j)i=1,...,n,j=1,...,K.

C = A · B = (ci,j)i=1,...,l, j=1,...,K ci,j =

n∑

k=1

ai,k · bk,j

Example:

(

1 0 2−1 3 1

)

×

3 12 11 0

=

(

5 14 2

)

.

with

5 = 1 · 3 + 0 · 2 + 2 · 1;

1 = 1 · 1 + 0 · 1 + 2 · 0;

4 = −1 · 3 + 3 · 2 + 1 · 1;

5 = −1 · 1 + 3 · 1 + 1 · 0.



Properties

Matrix additioni) A + B = B + Aii) (A + B) + C = A + (B + C)

Matrix multiplicationi) (A · B) · C = A · (B · C)ii) A · B 6= B · A

Example:(

1 23 4

)

·

(

0 10 0

)

=

(

0 10 3

)

,

(

0 10 0

)

·

(

1 23 4

)

=

(

3 40 0

)

.

iii) A · (B + C) = A · B + A · C(B + C) · A = B · A + C · A

iv) Multiplication with identity matrix for any n × K matrix M:

M · IK = In · M = M.

v) Idempotence: A · A = A.vi) Transpose of a product: (A · B · C)′ = C′ · B′ · A′



Quadratic form

Consider a symmetric matrix A ∈ RK×K and a vector x ∈ R

K×1. Theexpression

x′Ax =

K∑

i=1

K∑

j=1

xiai,jxj

is called quadratic form.

Implications:

A is positive definite if x′Ax > 0 for all x 6= 0.A is negative definite if x′Ax < 0 for all x 6= 0.

Problem: Show that the matrix

A =

2 −1 0−1 2 −10 −1 2

is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 86 / 88


Rank and inverse of a matrixThe rank of the n × K matrix (K ≤ n)

A = (a1, ..., aK)

is the number of linearly independent columns aj and is written as rank(A). Ahas full rank if rank(A) = K.Properties:

A square k × k matrix A is said to be nonsingular if it is has full rank, e.g.rank(A) = k. This means that there is no k × 1 vector c 6= 0 such thatAc = 0.If a square k × k matrix A is nonsingular then there exists a unique matrixk × k matrix A−1 called the inverse of A which satisfies

AA−1 = A−1A = Ik.

If A is positive or negative definite, then A is nonsingular.Problem: Compute the inverse of the matrix

A =

[

8 22 1

]

.

is positive definite.Christian Conrad (Heidelberg University) Winter term 2012/13 87 / 88


Trace of a matrix

The trace of a k × k square matrix A is defined to be the sum of the elementson the main diagonal, i.e.,

tr(A) =

k∑

i=1

ai,i

Properties for square matrices A and B and real λ are :

tr(λA) = λtr(A);

tr(A′) = tr(A);

tr(A + B) = tr(A) + tr(B);

tr(Ik) = k;

If A is an n × K matrix and B is an K × n matrix, then

tr(AB) = tr(BA).


review course statistics probability theory - statistical

Documents