introductory statistics - cm226: machine learning for...

Introductory statisticsCM226: Machine Learning for Bioinformatics.

Fall 2016

Sriram Sankararaman

Introductory statistics 1 / 49

Admin

• Course website: http://web.cs.ucla.edu/~sriram/courses/

cm226.fall-2016/html/index.html

• Course format changes• Homeworks: 50%• Exam: 50%. Nov 16 (note date change)


http://web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/html/index.html

http://web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/html/index.html

Statistics

• Previous class: intro to genomics

• This class: intro to statistics


Statistics

• (X1, . . . , Xn) is the data drawn from a distribution P

• θ is an unknown population parameter, i.e., a function of P .

• Inference: Learn θ from (X1, . . . , Xn).

• P links θ and data.


Tasks in statistical inference

Three broad and related inference tasks

• Estimation

• Hypothesis testing

• Prediction

For each of these tasks, we have procedures for finding an estimator ortest or predictor and procedures for evaluating them. We will focus on thefirst two tasks.


Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Point estimation 6 / 49

Point estimation

A statistic is a function of the data

Goal of estimation: Find statistic t such that θ̂n = t(X1, . . . , Xn) is“close” to θ.

θ̂n is an estimator of θ.

• Find estimators

• Evaluate an estimator


Point estimationsProcedures to find estimators

Procedures to find estimators

• Maximum likelihood

• Method of moments

• Bayes


Maximum likelihood estimation

Given X1, . . . , Xniid∼ P (x|θ), the likelihood function

L(θ|x) ≡ P (x1:n|θ) =∏ni=1 P (xi|θ).

For a differentiable log likelihood function, possible candidates for MLE areθ that solve

d

dθLL(θ) = 0

Why only possible candidates ?

• Not sufficient condition for a maximum

• Maximum could also be on the boundary

• Often, no closed-form solution


Example 1Bernoulli likelihood

Model

X1, . . . , Xniid∼ Ber (p) , Xi ∈ {0, 1}

Likelihood

L(p) =n∏i=1

pxi(1− p)1−xi

Log likelihood

LL(p) =

n∑i=1

xi log(p) + (1− xi) log(1− p)



Log likelihood

LL(p) =

n∑i=1

xi log(p) + (1− xi) log(1− p)

MLE

p̂ = x



5 coin flips, 2 heads

Likelihood

L(p) = (p)2(1− p)3

LL(p) = 2 log(p) + 3 log(1− p)dLLdp

(p) = 21

p− 3

1

1− p

MLE

p̂ =2

5


Example 2Normal likelihood

Model

X1, . . . , Xniid∼ N (µ, σ2)

Likelihood

L(µ, σ2) =n∏i=1

1

2πσ2exp(−(xi − µ)2

2σ2)

Log likelihood

LL(µ, σ2) = −n∑i=1

(xi − µ)2

2σ2− n

2log(2πσ2)


Example 2Normal likelihood

Log likelihood

LL(µ, σ2) = −n∑i=1

(xi − µ)2

2σ2− n

2log(2πσ2)

MLE

µ̂ = x =

n∑i=1

xin

σ̂2 =

n∑i=1

(xi − x)2

n


Evaluating estimators

• Finite sample

• Asymptotic


Evaluating estimatorsFinite sample

Bias of an estimatorbias(θ̂n) = E

[θ̂n

]− θ

An estimator is unbiased if bias = 0.

Variance of an estimator

var(θ̂n) = E[(θ̂n − E

[θ̂n

])2]

Mean Square Error of an estimator

mse(θ̂n) = E[(θ̂n − θ

)2]

The mse decomposes as mse(θ̂n) = bias(θ̂n)2

+ var(θ̂n)



Comparing estimators not straightforward

• Depends on the evaluation criterion

• Even with a single criterion, performance can depend on the trueparameter.



Normal likelihood

X1, . . . , Xniid∼ N (µ, σ2)

• X1 is an unbiased estimator of µ.

• µ̂ = X is also an unbiased estimator of µ

• Var [X1] = σ2 while Var [µ̂] = σ2

n



Normal likelihood

X1, . . . , Xniid∼ N (µ, σ2)

• σ̂2 is not an unbiased estimator of σ2. Its bias is σ2

n .


Evaluating estimatorsAsymptotics

• Challenging to compute finite-sample properties.

• What happens when sample size goes to infinity?


Asymptotics

Theorem (Weak Law of Large Numbers) Let X1, . . . , Xn be iid

random variables with E [Xi] = µ, Var [Xi] = σ2 <∞. Xnp→ µ.

Theorem (Central Limit Theorem) Let X1, . . . , Xn be iid randomvariables with E [Xi] = µ, Var [Xi] = σ2 <∞.

√n

(Xn − µ)

σ

d→ N (0, 1)

. We will informally write Xn ∼ AN (µ, σ2

n )



Consistency A sequence of estimators θ̂n is a consistent sequence ofestimators of a parameter θ if θ̂n

p→ θ.

Normal likelihoodµ̂n = xn is a consistent estimator of µ.σ̂2

n is a consistent estimator of σ2.



Theorem (Consistency of MLE) Let X1, . . . , Xniid∼ f(x|θ) and let θ̂n

denote the MLE of θ.

θ̂np→ θ

If the bias of θ̂ → 0 and its variance → 0, i.e., its MSE goes to 0, then it isconsistent.



Theorem (Asymptotic Normality of MLE) Under “regularityconditions”

√n(θ̂n − θ) d→ N (0, I−1(θ))

Here

I(θ) = −E[∂2 logP (X|θ)

∂2θ

]is the Fisher information associated with the model.What is Fisher information ?


Fisher informationNormal model

Focus on the mean.For a single observation X ∼ N (µ, σ2)

∂ logP (X|θ)∂µ

=x− µσ2

∂2 logP (X|θ)∂2µ

= − 1

σ2

I(µ) =1

σ2

Makes intuitive sense:

• Inverse Fisher information is the variance

• Smaller the variance, larger the Fisher information, and the estimatesof the mean should be more precise.


Fisher informationNormal model

-10 -5 0 5 10

-100

-80

-60

-40

-20

0

µ

LL

σ2

0.515


Fisher informationBernoulli model

∂ logP (X|p)∂p

=X

p− 1−X

1− p∂2 logP (X|p)

∂2p= −X

p2− X

(1− p)2

I(p) =1

p(1− p)

• Again inverse Fisher information is the variance.


Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Hypothesis testing 26 / 49

Hypothesis testingHypothesis

• Offspring has equal chance of inheriting either of two alleles from aparent (Mendel’s first law holds)

• Two genes segregate independently (Mendel’s second law holds)

• A genetic mutation has no effect on disease

• A drug has no effect on blood pressure

Null hypothesis: H0 : θ ∈ Θ0

Alternate hypotehsis: H1 : θ ∈ Θ1


Hypothesis testingHypothesis

Mendel’s first law/Test parameter of Bernoulli

• We observe n gametes from a diploid individual.

• Probability of transmitting maternal allele is p.

• We can tell for each gamete i if it is the paternal or the maternalallele. Xi ∼ Ber (p).

• After observing data from n gametes, accept or reject H0.

H0 : p =1

2

H1 : p 6= 1

2


Hypothesis test

A rule that specifies values of the sample for which H0 should be acceptedor rejected.Specify a test statistic t(X1, . . . , Xn) and a rejection/acceptance region.

Mendel’s first lawOne test:Statistic: t(X1, . . . , Xn) = X.Reject H0 if X > 0.9 or X < 0.1.Rejection region {(x1, . . . , xn) : X > 0.9 or X < 0.1}.


Hypothesis testingProcedures to find tests

• Likelihood Ratio Test (LRT)

• Wald test

• Score test

• Bayesian test


Hypothesis testingLikelihood Ratio Test

LRT statistic

λ(x) =supθ∈Θ0L(θ|x)

supθ∈ΘL(θ|x)

=L(θ̂0|x)

L(θ̂|x)

Θ = Θ0 ∪Θ1

Reject H0 is λ(x) < c.


Hypothesis testingLikelihood Ratio Test example 1

Test mean of normal modelGiven X1, . . . , Xn

iid∼ N (µ, 1), H0 : µ = 0 versus H1 : µ 6= 0.



Test mean of normal model

λ(x) =exp(−

∑ni=1 xi

2

2 )

exp(−∑ni=1 (xi−µ̂)2

2 )

= exp(−nx2

2)

Reject H0 if λ(x) < c equivalent to reject H0 if |x| >√

2 log( 1c)

n .



Mendel’s first law

λ(x) =12

∑i xi 1

2

n−∑i xi

p̂∑i xi(1− p̂)n−

∑i xi

=12

n

xnx(1− x)n(1−x)

= 2nH(x)−n

Here H(x) = −(x log(x) + (1− x) log(1− x), x ∈ [0, 1] is the entropyfunction.

Reject H0 is λ(x) < c equivalent to reject H0 if H(x) < 1− log2( 1c)

n .



Mendel’s first lawHow do we choose c?


Hypothesis testingEvaluating tests

DecisionTruth H0 H1

H0 TN FPH1 FN TP

FP: False positive (type-I error)

FN: False negative (type-II error)

TP: True positive

TN: True negative

For θ ∈ Θ0, probability of false positive for a test R: = Pθ(Test rejects).For θ ∈ Θ1, probability of false negative: Pθ(Test accepts).


Hypothesis testingEvaluating tests

To evaluate (or compare tests), we need to tradeoff false positives andfalse negatives.For tests with false positive probability below some threshold, what is thefalse negative threshold ?

A test is a level α test if supθ∈Θ0Pθ(Test rejects) ≤ α.

• α is an upper bound on the false positive probability.

• Often set α = 0.05


Hypothesis testingLevel of LRT

• Choose c to control α: supθ∈Θ0Pθ(λ(X) ≤ c) = α.

• Trivial test is never reject.

• We would like to maximize the probability of rejecting the null (i.e.,power which we denote β(θ)) if it is false while controlling the falsepositive probability.


Hypothesis testingNormal mean

For test of the normal mean, Θ0 = {0}.For θ = θ0 (under the null), the test statistic X ∼ N (0, 1

n).

P0(λ(X) ≤ c) = P0(|X| ≥

√2 log(1

c )

n)

= P (|Z| ≥√

2 log(1

c))

Here Z ∼ N (0, 1). P (|Z| ≥ z1−α2) = α where zα is the α-quantile of the

standard normal distribution.


Normal mean


Hypothesis testingp-value

• Level is a crude measure. If we reject the null hypothesis at levelα = 0.05, it does not tell us how strong the evidence against H0 is.

• p-value of a test is the smallest α at which the test rejects H0.

• Small values of p-value implies stronger evidence against H0.

For a statistic λ(X) such that small values of λ give evidence against H0,the p-value p(x):

p(x) = supθ∈Θ0Pθ(λ(X) ≤ λ(x))


Hypothesis testingp-value for test of normal mean

• Observe 10 samples and mean is 0.8. Reject H0 at α = .05.p-value=0.01.

• Observe 10 samples and mean is 2. Reject H0 at α = .05.p-value=2× 10−10.

• Observe 100 samples and mean is 0.8. Reject H0 at α = .05.p-value=1× 10−15.


LRT

• supθ∈Θ0Pθ(λ(X) ≤ c) = α.

• Choose c so that the false positive probability is bounded.

• To compute false positive probability, we need to know how thestatistic is distributed under the null (sampling distribution of thestatistic).

• Easy for Normal mean test:X.

• More complicated statistics?

AsymptoticsPermutation test


Asymptotics of the LRT

Theorem Let X1, . . . , Xniid∼ p(x|θ). For testing θ = θ0 versus θ 6= θ0,

under regularity conditions on p(x|θ)

−2 log λ(Xn)d→ χ2

1

χ21 is the distribution of the square of a normal random variable.

For testing θ ∈ Θ0 versus θ 6= Θ0, under regularity conditions on p(x|θ), ifν is the difference in the number of free parameters between θ ∈ Θ andθ ∈ Θ0:

−2 log λ(Xn)d→ χ2

ν

Reject H0 iff −2 log λ(X) ≥ χ2ν,1−α gives us an asymptotic (hence

approximate) level α test.


Asymptotics of the LRTApplication to Bernoulli model

Using the asymptotic distribution of the LRT, the p-value for the LRTstatistic:

p(x) = P (λ(X ≤ λ(x))

= P (−2 log λ(X) ≥ −2 log λ(x))

≈ P (χ21 ≥ −2 log λ(x))

Reject H0 iff p(x) ≤ α gives us approximate level α test.


Permutation tests

• On many instances, even asymptotic results are difficult. Also notclear, if sample size is big enough.

• Key idea: We don’t know the sampling distribution but can samplefrom this distribution.


Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy

X1, . . . , Xn ∼ F

Y1, . . . , Yn ∼ G

H0 : F = G = P0

Statistic: t((X1, . . . , Xn), (Y1, . . . , Yn)) = X − Y .Key idea:Under H0, all permutations of the data are equally likely.Use this idea to compute the sampling distribution.



• Randomly choose n out of 2n elements to belong to group 1. Theremaining belong to group 2.

• Compute the difference in means between the two groups.

• How often does the difference in means on the permuted data exceedthe observed difference ?

In many cases, the number of permutations is too large to exhaustivelyenumerate. Then use a random sample of permutations (Monte-Carloapproximation).


Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Interval estimation 45 / 49

Interval estimation

Point estimates do not give you a measure of confidence

For a scalar parameter θ, given data X, an interval estimator is a pair offunctions L(x1, . . . , xn) and U(x1, . . . , xn), L ≤ U , such that[L(X), U(X)] covers θ. The interval is random while θ is fixed


Evaluating interval estimators

The coverage probability of an interval estimator [L(X), U(X)] is theprobability that this interval covers the true parameter θ, denotedPθ(θ ∈ [L(X), U(X)]).

An interval estimator is termed a 1− α confidence interval ifinfθPθ(θ ∈ [L(X), U(X)] ≥ 1− α.


Finding interval estimatorsInverting a test statistic

Given X1, . . . , Xniid∼ N (µ, 1), H0 : µ = µ0 versus H1 : µ 6= 0.

One test is to reject H0 if {x : |x̄− µ0| >z 1−α

2√n} So H0 is accepted for

{x : |x̄− µ0| ≤z 1−α

2√n} or equivalently:

x̄− zα2

1√n≤ µ0 ≤ x̄+ zα

2

1√n

Since the test has size α, P (H0 is accepted|µ = µ0) = 1− α,

P (x̄− z 1−α2

1√n≤ µ0 ≤ x̄+ z 1−α

2

1√n|µ = µ0) = 1− α

Since this is true for all µ0, we have

Pµ(x̄− z 1−α2

1√n≤ µ ≤ x̄+ z 1−α

2

1√n

) = 1− α

So [x̄− z 1−α2

1√n, x̄+ z 1−α

2

1√n

] is a 1− α confidence interval.


Finding interval estimatorsInverting a test statistic

• A general method

• Methods for constructing tests can be used for getting confidenceintervals


Summary

• Two major inferential tasks

• General procedures for each task and to evaluate the quality.

• Can evaluate finite sample and large sample properties.• Estimation (point) : Maximum likelihood. Bias, variance, consistency.• Testing: Likelihood Ratio Test: False positive and false negative

probabilities (or power). Permutation is a useful tool in many settings.• Interval estimation: Can use tests to get at confidence intervals.


introductory statistics - cm226: machine learning for...

Documents