introductory statistics - cm226: machine learning for...

58
Introductory statistics CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Introductory statistics 1 / 49

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Introductory statisticsCM226: Machine Learning for Bioinformatics.

Fall 2016

Sriram Sankararaman

Introductory statistics 1 / 49

Page 2: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Admin

• Course website: http://web.cs.ucla.edu/~sriram/courses/

cm226.fall-2016/html/index.html

• Course format changes• Homeworks: 50%• Exam: 50%. Nov 16 (note date change)

Introductory statistics 2 / 49

Page 3: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Statistics

• Previous class: intro to genomics

• This class: intro to statistics

Introductory statistics 3 / 49

Page 4: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Statistics

• (X1, . . . , Xn) is the data drawn from a distribution P

• θ is an unknown population parameter, i.e., a function of P .

• Inference: Learn θ from (X1, . . . , Xn).

• P links θ and data.

Introductory statistics 4 / 49

Page 5: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Tasks in statistical inference

Three broad and related inference tasks

• Estimation

• Hypothesis testing

• Prediction

For each of these tasks, we have procedures for finding an estimator ortest or predictor and procedures for evaluating them. We will focus on thefirst two tasks.

Introductory statistics 5 / 49

Page 6: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Point estimation 6 / 49

Page 7: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Point estimation

A statistic is a function of the data

Goal of estimation: Find statistic t such that θ̂n = t(X1, . . . , Xn) is“close” to θ.

θ̂n is an estimator of θ.

• Find estimators

• Evaluate an estimator

Introductory statistics Point estimation 7 / 49

Page 8: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Point estimationsProcedures to find estimators

Procedures to find estimators

• Maximum likelihood

• Method of moments

• Bayes

Introductory statistics Point estimation 8 / 49

Page 9: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Maximum likelihood estimation

Given X1, . . . , Xniid∼ P (x|θ), the likelihood function

L(θ|x) ≡ P (x1:n|θ) =∏ni=1 P (xi|θ).

A maximum likelihood estimator

θ̂(x)∆= argmaxθL(θ|x)

= argmaxθLL(θ|x)

LL(θ|x)∆= logL(θ|x)

=

n∑i=1

logP (xi|θ)

Introductory statistics Point estimation 9 / 49

Page 10: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Maximum likelihood estimation

Given X1, . . . , Xniid∼ P (x|θ), the likelihood function

L(θ|x) ≡ P (x1:n|θ) =∏ni=1 P (xi|θ).

For a differentiable log likelihood function, possible candidates for MLE areθ that solve

d

dθLL(θ) = 0

Why only possible candidates ?

• Not sufficient condition for a maximum

• Maximum could also be on the boundary

• Often, no closed-form solution

Introductory statistics Point estimation 9 / 49

Page 11: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Example 1Bernoulli likelihood

Model

X1, . . . , Xniid∼ Ber (p) , Xi ∈ {0, 1}

Likelihood

L(p) =n∏i=1

pxi(1− p)1−xi

Log likelihood

LL(p) =

n∑i=1

xi log(p) + (1− xi) log(1− p)

Introductory statistics Point estimation 10 / 49

Page 12: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Example 1Bernoulli likelihood

Log likelihood

LL(p) =

n∑i=1

xi log(p) + (1− xi) log(1− p)

MLE

p̂ = x

Introductory statistics Point estimation 11 / 49

Page 13: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Example 1Bernoulli likelihood

5 coin flips, 2 heads

Likelihood

L(p) = (p)2(1− p)3

LL(p) = 2 log(p) + 3 log(1− p)dLLdp

(p) = 21

p− 3

1

1− p

MLE

p̂ =2

5

Introductory statistics Point estimation 12 / 49

Page 14: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Example 2Normal likelihood

Model

X1, . . . , Xniid∼ N (µ, σ2)

Likelihood

L(µ, σ2) =n∏i=1

1

2πσ2exp(−(xi − µ)2

2σ2)

Log likelihood

LL(µ, σ2) = −n∑i=1

(xi − µ)2

2σ2− n

2log(2πσ2)

Introductory statistics Point estimation 13 / 49

Page 15: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Example 2Normal likelihood

Log likelihood

LL(µ, σ2) = −n∑i=1

(xi − µ)2

2σ2− n

2log(2πσ2)

MLE

µ̂ = x =

n∑i=1

xin

σ̂2 =

n∑i=1

(xi − x)2

n

Introductory statistics Point estimation 14 / 49

Page 16: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimators

• Finite sample

• Asymptotic

Introductory statistics Point estimation 15 / 49

Page 17: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsFinite sample

Bias of an estimatorbias(θ̂n) = E

[θ̂n

]− θ

An estimator is unbiased if bias = 0.

Variance of an estimator

var(θ̂n) = E[(θ̂n − E

[θ̂n

])2]

Mean Square Error of an estimator

mse(θ̂n) = E[(θ̂n − θ

)2]

The mse decomposes as mse(θ̂n) = bias(θ̂n)2

+ var(θ̂n)

Introductory statistics Point estimation 16 / 49

Page 18: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsFinite sample

Comparing estimators not straightforward

• Depends on the evaluation criterion

• Even with a single criterion, performance can depend on the trueparameter.

Introductory statistics Point estimation 17 / 49

Page 19: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsFinite sample

Normal likelihood

X1, . . . , Xniid∼ N (µ, σ2)

• X1 is an unbiased estimator of µ.

• µ̂ = X is also an unbiased estimator of µ

• Var [X1] = σ2 while Var [µ̂] = σ2

n

Introductory statistics Point estimation 18 / 49

Page 20: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsFinite sample

Normal likelihood

X1, . . . , Xniid∼ N (µ, σ2)

• σ̂2 is not an unbiased estimator of σ2. Its bias is σ2

n .

Introductory statistics Point estimation 18 / 49

Page 21: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsAsymptotics

• Challenging to compute finite-sample properties.

• What happens when sample size goes to infinity?

Introductory statistics Point estimation 19 / 49

Page 22: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Asymptotics

Theorem (Weak Law of Large Numbers) Let X1, . . . , Xn be iid

random variables with E [Xi] = µ, Var [Xi] = σ2 <∞. Xnp→ µ.

Theorem (Central Limit Theorem) Let X1, . . . , Xn be iid randomvariables with E [Xi] = µ, Var [Xi] = σ2 <∞.

√n

(Xn − µ)

σ

d→ N (0, 1)

. We will informally write Xn ∼ AN (µ, σ2

n )

Introductory statistics Point estimation 20 / 49

Page 23: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsAsymptotics

Consistency A sequence of estimators θ̂n is a consistent sequence ofestimators of a parameter θ if θ̂n

p→ θ.

Normal likelihoodµ̂n = xn is a consistent estimator of µ.σ̂2

n is a consistent estimator of σ2.

Introductory statistics Point estimation 21 / 49

Page 24: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsAsymptotics

Theorem (Consistency of MLE) Let X1, . . . , Xniid∼ f(x|θ) and let θ̂n

denote the MLE of θ.

θ̂np→ θ

If the bias of θ̂ → 0 and its variance → 0, i.e., its MSE goes to 0, then it isconsistent.

Introductory statistics Point estimation 22 / 49

Page 25: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating estimatorsAsymptotics

Theorem (Asymptotic Normality of MLE) Under “regularityconditions”

√n(θ̂n − θ) d→ N (0, I−1(θ))

Here

I(θ) = −E[∂2 logP (X|θ)

∂2θ

]is the Fisher information associated with the model.What is Fisher information ?

Introductory statistics Point estimation 23 / 49

Page 26: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Fisher informationNormal model

Focus on the mean.For a single observation X ∼ N (µ, σ2)

∂ logP (X|θ)∂µ

=x− µσ2

∂2 logP (X|θ)∂2µ

= − 1

σ2

I(µ) =1

σ2

Makes intuitive sense:

• Inverse Fisher information is the variance

• Smaller the variance, larger the Fisher information, and the estimatesof the mean should be more precise.

Introductory statistics Point estimation 24 / 49

Page 27: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Fisher informationNormal model

-10 -5 0 5 10

-100

-80

-60

-40

-20

0

µ

LL

σ2

0.515

Introductory statistics Point estimation 24 / 49

Page 28: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Fisher informationBernoulli model

∂ logP (X|p)∂p

=X

p− 1−X

1− p∂2 logP (X|p)

∂2p= −X

p2− X

(1− p)2

I(p) =1

p(1− p)

• Again inverse Fisher information is the variance.

Introductory statistics Point estimation 25 / 49

Page 29: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Hypothesis testing 26 / 49

Page 30: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingHypothesis

• Offspring has equal chance of inheriting either of two alleles from aparent (Mendel’s first law holds)

• Two genes segregate independently (Mendel’s second law holds)

• A genetic mutation has no effect on disease

• A drug has no effect on blood pressure

Null hypothesis: H0 : θ ∈ Θ0

Alternate hypotehsis: H1 : θ ∈ Θ1

Introductory statistics Hypothesis testing 27 / 49

Page 31: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingHypothesis

Mendel’s first law/Test parameter of Bernoulli

• We observe n gametes from a diploid individual.

• Probability of transmitting maternal allele is p.

• We can tell for each gamete i if it is the paternal or the maternalallele. Xi ∼ Ber (p).

• After observing data from n gametes, accept or reject H0.

H0 : p =1

2

H1 : p 6= 1

2

Introductory statistics Hypothesis testing 27 / 49

Page 32: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis test

A rule that specifies values of the sample for which H0 should be acceptedor rejected.Specify a test statistic t(X1, . . . , Xn) and a rejection/acceptance region.

Mendel’s first lawOne test:Statistic: t(X1, . . . , Xn) = X.Reject H0 if X > 0.9 or X < 0.1.Rejection region {(x1, . . . , xn) : X > 0.9 or X < 0.1}.

Introductory statistics Hypothesis testing 28 / 49

Page 33: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingProcedures to find tests

• Likelihood Ratio Test (LRT)

• Wald test

• Score test

• Bayesian test

Introductory statistics Hypothesis testing 29 / 49

Page 34: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLikelihood Ratio Test

LRT statistic

λ(x) =supθ∈Θ0L(θ|x)

supθ∈ΘL(θ|x)

=L(θ̂0|x)

L(θ̂|x)

Θ = Θ0 ∪Θ1

Reject H0 is λ(x) < c.

Introductory statistics Hypothesis testing 30 / 49

Page 35: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLikelihood Ratio Test example 1

Test mean of normal modelGiven X1, . . . , Xn

iid∼ N (µ, 1), H0 : µ = 0 versus H1 : µ 6= 0.

Introductory statistics Hypothesis testing 31 / 49

Page 36: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLikelihood Ratio Test example 1

Test mean of normal model

λ(x) =exp(−

∑ni=1 xi

2

2 )

exp(−∑ni=1 (xi−µ̂)2

2 )

= exp(−nx2

2)

Reject H0 if λ(x) < c equivalent to reject H0 if |x| >√

2 log( 1c)

n .

Introductory statistics Hypothesis testing 31 / 49

Page 37: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLikelihood Ratio Test example 2

Mendel’s first law

λ(x) =12

∑i xi 1

2

n−∑i xi

p̂∑i xi(1− p̂)n−

∑i xi

=12

n

xnx(1− x)n(1−x)

= 2nH(x)−n

Here H(x) = −(x log(x) + (1− x) log(1− x), x ∈ [0, 1] is the entropyfunction.

Reject H0 is λ(x) < c equivalent to reject H0 if H(x) < 1− log2( 1c)

n .

Introductory statistics Hypothesis testing 32 / 49

Page 38: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLikelihood Ratio Test example 2

Mendel’s first lawHow do we choose c?

Introductory statistics Hypothesis testing 32 / 49

Page 39: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingEvaluating tests

DecisionTruth H0 H1

H0 TN FPH1 FN TP

FP: False positive (type-I error)

FN: False negative (type-II error)

TP: True positive

TN: True negative

For θ ∈ Θ0, probability of false positive for a test R: = Pθ(Test rejects).For θ ∈ Θ1, probability of false negative: Pθ(Test accepts).

Introductory statistics Hypothesis testing 33 / 49

Page 40: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingEvaluating tests

To evaluate (or compare tests), we need to tradeoff false positives andfalse negatives.For tests with false positive probability below some threshold, what is thefalse negative threshold ?

A test is a level α test if supθ∈Θ0Pθ(Test rejects) ≤ α.

• α is an upper bound on the false positive probability.

• Often set α = 0.05

Introductory statistics Hypothesis testing 34 / 49

Page 41: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingLevel of LRT

• Choose c to control α: supθ∈Θ0Pθ(λ(X) ≤ c) = α.

• Trivial test is never reject.

• We would like to maximize the probability of rejecting the null (i.e.,power which we denote β(θ)) if it is false while controlling the falsepositive probability.

Introductory statistics Hypothesis testing 35 / 49

Page 42: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingNormal mean

For test of the normal mean, Θ0 = {0}.For θ = θ0 (under the null), the test statistic X ∼ N (0, 1

n).

P0(λ(X) ≤ c) = P0(|X| ≥

√2 log(1

c )

n)

= P (|Z| ≥√

2 log(1

c))

Here Z ∼ N (0, 1). P (|Z| ≥ z1−α2) = α where zα is the α-quantile of the

standard normal distribution.

Introductory statistics Hypothesis testing 36 / 49

Page 43: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Normal mean

Introductory statistics Hypothesis testing 37 / 49

Page 44: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingp-value

• Level is a crude measure. If we reject the null hypothesis at levelα = 0.05, it does not tell us how strong the evidence against H0 is.

• p-value of a test is the smallest α at which the test rejects H0.

• Small values of p-value implies stronger evidence against H0.

For a statistic λ(X) such that small values of λ give evidence against H0,the p-value p(x):

p(x) = supθ∈Θ0Pθ(λ(X) ≤ λ(x))

Introductory statistics Hypothesis testing 38 / 49

Page 45: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Hypothesis testingp-value for test of normal mean

• Observe 10 samples and mean is 0.8. Reject H0 at α = .05.p-value=0.01.

• Observe 10 samples and mean is 2. Reject H0 at α = .05.p-value=2× 10−10.

• Observe 100 samples and mean is 0.8. Reject H0 at α = .05.p-value=1× 10−15.

Introductory statistics Hypothesis testing 39 / 49

Page 46: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

LRT

• supθ∈Θ0Pθ(λ(X) ≤ c) = α.

• Choose c so that the false positive probability is bounded.

• To compute false positive probability, we need to know how thestatistic is distributed under the null (sampling distribution of thestatistic).

• Easy for Normal mean test:X.

• More complicated statistics?

AsymptoticsPermutation test

Introductory statistics Hypothesis testing 40 / 49

Page 47: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Asymptotics of the LRT

Theorem Let X1, . . . , Xniid∼ p(x|θ). For testing θ = θ0 versus θ 6= θ0,

under regularity conditions on p(x|θ)

−2 log λ(Xn)d→ χ2

1

χ21 is the distribution of the square of a normal random variable.

For testing θ ∈ Θ0 versus θ 6= Θ0, under regularity conditions on p(x|θ), ifν is the difference in the number of free parameters between θ ∈ Θ andθ ∈ Θ0:

−2 log λ(Xn)d→ χ2

ν

Reject H0 iff −2 log λ(X) ≥ χ2ν,1−α gives us an asymptotic (hence

approximate) level α test.

Introductory statistics Hypothesis testing 41 / 49

Page 48: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Asymptotics of the LRTApplication to Bernoulli model

Using the asymptotic distribution of the LRT, the p-value for the LRTstatistic:

p(x) = P (λ(X ≤ λ(x))

= P (−2 log λ(X) ≥ −2 log λ(x))

≈ P (χ21 ≥ −2 log λ(x))

Reject H0 iff p(x) ≤ α gives us approximate level α test.

Introductory statistics Hypothesis testing 42 / 49

Page 49: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Permutation tests

• On many instances, even asymptotic results are difficult. Also notclear, if sample size is big enough.

• Key idea: We don’t know the sampling distribution but can samplefrom this distribution.

Introductory statistics Hypothesis testing 43 / 49

Page 50: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy

X1, . . . , Xn ∼ F

Y1, . . . , Yn ∼ G

H0 : F = G = P0

Statistic: t((X1, . . . , Xn), (Y1, . . . , Yn)) = X − Y .Key idea:Under H0, all permutations of the data are equally likely.Use this idea to compute the sampling distribution.

Introductory statistics Hypothesis testing 44 / 49

Page 51: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy

• Randomly choose n out of 2n elements to belong to group 1. Theremaining belong to group 2.

• Compute the difference in means between the two groups.

• How often does the difference in means on the permuted data exceedthe observed difference ?

In many cases, the number of permutations is too large to exhaustivelyenumerate. Then use a random sample of permutations (Monte-Carloapproximation).

Introductory statistics Hypothesis testing 44 / 49

Page 52: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy

Introductory statistics Hypothesis testing 44 / 49

Page 53: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Outline

Point estimation

Hypothesis testing

Interval estimation

Introductory statistics Interval estimation 45 / 49

Page 54: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Interval estimation

Point estimates do not give you a measure of confidence

For a scalar parameter θ, given data X, an interval estimator is a pair offunctions L(x1, . . . , xn) and U(x1, . . . , xn), L ≤ U , such that[L(X), U(X)] covers θ. The interval is random while θ is fixed

Introductory statistics Interval estimation 46 / 49

Page 55: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Evaluating interval estimators

The coverage probability of an interval estimator [L(X), U(X)] is theprobability that this interval covers the true parameter θ, denotedPθ(θ ∈ [L(X), U(X)]).

An interval estimator is termed a 1− α confidence interval ifinfθPθ(θ ∈ [L(X), U(X)] ≥ 1− α.

Introductory statistics Interval estimation 47 / 49

Page 56: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Finding interval estimatorsInverting a test statistic

Given X1, . . . , Xniid∼ N (µ, 1), H0 : µ = µ0 versus H1 : µ 6= 0.

One test is to reject H0 if {x : |x̄− µ0| >z 1−α

2√n} So H0 is accepted for

{x : |x̄− µ0| ≤z 1−α

2√n} or equivalently:

x̄− zα2

1√n≤ µ0 ≤ x̄+ zα

2

1√n

Since the test has size α, P (H0 is accepted|µ = µ0) = 1− α,

P (x̄− z 1−α2

1√n≤ µ0 ≤ x̄+ z 1−α

2

1√n|µ = µ0) = 1− α

Since this is true for all µ0, we have

Pµ(x̄− z 1−α2

1√n≤ µ ≤ x̄+ z 1−α

2

1√n

) = 1− α

So [x̄− z 1−α2

1√n, x̄+ z 1−α

2

1√n

] is a 1− α confidence interval.

Introductory statistics Interval estimation 48 / 49

Page 57: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Finding interval estimatorsInverting a test statistic

• A general method

• Methods for constructing tests can be used for getting confidenceintervals

Introductory statistics Interval estimation 48 / 49

Page 58: Introductory statistics - CM226: Machine Learning for ...web.cs.ucla.edu/~sriram/courses/cm226.fall-2016/slides/stats.pdfIntroductory statistics CM226: Machine Learning for Bioinformatics

Summary

• Two major inferential tasks

• General procedures for each task and to evaluate the quality.

• Can evaluate finite sample and large sample properties.• Estimation (point) : Maximum likelihood. Bias, variance, consistency.• Testing: Likelihood Ratio Test: False positive and false negative

probabilities (or power). Permutation is a useful tool in many settings.• Interval estimation: Can use tests to get at confidence intervals.

Introductory statistics Interval estimation 49 / 49