introductory statistics - cm226: machine learning for...
TRANSCRIPT
Introductory statisticsCM226: Machine Learning for Bioinformatics.
Fall 2016
Sriram Sankararaman
Introductory statistics 1 / 49
Admin
• Course website: http://web.cs.ucla.edu/~sriram/courses/
cm226.fall-2016/html/index.html
• Course format changes• Homeworks: 50%• Exam: 50%. Nov 16 (note date change)
Introductory statistics 2 / 49
Statistics
• Previous class: intro to genomics
• This class: intro to statistics
Introductory statistics 3 / 49
Statistics
• (X1, . . . , Xn) is the data drawn from a distribution P
• θ is an unknown population parameter, i.e., a function of P .
• Inference: Learn θ from (X1, . . . , Xn).
• P links θ and data.
Introductory statistics 4 / 49
Tasks in statistical inference
Three broad and related inference tasks
• Estimation
• Hypothesis testing
• Prediction
For each of these tasks, we have procedures for finding an estimator ortest or predictor and procedures for evaluating them. We will focus on thefirst two tasks.
Introductory statistics 5 / 49
Outline
Point estimation
Hypothesis testing
Interval estimation
Introductory statistics Point estimation 6 / 49
Point estimation
A statistic is a function of the data
Goal of estimation: Find statistic t such that θ̂n = t(X1, . . . , Xn) is“close” to θ.
θ̂n is an estimator of θ.
• Find estimators
• Evaluate an estimator
Introductory statistics Point estimation 7 / 49
Point estimationsProcedures to find estimators
Procedures to find estimators
• Maximum likelihood
• Method of moments
• Bayes
Introductory statistics Point estimation 8 / 49
Maximum likelihood estimation
Given X1, . . . , Xniid∼ P (x|θ), the likelihood function
L(θ|x) ≡ P (x1:n|θ) =∏ni=1 P (xi|θ).
A maximum likelihood estimator
θ̂(x)∆= argmaxθL(θ|x)
= argmaxθLL(θ|x)
LL(θ|x)∆= logL(θ|x)
=
n∑i=1
logP (xi|θ)
Introductory statistics Point estimation 9 / 49
Maximum likelihood estimation
Given X1, . . . , Xniid∼ P (x|θ), the likelihood function
L(θ|x) ≡ P (x1:n|θ) =∏ni=1 P (xi|θ).
For a differentiable log likelihood function, possible candidates for MLE areθ that solve
d
dθLL(θ) = 0
Why only possible candidates ?
• Not sufficient condition for a maximum
• Maximum could also be on the boundary
• Often, no closed-form solution
Introductory statistics Point estimation 9 / 49
Example 1Bernoulli likelihood
Model
X1, . . . , Xniid∼ Ber (p) , Xi ∈ {0, 1}
Likelihood
L(p) =n∏i=1
pxi(1− p)1−xi
Log likelihood
LL(p) =
n∑i=1
xi log(p) + (1− xi) log(1− p)
Introductory statistics Point estimation 10 / 49
Example 1Bernoulli likelihood
Log likelihood
LL(p) =
n∑i=1
xi log(p) + (1− xi) log(1− p)
MLE
p̂ = x
Introductory statistics Point estimation 11 / 49
Example 1Bernoulli likelihood
5 coin flips, 2 heads
Likelihood
L(p) = (p)2(1− p)3
LL(p) = 2 log(p) + 3 log(1− p)dLLdp
(p) = 21
p− 3
1
1− p
MLE
p̂ =2
5
Introductory statistics Point estimation 12 / 49
Example 2Normal likelihood
Model
X1, . . . , Xniid∼ N (µ, σ2)
Likelihood
L(µ, σ2) =n∏i=1
1
2πσ2exp(−(xi − µ)2
2σ2)
Log likelihood
LL(µ, σ2) = −n∑i=1
(xi − µ)2
2σ2− n
2log(2πσ2)
Introductory statistics Point estimation 13 / 49
Example 2Normal likelihood
Log likelihood
LL(µ, σ2) = −n∑i=1
(xi − µ)2
2σ2− n
2log(2πσ2)
MLE
µ̂ = x =
n∑i=1
xin
σ̂2 =
n∑i=1
(xi − x)2
n
Introductory statistics Point estimation 14 / 49
Evaluating estimators
• Finite sample
• Asymptotic
Introductory statistics Point estimation 15 / 49
Evaluating estimatorsFinite sample
Bias of an estimatorbias(θ̂n) = E
[θ̂n
]− θ
An estimator is unbiased if bias = 0.
Variance of an estimator
var(θ̂n) = E[(θ̂n − E
[θ̂n
])2]
Mean Square Error of an estimator
mse(θ̂n) = E[(θ̂n − θ
)2]
The mse decomposes as mse(θ̂n) = bias(θ̂n)2
+ var(θ̂n)
Introductory statistics Point estimation 16 / 49
Evaluating estimatorsFinite sample
Comparing estimators not straightforward
• Depends on the evaluation criterion
• Even with a single criterion, performance can depend on the trueparameter.
Introductory statistics Point estimation 17 / 49
Evaluating estimatorsFinite sample
Normal likelihood
X1, . . . , Xniid∼ N (µ, σ2)
• X1 is an unbiased estimator of µ.
• µ̂ = X is also an unbiased estimator of µ
• Var [X1] = σ2 while Var [µ̂] = σ2
n
Introductory statistics Point estimation 18 / 49
Evaluating estimatorsFinite sample
Normal likelihood
X1, . . . , Xniid∼ N (µ, σ2)
• σ̂2 is not an unbiased estimator of σ2. Its bias is σ2
n .
Introductory statistics Point estimation 18 / 49
Evaluating estimatorsAsymptotics
• Challenging to compute finite-sample properties.
• What happens when sample size goes to infinity?
Introductory statistics Point estimation 19 / 49
Asymptotics
Theorem (Weak Law of Large Numbers) Let X1, . . . , Xn be iid
random variables with E [Xi] = µ, Var [Xi] = σ2 <∞. Xnp→ µ.
Theorem (Central Limit Theorem) Let X1, . . . , Xn be iid randomvariables with E [Xi] = µ, Var [Xi] = σ2 <∞.
√n
(Xn − µ)
σ
d→ N (0, 1)
. We will informally write Xn ∼ AN (µ, σ2
n )
Introductory statistics Point estimation 20 / 49
Evaluating estimatorsAsymptotics
Consistency A sequence of estimators θ̂n is a consistent sequence ofestimators of a parameter θ if θ̂n
p→ θ.
Normal likelihoodµ̂n = xn is a consistent estimator of µ.σ̂2
n is a consistent estimator of σ2.
Introductory statistics Point estimation 21 / 49
Evaluating estimatorsAsymptotics
Theorem (Consistency of MLE) Let X1, . . . , Xniid∼ f(x|θ) and let θ̂n
denote the MLE of θ.
θ̂np→ θ
If the bias of θ̂ → 0 and its variance → 0, i.e., its MSE goes to 0, then it isconsistent.
Introductory statistics Point estimation 22 / 49
Evaluating estimatorsAsymptotics
Theorem (Asymptotic Normality of MLE) Under “regularityconditions”
√n(θ̂n − θ) d→ N (0, I−1(θ))
Here
I(θ) = −E[∂2 logP (X|θ)
∂2θ
]is the Fisher information associated with the model.What is Fisher information ?
Introductory statistics Point estimation 23 / 49
Fisher informationNormal model
Focus on the mean.For a single observation X ∼ N (µ, σ2)
∂ logP (X|θ)∂µ
=x− µσ2
∂2 logP (X|θ)∂2µ
= − 1
σ2
I(µ) =1
σ2
Makes intuitive sense:
• Inverse Fisher information is the variance
• Smaller the variance, larger the Fisher information, and the estimatesof the mean should be more precise.
Introductory statistics Point estimation 24 / 49
Fisher informationNormal model
-10 -5 0 5 10
-100
-80
-60
-40
-20
0
µ
LL
σ2
0.515
Introductory statistics Point estimation 24 / 49
Fisher informationBernoulli model
∂ logP (X|p)∂p
=X
p− 1−X
1− p∂2 logP (X|p)
∂2p= −X
p2− X
(1− p)2
I(p) =1
p(1− p)
• Again inverse Fisher information is the variance.
Introductory statistics Point estimation 25 / 49
Outline
Point estimation
Hypothesis testing
Interval estimation
Introductory statistics Hypothesis testing 26 / 49
Hypothesis testingHypothesis
• Offspring has equal chance of inheriting either of two alleles from aparent (Mendel’s first law holds)
• Two genes segregate independently (Mendel’s second law holds)
• A genetic mutation has no effect on disease
• A drug has no effect on blood pressure
Null hypothesis: H0 : θ ∈ Θ0
Alternate hypotehsis: H1 : θ ∈ Θ1
Introductory statistics Hypothesis testing 27 / 49
Hypothesis testingHypothesis
Mendel’s first law/Test parameter of Bernoulli
• We observe n gametes from a diploid individual.
• Probability of transmitting maternal allele is p.
• We can tell for each gamete i if it is the paternal or the maternalallele. Xi ∼ Ber (p).
• After observing data from n gametes, accept or reject H0.
H0 : p =1
2
H1 : p 6= 1
2
Introductory statistics Hypothesis testing 27 / 49
Hypothesis test
A rule that specifies values of the sample for which H0 should be acceptedor rejected.Specify a test statistic t(X1, . . . , Xn) and a rejection/acceptance region.
Mendel’s first lawOne test:Statistic: t(X1, . . . , Xn) = X.Reject H0 if X > 0.9 or X < 0.1.Rejection region {(x1, . . . , xn) : X > 0.9 or X < 0.1}.
Introductory statistics Hypothesis testing 28 / 49
Hypothesis testingProcedures to find tests
• Likelihood Ratio Test (LRT)
• Wald test
• Score test
• Bayesian test
Introductory statistics Hypothesis testing 29 / 49
Hypothesis testingLikelihood Ratio Test
LRT statistic
λ(x) =supθ∈Θ0L(θ|x)
supθ∈ΘL(θ|x)
=L(θ̂0|x)
L(θ̂|x)
Θ = Θ0 ∪Θ1
Reject H0 is λ(x) < c.
Introductory statistics Hypothesis testing 30 / 49
Hypothesis testingLikelihood Ratio Test example 1
Test mean of normal modelGiven X1, . . . , Xn
iid∼ N (µ, 1), H0 : µ = 0 versus H1 : µ 6= 0.
Introductory statistics Hypothesis testing 31 / 49
Hypothesis testingLikelihood Ratio Test example 1
Test mean of normal model
λ(x) =exp(−
∑ni=1 xi
2
2 )
exp(−∑ni=1 (xi−µ̂)2
2 )
= exp(−nx2
2)
Reject H0 if λ(x) < c equivalent to reject H0 if |x| >√
2 log( 1c)
n .
Introductory statistics Hypothesis testing 31 / 49
Hypothesis testingLikelihood Ratio Test example 2
Mendel’s first law
λ(x) =12
∑i xi 1
2
n−∑i xi
p̂∑i xi(1− p̂)n−
∑i xi
=12
n
xnx(1− x)n(1−x)
= 2nH(x)−n
Here H(x) = −(x log(x) + (1− x) log(1− x), x ∈ [0, 1] is the entropyfunction.
Reject H0 is λ(x) < c equivalent to reject H0 if H(x) < 1− log2( 1c)
n .
Introductory statistics Hypothesis testing 32 / 49
Hypothesis testingLikelihood Ratio Test example 2
Mendel’s first lawHow do we choose c?
Introductory statistics Hypothesis testing 32 / 49
Hypothesis testingEvaluating tests
DecisionTruth H0 H1
H0 TN FPH1 FN TP
FP: False positive (type-I error)
FN: False negative (type-II error)
TP: True positive
TN: True negative
For θ ∈ Θ0, probability of false positive for a test R: = Pθ(Test rejects).For θ ∈ Θ1, probability of false negative: Pθ(Test accepts).
Introductory statistics Hypothesis testing 33 / 49
Hypothesis testingEvaluating tests
To evaluate (or compare tests), we need to tradeoff false positives andfalse negatives.For tests with false positive probability below some threshold, what is thefalse negative threshold ?
A test is a level α test if supθ∈Θ0Pθ(Test rejects) ≤ α.
• α is an upper bound on the false positive probability.
• Often set α = 0.05
Introductory statistics Hypothesis testing 34 / 49
Hypothesis testingLevel of LRT
• Choose c to control α: supθ∈Θ0Pθ(λ(X) ≤ c) = α.
• Trivial test is never reject.
• We would like to maximize the probability of rejecting the null (i.e.,power which we denote β(θ)) if it is false while controlling the falsepositive probability.
Introductory statistics Hypothesis testing 35 / 49
Hypothesis testingNormal mean
For test of the normal mean, Θ0 = {0}.For θ = θ0 (under the null), the test statistic X ∼ N (0, 1
n).
P0(λ(X) ≤ c) = P0(|X| ≥
√2 log(1
c )
n)
= P (|Z| ≥√
2 log(1
c))
Here Z ∼ N (0, 1). P (|Z| ≥ z1−α2) = α where zα is the α-quantile of the
standard normal distribution.
Introductory statistics Hypothesis testing 36 / 49
Normal mean
Introductory statistics Hypothesis testing 37 / 49
Hypothesis testingp-value
• Level is a crude measure. If we reject the null hypothesis at levelα = 0.05, it does not tell us how strong the evidence against H0 is.
• p-value of a test is the smallest α at which the test rejects H0.
• Small values of p-value implies stronger evidence against H0.
For a statistic λ(X) such that small values of λ give evidence against H0,the p-value p(x):
p(x) = supθ∈Θ0Pθ(λ(X) ≤ λ(x))
Introductory statistics Hypothesis testing 38 / 49
Hypothesis testingp-value for test of normal mean
• Observe 10 samples and mean is 0.8. Reject H0 at α = .05.p-value=0.01.
• Observe 10 samples and mean is 2. Reject H0 at α = .05.p-value=2× 10−10.
• Observe 100 samples and mean is 0.8. Reject H0 at α = .05.p-value=1× 10−15.
Introductory statistics Hypothesis testing 39 / 49
LRT
• supθ∈Θ0Pθ(λ(X) ≤ c) = α.
• Choose c so that the false positive probability is bounded.
• To compute false positive probability, we need to know how thestatistic is distributed under the null (sampling distribution of thestatistic).
• Easy for Normal mean test:X.
• More complicated statistics?
AsymptoticsPermutation test
Introductory statistics Hypothesis testing 40 / 49
Asymptotics of the LRT
Theorem Let X1, . . . , Xniid∼ p(x|θ). For testing θ = θ0 versus θ 6= θ0,
under regularity conditions on p(x|θ)
−2 log λ(Xn)d→ χ2
1
χ21 is the distribution of the square of a normal random variable.
For testing θ ∈ Θ0 versus θ 6= Θ0, under regularity conditions on p(x|θ), ifν is the difference in the number of free parameters between θ ∈ Θ andθ ∈ Θ0:
−2 log λ(Xn)d→ χ2
ν
Reject H0 iff −2 log λ(X) ≥ χ2ν,1−α gives us an asymptotic (hence
approximate) level α test.
Introductory statistics Hypothesis testing 41 / 49
Asymptotics of the LRTApplication to Bernoulli model
Using the asymptotic distribution of the LRT, the p-value for the LRTstatistic:
p(x) = P (λ(X ≤ λ(x))
= P (−2 log λ(X) ≥ −2 log λ(x))
≈ P (χ21 ≥ −2 log λ(x))
Reject H0 iff p(x) ≤ α gives us approximate level α test.
Introductory statistics Hypothesis testing 42 / 49
Permutation tests
• On many instances, even asymptotic results are difficult. Also notclear, if sample size is big enough.
• Key idea: We don’t know the sampling distribution but can samplefrom this distribution.
Introductory statistics Hypothesis testing 43 / 49
Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy
X1, . . . , Xn ∼ F
Y1, . . . , Yn ∼ G
H0 : F = G = P0
Statistic: t((X1, . . . , Xn), (Y1, . . . , Yn)) = X − Y .Key idea:Under H0, all permutations of the data are equally likely.Use this idea to compute the sampling distribution.
Introductory statistics Hypothesis testing 44 / 49
Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy
• Randomly choose n out of 2n elements to belong to group 1. Theremaining belong to group 2.
• Compute the difference in means between the two groups.
• How often does the difference in means on the permuted data exceedthe observed difference ?
In many cases, the number of permutations is too large to exhaustivelyenumerate. Then use a random sample of permutations (Monte-Carloapproximation).
Introductory statistics Hypothesis testing 44 / 49
Permutation test exampleTest if distribution of two alleles is the same in disease vs healthy
Introductory statistics Hypothesis testing 44 / 49
Outline
Point estimation
Hypothesis testing
Interval estimation
Introductory statistics Interval estimation 45 / 49
Interval estimation
Point estimates do not give you a measure of confidence
For a scalar parameter θ, given data X, an interval estimator is a pair offunctions L(x1, . . . , xn) and U(x1, . . . , xn), L ≤ U , such that[L(X), U(X)] covers θ. The interval is random while θ is fixed
Introductory statistics Interval estimation 46 / 49
Evaluating interval estimators
The coverage probability of an interval estimator [L(X), U(X)] is theprobability that this interval covers the true parameter θ, denotedPθ(θ ∈ [L(X), U(X)]).
An interval estimator is termed a 1− α confidence interval ifinfθPθ(θ ∈ [L(X), U(X)] ≥ 1− α.
Introductory statistics Interval estimation 47 / 49
Finding interval estimatorsInverting a test statistic
Given X1, . . . , Xniid∼ N (µ, 1), H0 : µ = µ0 versus H1 : µ 6= 0.
One test is to reject H0 if {x : |x̄− µ0| >z 1−α
2√n} So H0 is accepted for
{x : |x̄− µ0| ≤z 1−α
2√n} or equivalently:
x̄− zα2
1√n≤ µ0 ≤ x̄+ zα
2
1√n
Since the test has size α, P (H0 is accepted|µ = µ0) = 1− α,
P (x̄− z 1−α2
1√n≤ µ0 ≤ x̄+ z 1−α
2
1√n|µ = µ0) = 1− α
Since this is true for all µ0, we have
Pµ(x̄− z 1−α2
1√n≤ µ ≤ x̄+ z 1−α
2
1√n
) = 1− α
So [x̄− z 1−α2
1√n, x̄+ z 1−α
2
1√n
] is a 1− α confidence interval.
Introductory statistics Interval estimation 48 / 49
Finding interval estimatorsInverting a test statistic
• A general method
• Methods for constructing tests can be used for getting confidenceintervals
Introductory statistics Interval estimation 48 / 49
Summary
• Two major inferential tasks
• General procedures for each task and to evaluate the quality.
• Can evaluate finite sample and large sample properties.• Estimation (point) : Maximum likelihood. Bias, variance, consistency.• Testing: Likelihood Ratio Test: False positive and false negative
probabilities (or power). Permutation is a useful tool in many settings.• Interval estimation: Can use tests to get at confidence intervals.
Introductory statistics Interval estimation 49 / 49