model selection and multiple hypothesis testing

Model selection and multiple hypothesis testing

Kelvin Gu

Contents1 Model selection 1

1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 RSS-d.o.f. decomposition of the PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 RSS-d.o.f. decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Bias-variance decomposition of the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Bias-variance decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Two reasons why the methods above aren’t ideal . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Bayesian Info Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.9 Stein’s unbiased risk estimate (SURE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.10 SURE in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Multiple hypothesis testing 62.1 The setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Why do we need it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Controlling FDR using Benjamini Hochberg (BH) . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Proof of BH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 References 6

1 Model selection

1.1 Setup• Suppose we know X = x. We want to predict the value of Y

• Define the prediction error to be PE = (Y − f (X))2

• We want to choose some function f that minimizes the objective E [PE | X = x]

– the optimal solution is µ (x) = E [Y | X = x]

• As a proxy for minimizing E[(Y − f (X))

2 | X = x], we’ll minimize the risk: R = E

[(µ (X)− f (X))

2]

– note that

E {E [PE | X = x]} = E [PE] = E[(µ (X) + ε (X)− f (X))

2]

= E[(µ (X)− f (X))

2]

+ E[ε (X)

2]

= R+ V ar (Y )

– so, the risk R is a reasonable proxy to optimize– V ar (Y ) is unavoidable

• For notational convenience, we’ll call µ = f (X) and µ = µ (X)

1

Kelvin Gu

1.2 Motivation• Why can’t we just use cross-validation for all tasks?

• The problem:

– Suppose we’re doing ordinary least squares with p = 30 predictors (inputs)

– we want to select a subset of the p predictors with smallest EPE

– for each subset of predictors, we fit the model and then test on some held-out test set.

– there are(p2

)= 435 models of size 2, there are

(p15

)= 155117520 models of size 15.

– even if most of the size 15 models are terrible, after 155117520 opportunities, you’ll probably findone that fits the test data better than any of the size 2 models.

– This is “second-order” overfitting.

– letM15 be the set of all size 15 models

E

[min

m∈M15

PE (m)

]︸︷︷︸

cross validation thinks you get this

� minm∈M

E [PE (m)]︸︷︷︸you actually get this

– even if you have the computation power to try all models, it’s still a bad idea (withoutsome modification)

• How will we address this?

– find better ways to estimate PE, and add an additional penalty to account for the overfittingproblem presented above

– it turns out that we need a penalty which depends not only on model size p, but also data size n

• Other ways to address this:

– just avoid searching over high-dimensional model space in the first place (e.g. ridge regressionand LASSO both offer just a single parameter to vary)

1.3 RSS-d.o.f. decomposition of the PE• We just saw that expected prediction error could be decomposed as: E [PE] = R + V ar (Y ). Here is

another decomposition.

• Let (X,Y ) be the training data, and let (X,Y ∗) be the test data.

– Note that X is the same in both cases but Y is not!

– this proof doesn’t work if the X is different in training and test

• To make matters simple, just assume X,Y ∈ R. It easily generalizes to the vector case.

• Our prediction for Y ∗ is µ, which is a function of (X,Y ) because we trained on (X,Y ). To emphasizethis, we’ll write µ = µX,Y .

• The prediction error (PE) is (µX,Y − Y ∗)2

E[(µXY − Y ∗)2

]︸︷︷︸

E[PE]

= E[(µXY − Y )

2]

︸︷︷︸E[RSS]

+ 2Cov (µXY , Y )︸︷︷︸d.o.f.

Kelvin Gu

• Proof:

E[(µXY − Y )

2]

= E[(µXY − µ+ µ− Y )

2]

E[(µXY − Y )

2]

︸︷︷︸E[RSS]

= E[(µXY − µ)

2]

+ E[(µ− Y )

2]

︸︷︷︸E[PE]

− 2E [(µXY − µ) (Y − µ)]︸︷︷︸d.o.f.

– The second term is E [PE] because:

E [PE] = E[(µXY − Y ∗)2

]= E

[(µXY − µ+ µ− Y ∗)2

]= E

[(µXY − µ)

2+ 2 (µXY − µ) (µ− Y ∗) + (µ− Y ∗)2

]= E

[(µXY − µ)

2]

+ E[(µ− Y ∗)2

]∗ A key thing to note is that E [(µXY − µ) (µ− Y ∗)] = E [µXY − µ]E [µ− Y ∗] = 0

∗ expectation factorizes because Y ∗ and (X,Y ) are independent.

1.4 RSS-d.o.f. decomposition in action• Suppose we’re fitting a linear model µ = HY . Then we can compute Tr (Cov (µ, Y ))

Tr (Cov (HY, Y )) = Tr (HCov (Y, Y ))

= Tr (HΣ)

• if H = X(XTX

)−1XT and Σ = σ2I, we can make this even more explicit

Tr (HΣ) = σ2rank (X)

• ‖µ− Y ‖2 + 2σ2rank (X) is called the Cp statistic

• We will see that model selection using Cp has the same problems as cross validation.

1.5 Bias-variance decomposition of the risk• Returning to E [PE] = R+ V ar (Y )

• We can decompose the risk term R term further:

R = E[(µ− µ)

2]

= E[(µ− Eµ+ Eµ− µ)

2]

= E[(µ− Eµ)

2+ 2 (µ− Eµ) (Eµ− µ) + (Eµ− µ)

2]

= E[(µ− Eµ)

2]

+ (Eµ− µ)2

= V ar (µ) +Bias (µ)2

• same trick: expand and kill the cross term: E [(µ− Eµ) (Eµ− µ)] = 0 · (Eµ− µ)

• Intuition: as you increase model size, bias tends to go down but variance goes up (overfitting)

Kelvin Gu

1.6 Bias-variance decomposition in action• Denoising problem:

– we observe y ∼ N (µ, I)

– we think µ has sparsity, so we try to recover it by solving: min ‖y − y‖2 + 2σ2‖y‖0– Note that the penalty 2σ2‖y0‖ is just the Cp penalty, 2σ2rank (X), when X = I

• bias-variance tradeoff

– consider each yi separately

– either you set yi = yi and pay 2σ2 (bias)

– or you set yi = 0 and pay y2i (variance)

• the solution is hard-thresholding

yi =

{0 |yi| ≤

√2σ

yi |yi| >√

2σ

• We can solve this problem even though it has an L0 penalty. You would think this procedure shouldbe good, right? Nope. It tends to set too many entries to yi.

• We will see that model selection using a constant penalty to the L0 norm suffers from thesame problem as Cp and cross-validation

1.7 Two reasons why the methods above aren’t ideal• (Continuing the denoising problem)

1. Other estimators can achieve better prediction error

• suppose the real µ = 0

• then our risk is: E‖Y − µ‖2 =∑E[y2i I|yi|>

√2σ

]≈ 0.57pσ2

• consider James Stein estimator

• whereas a different model selection procedure, such as James Stein, has risk 2σ2

– µJS =[1− p−2

‖Y ‖2

]Y

– The following bound on the risk of µJS is offered without proof:

E‖µJS − µ‖2 ≤ σ2

(p− p− 2

1 + ‖µ‖2p−2

)

– plugging in ‖µ‖2 = 0, we get a bound of 2σ2. Doesn’t even depend on p!

2. They aren’t consistent (this is a n→∞ argument)

• suppose we get more iid observations yi ∼ N (µ, I), i = 1, . . . , n

• suppose that µ has sparsity k∗

• Now we want to minimize Cp =∑ni=1 ‖yi − y‖2 + 2σ2‖y‖0

• Consistency would require that P (Cp (k∗) < Cp (k))→ 1 for all k 6= k∗

• We’re going to vary the parameter k and compute Cp (k) for each k

Kelvin Gu

Cp (k) =

n∑i=1

‖yi − y‖2 + 2σ2‖y‖0 =

n∑i=1

k∑j=1

(yij − yj

)2+

p∑j=k+1

y2ij

+ 2σ2k

=

k∑j=1

n∑i=1

(yij − yj

)2+

p∑j=k+1

n∑i=1

y2ij

+ 2σ2k

=

k∑j=1

(n∑i=1

y2ij − ny2j

)+

p∑j=k+1

n∑i=1

y2ij

+ 2σ2k

=

p∑j=1

n∑i=1

y2ij − nk∑j=1

y2j

+ 2σ2k

= −nk∑j=1

y2j + 2σ2k +

n∑i=1

‖yi‖2︸︷︷︸constant for all k

• wlog, consider k∗ > k.

Cp (k∗)− Cp (k) = 2σ2 (k∗ − k)−

|k∗−k|∑j

ny2j

• each yj ∼ N

(µj ,

1n

)so nyj ∼ χ2

1 and∑|k∗−k|j ny2j ∼ χ2

|k∗−k|

• this expression doesn’t depend on n anymore, so there’s always positive probability that P (Cp (k∗) > Cp (k))

• this proof seems slightly fishy, because for each model size k, I’m arbitrarily picking the first kentries to threshold

– but you can replace all the sums up to k with sums over any size k subset– now compute Cp for all subsets, and you’ll get the same result

• in contrast, the Bayesian information criterion works

1.8 Bayesian Info Criterion• We have a collection of models to select from {Mi}, indexed by i

• Model Mi has parameter vector θi associated with it. Let |θi| denote the dimension of θi.

• Pick model with highest marginal probability: P (y |Mi) =´f (y | θi) gi (θi) dθi

• logP (y |Mi) ≈ logLθi (y)− |θi|2 log n

• derivation steps:

– take log (exp (·)) of integrand

– Taylor expand around MLE

– recognize Gaussian integral

– take logs

– deal with Hessian term using SLLN

Kelvin Gu

1.9 Stein’s unbiased risk estimate (SURE)• (just mentioning, not giving details here)

• suppose X is a vector with mean µ and variance σ2I

• we want to estimate µ = X + g (X) where g must be almost differentiable

• thenR = E‖µ− µ‖2 = nσ2 + E

[‖g (X) ‖2 + 2σ2divg (X)

]• where divg (X) =

∑∂∂Xi

gi (X) = Tr(∂g∂X

)(trace of Jacobian)

1.10 SURE in action• leads to James Stein estimator

2 Multiple hypothesis testing

2.1 The setup• we have Hi, i = 1, . . . , n hypotheses to test

• we want to control the quality of our conclusions, using one of these metrics:

– FWER: P (we make at least one false rejection)

– FDR: E[# of false rejections# of total rejections

]2.2 Why do we need it?• Suppose you don’t even care about making scientific conclusions. You just want to do good prediction.

• You can think of model selection as a way to induce sparsity.

• (Candes calls it testimation)

• Back to the thresholding example

2.3 Controlling FDR using Benjamini Hochberg (BH)• if we have time

2.4 Proof of BH• if we have time

3 References• http://nscs00.ucmerced.edu/~nkumar4/BhatKumarBIC.pdf

• The STATS 300 sequence, with thanks to Profs. Candes, Siegmund and Romano!

model selection and multiple hypothesis testing

Documents