model selection and multiple hypothesis testing
DESCRIPTION
My presentation notes for feature selection and multiple hypothesis testing.TRANSCRIPT
Model selection and multiple hypothesis testing
Kelvin Gu
Contents1 Model selection 1
1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 RSS-d.o.f. decomposition of the PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 RSS-d.o.f. decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Bias-variance decomposition of the risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Bias-variance decomposition in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Two reasons why the methods above aren’t ideal . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Bayesian Info Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.9 Stein’s unbiased risk estimate (SURE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.10 SURE in action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Multiple hypothesis testing 62.1 The setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Why do we need it? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Controlling FDR using Benjamini Hochberg (BH) . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Proof of BH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 References 6
1 Model selection
1.1 Setup• Suppose we know X = x. We want to predict the value of Y
• Define the prediction error to be PE = (Y − f (X))2
• We want to choose some function f that minimizes the objective E [PE | X = x]
– the optimal solution is µ (x) = E [Y | X = x]
• As a proxy for minimizing E[(Y − f (X))
2 | X = x], we’ll minimize the risk: R = E
[(µ (X)− f (X))
2]
– note that
E {E [PE | X = x]} = E [PE] = E[(µ (X) + ε (X)− f (X))
2]
= E[(µ (X)− f (X))
2]
+ E[ε (X)
2]
= R+ V ar (Y )
– so, the risk R is a reasonable proxy to optimize– V ar (Y ) is unavoidable
• For notational convenience, we’ll call µ = f (X) and µ = µ (X)
1
Kelvin Gu
1.2 Motivation• Why can’t we just use cross-validation for all tasks?
• The problem:
– Suppose we’re doing ordinary least squares with p = 30 predictors (inputs)
– we want to select a subset of the p predictors with smallest EPE
– for each subset of predictors, we fit the model and then test on some held-out test set.
– there are(p2
)= 435 models of size 2, there are
(p15
)= 155117520 models of size 15.
– even if most of the size 15 models are terrible, after 155117520 opportunities, you’ll probably findone that fits the test data better than any of the size 2 models.
– This is “second-order” overfitting.
– letM15 be the set of all size 15 models
E
[min
m∈M15
PE (m)
]︸ ︷︷ ︸
cross validation thinks you get this
� minm∈M
E [PE (m)]︸ ︷︷ ︸you actually get this
– even if you have the computation power to try all models, it’s still a bad idea (withoutsome modification)
• How will we address this?
– find better ways to estimate PE, and add an additional penalty to account for the overfittingproblem presented above
– it turns out that we need a penalty which depends not only on model size p, but also data size n
• Other ways to address this:
– just avoid searching over high-dimensional model space in the first place (e.g. ridge regressionand LASSO both offer just a single parameter to vary)
1.3 RSS-d.o.f. decomposition of the PE• We just saw that expected prediction error could be decomposed as: E [PE] = R + V ar (Y ). Here is
another decomposition.
• Let (X,Y ) be the training data, and let (X,Y ∗) be the test data.
– Note that X is the same in both cases but Y is not!
– this proof doesn’t work if the X is different in training and test
• To make matters simple, just assume X,Y ∈ R. It easily generalizes to the vector case.
• Our prediction for Y ∗ is µ, which is a function of (X,Y ) because we trained on (X,Y ). To emphasizethis, we’ll write µ = µX,Y .
• The prediction error (PE) is (µX,Y − Y ∗)2
E[(µXY − Y ∗)2
]︸ ︷︷ ︸
E[PE]
= E[(µXY − Y )
2]
︸ ︷︷ ︸E[RSS]
+ 2Cov (µXY , Y )︸ ︷︷ ︸d.o.f.
Kelvin Gu
• Proof:
E[(µXY − Y )
2]
= E[(µXY − µ+ µ− Y )
2]
E[(µXY − Y )
2]
︸ ︷︷ ︸E[RSS]
= E[(µXY − µ)
2]
+ E[(µ− Y )
2]
︸ ︷︷ ︸E[PE]
− 2E [(µXY − µ) (Y − µ)]︸ ︷︷ ︸d.o.f.
– The second term is E [PE] because:
E [PE] = E[(µXY − Y ∗)2
]= E
[(µXY − µ+ µ− Y ∗)2
]= E
[(µXY − µ)
2+ 2 (µXY − µ) (µ− Y ∗) + (µ− Y ∗)2
]= E
[(µXY − µ)
2]
+ E[(µ− Y ∗)2
]∗ A key thing to note is that E [(µXY − µ) (µ− Y ∗)] = E [µXY − µ]E [µ− Y ∗] = 0
∗ expectation factorizes because Y ∗ and (X,Y ) are independent.
1.4 RSS-d.o.f. decomposition in action• Suppose we’re fitting a linear model µ = HY . Then we can compute Tr (Cov (µ, Y ))
Tr (Cov (HY, Y )) = Tr (HCov (Y, Y ))
= Tr (HΣ)
• if H = X(XTX
)−1XT and Σ = σ2I, we can make this even more explicit
Tr (HΣ) = σ2rank (X)
• ‖µ− Y ‖2 + 2σ2rank (X) is called the Cp statistic
• We will see that model selection using Cp has the same problems as cross validation.
1.5 Bias-variance decomposition of the risk• Returning to E [PE] = R+ V ar (Y )
• We can decompose the risk term R term further:
R = E[(µ− µ)
2]
= E[(µ− Eµ+ Eµ− µ)
2]
= E[(µ− Eµ)
2+ 2 (µ− Eµ) (Eµ− µ) + (Eµ− µ)
2]
= E[(µ− Eµ)
2]
+ (Eµ− µ)2
= V ar (µ) +Bias (µ)2
• same trick: expand and kill the cross term: E [(µ− Eµ) (Eµ− µ)] = 0 · (Eµ− µ)
• Intuition: as you increase model size, bias tends to go down but variance goes up (overfitting)
Kelvin Gu
1.6 Bias-variance decomposition in action• Denoising problem:
– we observe y ∼ N (µ, I)
– we think µ has sparsity, so we try to recover it by solving: min ‖y − y‖2 + 2σ2‖y‖0– Note that the penalty 2σ2‖y0‖ is just the Cp penalty, 2σ2rank (X), when X = I
• bias-variance tradeoff
– consider each yi separately
– either you set yi = yi and pay 2σ2 (bias)
– or you set yi = 0 and pay y2i (variance)
• the solution is hard-thresholding
yi =
{0 |yi| ≤
√2σ
yi |yi| >√
2σ
• We can solve this problem even though it has an L0 penalty. You would think this procedure shouldbe good, right? Nope. It tends to set too many entries to yi.
• We will see that model selection using a constant penalty to the L0 norm suffers from thesame problem as Cp and cross-validation
1.7 Two reasons why the methods above aren’t ideal• (Continuing the denoising problem)
1. Other estimators can achieve better prediction error
• suppose the real µ = 0
• then our risk is: E‖Y − µ‖2 =∑E[y2i I|yi|>
√2σ
]≈ 0.57pσ2
• consider James Stein estimator
• whereas a different model selection procedure, such as James Stein, has risk 2σ2
– µJS =[1− p−2
‖Y ‖2
]Y
– The following bound on the risk of µJS is offered without proof:
E‖µJS − µ‖2 ≤ σ2
(p− p− 2
1 + ‖µ‖2p−2
)
– plugging in ‖µ‖2 = 0, we get a bound of 2σ2. Doesn’t even depend on p!
2. They aren’t consistent (this is a n→∞ argument)
• suppose we get more iid observations yi ∼ N (µ, I), i = 1, . . . , n
• suppose that µ has sparsity k∗
• Now we want to minimize Cp =∑ni=1 ‖yi − y‖2 + 2σ2‖y‖0
• Consistency would require that P (Cp (k∗) < Cp (k))→ 1 for all k 6= k∗
• We’re going to vary the parameter k and compute Cp (k) for each k
Kelvin Gu
Cp (k) =
n∑i=1
‖yi − y‖2 + 2σ2‖y‖0 =
n∑i=1
k∑j=1
(yij − yj
)2+
p∑j=k+1
y2ij
+ 2σ2k
=
k∑j=1
n∑i=1
(yij − yj
)2+
p∑j=k+1
n∑i=1
y2ij
+ 2σ2k
=
k∑j=1
(n∑i=1
y2ij − ny2j
)+
p∑j=k+1
n∑i=1
y2ij
+ 2σ2k
=
p∑j=1
n∑i=1
y2ij − nk∑j=1
y2j
+ 2σ2k
= −nk∑j=1
y2j + 2σ2k +
n∑i=1
‖yi‖2︸ ︷︷ ︸constant for all k
• wlog, consider k∗ > k.
Cp (k∗)− Cp (k) = 2σ2 (k∗ − k)−
|k∗−k|∑j
ny2j
• each yj ∼ N
(µj ,
1n
)so nyj ∼ χ2
1 and∑|k∗−k|j ny2j ∼ χ2
|k∗−k|
• this expression doesn’t depend on n anymore, so there’s always positive probability that P (Cp (k∗) > Cp (k))
• this proof seems slightly fishy, because for each model size k, I’m arbitrarily picking the first kentries to threshold
– but you can replace all the sums up to k with sums over any size k subset– now compute Cp for all subsets, and you’ll get the same result
• in contrast, the Bayesian information criterion works
1.8 Bayesian Info Criterion• We have a collection of models to select from {Mi}, indexed by i
• Model Mi has parameter vector θi associated with it. Let |θi| denote the dimension of θi.
• Pick model with highest marginal probability: P (y |Mi) =´f (y | θi) gi (θi) dθi
• logP (y |Mi) ≈ logLθi (y)− |θi|2 log n
• derivation steps:
– take log (exp (·)) of integrand
– Taylor expand around MLE
– recognize Gaussian integral
– take logs
– deal with Hessian term using SLLN
Kelvin Gu
1.9 Stein’s unbiased risk estimate (SURE)• (just mentioning, not giving details here)
• suppose X is a vector with mean µ and variance σ2I
• we want to estimate µ = X + g (X) where g must be almost differentiable
• thenR = E‖µ− µ‖2 = nσ2 + E
[‖g (X) ‖2 + 2σ2divg (X)
]• where divg (X) =
∑∂∂Xi
gi (X) = Tr(∂g∂X
)(trace of Jacobian)
1.10 SURE in action• leads to James Stein estimator
2 Multiple hypothesis testing
2.1 The setup• we have Hi, i = 1, . . . , n hypotheses to test
• we want to control the quality of our conclusions, using one of these metrics:
– FWER: P (we make at least one false rejection)
– FDR: E[# of false rejections# of total rejections
]2.2 Why do we need it?• Suppose you don’t even care about making scientific conclusions. You just want to do good prediction.
• You can think of model selection as a way to induce sparsity.
• (Candes calls it testimation)
• Back to the thresholding example
2.3 Controlling FDR using Benjamini Hochberg (BH)• if we have time
2.4 Proof of BH• if we have time
3 References• http://nscs00.ucmerced.edu/~nkumar4/BhatKumarBIC.pdf
• The STATS 300 sequence, with thanks to Profs. Candes, Siegmund and Romano!