high-dimensional variable selection in nonlinear models ... · lucas janson harvard university...

High-Dimensional Variable Selection in NonlinearModels that Controls the False Discovery Rate

Lucas Janson

Harvard University Department of Statisticsblank lineblank line

CMSA Big Data Conference, August 18, 2017

Collaborators: Emmanuel Candes (Stanford), Yingying Fan, Jinchi Lv (USC)

Problem Statement

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 0 / 18

Controlled Variable Selection

Given:

Y an outcome of interest (AKA response or dependent variable),

X1, . . . , Xp a set of p potential explanatory variables (AKA covariates,features, or independent variables),

How can we select important explanatory variables with few mistakes?

Applications to:

Medicine/genetics/health care

Economics/political science

Industry/technology

Given:

Applications to:

Industry/technology

Given:

Applications to:

Industry/technology

Given:

Applications to:

Industry/technology

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”

Y ⊥⊥ Xj |X-j

#{j in S}

)≤ q (e.g. 10%)

Y ⊥⊥ Xj |X-j

#{j in S}

)≤ q (e.g. 10%)

Y ⊥⊥ Xj |X-j

#{j in S}

)≤ q (e.g. 10%)

Y ⊥⊥ Xj |X-j

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem

Allows any model for Y and X1, . . . , Xp

Allows any dimension (including p > n)

Finite-sample control (non-asymptotic) of FDR

Practical performance on real problems

Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007)

≈ 5, 000 subjects (≈ 40% with Crohn’s Disease)

≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Original analysis of the data made 9 discoveries by running marginal tests andselecting p-values to target a FDR of 10%

Model-free knockoffs used the same FDR of 10% and made 18 discoveries, withmany of the new discoveries confirmed by a larger meta-analysis

Sneak Peak

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lowdimensions

Model forY

Randomdesign

Lowdimensions

Model forY

Randomdesign

Lowdimensions

Model forY

Randomdesign

Lowdimensions

Model forY

Randomdesign

The Knockoffs Idea

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

](2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

[X X]>[X X] =

X>X − diag{s} X>X

(2) Compute knockoff statistics:Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

positives≤ q

Comments:

[X X]>[X X] =

X>X − diag{s} X>X

positives≤ q

Comments:

[X X]>[X X] =

X>X − diag{s} X>X

positives≤ q

Comments:

[X X]>[X X] =

X>X − diag{s} X>X

positives≤ q

Comments:

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variableAct as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variableMeasures how much more important a variable appears than its knockoffPositive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are donespecifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of theunimportant/null Wj are independently ±1 with probability 1/2

positives≤ q

New Interpretation of Knockoffs

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Explicitly,

Robustness

Exact Cov

0.0 0.5 1.0Relative Frobenius Norm Error

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Robustness

● ●

Exact Cov

Graph. Lasso

● ●

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

● ● ●

Robustness

● ●●

Exact Cov

62.5% Emp. Cov

● ● ●●

Robustness

● ●●

Exact Cov

62.5% Emp. Cov

75% Emp. Cov

● ● ●●

Robustness

● ●●

Exact Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

● ● ●●

Robustness

● ●●

Exact Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

100% Emp. Cov0.00

● ● ●●

●●0.00

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Variable importance measure ZAntisymmetric function fj : R2 → R, i.e.,

fj(z1, z2) = −fj(z2, z1)

Wj = fj(Zj , Zj), where Zj and Zj are the variable importances of Xj andXj , respectively

positives≤ q

fj(z1, z2) = −fj(z2, z1)

positives≤ q

fj(z1, z2) = −fj(z2, z1)

positives≤ q

Step (1): Construct Knockoffs

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution

If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first

and second moments when Xj , Xj swapped

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]For non-Gaussian X, still second-order-correct approximate knockoffs

Linear algebra and semidefinite programming to find good s

Recently: construction for Markov chains and HMMs (Sesia et al., 2017)

Constructions also possible for grouped variables (Dai and Barber, 2016)

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

Step (2): Compute Knockoff Statistics

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of

Xj and Xj , respectively):

Wj = fj(Zj , Zj) = −fj(Zj , Zj)

For example,

Z is magnitude of fitted coefficient β from a lasso regression of y on [X X]

fj(z1, z2) = z1 − z2Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

For example,

fj(z1, z2) = z1 − z2

Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

For example,

fj(z1, z2) = z1 − z2Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj :

for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

=(Zj , Zj

)Wj = fj(Zj , Zj)

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

Wj = fj(Zj , Zj)D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj)

= −fj(Zj , Zj) = −Wj

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

D= −Wj

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Adaptivity

Prior information

Adaptivity

Prior information

Adaptivity

Prior information

Adaptivity

Prior information

Adaptivity

Prior information

Step (3): Find the Knockoff Threshold

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

W3 W4 W5

|W3| |W4| |W5|

|W6||W7|

|W9| |W10|

|W1| |W4| |W5|

|W6||W7|

q = 20%

S = {1, 4, 5, 6, 7}

Intuition for FDR Control

FDR = E(

#{null Xj selected}#{total Xj selected}

= E(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

FDR = E(

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

≈ E(#{null negative |Wj | > τ}

)≤ E

FDR = E(

)≈ E

≤ E(#{negative |Wj | > τ}#{positive |Wj | > τ}

FDR = E(

)≈ E

)≤ E

GWAS Application

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Discussion

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Appendix

References

Barber, R. F. and Candes, E. J. (2015). Controlling the false discovery rate viaknockoffs. Ann. Statist., 43(5):2055–2085.

Candes, E., Fan, Y., Janson, L., and Lv, J. (2016). Panning for gold: Model-freeknockoffs for high-dimensional controlled variable selection. arXiv preprintarXiv:1610.02351.

Dai, R. and Barber, R. F. (2016). The knockoff filter for fdr control ingroup-sparse and multitask regression. arXiv preprint arXiv:1602.03589.

Sesia, M., Sabatti, C., and Candes, E. (2017). Gene hunting with knockoffs forhidden markov models. arXiv preprint arXiv:1706.04677.

Wen, X. and Stephens, M. (2010). Using linear predictors to impute allelefrequencies from summary or pooled genotype data. Ann. Appl. Stat.,4(3):1158–1182.

WTCCC (2007). Genome-wide association study of 14,000 cases of sevencommon diseases and 3,000 shared controls. Nature, 447(7145):661–678.

Simulations in Low-Dimensional Linear Model

2 3 4 5Coefficient Amplitude

MethodsBHq MarginalBHq Max Lik.MF KnockoffsOrig. Knockoffs 0.00

2 3 4 5Coefficient Amplitude

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 1000, and y comes from a Gaussianlinear model with 60 nonzero regression coefficients having equal magnitudes andrandom signs. The noise variance is 1.

Simulations in Low-Dimensional Nonlinear Model

6 8 10Coefficient Amplitude

MethodsBHq MarginalBHq Max Lik.MF Knockoffs 0.00

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 1000, and y comes from a binomiallinear model with logit link function, and 60 nonzero regression coefficients having equalmagnitudes and random signs.

Simulations in High Dimensions

MethodsBHq MarginalMF Knockoffs 0.00

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 6000, and y comes from a binomiallinear model with logit link function, and 60 nonzero regression coefficients having equalmagnitudes and random signs.

Simulations in High Dimensions with Dependence

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

MethodsBHq MarginalMF Knockoffs 0.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix has AR(1) columns, and marginally each Xj ∼ N (0, 1/n). n = 3000,p = 6000, and y follows a binomial linear model with logit link function, and 60 nonzerocoefficients with random signs and randomly selected locations.

Checking Sensitivity to Misspecification Error

Concern about misspecification

Y |X X

Canonical (model Y , not X) Yes No

model X, not Y No Yes

Misspecification replicatedin simulation?

No Yes

Can actually check sensitivity to misspecification error!

Y |X X

No Yes

Y |X X

No Yes

Robustness on Real Data

●●

9 12 15 18 21Coefficient Amplitude

● ●

9 12 15 18 21Coefficient Amplitude

Figure: Power and FDR (target is 10%) for model-free knockoffs applied to subsamplesof a chromosome 1 of real genetic design matrix; n ≈ 1, 400.

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

Equicorrelated (EQ) (fast, less powerful): sEQj = 2λmin(Σ) ∧ 1 for all j

Semidefinite program (SDP) (slower, more powerful):

minimize∑

j |1− sSDPj |

subject to sSDPj ≥ 0

diag{sSDP} � 2Σ,

(New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separatesBisection search scalar multiplier of solution to account for approximationfaster than SDP, more powerful than EQ, and easily parallelizable

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

minimize∑

j |1− sSDPj |

diag{sSDP} � 2Σ,

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

minimize∑

j |1− sSDPj |

diag{sSDP} � 2Σ,

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

minimize∑

j |1− sSDPj |

diag{sSDP} � 2Σ,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Proof of Control

FDR = E(

= E(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

)≤ E

More precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸︷︷︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸︷︷︸≤ q by definition of τ

Proof of Control

FDR = E(

≈ E(#{null negative |Wj | > τ}

)≤ E

More precisely:

mFDR = E(

Proof of Control

FDR = E(

)≈ E

≤ E(#{negative |Wj | > τ}#{positive |Wj | > τ}

More precisely:

mFDR = E(

Proof of Control

FDR = E(

)≈ E

)≤ E

More precisely:

mFDR = E(

Proof of Control

FDR = E(

)≈ E

)≤ E

τMore precisely:

mFDR = E(

#{null positive |Wj | > τ}1 + #{null negative |Wj | > τ}︸︷︷︸

Supermartingale ≤ 1with τ a stopping time

Proof of Control

FDR = E(

)≈ E

)≤ E

τMore precisely:

mFDR = E(

1 + #{null negative |Wj | > τ}· 1 + #{null negative |Wj | > τ}

q−1 +#{positive|Wj | > τ}

Proof of Control

FDR = E(

)≈ E

)≤ E

τMore precisely:

mFDR = E(

1 + #{null negative |Wj | > τ}· 1 + #{null negative |Wj | > τ}

q−1 +#{positive|Wj | > τ}︸︷︷︸≤ q by definition of τ

Proof of Control

FDR = E(

)≈ E

)≤ E

τMore precisely:

mFDR = E(

high-dimensional variable selection in nonlinear models ... · lucas janson harvard university...

Documents