high-dimensional variable selection in nonlinear models ... · lucas janson harvard university...

140
High-Dimensional Variable Selection in Nonlinear Models that Controls the False Discovery Rate Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18, 2017 Collaborators: Emmanuel Cand` es (Stanford), Yingying Fan, Jinchi Lv (USC)

Upload: others

Post on 22-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

High-Dimensional Variable Selection in NonlinearModels that Controls the False Discovery Rate

Lucas Janson

Harvard University Department of Statisticsblank lineblank line

CMSA Big Data Conference, August 18, 2017

Collaborators: Emmanuel Candes (Stanford), Yingying Fan, Jinchi Lv (USC)

Page 2: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Problem Statement

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 0 / 18

Page 3: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection

Given:

Y an outcome of interest (AKA response or dependent variable),

X1, . . . , Xp a set of p potential explanatory variables (AKA covariates,features, or independent variables),

How can we select important explanatory variables with few mistakes?

Applications to:

Medicine/genetics/health care

Economics/political science

Industry/technology

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

Page 4: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection

Given:

Y an outcome of interest (AKA response or dependent variable),

X1, . . . , Xp a set of p potential explanatory variables (AKA covariates,features, or independent variables),

How can we select important explanatory variables with few mistakes?

Applications to:

Medicine/genetics/health care

Economics/political science

Industry/technology

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

Page 5: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection

Given:

Y an outcome of interest (AKA response or dependent variable),

X1, . . . , Xp a set of p potential explanatory variables (AKA covariates,features, or independent variables),

How can we select important explanatory variables with few mistakes?

Applications to:

Medicine/genetics/health care

Economics/political science

Industry/technology

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

Page 6: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection

Given:

Y an outcome of interest (AKA response or dependent variable),

X1, . . . , Xp a set of p potential explanatory variables (AKA covariates,features, or independent variables),

How can we select important explanatory variables with few mistakes?

Applications to:

Medicine/genetics/health care

Economics/political science

Industry/technology

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 1 / 18

Page 7: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Page 8: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Page 9: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Page 10: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Page 11: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Controlled Variable Selection (cont’d)

What is an important variable?

We consider Xj to be unimportant if the conditional distribution of Y givenX1, . . . , Xp does not depend on Xj . Formally, Xj is unimportant if it isconditionally independent of Y given X-j :

Y ⊥⊥ Xj |X-j

Markov Blanket of Y : smallest set S such that Y ⊥⊥ X-S |XS

For GLMs with no stochastically redundant covariates, equivalent to {j : βj = 0}

To make sure we do not make too many mistakes, we seek to select a set S tocontrol the false discovery rate (FDR):

FDR(S) = E(#{j in S : Xj unimportant}

#{j in S}

)≤ q (e.g. 10%)

“Here is a set of variables S, 90% of which I expect to be important”Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 2 / 18

Page 12: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem

Allows any model for Y and X1, . . . , Xp

Allows any dimension (including p > n)

Finite-sample control (non-asymptotic) of FDR

Practical performance on real problems

Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007)

≈ 5, 000 subjects (≈ 40% with Crohn’s Disease)

≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Original analysis of the data made 9 discoveries by running marginal tests andselecting p-values to target a FDR of 10%

Model-free knockoffs used the same FDR of 10% and made 18 discoveries, withmany of the new discoveries confirmed by a larger meta-analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

Page 13: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem

Allows any model for Y and X1, . . . , Xp

Allows any dimension (including p > n)

Finite-sample control (non-asymptotic) of FDR

Practical performance on real problems

Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007)

≈ 5, 000 subjects (≈ 40% with Crohn’s Disease)

≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Original analysis of the data made 9 discoveries by running marginal tests andselecting p-values to target a FDR of 10%

Model-free knockoffs used the same FDR of 10% and made 18 discoveries, withmany of the new discoveries confirmed by a larger meta-analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

Page 14: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem

Allows any model for Y and X1, . . . , Xp

Allows any dimension (including p > n)

Finite-sample control (non-asymptotic) of FDR

Practical performance on real problems

Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007)

≈ 5, 000 subjects (≈ 40% with Crohn’s Disease)

≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Original analysis of the data made 9 discoveries by running marginal tests andselecting p-values to target a FDR of 10%

Model-free knockoffs used the same FDR of 10% and made 18 discoveries, withmany of the new discoveries confirmed by a larger meta-analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

Page 15: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sneak Peak

New interpretation of knockoffs solves the controlled variable selection problem

Allows any model for Y and X1, . . . , Xp

Allows any dimension (including p > n)

Finite-sample control (non-asymptotic) of FDR

Practical performance on real problems

Analysis of the genetic basis of Crohn’s Disease (WTCCC, 2007)

≈ 5, 000 subjects (≈ 40% with Crohn’s Disease)

≈ 375, 000 single nucleotide polymorphisms (SNPs) for each subject

Original analysis of the data made 9 discoveries by running marginal tests andselecting p-values to target a FDR of 10%

Model-free knockoffs used the same FDR of 10% and made 18 discoveries, withmany of the new discoveries confirmed by a larger meta-analysis

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 3 / 18

Page 16: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 17: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 18: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 19: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 20: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Review of Methods for Controlled Variable Selection

What is required for valid inference?

Lowdimensions

Model forY

Asymptopicregime Sparsity

Randomdesign

OLSp+BHq Yes Yes No No No

MLp+BHq Yes Yes Yes No No

HDp+BHq No Yes Yes Yes Yes

Orig KnO Yes Yes No No No

New KnO No No No No Yes*

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 21: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

The Knockoffs Idea

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 4 / 18

Page 22: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

](2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

Page 23: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

]

(2) Compute knockoff statistics:Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

Page 24: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

](2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

Page 25: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

](2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

Page 26: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs (Barber and Candes, 2015)

y and Xj are n× 1 column vectors of data: n draws from the random variablesY and Xj , respectively; design matrix X := [X1 · · ·Xp]

(1) Construct knockoffs: Knockoffs Xj must satisfy, (X := [X1 · · · Xp])

[X X]>[X X] =

[X>X X>X − diag{s}

X>X − diag{s} X>X

](2) Compute knockoff statistics:

Sufficiency: Wj only a function of [X X]>[X X] and [X X]>yAntisymmetry: swapping values of Xj and Xj flips sign of Wj

(3) Find the knockoff threshold:Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Comments:

Finite-sample FDR control and leverages sparsity for power

Requires data follow low-dimensional (n ≥ p) Gaussian linear model

Canonical approach: condition on X, rely heavily on model for y

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 5 / 18

Page 27: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variableAct as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variableMeasures how much more important a variable appears than its knockoffPositive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are donespecifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of theunimportant/null Wj are independently ±1 with probability 1/2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

Page 28: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variableAct as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variableMeasures how much more important a variable appears than its knockoffPositive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are donespecifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of theunimportant/null Wj are independently ±1 with probability 1/2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

Page 29: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variableAct as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variableMeasures how much more important a variable appears than its knockoffPositive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are donespecifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of theunimportant/null Wj are independently ±1 with probability 1/2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

Page 30: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Generalizing the Knockoffs Procedure

(1) Construct knockoffs:

Artificial versions (“knockoffs”) of each variableAct as controls for assessing importance of original variables

(2) Compute knockoff statistics:

Scalar statistic Wj for each variableMeasures how much more important a variable appears than its knockoffPositive Wj denotes original more important, strength measured by magnitude

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Coin-flipping property: The key to knockoffs is that steps (1) and (2) are donespecifically to ensure that, conditional on |W1|, . . . , |Wp|, the signs of theunimportant/null Wj are independently ±1 with probability 1/2

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

Page 31: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

New Interpretation of Knockoffs

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 6 / 18

Page 32: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 33: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 34: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 35: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 36: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 37: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoffs Without a Model for Y (Candes et al., 2016)

Instead of modeling y and conditioning on X, condition on y and model X(shifts the burden of knowledge from y onto X)

Explicitly,

rows of X = (Xi,1, . . . , Xi,p)iid∼ G

where G can be arbitrary but is assumed known

As compared to original knockoffs, removes

Restriction on dimensionLinear model requirement for Y |X1, . . . , Xp

“Sufficiency” constraint for Wj

The rows of X must be i.i.d., not the columns (covariates)

Nothing about y’s distribution is assumed or need be known

Robust to overfitting X’s distribution in preliminary experiments

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 7 / 18

Page 38: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

Exact Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 39: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●

Exact Cov

Graph. Lasso

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 40: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 41: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 42: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 43: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 44: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

100% Emp. Cov0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

●●0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y comes from a binomial linear model with logit link functionwith 50 nonzero entries.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 8 / 18

Page 45: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

Page 46: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

Page 47: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

Page 48: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

Page 49: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Shifting the Burden of Knowledge

When is it appropriate?

1. Subjects sampled from a population, and

2a. Xj highly structured, well-studied, or well-understood, OR

2b. Large set of unsupervised X data (without Y ’s)

For instance, many genome-wide association studies satisfy all conditions:

1. Subjects sampled from a population (oversampling cases still valid)

2a. Strong spatial structure: linkage disequilibrium models, e.g., Markov chains,are well-studied and work well

2b. Other studies have collected same or similar SNP arrays on different subjects

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 9 / 18

Page 50: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

(2) Compute knockoff statistics:

Variable importance measure ZAntisymmetric function fj : R2 → R, i.e.,

fj(z1, z2) = −fj(z2, z1)

Wj = fj(Zj , Zj), where Zj and Zj are the variable importances of Xj andXj , respectively

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

Page 51: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

(2) Compute knockoff statistics:

Variable importance measure ZAntisymmetric function fj : R2 → R, i.e.,

fj(z1, z2) = −fj(z2, z1)

Wj = fj(Zj , Zj), where Zj and Zj are the variable importances of Xj andXj , respectively

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

Page 52: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

The New Knockoffs Procedure

(1) Construct knockoffs: Exchangeability

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

(2) Compute knockoff statistics:

Variable importance measure ZAntisymmetric function fj : R2 → R, i.e.,

fj(z1, z2) = −fj(z2, z1)

Wj = fj(Zj , Zj), where Zj and Zj are the variable importances of Xj andXj , respectively

(3) Find the knockoff threshold: (same as before)

Order the variables by decreasing |Wj | and proceed down listSelect only variables with positive Wj until last time negatives

positives≤ q

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

Page 53: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Step (1): Construct Knockoffs

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 10 / 18

Page 54: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution

If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first

and second moments when Xj , Xj swapped

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]For non-Gaussian X, still second-order-correct approximate knockoffs

Linear algebra and semidefinite programming to find good s

Recently: construction for Markov chains and HMMs (Sesia et al., 2017)

Constructions also possible for grouped variables (Dai and Barber, 2016)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

Page 55: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution

If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first

and second moments when Xj , Xj swapped

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]For non-Gaussian X, still second-order-correct approximate knockoffs

Linear algebra and semidefinite programming to find good s

Recently: construction for Markov chains and HMMs (Sesia et al., 2017)

Constructions also possible for grouped variables (Dai and Barber, 2016)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

Page 56: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Knockoff Construction

Proof that valid knockoff variables can be generated for any X distribution

If (X1, . . . , Xp) multivariate Gaussian, exchangeability reduces to matching first

and second moments when Xj , Xj swapped

For Cov(X1, . . . , Xp) = Σ:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]For non-Gaussian X, still second-order-correct approximate knockoffs

Linear algebra and semidefinite programming to find good s

Recently: construction for Markov chains and HMMs (Sesia et al., 2017)

Constructions also possible for grouped variables (Dai and Barber, 2016)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

Page 57: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Step (2): Compute Knockoff Statistics

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 11 / 18

Page 58: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of

Xj and Xj , respectively):

Wj = fj(Zj , Zj) = −fj(Zj , Zj)

For example,

Z is magnitude of fitted coefficient β from a lasso regression of y on [X X]

fj(z1, z2) = z1 − z2Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

Page 59: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of

Xj and Xj , respectively):

Wj = fj(Zj , Zj) = −fj(Zj , Zj)

For example,

Z is magnitude of fitted coefficient β from a lasso regression of y on [X X]

fj(z1, z2) = z1 − z2

Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

Page 60: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Strategy for Choosing Knockoff Statistics

Recall Wj an antisymmetric function fj of Zj and Zj (the variable importances of

Xj and Xj , respectively):

Wj = fj(Zj , Zj) = −fj(Zj , Zj)

For example,

Z is magnitude of fitted coefficient β from a lasso regression of y on [X X]

fj(z1, z2) = z1 − z2Lasso Coefficient Difference (LCD) statistic:

Wj = |βj | − |βj |

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 12 / 18

Page 61: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 62: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj :

for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 63: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))

D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 64: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))

=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 65: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))

=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 66: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)

Wj = fj(Zj , Zj)D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 67: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj)

= −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 68: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj = fj(Zj , Zj)

D= fj(Zj , Zj) = −fj(Zj , Zj) = −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 69: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Exchangeability Endows Coin-Flipping

Recall exchangeability property: for any j,

[X1 ···Xj ···Xp X1 ··· Xj ··· Xp]

D= [X1 ··· Xj ···Xp X1 ···Xj ··· Xp]

Coin-flipping property for Wj : for any unimportant variable j,(Zj , Zj

):=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))D=(Zj

(y,[· · · Xj · · ·Xj · · ·

]), Zj

(y,[· · · Xj · · ·Xj · · ·

]))=(Zj

(y,[· · ·Xj · · · Xj · · ·

]), Zj

(y,[· · ·Xj · · · Xj · · ·

]))=(Zj , Zj

)Wj

D= −Wj

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 13 / 18

Page 70: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 71: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 72: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 73: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 74: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 75: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Adaptivity and Prior Information in Wj

Recall LCD: Wj = |βj | − |βj |, where βj , βj come from `1-penalized regression

Adaptivity

Cross-validation (on [X X]) to choose the penalty parameter in LCD

Higher-level adaptivity: CV to choose best-fitting model for inference

− E.g., fit random forest and `1-penalized regression; derive feature importancefrom whichever has lower CV error—still strict FDR control

Can even let analyst look at (masked version of) data to choose Z function

Prior information

Bayesian approach: choose prior and model, and Zj could be the posteriorprobability that Xj contributes to the model

Still strict FDR control, even if wrong prior or MCMC has not converged

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 76: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Step (3): Find the Knockoff Threshold

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 14 / 18

Page 77: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 78: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 79: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 80: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 81: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 82: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 83: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 84: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 85: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 86: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 87: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 88: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 89: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 90: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 91: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 92: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 93: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Find the Knockoff Threshold

Example with p = 10 and q = 20% = 1/5:

0

W1

W2

W3 W4 W5

W6W7

W8

W9W10

|W1|

|W2|

|W3| |W4| |W5|

|W6||W7|

|W8|

|W9| |W10|

01

02

03

13

14

15

25

35

36

37

|W1| |W4| |W5|

|W6||W7|

q = 20%

#{negative Wj}#{positive Wj}

τ

S = {1, 4, 5, 6, 7}

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 15 / 18

Page 94: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Intuition for FDR Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)

= E(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

Page 95: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Intuition for FDR Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)

≈ E(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

Page 96: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Intuition for FDR Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)

≤ E(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

Page 97: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Intuition for FDR Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τ

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

Page 98: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

GWAS Application

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 16 / 18

Page 99: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 100: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 101: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 102: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 103: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 104: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 105: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Genetic Analysis of Crohn’s Disease

2007 case-control study by WTCCC

n ≈ 5, 000, p ≈ 375, 000; preprocessing mirrored original analysis

Strong spatial structure: second-order knockoffs generated using geneticcovariance estimate (Wen and Stephens, 2010)

Entire analysis took 6 hours of serial computation time; 1 hour in parallel

Knockoffs made twice as many discoveries as original analysis

− Some new discoveries confirmed in larger study

− Some corroborated by work on nearby genes: promising candidates

− Similar result when HMM knockoffs applied to same data (Sesia et al., 2017)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 106: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Discussion

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 17 / 18

Page 107: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 108: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 109: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 110: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 111: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Summary and Next Steps

By conditioning on Y and modeling X, knockoffs can be applied tohigh-dimensional and nonlinear problems, where it is powerful, flexible, andappears robust

Some future directions for research:

Theoretical: rigorous guarantees on robustness

Methodological: develop knockoff constructions for new X distributions

Applied: team up with domain experts who know/control their X, e.g., geneknockout/knockdown, climate change modeling

Thank you!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 112: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Appendix

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 113: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

References

Barber, R. F. and Candes, E. J. (2015). Controlling the false discovery rate viaknockoffs. Ann. Statist., 43(5):2055–2085.

Candes, E., Fan, Y., Janson, L., and Lv, J. (2016). Panning for gold: Model-freeknockoffs for high-dimensional controlled variable selection. arXiv preprintarXiv:1610.02351.

Dai, R. and Barber, R. F. (2016). The knockoff filter for fdr control ingroup-sparse and multitask regression. arXiv preprint arXiv:1602.03589.

Sesia, M., Sabatti, C., and Candes, E. (2017). Gene hunting with knockoffs forhidden markov models. arXiv preprint arXiv:1706.04677.

Wen, X. and Stephens, M. (2010). Using linear predictors to impute allelefrequencies from summary or pooled genotype data. Ann. Appl. Stat.,4(3):1158–1182.

WTCCC (2007). Genome-wide association study of 14,000 cases of sevencommon diseases and 3,000 shared controls. Nature, 447(7145):661–678.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 114: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Simulations in Low-Dimensional Linear Model

0.00

0.25

0.50

0.75

1.00

2 3 4 5Coefficient Amplitude

Pow

er

MethodsBHq MarginalBHq Max Lik.MF KnockoffsOrig. Knockoffs 0.00

0.25

0.50

0.75

1.00

2 3 4 5Coefficient Amplitude

FD

R

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 1000, and y comes from a Gaussianlinear model with 60 nonzero regression coefficients having equal magnitudes andrandom signs. The noise variance is 1.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 115: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Simulations in Low-Dimensional Nonlinear Model

0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

Pow

er

MethodsBHq MarginalBHq Max Lik.MF Knockoffs 0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

FD

R

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 1000, and y comes from a binomiallinear model with logit link function, and 60 nonzero regression coefficients having equalmagnitudes and random signs.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 116: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Simulations in High Dimensions

0.00

0.25

0.50

0.75

1.00

8 10 12Coefficient Amplitude

Pow

er

MethodsBHq MarginalMF Knockoffs 0.00

0.25

0.50

0.75

1.00

8 10 12Coefficient Amplitude

FD

R

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix is i.i.d. N (0, 1/n), n = 3000, p = 6000, and y comes from a binomiallinear model with logit link function, and 60 nonzero regression coefficients having equalmagnitudes and random signs.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 117: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Simulations in High Dimensions with Dependence

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

Pow

er

MethodsBHq MarginalMF Knockoffs 0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

FD

R

Figure: Power and FDR (target is 10%) for MF knockoffs and alternative procedures.The design matrix has AR(1) columns, and marginally each Xj ∼ N (0, 1/n). n = 3000,p = 6000, and y follows a binomial linear model with logit link function, and 60 nonzerocoefficients with random signs and randomly selected locations.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 118: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Checking Sensitivity to Misspecification Error

Concern about misspecification

Y |X X

Canonical (model Y , not X) Yes No

model X, not Y No Yes

Misspecification replicatedin simulation?

No Yes

Can actually check sensitivity to misspecification error!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 119: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Checking Sensitivity to Misspecification Error

Concern about misspecification

Y |X X

Canonical (model Y , not X) Yes No

model X, not Y No Yes

Misspecification replicatedin simulation?

No Yes

Can actually check sensitivity to misspecification error!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 120: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Checking Sensitivity to Misspecification Error

Concern about misspecification

Y |X X

Canonical (model Y , not X) Yes No

model X, not Y No Yes

Misspecification replicatedin simulation?

No Yes

Can actually check sensitivity to misspecification error!

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 121: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Robustness on Real Data

●●

0.00

0.25

0.50

0.75

1.00

9 12 15 18 21Coefficient Amplitude

Pow

er

● ●

0.00

0.25

0.50

0.75

1.00

9 12 15 18 21Coefficient Amplitude

FD

R

Figure: Power and FDR (target is 10%) for model-free knockoffs applied to subsamplesof a chromosome 1 of real genetic design matrix; n ≈ 1, 400.

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 122: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

Equicorrelated (EQ) (fast, less powerful): sEQj = 2λmin(Σ) ∧ 1 for all j

Semidefinite program (SDP) (slower, more powerful):

minimize∑

j |1− sSDPj |

subject to sSDPj ≥ 0

diag{sSDP} � 2Σ,

(New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separatesBisection search scalar multiplier of solution to account for approximationfaster than SDP, more powerful than EQ, and easily parallelizable

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 123: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

Equicorrelated (EQ) (fast, less powerful): sEQj = 2λmin(Σ) ∧ 1 for all j

Semidefinite program (SDP) (slower, more powerful):

minimize∑

j |1− sSDPj |

subject to sSDPj ≥ 0

diag{sSDP} � 2Σ,

(New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separatesBisection search scalar multiplier of solution to account for approximationfaster than SDP, more powerful than EQ, and easily parallelizable

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 124: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

Equicorrelated (EQ) (fast, less powerful): sEQj = 2λmin(Σ) ∧ 1 for all j

Semidefinite program (SDP) (slower, more powerful):

minimize∑

j |1− sSDPj |

subject to sSDPj ≥ 0

diag{sSDP} � 2Σ,

(New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separatesBisection search scalar multiplier of solution to account for approximationfaster than SDP, more powerful than EQ, and easily parallelizable

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 125: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Computation of Second-Order Knockoffs

Cov(X1, . . . , Xp) = Σ, need:

Cov(X1, . . . , Xp, X1, . . . , Xp) =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

Equicorrelated (EQ) (fast, less powerful): sEQj = 2λmin(Σ) ∧ 1 for all j

Semidefinite program (SDP) (slower, more powerful):

minimize∑

j |1− sSDPj |

subject to sSDPj ≥ 0

diag{sSDP} � 2Σ,

(New) Approximate SDP:

Approximate Σ as block diagonal so that SDP separatesBisection search scalar multiplier of solution to account for approximationfaster than SDP, more powerful than EQ, and easily parallelizable

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 126: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 127: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 128: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 129: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 130: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 131: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 132: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Sequential Independent Pairs Generates Valid Knockoffs

Algorithm 1 Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from L(Xj |X-j , X1:j−1) conditionally independently of Xj

end

Proof sketch (discrete case):

Denote PMF of (X1:p, X1:j−1) by L(X-j , Xj , X1:j−1)

Conditional PMF of Xj |X1:p, X1:j−1 is

L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

.

Joint PMF of (X1:p, X1:j) is

L(X-j , Xj , X1:j−1)L(X-j , Xj , X1:j−1)∑u L(X-j , u, X1:j−1)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 133: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)

= E(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

)

q

τ

More precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸ ︷︷ ︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 134: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)

≈ E(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

)

q

τ

More precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸ ︷︷ ︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 135: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)

≤ E(#{negative |Wj | > τ}#{positive |Wj | > τ}

)

q

τ

More precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸ ︷︷ ︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 136: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τ

More precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸ ︷︷ ︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 137: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τMore precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)

= E(

#{null positive |Wj | > τ}1 + #{null negative |Wj | > τ}︸ ︷︷ ︸

Supermartingale ≤ 1with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 138: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τMore precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}· 1 + #{null negative |Wj | > τ}

q−1 +#{positive|Wj | > τ}

)

= E(

#{null positive |Wj | > τ}1 + #{null negative |Wj | > τ}︸ ︷︷ ︸

Supermartingale ≤ 1with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 139: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τMore precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}· 1 + #{null negative |Wj | > τ}

q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

= E(

#{null positive |Wj | > τ}1 + #{null negative |Wj | > τ}︸ ︷︷ ︸

Supermartingale ≤ 1with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18

Page 140: High-Dimensional Variable Selection in Nonlinear Models ... · Lucas Janson Harvard University Department of Statistics blank line blank line CMSA Big Data Conference, August 18,

Proof of Control

FDR = E(

#{null Xj selected}#{total Xj selected}

)= E

(#{null positive |Wj | > τ}#{positive |Wj | > τ}

)≈ E

(#{null negative |Wj | > τ}

#{positive |Wj | > τ}

)≤ E

(#{negative |Wj | > τ}#{positive |Wj | > τ}

) q

τMore precisely:

mFDR = E(

#{null Xj selected}q−1 +#{total Xj selected}

)= E

(#{null positive |Wj | > τ}q−1 +#{positive |Wj | > τ}

)= E

(#{null positive |Wj | > τ}

1 + #{null negative |Wj | > τ}︸ ︷︷ ︸Supermartingale ≤ 1

with τ a stopping time

· 1 + #{null negative |Wj | > τ}q−1 +#{positive|Wj | > τ}︸ ︷︷ ︸≤ q by definition of τ

)

Lucas Janson (Harvard Statistics) Knockoffs for HD Controlled Variable Selection 18 / 18