emmanuel cand es, stanford universitycandes/talks/slides/wald2.pdfepisodic central nervous system...

What’s Happening in Selective Inference II?

Emmanuel Candes, Stanford University

The 2017 Wald Lectures, Joint Statistical Meetings, Baltimore, August 2017

Lecture 2: Special dedication

Chiara Sabatti

Agenda: The knockoff machine

(1) The knockoff framework (mostly from yesterday)

(2) Knockoffs for fixed covariates

(3) Knockoffs for random covariates

(4) Knockoffs for genome-wide association studies (GWAS)

(5) Genetic data analysis

The Knockoffs Framework(Summary from Lecture 1)

Controlled variable selection

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

Response Y (e.g. disease status)

Features X1, . . . , Xp (e.g. SNPs)

Question: distribution of Y |X depends on X through which variables?

Goal: select set of features Xj that are likely to be relevantwithout too many false positives – do not run into the problem of irreproducibilty

FDR = E[ # false positives

# features selected︸︷︷︸FDP

]

Which variables should we report?

Feature importance Zj from random forests

●●●●

●●●●●●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●●●●●

●

●●

●

●●●●●●

●

●●

●●●

●

●

●●

●

●●●●

●

●●

●

●

●

●

●

●●

●

●●●●

●

●

●●●●

●

●●●●●●●

●●●●●●●●●

●

●●

●●●●●●●●●●●●●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●●

●

●

●

●

●●●●

●●●

●●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●●

●

●

●●●●●

●

●

●●●●

●

●●●

●

●●

●●●●

●

●●

●●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●●●●

●

●

●●●

●

●

●●●●●

●●

●

●

●

●●●●●●●

●

●●●

●●●

●●●●●●

●

●

●

●

●

●

●

●

●●●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●●●●●●●

●●●●●●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●●●●

●

●

●

●

●

●●●●●

●

●

●●●

●●

●●

●●

●

●

●

●●

●●●

●●●

●●●●●

●●●●●●●●●●●●●

●

●

●●●●●

●

●

●

●

●

●●

●

●

●●●●

●●●

●

●●

0 100 200 300 400 500

12

34

56

7

Variables

Fea

ture

Impo

rtan

ce

Which variables should we report?

Feature importance Zj from random forests

●●●●

●●●●●●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●●●

●

●

●

●●●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●

●●●●●

●

●●

●

●●●●●●

●

●●

●●●

●

●

●●

●

●●●●

●

●●

●

●

●

●

●

●●

●

●●●●

●

●

●●●●

●

●●●●●●●

●●●●●●●●●

●

●●

●●●●●●●●●●●●●

●

●●

●

●

●

●

●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●

●

●

●

●●●

●

●

●●

●

●

●

●●

●

●

●

●

●●●●

●●●

●●●

●●●

●

●●●

●

●

●

●●●

●

●

●

●●

●

●

●●●●●

●

●

●●●●

●

●●●

●

●●

●●●●

●

●●

●●●●●

●

●●

●

●

●

●

●

●

●

●

●

●●●●

●●●●

●

●

●●●

●

●

●●●●●

●●

●

●

●

●●●●●●●

●

●●●

●●●

●●●●●●

●

●

●

●

●

●

●

●

●●●●●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●●

●●

●

●

●

●●

●

●

●

●●●●●●●

●●●●●●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●●●●

●

●

●

●

●

●●●●●

●

●

●●●

●●

●●

●●

●

●

●

●●

●●●

●●●

●●●●●

●●●●●●●●●●●●●

●

●

●●●●●

●

●

●

●

●

●●

●

●

●●●●

●●●

●

●●

0 100 200 300 400 500

12

34

56

7

Variables

Fea

ture

Impo

rtan

ce

● ●

True positives?

Knockoffs as negative controls

●

●●●●●●●

●●

●●

●

●●●

●

●

●

●

●●

●

●●

●

●

●

●●●●

●

●

●●

●

●

●●●

●

●

●●●

●●●

●

●

●

●●●●●

●

●

●

●

●●●

●

●●●●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●●●●●

●

●●

●

●●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●●

●●●●●

●

●●●●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●●●●●

●

●●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●●●●●●●●●●

●

●●

●

●

●●

●

●

●

●

●●●●●●

●

●

●●●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●●

●

●

●●●●●●●●

●●

●

●

●●●●●●●●●●

●

●●

●

●

●●●●●●●

●

●●●

●●●●●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●●●●●●

●

●

●●●

●

●

●●

●●

●●

●

●

●

●

●●

●●

●●●●

●●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●●

●

●●●●

●●

●

●

●●●

●

●

●

●●

●

●●●●

●●

●

●

●

●●

●●

●

●

●●

●

●●●●

●●

●

●

●●●●●●●●●●●●●

●

●

●●

●●●●●

●

●

●

●

●●

●

●●●●●

●●●●●●●●

●●●

●●●

●

●

●●

●

●

●

●●●●●●●●●●●●

●

●●●

●

●

●●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●●●●●

●

●●●

●

●●●●

●●●●●●

●●

●●

●●●●●●●

●

●

●●●●

●●●●●●

●●

●●●●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●●●

●●●

●

●

●●●●

●

●

●

●

●

●

●●

●●

●●●●●●●●

●●●●

●●●●●

●

●●

●●●

●

●●●

●●

●

●●●●●●●

●

●

●●●●

●

●

●

●●●

●●

●

●

●

●

●●●

●

●●

●●●

●●●●

●●

●

●●

●

●

●●●

●●

●●

●

●

●

●●●

●

●●

●●

●●●

●●

●●●

●

●●

●

●●●●

●

●

●●●

●

●●●●

●

●

●

●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●

●

●

●●●●●●

●●●●●

●

●●●●●●●

●

●●●

●●

●

●●●

●●●●

●●

●●

●

●

●

●

●●●

●

●

●●

●●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●●●●●

●●●●

●

●

●

●

●

●●●

●●●

●●

●

●●●

●●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●●●

●●

●●●●

●

●●●●●●

●

●●●●●●●

●

●●●

●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●

●

●

●●

●●●●●●●●●●●●●●●●●

●

●●●●

●●●●●●●●●

0 200 400 600 800 1000

12

34

Variables

Fea

ture

Impo

rtan

ce

●

●

OriginalKnockoffs

Exchangeability of feature importance statistics

Knockoff agnostic feature importance Z

(Z1, . . . , Zp︸︷︷︸originals

, Z1, . . . , Zp︸︷︷︸knockoffs

) = z([X, X], y)

●

●●●●●●●

●●

●●

●

●●●

●

●

●

●

●●

●

●●

●

●

●

●●●●

●

●

●●

●

●

●●●

●

●

●●●

●●●

●

●

●

●●●●●

●

●

●

●

●●●

●

●●●●

●

●●●

●

●●

●

●

●●

●

●●

●

●

●●●●●

●

●●

●

●●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●●

●

●●

●●●●●

●

●●●●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●●●●●●

●

●●●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●●●●●●●●●●●

●

●●

●

●

●●

●

●

●

●

●●●●●●

●

●

●●●●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●●●●

●

●

●●●●●●●●

●●

●

●

●●●●●●●●●●

●

●●

●

●

●●●●●●●

●

●●●

●●●●●

●

●●

●

●

●●●●

●●

●

●

●

●

●

●●●●●●

●

●

●●●

●

●

●●

●●

●●

●

●

●

●

●●

●●

●●●●

●●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●●

●

●●●●

●●

●

●

●●●

●

●

●

●●

●

●●●●

●●

●

●

●

●●

●●

●

●

●●

●

●●●●

●●

●

●

●●●●●●●●●●●●●

●

●

●●

●●●●●

●

●

●

●

●●

●

●●●●●

●●●●●●●●

●●●

●●●

●

●

●●

●

●

●

●●●●●●●●●●●●

●

●●●

●

●

●●●●●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●●●●●

●

●●●

●

●●●●

●●●●●●

●●

●●

●●●●●●●

●

●

●●●●

●●●●●●

●●

●●●●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●●●

●●●

●

●

●●●●

●

●

●

●

●

●

●●

●●

●●●●●●●●

●●●●

●●●●●

●

●●

●●●

●

●●●

●●

●

●●●●●●●

●

●

●●●●

●

●

●

●●●

●●

●

●

●

●

●●●

●

●●

●●●

●●●●

●●

●

●●

●

●

●●●

●●

●●

●

●

●

●●●

●

●●

●●

●●●

●●

●●●

●

●●

●

●●●●

●

●

●●●

●

●●●●

●

●

●

●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●

●

●

●●●●●●

●●●●●

●

●●●●●●●

●

●●●

●●

●

●●●

●●●●

●●

●●

●

●

●

●

●●●

●

●

●●

●●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●●●●●●●

●●●●

●

●

●

●

●

●●●

●●●

●●

●

●●●

●●

●

●

●

●

●

●●●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●●

●●●

●●

●●●●

●

●●●●●●

●

●●●●●●●

●

●●●

●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●

●

●

●●

●●●●●●●●●●●●●●●●●

●

●●●●

●●●●●●●●●

0 200 400 600 800 1000

12

34

This lectureCan construct knockoff features such that

j null =⇒ (Zj , Zj)d= (Zj , Zj)

more generally T subset of nulls =⇒ (Z, Z)swap(T )d= (Z, Z)

Z1 ZpZ2 ZpZ2Z1

Knockoffs-adjusted scores

++____ +++__++__0

|W|if null

Ordering of variables + 1-bit p-values

Adjusted scores Wj with flip-sign property

Combine Zj and Zj into single (knockoff) score Wj

Wj = wj(Zj , Zj) wj(Zj , Zj) = −wj(Zj , Zj)

e.g. Wj = Zj − Zj Wj = Zj ∨ Zj ·{

+1 Zj > Zj

−1 Zj ≤ Zj=⇒ Conditional on |W |, signs of null Wj ’s are i.i.d. coin flips

Selection by sequential testing

++____ +++__++

0 |W|

+++++...

t

Select S+(t) =⇒ FDP(t) =1+|S−(t)|1 ∨ |S+(t)|

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

Theorem (Barber and C. (’15))

Select S+(τ), τ = min {t : FDP(t) ≤ q}Knockoff

E[

# false positives

# selections + q−1

]≤ q

Knockoff+

E[

# false positives

# selections

]≤ q

Why Can We Invert the Estimate of FDP?Proof Sketch of FDR Control

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1


τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) =#{j null : j ∈ S+(τ)}#{j : j ∈ S+(τ)} ∨ 1

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1


τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) =#{j null : j ∈ S+(τ))}#{j : j ∈ S+(τ)} ∨ 1

· 1 + #{j null : j ∈ S−(τ)}1 + #{j null : j ∈ S−(τ)}

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1


τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) ≤ q ·

V +(τ)︷︸︸︷#{j null : j ∈ S+(τ)}

1 + #{j null : j ∈ S−(τ)}︸︷︷︸V −(τ)

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Martingales

V +(t)

1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t

__++__0

if nullt

V +(t) V −(t),

|W |

Conditioned on V +(s) + V −(s), V +(s) is hypergeometric

E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Martingales

V +(t)


__++__0

if nullt s

V +(t) V −(t),

|W |


E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Martingales

V +(t)


__++__0

if nullt s

V +(t) V −(t),

V +(s) + V −(s) = m

|W |


E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Optional stopping theorem

0

if null τ

FDR ≤ q E[

V +(τ)

1 + V −(τ)

]≤ q E

Bin(#nulls,1/2)︷︸︸︷V +(0)

1 + V −(0)

≤ q

Knockoffs for Fixed Features

Joint with Barber

Linear model

y =

∑j βjXj︷︸︸︷Xβ + z

n× 1 n× p p× 1 n× 1

y ∼ N (Xβ, σ2I)

Fixed design X

Noise level σ unknown

Multiple testing: Hj : βj = 0 (is jth variable in the model?)

Identifiability =⇒ p ≤ n

Inference (FDR control) will hold conditionally on X

Knockoff features (fixed X)

Originals Knockoffs

X ′jXk = X ′jXk for all j, k

X ′jXk = X ′jXk for all j 6= k

No need for new data or experiment

No knowledge of response y

Knockoff construction (n ≥ 2p)

Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G

� 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0


Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G � 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0


X ′jXj = 1− sj (Standardized columns)

Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1

Under equivariance, minimizes the value of |〈Xj , Xj〉|

SDP knockoffsminimize

∑j |1− sj |

subject to sj ≥ 0diag{s} � 2Σ

Highly structured semidefinite program (SDP)

Other possibilities ...

Why?

For null feature Xj

X ′jy = X ′jXβ +X ′jzd= X ′jXβ + X ′jz = X ′jy

Construct knockoffs

Why?

For a null feature Xj,

X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y

Jan 21 2015 Controlling false discovery rate via knockoffs 12/36

Why?

For any subset of nulls T

[X X]′swap(T ) yd= [X X]′ y

[X X]′swap(T ) =

Construct knockoffs

Why?

For a null feature Xj,

X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y

Jan 21 2015 Controlling false discovery rate via knockoffs 12/36


Sufficiency:

(Z, Z) = z([X X

]′ [X X

],[X X

]′y)

Knockoff-agnostic: swapping originals and knockoffs =⇒ swaps Z’s

z([X X

]swap(T )

, y) = (Z, Z)swap(T )

Theorem (Barber and C. (15))

For any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

=⇒ FDR control (conditional on X)

Z1 ZpZ2 ZpZ2Z1

Telling the effect direction

[...] in classical statistics, the significance of comparisons (e. g., θ1 − θ2)is calibrated using Type I error rate, relying on the assumption that thetrue difference is zero, which makes no sense in many applications.[...] a more relevant framework in which a true comparison can bepositive or negative, and, based on the data, you can state “θ1 > θ2 withconfidence”, “θ2 > θ1 with confidence”, or “no claim with confidence”.

A. Gelman & F. Tuerlinckx

Directional FDR

Are any effects exactly zero?

FDRdir = E[

# selections with wrong effect direction

# selections

]

↑ ︸︷︷︸Directional false discovery rate Directional false discovery proportion

Directional FDR (Benjamini & Yekutieli, ’05)

Sign errors (Type-S) (Gelman & Tuerlinckx, ’00)

Important for misspecified models — exact sparsity unlikely

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C., ’16)

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__null non null

0 + +__

|W|

Null coin flips are unbiased

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C. (16))

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__0 + +__

|W|

Great subtlety: coin flips are now biased

Empirical results

Features N (0, In), n = 3000, p = 1000

k = 30 variables with regression coefficients of magnitude 3.5

Method FDR (%) Power (%) Theor. FDR(nominal level q = 20%) control?

Knockoff+ (equivariant) 14.40 60.99 YesKnockoff (equivariant) 17.82 66.73 No

Knockoff+ (SDP) 15.05 61.54 YesKnockoff (SDP) 18.72 67.50 No

BHq 18.70 48.88 NoBHq + log-factor correction 2.20 19.09 Yes

BHq with whitened noise 18.79 2.33 Yes

Effect of signal amplitude

Same setup with k = 30 (q = 0.2)

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2

0

5

10

15

20

25

Amplitude A

FD

R (

%)

●

Nominal levelKnockoffKnockoff+BHq

●

●

●

●●

● ●

●●

●

●

●●

● ●

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2

0

20

40

60

80

100

Amplitude A

Pow

er (

%)

●

KnockoffKnockoff+BHq

●

●●

●

●

●

●

●

●

●●

●

●

● ●

Effect of feature correlation

Features ∼ N (0,Θ) Θjk = ρ|j−k|

n = 3000, p = 1000, and k = 30 and amplitude = 3.5

0.0 0.2 0.4 0.6 0.8

0

5

10

15

20

25

30

Feature correlation ρ

FD

R (

%)

●

Nominal levelKnockoffKnockoff+BHq

●

●●

●●

●●

●●

●

0.0 0.2 0.4 0.6 0.8

0

20

40

60

80

100

Feature correlation ρ

Pow

er (

%)

●

KnockoffKnockoff+BHq

●

●●

●

●

●

●

●

●●

Fixed Design Knockoff Data Analysis

HIV drug resistance

Drug type # drugs Sample size # protease or RT # mutations appearingpositions genotyped ≥ 3 times in sample

PI 6 848 99 209NRTI 6 639 240 294

NNRTI 3 747 240 319

response y: log-fold-increase of lab-tested drug resistance

covariate Xj : presence or absence of mutation #j

Data from R. Shafer (Stanford) available at:

http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/

http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/

HIV data

TSM list: mutations associated with the PI class of drugs in general, and is not

specialized to the individual drugs in the class

Results forPI type drugs

Knockoff BHq

Data set size: n=768, p=201

# H

IV−

1 pr

otea

se p

ositi

ons

sele

cted

0

5

10

15

20

25

30

35

Resistance to APV

Appear in TSM listNot in TSM list

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to ATV

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to IDV

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to LPV

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to NFV

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to SQV

HIV data

Results forNRTI type drugs

Results forNNRTI type drugs

Knockoff BHq


# H

IV−

1 R

T p

ositi

ons

sele

cted

0

5

10

15

20

25

30

Resistance to X3TC


Knockoff BHq


0

5

10

15

20

25

30

Resistance to ABC

Knockoff BHq


0

5

10

15

20

25

30

Resistance to AZT

Knockoff BHq


0

5

10

15

20

25

30

Resistance to D4T

Knockoff BHq


0

5

10

15

20

25

30

Resistance to DDI

Knockoff BHq


0

5

10

15

20

25

30

Resistance to TDF

Knockoff BHq


# H

IV−

1 R

T p

ositi

ons

sele

cted

0

5

10

15

20

25

30

35

Resistance to DLV


Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to EFV

Knockoff BHq


0

5

10

15

20

25

30

35

Resistance to NVP

High-dimensional setting

n ≈ 5, 000 subjects

p ≈ 330, 000 SNPs/vars to test

improve coverage around type 2 diabetes (T2D) candidate genesin 1,874 Finnish individuals from the Finland–United StatesInvestigation of NIDDM Genetics (FUSION) study13. In a secondscan, after quality-control filtering, we examined 356,539 SNPs (MAF4 5%) from the Affymetrix 500K Mapping Array Set in 4,184individuals from the SardiNIA Study of Aging10,14. The Sardiniansample is organized into a number of small- to medium-sizedpedigrees. We took advantage of this relatedness to reduce genotypingcosts: we genotyped 1,412 individuals with the Affymetrix 500KMapping Array Set (organized into groups of 2–3 individuals pernuclear family) and then propagated their genotypes to the remainingindividuals, who were genotyped using only the Affymetrix 10KMapping Array14,17,18 (see Methods). To increase statistical power,we also contacted the authors of a previously published study15 toobtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758Finnish and Swedish individuals from the Diabetes Genetics Initiative(DGI) using the Affymetrix 500K Mapping Array Set. Further detailsof the DGI study and independent follow-up analyses are provided ina companion manuscript16. All three initial scans excluded individualstaking lipid lowering therapies, for a total of 8,816 phenotypedindividuals (Table 1). Informed consent was obtained from all

study participants and ethics approval was obtained from theparticipating institutions.

Because the three studies used different marker sets with anoverlap of only 44,998 SNPs across studies, we used information onpatterns of haplotype variation in the HapMap CEU samples (release21)19 to infer missing genotypes in silico and to facilitate comparisonbetween the studies13. Imputation analyses were carried out withMarkov Chain Haplotyping software (MaCH; see URLs section inMethods). For our analyses, we only considered SNPs that were eithergenotyped or could be imputed with relatively high confidence; that is,SNPs for which patterns of haplotype sharing between sampledindividuals and those genotyped by the HapMap consistently indi-cated a specific allele. Comparison of imputed and experimentallyderived genotypes in our samples yielded estimated error rates of1.46% (for imputation based on Illumina genotypes) to 2.14%(imputation based on Affymetrix genotypes) per allele, consistentwith expectations from HapMap data. For additional details ofquality-control and imputation procedures, see Methods and Supple-mentary Table 1 online.

We then conducted a series of association analyses to relatethe B2,261,000 genotyped and/or imputed SNPs with plasma

20

15

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 2 3 4 5 6

1 2 3 4 5 6

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

–log

10 (

P v

alue

)

20

15

10

5

0

–log

10 (

P v

alue

)

20

15

10

5

0

–log

10 (

P v

alue

)

20

15

10

5

0

0

–log

10 (

P v

alue

) 20

15

10

5

0

–log

10 (

P v

alue

) 20

15

10

5

0

–log

10 (

P v

alue

)

Percentile

1 2 3 4 5 60

Percentile

1 2 3 4 5 60

Percentile

HDL cholesterol LDL cholesterol Triglycerides

HDL cholesterol

LDL cholesterol

Triglycerides

GALNT2

APOB

PCSK9

ANGPTL3

GCKR

MLXIPL

LPL

TRIB1

APOA5

LIPCNCAN/CILP2

GALNT2 RBKS

B4GALT4 B3GALT4

SORT1/CELSR2/PSRC1

LPLABCA1

MVK/MMAB

LIPC

LCAT LIPG

APOE cluster

LDLR

NCAN/CILP2

CETP

Figure 1 Summary of genome-wide association scans. The figure summarizes combined genome-wide association scan results in the top 3 panels (plottedas –log10 P value for HDL cholesterol, LDL cholesterol and triglycerides). Loci that were not followed up are in gray. Loci that were followed-up are in green(combined dataset yielded convincing evidence of association, P o 5 ! 10"8), orange (combined dataset yielded promising evidence of association,P o 10"5), or red (combined dataset did not suggest association, P 4 10"5). The three panels in the bottom row display quantile-quantile plots for teststatistics. The red line corresponds to all test statistics, the blue line corresponds to results after excluding statistics at replicated loci (in green, top panel),and the gray area corresponds to the 90% confidence region from a null distribution of P values (generated from 100 simulations).

NATURE GENETICS VOLUME 40 [ NUMBER 2 [ FEBRUARY 2008 16 3

A R T I C L E S

©20

08 N

atur

e Pu

blis

hing

Gro

up h

ttp://

ww

w.n

atur

e.co

m/n

atur

egen

etic

s

p > n −→ cannot construct knockoffs as before

X ′jXk = X ′jXk ∀ j, kX ′jXk = X ′jXk ∀ j 6= k

=⇒ Xj = Xj ∀j

High dimensional knockoffs: screen and confirm

original data set

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

X(0)y(0)

screen on sample 1

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

y(1)X(1)

inference on sample 2

Theory (Barber and C., ’16)

Safe data re-use to improve power (Barber and C., ’16)

Some extensions

y =(X1

)

︸︷︷︸n×p1

·β1 +(X2

)

︸︷︷︸n×p2

·β2 + · · ·+N (0, σ2In)

Group sparsity — build knockoffs at the group-wise levelDai & Barber 2015

Identify key groups with PCA — build knockoffs only for the top PC in eachgroupChen, Hou, Hou 2017

Build knockoffs only for prototypes selected from each groupReid & Tibshirani 2015

Multilayer knockoffs to control FDR at the individual and group levelssimultaneouslyKatsevich & Sabatti 2017

Knockoffs for Random Features

Joint with Fan, Janson & Lv

Variable selection in arbitrary models

Random pair (X,Y ) (perhaps thousands/millions of covariates)

p(Y |X) depends on X through which variables?

Working definition of null variables

Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j

Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.

Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S

Logistic model: P(Y = 0|X) =1

1 + eX>β

If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0

Knockoff features (random X)

i.i.d. samples from p(X,Y )

Distribution of X known

Distribution of Y |X (likelihood) completely unknown

Originals X = (X1, . . . , Xp)

Knockoffs X = (X1, . . . , Xp)

(1) Pairwise exchangeability

(X, X)swap(S)d= (X, X)

e.g.

(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)

(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)


Theorem (C., Fan, Janson Lv (’16))

For knockoff-agnostic scores and any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

This holds no matter the relationship between Y and X

This holds conditionally on Y

=⇒ FDR control (conditional on Y ) no matter the relationship between X and Y

Z1 ZpZ2 ZpZ2Z1

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!




Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗)

∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]







Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]




Robustness

●

Exact Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

●

0.00

0.25

0.50

0.75

1.00


FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Robustness

● ●

Exact Cov

Graph. Lasso

0.00

0.25

0.50

0.75

1.00


Pow

er

● ●

0.00

0.25

0.50

0.75

1.00


FD

R


Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

0.00

0.25

0.50

0.75

1.00


Pow

er

● ● ●

0.00

0.25

0.50

0.75

1.00


FD

R


Robustness

● ●●

●

Exact Cov


62.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00


Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00


FD

R


Robustness

● ●●

●

●

Exact Cov


62.5% Emp. Cov

75% Emp. Cov

0.00

0.25

0.50

0.75

1.00


Pow

er

● ● ●●

●

0.00

0.25

0.50

0.75

1.00


FD

R


Robustness

● ●●

●

●

●

Exact Cov


62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00


Pow

er

● ● ●●

●

●

0.00

0.25

0.50

0.75

1.00


FD

R


Robustness

● ●●

●

●

●

●

Exact Cov


62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

100% Emp. Cov0.00

0.25

0.50

0.75

1.00


Pow

er

● ● ●●

●

●●0.00

0.25

0.50

0.75

1.00


FD

R


Robustness theory

Ongoing with R. Barber andR. Samworth

(Partial) subject of 2017 TweedieAward Lecture

Rina F. Barber

Knockoffs inference with random features

Pros:

No parameters

No p-values

Holds for finite samples

No matter the dependence between Y and X

No matter the dimensionality

Cons: Need to know distribution of covariates

Relationship with classical setup

Classical MF Knockoffs

Observations of X are fixedInference is conditional on obs. values

Observations of X are random1

Strong model linking Y and X Model free2

Useful inference even if model inexact Useful inference even if model inexact3

1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled

2 Shifts the ‘burden’ of knowledge

3 More later

Shift in the burden of knowledge

When are our assumptions useful?

When we have large amounts of unsupervised data (e.g. economic studieswith same covariate info but different responses)

When we have more prior information about the covariates than about theirrelationship with a response (e.g. GWAS)

When we control the distribution of X (experimental crosses in genetics,gene knockout experiments,...)

Obstacles to obtaining p-values

Y |X ∼ Bernoulli(logit(X>β))

0

500

1000

1500

2000

0.00 0.25 0.50 0.75 1.00P−Values

coun

t

Global Null, AR(1) Design

0

500

1000

1500

2000

0.00 0.25 0.50 0.75 1.00P−Values

coun

t

20 Nonzero Coefficients, AR(1) Design

Figure: Distribution of null logistic regression p-values with n = 500 and p = 200

Obstacles to obtaining p-values

P{p-val ≤ . . .%} Sett. (1) Sett. (2) Sett. (3) Sett. (4)

5% 16.89% (0.37) 19.17% (0.39) 16.88% (0.37) 16.78% (0.37)

1% 6.78% (0.25) 8.49% (0.28) 7.02% (0.26) 7.03% (0.26)

0.1% 1.53% (0.12) 2.27% (0.15) 1.87% (0.14) 2.04% (0.14)

Table: Inflated p-value probabilities with estimated Monte Carlo SEs

Shameless plug: distribution of high-dimensional LRTs

Wilks’ phenomenon (1938)

2 logLd→ χ2

df

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

Sur, Chen, Candes (2017)

2 logLd→ κ

( pn

)χ2df

0

2500

5000

7500

10000

12500

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

‘Low’ dim. linear model with dependent covariates

Zj = |βj(λCV)|Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

Pow

er

MethodsBHq MarginalBHq Max Lik.MF KnockoffsOrig. Knockoffs

0.00

0.25

0.50

0.75

1.00


FD

R

Figure: Low-dimensional setting: n = 3000, p = 1000

‘Low’ dim. logistic model with indep. covariates


0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

Pow

er

MethodsBHq MarginalBHq Max Lik.MF Knockoffs 0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

FD

R

Figure: Low-dimensional setting: n = 3000, p = 1000

‘High’ dim. logistic model with dependent covariates


0.00

0.25

0.50

0.75

1.00


Pow

er

MethodsBHq MarginalMF Knockoffs 0.00

0.25

0.50

0.75

1.00


FD

R

Figure: High-dimensional setting: n = 3000, p = 6000

Bayesian knockoff statistics

LCD (Lasso coeff. difference)

BVS (Bayesian variable selection)

Zj = P(βj 6= 0 |y,X)

Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

Pow

er

MethodsBVS KnockoffsLCD Knockoffs

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

FD

R

MethodsBVS KnockoffsLCD Knockoffs

Figure: n = 300, p = 1000 and Bayesian linear model with 60 expected variables

Inference is correct even if prior is wrong or MCMC has not converged

Partial summary

No valid p-values even for logistic regression

Shifts the burden of knowledge to X (covariates); makes sense in manycontexts

Robustness: simulations show properties of inference hold even when themodel for X is only approximately right.

Always have access to these diagnostic checks (later)

When assumptions are appropriate gain a lot of power, and can usesophisticated selection techniques

How to Construct Knockoffs for Hidden Markov Models

Joint with Sabatti & Sesia

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)


(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)



end

e.g. p = 3

Sample X1 from X1 |X−1

Joint law of X, X1 is known







(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)



end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known






Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Recursive update of normalizing constants

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Haplotypes and genotypes

Haplotype Set of alleles on a single chromosome0/1 for common/rare allele

Genotype Unordered pair of alleles at a single marker

0 1 0 1 1 0 1 1 0 0 1 11 2 0 1 2 1

+Haplotype MHaplotype PGenotypes

A phenomenological HMM for haplotype & genotype data

Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)

Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)

fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)

MaCH (Li, ’10)

New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

latent variables knockoff latent variables

knockoff variables

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

imputed latent variables knockoff latent variables

knockoff variables

Some Examples

Simulations with synthetic Markov chainMarkov chain covariates with 5 hidden states. Binomial response

4 5 6 7 8 9 10 12 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r


0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

RobustnessMarkov chain covariates with 5 hidden states. Binomial response


0.0

0.2

0.4

0.6

0.8

1.0

Powe

r


0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1


Simulations with synthetic HMMHMM covariates with latent “clockwise” Markov chain. Binomial response


0.0

0.2

0.4

0.6

0.8

1.0

Powe

r


0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1


RobustnessHMM covariates with latent “clockwise” Markov chain. Binomial response


0.0

0.2

0.4

0.6

0.8

1.0

Powe

r


0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1


Out-of-sample parameter estimationInhomogeneous Markov chain covariates with 5 hidden states. Binomial response

10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX from independent dataset)n = 1000, p = 1000, target FDR: α = 0.1


Genetic Data Analysis

Genetic analysis

Crohn’s disease (CD)

Wellcome Trust Case Control Consortium (WTCCC)

n ≈ 5, 000 subjects (≈ 2, 000 patients, ≈ 3, 000 healthy controls)

p ≈ 400, 000 SNPs

Previously analyzed in WTCCC (2007)

Lipid traits (HDL, LDL cholesterol)

Northern Finland 1966 Birth Cohort study of metabolic syndrome (NFBC)

n ≈ 4, 700 subjects

p ≈ 330, 000 SNPs

Previously analyzed in Sabatti et al. (2009)

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Franke etal. ’10

WTCCC’07

100% rs11209026 (2) 1 67.31–67.42 yes yes

99% rs6431654 (20) 2 233.94–234.11 yes yes

98% rs6688532 (33) 1 169.4–169.65 yes

97% rs17234657 (1) 5 40.44–40.44 yes yes

95% rs11805303 (16) 1 67.31–67.46 yes yes

91% rs7095491 (18) 10 101.26–101.32 yes yes

91% rs3135503 (16) 16 49.28–49.36 yes yes

81% rs7768538 (1145) 6 25.19–32.91 yes yes

80% rs6601764 (1) 10 3.85–3.85 yes

75% rs7655059 (5) 4 89.5–89.53

73% rs6500315 (4) 16 49.03–49.07 yes yes

72% rs2738758 (5) 20 61.71–61.82 yes

70% rs7726744 (46) 5 40.35–40.71 yes yes

68% rs11627513 (7) 14 96.61–96.63

66% rs4246045 (46) 5 150.07–150.41 yes yes

62% rs9783122 (234) 10 106.43–107.61

61% rs6825958 (3) 4 55.73–55.77

Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.

Selectionfrequency


Position range(Mb)

Confirmedin Willeret al. ’13

Found inSabatti

et al. ’09

100% rs1532085 (4) 15 58.68–58.7 yes yes

100% rs7499892 (1) 16 57.01–57.01 yes yes

100% rs1800961 (1) 20 43.04–43.04 yes

99% rs1532624 (2) 16 56.99–57.01 yes yes

95% rs255049 (142) 16 66.41–69.41 yes yes

Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.

Selectionfrequency


Position range(Mb)


Found inSabatti

et al. ’09

99% rs4844614 (34) 1 207.3–207.88 yes

97% rs646776 (5) 1 109.8–109.82 yes yes

97% rs2228671 (2) 19 11.2–11.21 yes yes

94% rs157580 (4) 19 45.4–45.41 yes yes

92% rs557435 (21) 1 55.52–55.72 yes

80% rs10198175 (1) 2 21.13–21.13 yes yes

76% rs10953541 (58) 7 106.48–107.3

62% rs6575501 (1) 14 95.64–95.64

Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.

HDL0

5

10

15

20

25

Num

ber o

f disc

over

ies

LDL0

5

10

15

20

25

CD0

10

20

30

40

50

60

TraitHDL

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

con

firm

ed d

iscov

erie

s

LDL CDTrait

Figure: Number of discoveries made on different GWAS datasets (left) and proportion ofdiscoveries confirmed by a meta-analysis (right). Red lines correspond to resultspublished in papers that first analyzed our datasets

Data analysis issues

(1) Estimate distribution of SNPs (HMM) to build knockoffs

(2) Highly correlated SNPs

(1) Estimating the HMM

Methodology of Scheet and Stephens ’06

Fitted with fastPHASE (EM), K ≈ 10 possible hidden states

For each individual, making a knockoff copy of 70,000 SNPs takes about 1.3sec on Intel Xeon CPU (2.6GHz) (after parameter estimation)

Highly correlated SNPs

Hard to choose between two or more nearly-identical variables if the data supportsat least one of them being selected

SNPs

Clustering

SNPs

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Clustering

Cluster






Clustering

Representatives






Safe data re-use

Used for selecting reps and safely re-used for inference

Used only for inference

We used an independent split of the data to select representative SNPs

X(0)

X(1) X(1)

XX

X(0)

++____ +++__++__0

|W|if null

Re-use data to improve ordering but not to compute signs (1-bit p-values)

Simulations with genetic covariates

Real genetic covariates X

Logistic conditional model Y |X with 60 variables

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitionsZj = |βj(λCV)|, Wj = Zj − Zj , target FDR: α = 0.1

Diagnostic plot: simulation with data from Chromosome 1

Feature importance Zj = |βj(λCV)|

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●● ●

●

●●●

●

●●●

●

●● ●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 2000 4000 6000 8000 10000

0.00

0.05

0.10

0.15

Variables

Fea

ture

Impo

rtan

ce

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●● ●

●

●●●

●

●●●

●

●● ●●●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

Diagnostic plot: simulation with data from Chromosome 1

Feature importance Zj = |βj(λCV)|

0 2000 4000 6000 8000 10000

0.00

0.05

0.10

0.15

Variables

Fea

ture

Impo

rtan

ce

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●

●

●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Results of data analysis

Selectionfrequency


Position range(Mb)

Franke etal. ’10

WTCCC’07

100% rs11209026 (2) 1 67.31–67.42 yes yes

99% rs6431654 (20) 2 233.94–234.11 yes yes

98% rs6688532 (33) 1 169.4–169.65 yes

97% rs17234657 (1) 5 40.44–40.44 yes yes

95% rs11805303 (16) 1 67.31–67.46 yes yes

91% rs7095491 (18) 10 101.26–101.32 yes yes

91% rs3135503 (16) 16 49.28–49.36 yes yes

81% rs7768538 (1145) 6 25.19–32.91 yes yes

80% rs6601764 (1) 10 3.85–3.85 yes

75% rs7655059 (5) 4 89.5–89.53

73% rs6500315 (4) 16 49.03–49.07 yes yes

72% rs2738758 (5) 20 61.71–61.82 yes

70% rs7726744 (46) 5 40.35–40.71 yes yes

68% rs11627513 (7) 14 96.61–96.63

66% rs4246045 (46) 5 150.07–150.41 yes yes

62% rs9783122 (234) 10 106.43–107.61

61% rs6825958 (3) 4 55.73–55.77

Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.

Selectionfrequency


Position range(Mb)


Found inSabatti

et al. ’09

100% rs1532085 (4) 15 58.68–58.7 yes yes

100% rs7499892 (1) 16 57.01–57.01 yes yes

100% rs1800961 (1) 20 43.04–43.04 yes

99% rs1532624 (2) 16 56.99–57.01 yes yes

95% rs255049 (142) 16 66.41–69.41 yes yes

Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.

Selectionfrequency


Position range(Mb)


Found inSabatti

et al. ’09

99% rs4844614 (34) 1 207.3–207.88 yes

97% rs646776 (5) 1 109.8–109.82 yes yes

97% rs2228671 (2) 19 11.2–11.21 yes yes

94% rs157580 (4) 19 45.4–45.41 yes yes

92% rs557435 (21) 1 55.52–55.72 yes

80% rs10198175 (1) 2 21.13–21.13 yes yes

76% rs10953541 (58) 7 106.48–107.3

62% rs6575501 (1) 14 95.64–95.64

Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.

Summary and open questions

Knockoffs offers finite sample inferentialproperties in subtle and important problems

Knockoffs is a powerful, flexible, and robustsolution whenever there is considerable outsideinformation on dist. of X such as GWAS

Knockoffs addresses the replicability issue

Where is the burden of knowledge?

Robustness theory (Barber, Samworth and C.)

Derandomization (multiple knockoffs)

Knockoff constructions and statistics for otherapplications

What’s happening in selective inference III?

Lecture 3 (Thu. 8:30 a.m.)

Other views on selective inference: geography & vignettes

False coverage rate (Benjamini & Yekutieli)

POSI (Berk, Brown, Buja, Zhang, Zhao)

Inference after Lasso (Taylor & al.)

Selective hypothesis testing (Fithian et al.)

Thank You!

Derandomization

Combine information from mutiple knockoffs: who’s consistently showing up?

9

…

2 7 3 41 5 68…

9 2 4 3 7 1 5 68…

927 34 568…

|W|

9 2 73 4 15 68…

1

Figure: Cartoon representation of W ’s from different sample realizations of knockoffs

Sampling X1

p(X1|X−1) = p(X1|X2)

=p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)

=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2


Z1(X2)


Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2


Z1(X2)


Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1)

∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)


Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2


Z1(X2)


Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Sampling X3

p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)

∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)

Z2(X3)


Z3(z) =∑

u

Q3(u|X2)Q4(z|u)Q3(u|X2)

Z2(u)

And so on sampling Xj ...

Computationally efficient O(p)

emmanuel cand es, stanford universitycandes/talks/slides/wald2.pdfepisodic central nervous system...

Documents