emmanuel cand es, stanford universitycandes/talks/slides/wald2.pdfepisodic central nervous system...

186
What’s Happening in Selective Inference II? Emmanuel Cand` es, Stanford University The 2017 Wald Lectures, Joint Statistical Meetings, Baltimore, August 2017

Upload: others

Post on 12-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

What’s Happening in Selective Inference II?

Emmanuel Candes, Stanford University

The 2017 Wald Lectures, Joint Statistical Meetings, Baltimore, August 2017

Page 2: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Lecture 2: Special dedication

Chiara Sabatti

Page 3: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Agenda: The knockoff machine

(1) The knockoff framework (mostly from yesterday)

(2) Knockoffs for fixed covariates

(3) Knockoffs for random covariates

(4) Knockoffs for genome-wide association studies (GWAS)

(5) Genetic data analysis

Page 4: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

The Knockoffs Framework(Summary from Lecture 1)

Page 5: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Controlled variable selection

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

Response Y (e.g. disease status)

Features X1, . . . , Xp (e.g. SNPs)

Question: distribution of Y |X depends on X through which variables?

Goal: select set of features Xj that are likely to be relevantwithout too many false positives – do not run into the problem of irreproducibilty

FDR = E[ # false positives

# features selected︸ ︷︷ ︸FDP

]

Page 6: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Controlled variable selection

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.

Several genomic regions have been implicated in linkage studies30

and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with

both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.

The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.

Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias

−log

10(P

)

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

05

1015

Chromosome

Type 2 diabetes

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

22 XX212019181716151413121110987654321

Coronary artery disease

Crohn’s disease

Hypertension

Rheumatoid arthritis

Type 1 diabetes

Bipolar disorder

Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.

Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.

ARTICLES NATURE | Vol 447 | 7 June 2007

666Nature ©2007 Publishing Group

Response Y (e.g. disease status)

Features X1, . . . , Xp (e.g. SNPs)

Question: distribution of Y |X depends on X through which variables?

Goal: select set of features Xj that are likely to be relevantwithout too many false positives – do not run into the problem of irreproducibilty

FDR = E[ # false positives

# features selected︸ ︷︷ ︸FDP

]

Page 7: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Which variables should we report?

Feature importance Zj from random forests

●●●●

●●●●●●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●●●

●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●

●●●●

●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●

●●

●●●●●●●

●●●

●●●

●●●●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●●

●●

0 100 200 300 400 500

12

34

56

7

Variables

Fea

ture

Impo

rtan

ce

Page 8: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Which variables should we report?

Feature importance Zj from random forests

●●●●

●●●●●●●

●●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●●●

●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●●

●●●●

●●●

●●

●●●●

●●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●

●●

●●●●●●●

●●●

●●●

●●●●●●

●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●●●

●●

●●●●

●●●

●●

0 100 200 300 400 500

12

34

56

7

Variables

Fea

ture

Impo

rtan

ce

● ●

True positives?

Page 9: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs as negative controls

●●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●●

●●●●●

●●●

●●●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●●●●●●●●

●●

●●●●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●

●●

●●●●●

●●●

●●●●

●●●●●●

●●

●●

●●●●●●●

●●●●

●●●●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●●●●●●●

●●●●

●●●●●

●●

●●●

●●●

●●

●●●●●●●

●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●●●●

●●●●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

0 200 400 600 800 1000

12

34

Variables

Fea

ture

Impo

rtan

ce

OriginalKnockoffs

Page 10: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Knockoff agnostic feature importance Z

(Z1, . . . , Zp︸ ︷︷ ︸originals

, Z1, . . . , Zp︸ ︷︷ ︸knockoffs

) = z([X, X], y)

●●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●●

●●●●●

●●●

●●●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●●●●●●●●

●●

●●●●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●

●●

●●●●●

●●●

●●●●

●●●●●●

●●

●●

●●●●●●●

●●●●

●●●●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●●●●●●●

●●●●

●●●●●

●●

●●●

●●●

●●

●●●●●●●

●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●●●●

●●●●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

0 200 400 600 800 1000

12

34

This lectureCan construct knockoff features such that

j null =⇒ (Zj , Zj)d= (Zj , Zj)

more generally T subset of nulls =⇒ (Z, Z)swap(T )d= (Z, Z)

Z1 ZpZ2 ZpZ2Z1

Page 11: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Knockoff agnostic feature importance Z

(Z1, . . . , Zp︸ ︷︷ ︸originals

, Z1, . . . , Zp︸ ︷︷ ︸knockoffs

) = z([X, X], y)

●●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●

●●●

●●●●●

●●●

●●●●

●●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●

●●

●●

●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●●●●●●●●

●●

●●●●●

●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●

●●

●●●●●

●●●

●●●●

●●●●●●

●●

●●

●●●●●●●

●●●●

●●●●●●

●●

●●●●

●●●

●●●

●●●

●●●

●●●●

●●

●●

●●●●●●●●

●●●●

●●●●●

●●

●●●

●●●

●●

●●●●●●●

●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●

●●●●

●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●●●●

●●●●●●●

●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●

●●●●●●●

●●●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

0 200 400 600 800 1000

12

34

This lectureCan construct knockoff features such that

j null =⇒ (Zj , Zj)d= (Zj , Zj)

more generally T subset of nulls =⇒ (Z, Z)swap(T )d= (Z, Z)

Z1 ZpZ2 ZpZ2Z1

Page 12: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs-adjusted scores

++____ +++__++__0

|W|if null

Ordering of variables + 1-bit p-values

Adjusted scores Wj with flip-sign property

Combine Zj and Zj into single (knockoff) score Wj

Wj = wj(Zj , Zj) wj(Zj , Zj) = −wj(Zj , Zj)

e.g. Wj = Zj − Zj Wj = Zj ∨ Zj ·{

+1 Zj > Zj

−1 Zj ≤ Zj=⇒ Conditional on |W |, signs of null Wj ’s are i.i.d. coin flips

Page 13: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Selection by sequential testing

++____ +++__++

0 |W|

+++++...

t

Select S+(t) =⇒ FDP(t) =1+|S−(t)|1 ∨ |S+(t)|

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

Theorem (Barber and C. (’15))

Select S+(τ), τ = min {t : FDP(t) ≤ q}Knockoff

E[

# false positives

# selections + q−1

]≤ q

Knockoff+

E[

# false positives

# selections

]≤ q

Page 14: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why Can We Invert the Estimate of FDP?Proof Sketch of FDR Control

Page 15: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Page 16: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) =#{j null : j ∈ S+(τ)}#{j : j ∈ S+(τ)} ∨ 1

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Page 17: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) =#{j null : j ∈ S+(τ))}#{j : j ∈ S+(τ)} ∨ 1

· 1 + #{j null : j ∈ S−(τ)}1 + #{j null : j ∈ S−(τ)}

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Page 18: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) ≤ q ·

V +(τ)︷ ︸︸ ︷#{j null : j ∈ S+(τ)}

1 + #{j null : j ∈ S−(τ)}︸ ︷︷ ︸V −(τ)

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Page 19: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why does all this work?

τ = min

{t :

1+|S−(t)||S+(t)| ∨ 1

≤ q}

S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}

++____

+++__

++__

0

FDP(τ) ≤ q ·

V +(τ)︷ ︸︸ ︷#{j null : j ∈ S+(τ)}

1 + #{j null : j ∈ S−(τ)}︸ ︷︷ ︸V −(τ)

To show

E[

V +(τ)

1 + V −(τ)

]≤ 1

Page 20: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Martingales

V +(t)

1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t

__++__0

if nullt

V +(t) V −(t),

|W |

Conditioned on V +(s) + V −(s), V +(s) is hypergeometric

E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Page 21: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Martingales

V +(t)

1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t

__++__0

if nullt s

V +(t) V −(t),

|W |

Conditioned on V +(s) + V −(s), V +(s) is hypergeometric

E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Page 22: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Martingales

V +(t)

1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t

__++__0

if nullt s

V +(t) V −(t),

V +(s) + V −(s) = m

|W |

Conditioned on V +(s) + V −(s), V +(s) is hypergeometric

E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Page 23: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Martingales

V +(t)

1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t

__++__0

if nullt s

V +(t) V −(t),

V +(s) + V −(s) = m

|W |

Conditioned on V +(s) + V −(s), V +(s) is hypergeometric

E[

V +(s)

1 + V −(s)|V ±(t), V +(s) + V −(s)

]≤ V +(t)

1 + V −(t)

Page 24: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Optional stopping theorem

0

if null τ

FDR ≤ q E[

V +(τ)

1 + V −(τ)

]≤ q E

Bin(#nulls,1/2)︷ ︸︸ ︷V +(0)

1 + V −(0)

≤ q

Page 25: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Fixed Features

Joint with Barber

Page 26: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Linear model

y =

∑j βjXj︷︸︸︷Xβ + z

n× 1 n× p p× 1 n× 1

y ∼ N (Xβ, σ2I)

Fixed design X

Noise level σ unknown

Multiple testing: Hj : βj = 0 (is jth variable in the model?)

Identifiability =⇒ p ≤ n

Inference (FDR control) will hold conditionally on X

Page 27: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (fixed X)

Originals Knockoffs

X ′jXk = X ′jXk for all j, k

X ′jXk = X ′jXk for all j 6= k

No need for new data or experiment

No knowledge of response y

Page 28: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (fixed X)

Originals Knockoffs

X ′jXk = X ′jXk for all j, k

X ′jXk = X ′jXk for all j 6= k

No need for new data or experiment

No knowledge of response y

Page 29: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (fixed X)

Originals Knockoffs

X ′jXk = X ′jXk for all j, k

X ′jXk = X ′jXk for all j 6= k

No need for new data or experiment

No knowledge of response y

Page 30: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G

� 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0

Page 31: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G � 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0

Page 32: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G � 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0

Page 33: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.

[X X

]′ [X X

]=

[Σ Σ− diag{s}

Σ− diag{s} Σ

]:= G � 0

G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0

Solution

X = X(I − Σ−1 diag{s}) + UC

U ∈ Rn×p with col. space orthogonal to that of X

C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0

Page 34: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

X ′jXj = 1− sj (Standardized columns)

Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1

Under equivariance, minimizes the value of |〈Xj , Xj〉|

SDP knockoffsminimize

∑j |1− sj |

subject to sj ≥ 0diag{s} � 2Σ

Highly structured semidefinite program (SDP)

Other possibilities ...

Page 35: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

X ′jXj = 1− sj (Standardized columns)

Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1

Under equivariance, minimizes the value of |〈Xj , Xj〉|

SDP knockoffsminimize

∑j |1− sj |

subject to sj ≥ 0diag{s} � 2Σ

Highly structured semidefinite program (SDP)

Other possibilities ...

Page 36: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff construction (n ≥ 2p)

X ′jXj = 1− sj (Standardized columns)

Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1

Under equivariance, minimizes the value of |〈Xj , Xj〉|

SDP knockoffsminimize

∑j |1− sj |

subject to sj ≥ 0diag{s} � 2Σ

Highly structured semidefinite program (SDP)

Other possibilities ...

Page 37: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why?

For null feature Xj

X ′jy = X ′jXβ +X ′jzd= X ′jXβ + X ′jz = X ′jy

Construct knockoffs

Why?

For a null feature Xj,

X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y

Jan 21 2015 Controlling false discovery rate via knockoffs 12/36

Page 38: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why?

For null feature Xj

X ′jy = X ′jXβ +X ′jzd= X ′jXβ + X ′jz = X ′jy

Construct knockoffs

Why?

For a null feature Xj,

X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y

Jan 21 2015 Controlling false discovery rate via knockoffs 12/36

Page 39: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Why?

For any subset of nulls T

[X X]′swap(T ) yd= [X X]′ y

[X X]′swap(T ) =

Construct knockoffs

Why?

For a null feature Xj,

X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y

Jan 21 2015 Controlling false discovery rate via knockoffs 12/36

Page 40: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Sufficiency:

(Z, Z) = z([X X

]′ [X X

],[X X

]′y)

Knockoff-agnostic: swapping originals and knockoffs =⇒ swaps Z’s

z([X X

]swap(T )

, y) = (Z, Z)swap(T )

Theorem (Barber and C. (15))

For any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

=⇒ FDR control (conditional on X)

Z1 ZpZ2 ZpZ2Z1

Page 41: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Sufficiency:

(Z, Z) = z([X X

]′ [X X

],[X X

]′y)

Knockoff-agnostic: swapping originals and knockoffs =⇒ swaps Z’s

z([X X

]swap(T )

, y) = (Z, Z)swap(T )

Theorem (Barber and C. (15))

For any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

=⇒ FDR control (conditional on X)

Z1 ZpZ2 ZpZ2Z1

Page 42: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Telling the effect direction

[...] in classical statistics, the significance of comparisons (e. g., θ1 − θ2)is calibrated using Type I error rate, relying on the assumption that thetrue difference is zero, which makes no sense in many applications.[...] a more relevant framework in which a true comparison can bepositive or negative, and, based on the data, you can state “θ1 > θ2 withconfidence”, “θ2 > θ1 with confidence”, or “no claim with confidence”.

A. Gelman & F. Tuerlinckx

Page 43: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Directional FDR

Are any effects exactly zero?

FDRdir = E[

# selections with wrong effect direction

# selections

]

↑ ︸ ︷︷ ︸Directional false discovery rate Directional false discovery proportion

Directional FDR (Benjamini & Yekutieli, ’05)

Sign errors (Type-S) (Gelman & Tuerlinckx, ’00)

Important for misspecified models — exact sparsity unlikely

Page 44: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C., ’16)

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__null non null

0 + +__

|W|

Null coin flips are unbiased

Page 45: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C., ’16)

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__null non null

0 + +__

|W|

Null coin flips are unbiased

Page 46: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C., ’16)

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__null non null

0 + +__

|W|

Null coin flips are unbiased

Page 47: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Directional FDR control

(Xj − Xj)′y

ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0

Sign estimate sgn((Xj − Xj)′y)

Theorem (Barber and C. (16))

Exact same knockoff selection + sign estimate

FDR ≤ FDRdir ≤ q

++____ +++__++__0 + +__

|W|

Great subtlety: coin flips are now biased

Page 48: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Empirical results

Features N (0, In), n = 3000, p = 1000

k = 30 variables with regression coefficients of magnitude 3.5

Method FDR (%) Power (%) Theor. FDR(nominal level q = 20%) control?

Knockoff+ (equivariant) 14.40 60.99 YesKnockoff (equivariant) 17.82 66.73 No

Knockoff+ (SDP) 15.05 61.54 YesKnockoff (SDP) 18.72 67.50 No

BHq 18.70 48.88 NoBHq + log-factor correction 2.20 19.09 Yes

BHq with whitened noise 18.79 2.33 Yes

Page 49: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Effect of signal amplitude

Same setup with k = 30 (q = 0.2)

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2

0

5

10

15

20

25

Amplitude A

FD

R (

%)

Nominal levelKnockoffKnockoff+BHq

●●

● ●

●●

●●

● ●

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2

0

20

40

60

80

100

Amplitude A

Pow

er (

%)

KnockoffKnockoff+BHq

●●

●●

● ●

Page 50: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Effect of feature correlation

Features ∼ N (0,Θ) Θjk = ρ|j−k|

n = 3000, p = 1000, and k = 30 and amplitude = 3.5

0.0 0.2 0.4 0.6 0.8

0

5

10

15

20

25

30

Feature correlation ρ

FD

R (

%)

Nominal levelKnockoffKnockoff+BHq

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8

0

20

40

60

80

100

Feature correlation ρ

Pow

er (

%)

KnockoffKnockoff+BHq

●●

●●

Page 51: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Fixed Design Knockoff Data Analysis

Page 52: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

HIV drug resistance

Drug type # drugs Sample size # protease or RT # mutations appearingpositions genotyped ≥ 3 times in sample

PI 6 848 99 209NRTI 6 639 240 294

NNRTI 3 747 240 319

response y: log-fold-increase of lab-tested drug resistance

covariate Xj : presence or absence of mutation #j

Data from R. Shafer (Stanford) available at:

http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/

Page 53: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

HIV data

TSM list: mutations associated with the PI class of drugs in general, and is not

specialized to the individual drugs in the class

Results forPI type drugs

Knockoff BHq

Data set size: n=768, p=201

# H

IV−

1 pr

otea

se p

ositi

ons

sele

cted

0

5

10

15

20

25

30

35

Resistance to APV

Appear in TSM listNot in TSM list

Knockoff BHq

Data set size: n=329, p=147

0

5

10

15

20

25

30

35

Resistance to ATV

Knockoff BHq

Data set size: n=826, p=208

0

5

10

15

20

25

30

35

Resistance to IDV

Knockoff BHq

Data set size: n=516, p=184

0

5

10

15

20

25

30

35

Resistance to LPV

Knockoff BHq

Data set size: n=843, p=209

0

5

10

15

20

25

30

35

Resistance to NFV

Knockoff BHq

Data set size: n=825, p=208

0

5

10

15

20

25

30

35

Resistance to SQV

Page 54: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

HIV data

Results forNRTI type drugs

Results forNNRTI type drugs

Knockoff BHq

Data set size: n=633, p=292

# H

IV−

1 R

T p

ositi

ons

sele

cted

0

5

10

15

20

25

30

Resistance to X3TC

Appear in TSM listNot in TSM list

Knockoff BHq

Data set size: n=628, p=294

0

5

10

15

20

25

30

Resistance to ABC

Knockoff BHq

Data set size: n=630, p=292

0

5

10

15

20

25

30

Resistance to AZT

Knockoff BHq

Data set size: n=630, p=293

0

5

10

15

20

25

30

Resistance to D4T

Knockoff BHq

Data set size: n=632, p=292

0

5

10

15

20

25

30

Resistance to DDI

Knockoff BHq

Data set size: n=353, p=218

0

5

10

15

20

25

30

Resistance to TDF

Knockoff BHq

Data set size: n=732, p=311

# H

IV−

1 R

T p

ositi

ons

sele

cted

0

5

10

15

20

25

30

35

Resistance to DLV

Appear in TSM listNot in TSM list

Knockoff BHq

Data set size: n=734, p=318

0

5

10

15

20

25

30

35

Resistance to EFV

Knockoff BHq

Data set size: n=746, p=319

0

5

10

15

20

25

30

35

Resistance to NVP

Page 55: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-dimensional setting

n ≈ 5, 000 subjects

p ≈ 330, 000 SNPs/vars to test

improve coverage around type 2 diabetes (T2D) candidate genesin 1,874 Finnish individuals from the Finland–United StatesInvestigation of NIDDM Genetics (FUSION) study13. In a secondscan, after quality-control filtering, we examined 356,539 SNPs (MAF4 5%) from the Affymetrix 500K Mapping Array Set in 4,184individuals from the SardiNIA Study of Aging10,14. The Sardiniansample is organized into a number of small- to medium-sizedpedigrees. We took advantage of this relatedness to reduce genotypingcosts: we genotyped 1,412 individuals with the Affymetrix 500KMapping Array Set (organized into groups of 2–3 individuals pernuclear family) and then propagated their genotypes to the remainingindividuals, who were genotyped using only the Affymetrix 10KMapping Array14,17,18 (see Methods). To increase statistical power,we also contacted the authors of a previously published study15 toobtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758Finnish and Swedish individuals from the Diabetes Genetics Initiative(DGI) using the Affymetrix 500K Mapping Array Set. Further detailsof the DGI study and independent follow-up analyses are provided ina companion manuscript16. All three initial scans excluded individualstaking lipid lowering therapies, for a total of 8,816 phenotypedindividuals (Table 1). Informed consent was obtained from all

study participants and ethics approval was obtained from theparticipating institutions.

Because the three studies used different marker sets with anoverlap of only 44,998 SNPs across studies, we used information onpatterns of haplotype variation in the HapMap CEU samples (release21)19 to infer missing genotypes in silico and to facilitate comparisonbetween the studies13. Imputation analyses were carried out withMarkov Chain Haplotyping software (MaCH; see URLs section inMethods). For our analyses, we only considered SNPs that were eithergenotyped or could be imputed with relatively high confidence; that is,SNPs for which patterns of haplotype sharing between sampledindividuals and those genotyped by the HapMap consistently indi-cated a specific allele. Comparison of imputed and experimentallyderived genotypes in our samples yielded estimated error rates of1.46% (for imputation based on Illumina genotypes) to 2.14%(imputation based on Affymetrix genotypes) per allele, consistentwith expectations from HapMap data. For additional details ofquality-control and imputation procedures, see Methods and Supple-mentary Table 1 online.

We then conducted a series of association analyses to relatethe B2,261,000 genotyped and/or imputed SNPs with plasma

20

15

10

5

01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

1 2 3 4 5 6

1 2 3 4 5 6

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

–log

10 (

P v

alue

)

20

15

10

5

0

–log

10 (

P v

alue

)

20

15

10

5

0

–log

10 (

P v

alue

)

20

15

10

5

0

0

–log

10 (

P v

alue

) 20

15

10

5

0

–log

10 (

P v

alue

) 20

15

10

5

0

–log

10 (

P v

alue

)

Percentile

1 2 3 4 5 60

Percentile

1 2 3 4 5 60

Percentile

HDL cholesterol LDL cholesterol Triglycerides

HDL cholesterol

LDL cholesterol

Triglycerides

GALNT2

APOB

PCSK9

ANGPTL3

GCKR

MLXIPL

LPL

TRIB1

APOA5

LIPCNCAN/CILP2

GALNT2 RBKS

B4GALT4 B3GALT4

SORT1/CELSR2/PSRC1

LPLABCA1

MVK/MMAB

LIPC

LCAT LIPG

APOE cluster

LDLR

NCAN/CILP2

CETP

Figure 1 Summary of genome-wide association scans. The figure summarizes combined genome-wide association scan results in the top 3 panels (plottedas –log10 P value for HDL cholesterol, LDL cholesterol and triglycerides). Loci that were not followed up are in gray. Loci that were followed-up are in green(combined dataset yielded convincing evidence of association, P o 5 ! 10"8), orange (combined dataset yielded promising evidence of association,P o 10"5), or red (combined dataset did not suggest association, P 4 10"5). The three panels in the bottom row display quantile-quantile plots for teststatistics. The red line corresponds to all test statistics, the blue line corresponds to results after excluding statistics at replicated loci (in green, top panel),and the gray area corresponds to the 90% confidence region from a null distribution of P values (generated from 100 simulations).

NATURE GENETICS VOLUME 40 [ NUMBER 2 [ FEBRUARY 2008 16 3

A R T I C L E S

©20

08 N

atur

e Pu

blis

hing

Gro

up h

ttp://

ww

w.n

atur

e.co

m/n

atur

egen

etic

s

p > n −→ cannot construct knockoffs as before

X ′jXk = X ′jXk ∀ j, kX ′jXk = X ′jXk ∀ j 6= k

=⇒ Xj = Xj ∀j

Page 56: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High dimensional knockoffs: screen and confirm

original data set

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

X(0)y(0)

screen on sample 1

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

y(1)X(1)

inference on sample 2

Theory (Barber and C., ’16)

Safe data re-use to improve power (Barber and C., ’16)

Page 57: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High dimensional knockoffs: screen and confirm

original data set

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

X(0)y(0)

screen on sample 1

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

y(1)X(1)

inference on sample 2

Theory (Barber and C., ’16)

Safe data re-use to improve power (Barber and C., ’16)

Page 58: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High dimensional knockoffs: screen and confirm

original data set

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

X(0)y(0)

screen on sample 1

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

y(1)X(1)

inference on sample 2

Theory (Barber and C., ’16)

Safe data re-use to improve power (Barber and C., ’16)

Page 59: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High dimensional knockoffs: screen and confirm

original data set

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

X(0)y(0)

screen on sample 1

Knocko↵ in high dimensions

y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =

1

screen on sample 1exploratory confirmatory

y(1)X(1)

inference on sample 2

Theory (Barber and C., ’16)

Safe data re-use to improve power (Barber and C., ’16)

Page 60: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Some extensions

y =(X1

)

︸ ︷︷ ︸n×p1

·β1 +(X2

)

︸ ︷︷ ︸n×p2

·β2 + · · ·+N (0, σ2In)

Group sparsity — build knockoffs at the group-wise levelDai & Barber 2015

Identify key groups with PCA — build knockoffs only for the top PC in eachgroupChen, Hou, Hou 2017

Build knockoffs only for prototypes selected from each groupReid & Tibshirani 2015

Multilayer knockoffs to control FDR at the individual and group levelssimultaneouslyKatsevich & Sabatti 2017

Page 61: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Random Features

Joint with Fan, Janson & Lv

Page 62: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Variable selection in arbitrary models

Random pair (X,Y ) (perhaps thousands/millions of covariates)

p(Y |X) depends on X through which variables?

Working definition of null variables

Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j

Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.

Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S

Logistic model: P(Y = 0|X) =1

1 + eX>β

If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0

Page 63: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Variable selection in arbitrary models

Random pair (X,Y ) (perhaps thousands/millions of covariates)

p(Y |X) depends on X through which variables?

Working definition of null variables

Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j

Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.

Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S

Logistic model: P(Y = 0|X) =1

1 + eX>β

If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0

Page 64: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Variable selection in arbitrary models

Random pair (X,Y ) (perhaps thousands/millions of covariates)

p(Y |X) depends on X through which variables?

Working definition of null variables

Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j

Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.

Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S

Logistic model: P(Y = 0|X) =1

1 + eX>β

If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0

Page 65: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Variable selection in arbitrary models

Random pair (X,Y ) (perhaps thousands/millions of covariates)

p(Y |X) depends on X through which variables?

Working definition of null variables

Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j

Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.

Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S

Logistic model: P(Y = 0|X) =1

1 + eX>β

If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0

Page 66: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (random X)

i.i.d. samples from p(X,Y )

Distribution of X known

Distribution of Y |X (likelihood) completely unknown

Originals X = (X1, . . . , Xp)

Knockoffs X = (X1, . . . , Xp)

(1) Pairwise exchangeability

(X, X)swap(S)d= (X, X)

e.g.

(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)

(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)

Page 67: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (random X)

i.i.d. samples from p(X,Y )

Distribution of X known

Distribution of Y |X (likelihood) completely unknown

Originals X = (X1, . . . , Xp)

Knockoffs X = (X1, . . . , Xp)

(1) Pairwise exchangeability

(X, X)swap(S)d= (X, X)

e.g.

(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)

(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)

Page 68: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (random X)

i.i.d. samples from p(X,Y )

Distribution of X known

Distribution of Y |X (likelihood) completely unknown

Originals X = (X1, . . . , Xp)

Knockoffs X = (X1, . . . , Xp)

(1) Pairwise exchangeability

(X, X)swap(S)d= (X, X)

e.g.

(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)

(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)

Page 69: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff features (random X)

i.i.d. samples from p(X,Y )

Distribution of X known

Distribution of Y |X (likelihood) completely unknown

Originals X = (X1, . . . , Xp)

Knockoffs X = (X1, . . . , Xp)

(1) Pairwise exchangeability

(X, X)swap(S)d= (X, X)

e.g.

(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)

(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)

Page 70: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Theorem (C., Fan, Janson Lv (’16))

For knockoff-agnostic scores and any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

This holds no matter the relationship between Y and X

This holds conditionally on Y

=⇒ FDR control (conditional on Y ) no matter the relationship between X and Y

Z1 ZpZ2 ZpZ2Z1

Page 71: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Exchangeability of feature importance statistics

Theorem (C., Fan, Janson Lv (’16))

For knockoff-agnostic scores and any subset T of nulls

(Z,Z)swap(T )d= (Z, Z)

This holds no matter the relationship between Y and X

This holds conditionally on Y

=⇒ FDR control (conditional on Y ) no matter the relationship between X and Y

Z1 ZpZ2 ZpZ2Z1

Page 72: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 73: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 74: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗)

∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 75: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 76: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 77: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs for Gaussian features

Swapping any subset of original and knockoff features leaves (joint) dist. invariant

e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Note Xd=X

X ∼ N (µ,Σ)

Possible solution

(X, X) ∼ N (∗, ∗∗) ∗ =

[µµ

]∗ ∗ =

[Σ Σ− diag{s}

Σ− diag{s} Σ

]

s such that ∗∗ � 0

Given X, sample X from X |X (regression formula)

Different from knockoff features for fixed X!

Page 78: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

Exact Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 79: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●

Exact Cov

Graph. Lasso

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 80: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 81: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 82: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 83: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 84: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness

● ●●

Exact Cov

Graph. Lasso50% Emp. Cov

62.5% Emp. Cov

75% Emp. Cov

87.5% Emp. Cov

100% Emp. Cov0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

Pow

er

● ● ●●

●●0.00

0.25

0.50

0.75

1.00

0.0 0.5 1.0Relative Frobenius Norm Error

FD

R

Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries

Page 85: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Robustness theory

Ongoing with R. Barber andR. Samworth

(Partial) subject of 2017 TweedieAward Lecture

Rina F. Barber

Page 86: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoffs inference with random features

Pros:

No parameters

No p-values

Holds for finite samples

No matter the dependence between Y and X

No matter the dimensionality

Cons: Need to know distribution of covariates

Page 87: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Relationship with classical setup

Classical MF Knockoffs

Observations of X are fixedInference is conditional on obs. values

Observations of X are random1

Strong model linking Y and X Model free2

Useful inference even if model inexact Useful inference even if model inexact3

1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled

2 Shifts the ‘burden’ of knowledge

3 More later

Page 88: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Relationship with classical setup

Classical MF Knockoffs

Observations of X are fixedInference is conditional on obs. values

Observations of X are random1

Strong model linking Y and X Model free2

Useful inference even if model inexact Useful inference even if model inexact3

1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled

2 Shifts the ‘burden’ of knowledge

3 More later

Page 89: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Relationship with classical setup

Classical MF Knockoffs

Observations of X are fixedInference is conditional on obs. values

Observations of X are random1

Strong model linking Y and X Model free2

Useful inference even if model inexact Useful inference even if model inexact3

1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled

2 Shifts the ‘burden’ of knowledge

3 More later

Page 90: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Relationship with classical setup

Classical MF Knockoffs

Observations of X are fixedInference is conditional on obs. values

Observations of X are random1

Strong model linking Y and X Model free2

Useful inference even if model inexact Useful inference even if model inexact3

1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled

2 Shifts the ‘burden’ of knowledge

3 More later

Page 91: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Shift in the burden of knowledge

When are our assumptions useful?

When we have large amounts of unsupervised data (e.g. economic studieswith same covariate info but different responses)

When we have more prior information about the covariates than about theirrelationship with a response (e.g. GWAS)

When we control the distribution of X (experimental crosses in genetics,gene knockout experiments,...)

Page 92: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Obstacles to obtaining p-values

Y |X ∼ Bernoulli(logit(X>β))

0

500

1000

1500

2000

0.00 0.25 0.50 0.75 1.00P−Values

coun

t

Global Null, AR(1) Design

0

500

1000

1500

2000

0.00 0.25 0.50 0.75 1.00P−Values

coun

t

20 Nonzero Coefficients, AR(1) Design

Figure: Distribution of null logistic regression p-values with n = 500 and p = 200

Page 93: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Obstacles to obtaining p-values

P{p-val ≤ . . .%} Sett. (1) Sett. (2) Sett. (3) Sett. (4)

5% 16.89% (0.37) 19.17% (0.39) 16.88% (0.37) 16.78% (0.37)

1% 6.78% (0.25) 8.49% (0.28) 7.02% (0.26) 7.03% (0.26)

0.1% 1.53% (0.12) 2.27% (0.15) 1.87% (0.14) 2.04% (0.14)

Table: Inflated p-value probabilities with estimated Monte Carlo SEs

Page 94: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Shameless plug: distribution of high-dimensional LRTs

Wilks’ phenomenon (1938)

2 logLd→ χ2

df

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

Sur, Chen, Candes (2017)

2 logLd→ κ

( pn

)χ2df

0

2500

5000

7500

10000

12500

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

Page 95: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Shameless plug: distribution of high-dimensional LRTs

Wilks’ phenomenon (1938)

2 logLd→ χ2

df

0

10000

20000

30000

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

Sur, Chen, Candes (2017)

2 logLd→ κ

( pn

)χ2df

0

2500

5000

7500

10000

12500

0.00 0.25 0.50 0.75 1.00P−Values

Cou

nts

Page 96: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

‘Low’ dim. linear model with dependent covariates

Zj = |βj(λCV)|Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

Pow

er

MethodsBHq MarginalBHq Max Lik.MF KnockoffsOrig. Knockoffs

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

FD

R

Figure: Low-dimensional setting: n = 3000, p = 1000

Page 97: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

‘Low’ dim. logistic model with indep. covariates

Zj = |βj(λCV)|Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

Pow

er

MethodsBHq MarginalBHq Max Lik.MF Knockoffs 0.00

0.25

0.50

0.75

1.00

6 8 10Coefficient Amplitude

FD

R

Figure: Low-dimensional setting: n = 3000, p = 1000

Page 98: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

‘High’ dim. logistic model with dependent covariates

Zj = |βj(λCV)|Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

Pow

er

MethodsBHq MarginalMF Knockoffs 0.00

0.25

0.50

0.75

1.00

0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient

FD

R

Figure: High-dimensional setting: n = 3000, p = 6000

Page 99: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Bayesian knockoff statistics

LCD (Lasso coeff. difference)

BVS (Bayesian variable selection)

Zj = P(βj 6= 0 |y,X)

Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

Pow

er

MethodsBVS KnockoffsLCD Knockoffs

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

FD

R

MethodsBVS KnockoffsLCD Knockoffs

Figure: n = 300, p = 1000 and Bayesian linear model with 60 expected variables

Inference is correct even if prior is wrong or MCMC has not converged

Page 100: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Bayesian knockoff statistics

LCD (Lasso coeff. difference)

BVS (Bayesian variable selection)

Zj = P(βj 6= 0 |y,X)

Wj = Zj − Zj

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

Pow

er

MethodsBVS KnockoffsLCD Knockoffs

0.00

0.25

0.50

0.75

1.00

5 10 15Amplitude

FD

R

MethodsBVS KnockoffsLCD Knockoffs

Figure: n = 300, p = 1000 and Bayesian linear model with 60 expected variables

Inference is correct even if prior is wrong or MCMC has not converged

Page 101: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Partial summary

No valid p-values even for logistic regression

Shifts the burden of knowledge to X (covariates); makes sense in manycontexts

Robustness: simulations show properties of inference hold even when themodel for X is only approximately right.

Always have access to these diagnostic checks (later)

When assumptions are appropriate gain a lot of power, and can usesophisticated selection techniques

Page 102: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

How to Construct Knockoffs for Hidden Markov Models

Joint with Sabatti & Sesia

Page 103: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 104: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 105: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1

Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 106: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 107: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 108: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 109: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 110: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 111: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A general construction (C., Fan, Janson and Lv, ’16)

(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)

Algorithm Sequential Conditional Independent Pairs

for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1

end

e.g. p = 3

Sample X1 from X1 |X−1Joint law of X, X1 is known

Sample X2 from X2 |X−2, X1

Joint law of X, X1:2 is known

Sample X3 from X3 |X−3, X1:2

Joint law of X, X is known and is pairwise exchangeable!

Usually not practical, easy in some cases (e.g. Markov chains)

Page 112: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 113: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 114: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 115: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 116: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 117: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 118: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 119: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 120: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 121: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 122: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a Markov chain

X = (X1, X2, . . . , Xp) is a Markov chain

p(X1, . . . , Xp) = q1(X1)

p∏

j=2

Qj(Xj |Xj−1) (X ∼ MC (q1,Q))

X1 X2 X3 X4

X1 X2 X3 X4

Observed variables

Knockoff variables

General algorithm can be implemented efficiently in the case of a Markov chain

Page 123: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Recursive update of normalizing constants

Page 124: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 125: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 126: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 127: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 128: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 129: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Hidden Markov Models (HMMs)

X = (X1, X2, . . . , Xp) is a HMM if

{H ∼ MC (q1,Q) (latent Markov chain)

Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)

H1 H2 H3

X1 X2 X3

The H variables are latent and only the X variables are observed

Page 130: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Haplotypes and genotypes

Haplotype Set of alleles on a single chromosome0/1 for common/rare allele

Genotype Unordered pair of alleles at a single marker

0 1 0 1 1 0 1 1 0 0 1 11 2 0 1 2 1

+Haplotype MHaplotype PGenotypes

Page 131: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A phenomenological HMM for haplotype & genotype data

Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)

Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)

fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)

MaCH (Li, ’10)

New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences

Page 132: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A phenomenological HMM for haplotype & genotype data

Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)

Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)

fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)

MaCH (Li, ’10)

New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences

Page 133: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

A phenomenological HMM for haplotype & genotype data

Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)

Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)

fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)

MaCH (Li, ’10)

New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences

Page 134: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

latent variables knockoff latent variables

knockoff variables

Page 135: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

imputed latent variables knockoff latent variables

knockoff variables

Page 136: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

imputed latent variables knockoff latent variables

knockoff variables

Page 137: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Knockoff copies of a hidden Markov model

Theorem (Sesia, Sabatti, C. ’17)

A knockoff copy of X of X can be constructed as

(1) Sample H from p(H|X) using forward-backward algorithm

(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain

(3) Sample X from the emission distribution of X given H = H

H1 H2 H3

X1 X2 X3

H1 H2 H1

X1 X2 X3

observed variables

imputed latent variables knockoff latent variables

knockoff variables

Page 138: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Some Examples

Page 139: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Simulations with synthetic Markov chainMarkov chain covariates with 5 hidden states. Binomial response

4 5 6 7 8 9 10 12 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

4 5 6 7 8 9 10 12 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

Page 140: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

RobustnessMarkov chain covariates with 5 hidden states. Binomial response

4 5 6 7 8 9 10 12 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

4 5 6 7 8 9 10 12 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

Page 141: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Simulations with synthetic HMMHMM covariates with latent “clockwise” Markov chain. Binomial response

3 4 5 6 7 8 9 10 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

3 4 5 6 7 8 9 10 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

Page 142: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

RobustnessHMM covariates with latent “clockwise” Markov chain. Binomial response

3 4 5 6 7 8 9 10 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

3 4 5 6 7 8 9 10 15 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

Page 143: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Out-of-sample parameter estimationInhomogeneous Markov chain covariates with 5 hidden states. Binomial response

10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitions (estimated FX from independent dataset)n = 1000, p = 1000, target FDR: α = 0.1

Zj = |βj(λCV)|, Wj = Zj − Zj

Page 144: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Genetic Data Analysis

Page 145: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Genetic analysis

Crohn’s disease (CD)

Wellcome Trust Case Control Consortium (WTCCC)

n ≈ 5, 000 subjects (≈ 2, 000 patients, ≈ 3, 000 healthy controls)

p ≈ 400, 000 SNPs

Previously analyzed in WTCCC (2007)

Lipid traits (HDL, LDL cholesterol)

Northern Finland 1966 Birth Cohort study of metabolic syndrome (NFBC)

n ≈ 4, 700 subjects

p ≈ 330, 000 SNPs

Previously analyzed in Sabatti et al. (2009)

Page 146: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Genetic analysis

Crohn’s disease (CD)

Wellcome Trust Case Control Consortium (WTCCC)

n ≈ 5, 000 subjects (≈ 2, 000 patients, ≈ 3, 000 healthy controls)

p ≈ 400, 000 SNPs

Previously analyzed in WTCCC (2007)

Lipid traits (HDL, LDL cholesterol)

Northern Finland 1966 Birth Cohort study of metabolic syndrome (NFBC)

n ≈ 4, 700 subjects

p ≈ 330, 000 SNPs

Previously analyzed in Sabatti et al. (2009)

Page 147: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 148: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 149: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 150: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 151: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 152: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 153: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

High-level results

Knockoffs with nominal FDR level of 10%

Power is much higher:

DatasetNumber of discoveries

Original study Knockoffs (average)CD 9 22.8

HDL 5 8LDL 6 9.8

Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)

Knockoffs made a number of new discoveries

Expect some (roughly 10%) of these to be false discoveries

It is likely that many of these correspond to true discoveries

Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates

Page 154: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Franke etal. ’10

WTCCC’07

100% rs11209026 (2) 1 67.31–67.42 yes yes

99% rs6431654 (20) 2 233.94–234.11 yes yes

98% rs6688532 (33) 1 169.4–169.65 yes

97% rs17234657 (1) 5 40.44–40.44 yes yes

95% rs11805303 (16) 1 67.31–67.46 yes yes

91% rs7095491 (18) 10 101.26–101.32 yes yes

91% rs3135503 (16) 16 49.28–49.36 yes yes

81% rs7768538 (1145) 6 25.19–32.91 yes yes

80% rs6601764 (1) 10 3.85–3.85 yes

75% rs7655059 (5) 4 89.5–89.53

73% rs6500315 (4) 16 49.03–49.07 yes yes

72% rs2738758 (5) 20 61.71–61.82 yes

70% rs7726744 (46) 5 40.35–40.71 yes yes

68% rs11627513 (7) 14 96.61–96.63

66% rs4246045 (46) 5 150.07–150.41 yes yes

62% rs9783122 (234) 10 106.43–107.61

61% rs6825958 (3) 4 55.73–55.77

Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.

Page 155: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Confirmedin Willeret al. ’13

Found inSabatti

et al. ’09

100% rs1532085 (4) 15 58.68–58.7 yes yes

100% rs7499892 (1) 16 57.01–57.01 yes yes

100% rs1800961 (1) 20 43.04–43.04 yes

99% rs1532624 (2) 16 56.99–57.01 yes yes

95% rs255049 (142) 16 66.41–69.41 yes yes

Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Confirmedin Willeret al. ’13

Found inSabatti

et al. ’09

99% rs4844614 (34) 1 207.3–207.88 yes

97% rs646776 (5) 1 109.8–109.82 yes yes

97% rs2228671 (2) 19 11.2–11.21 yes yes

94% rs157580 (4) 19 45.4–45.41 yes yes

92% rs557435 (21) 1 55.52–55.72 yes

80% rs10198175 (1) 2 21.13–21.13 yes yes

76% rs10953541 (58) 7 106.48–107.3

62% rs6575501 (1) 14 95.64–95.64

Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.

Page 156: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

HDL0

5

10

15

20

25

Num

ber o

f disc

over

ies

LDL0

5

10

15

20

25

CD0

10

20

30

40

50

60

TraitHDL

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

con

firm

ed d

iscov

erie

s

LDL CDTrait

Figure: Number of discoveries made on different GWAS datasets (left) and proportion ofdiscoveries confirmed by a meta-analysis (right). Red lines correspond to resultspublished in papers that first analyzed our datasets

Page 157: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Data analysis issues

(1) Estimate distribution of SNPs (HMM) to build knockoffs

(2) Highly correlated SNPs

(1) Estimating the HMM

Methodology of Scheet and Stephens ’06

Fitted with fastPHASE (EM), K ≈ 10 possible hidden states

For each individual, making a knockoff copy of 70,000 SNPs takes about 1.3sec on Intel Xeon CPU (2.6GHz) (after parameter estimation)

Page 158: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Data analysis issues

(1) Estimate distribution of SNPs (HMM) to build knockoffs

(2) Highly correlated SNPs

(1) Estimating the HMM

Methodology of Scheet and Stephens ’06

Fitted with fastPHASE (EM), K ≈ 10 possible hidden states

For each individual, making a knockoff copy of 70,000 SNPs takes about 1.3sec on Intel Xeon CPU (2.6GHz) (after parameter estimation)

Page 159: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Highly correlated SNPs

Hard to choose between two or more nearly-identical variables if the data supportsat least one of them being selected

SNPs

Page 160: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Clustering

SNPs

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Page 161: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Clustering

Cluster

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Page 162: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Clustering

Representatives

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Page 163: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Clustering

Representatives

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Page 164: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Clustering

Representatives

Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among

71,145 candidates for CD and 59,005 for cholesterol

Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps

Which rep? Most significant SNP as computed on 20% of the samples

Safe data re-use (optimize power) as in Barber and C. (16)

Page 165: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Safe data re-use

Used for selecting reps and safely re-used for inference

Used only for inference

We used an independent split of the data to select representative SNPs

X(0)

X(1) X(1)

XX

X(0)

++____ +++__++__0

|W|if null

Re-use data to improve ordering but not to compute signs (1-bit p-values)

Page 166: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Simulations with genetic covariates

Real genetic covariates X

Logistic conditional model Y |X with 60 variables

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitionsZj = |βj(λCV)|, Wj = Zj − Zj , target FDR: α = 0.1

Page 167: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Simulations with genetic covariates

Real genetic covariates X

Logistic conditional model Y |X with 60 variables

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

Powe

r

8 10 12 14 16 18 20Signal amplitude

0.0

0.2

0.4

0.6

0.8

1.0

FDP

Figure: Power and FDP over 100 repetitionsZj = |βj(λCV)|, Wj = Zj − Zj , target FDR: α = 0.1

Page 168: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Diagnostic plot: simulation with data from Chromosome 1

Feature importance Zj = |βj(λCV)|

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

●●

●● ●

●●●

●●●

●● ●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 2000 4000 6000 8000 10000

0.00

0.05

0.10

0.15

Variables

Fea

ture

Impo

rtan

ce

● ●

●●

●● ●

●●●

●●●

●● ●●●

●●

●●

●●

Page 169: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Diagnostic plot: simulation with data from Chromosome 1

Feature importance Zj = |βj(λCV)|

0 2000 4000 6000 8000 10000

0.00

0.05

0.10

0.15

Variables

Fea

ture

Impo

rtan

ce

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Page 170: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Results of data analysis

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Franke etal. ’10

WTCCC’07

100% rs11209026 (2) 1 67.31–67.42 yes yes

99% rs6431654 (20) 2 233.94–234.11 yes yes

98% rs6688532 (33) 1 169.4–169.65 yes

97% rs17234657 (1) 5 40.44–40.44 yes yes

95% rs11805303 (16) 1 67.31–67.46 yes yes

91% rs7095491 (18) 10 101.26–101.32 yes yes

91% rs3135503 (16) 16 49.28–49.36 yes yes

81% rs7768538 (1145) 6 25.19–32.91 yes yes

80% rs6601764 (1) 10 3.85–3.85 yes

75% rs7655059 (5) 4 89.5–89.53

73% rs6500315 (4) 16 49.03–49.07 yes yes

72% rs2738758 (5) 20 61.71–61.82 yes

70% rs7726744 (46) 5 40.35–40.71 yes yes

68% rs11627513 (7) 14 96.61–96.63

66% rs4246045 (46) 5 150.07–150.41 yes yes

62% rs9783122 (234) 10 106.43–107.61

61% rs6825958 (3) 4 55.73–55.77

Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.

Page 171: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Confirmedin Willeret al. ’13

Found inSabatti

et al. ’09

100% rs1532085 (4) 15 58.68–58.7 yes yes

100% rs7499892 (1) 16 57.01–57.01 yes yes

100% rs1800961 (1) 20 43.04–43.04 yes

99% rs1532624 (2) 16 56.99–57.01 yes yes

95% rs255049 (142) 16 66.41–69.41 yes yes

Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.

Selectionfrequency

SNP(cluster size) Chr.

Position range(Mb)

Confirmedin Willeret al. ’13

Found inSabatti

et al. ’09

99% rs4844614 (34) 1 207.3–207.88 yes

97% rs646776 (5) 1 109.8–109.82 yes yes

97% rs2228671 (2) 19 11.2–11.21 yes yes

94% rs157580 (4) 19 45.4–45.41 yes yes

92% rs557435 (21) 1 55.52–55.72 yes

80% rs10198175 (1) 2 21.13–21.13 yes yes

76% rs10953541 (58) 7 106.48–107.3

62% rs6575501 (1) 14 95.64–95.64

Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.

Page 172: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Summary and open questions

Knockoffs offers finite sample inferentialproperties in subtle and important problems

Knockoffs is a powerful, flexible, and robustsolution whenever there is considerable outsideinformation on dist. of X such as GWAS

Knockoffs addresses the replicability issue

Where is the burden of knowledge?

Robustness theory (Barber, Samworth and C.)

Derandomization (multiple knockoffs)

Knockoff constructions and statistics for otherapplications

Page 173: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Summary and open questions

Knockoffs offers finite sample inferentialproperties in subtle and important problems

Knockoffs is a powerful, flexible, and robustsolution whenever there is considerable outsideinformation on dist. of X such as GWAS

Knockoffs addresses the replicability issue

Where is the burden of knowledge?

Robustness theory (Barber, Samworth and C.)

Derandomization (multiple knockoffs)

Knockoff constructions and statistics for otherapplications

Page 174: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

What’s happening in selective inference III?

Lecture 3 (Thu. 8:30 a.m.)

Other views on selective inference: geography & vignettes

False coverage rate (Benjamini & Yekutieli)

POSI (Berk, Brown, Buja, Zhang, Zhao)

Inference after Lasso (Taylor & al.)

Selective hypothesis testing (Fithian et al.)

Page 175: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Thank You!

Page 176: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Derandomization

Combine information from mutiple knockoffs: who’s consistently showing up?

9

2 7 3 41 5 68…

9 2 4 3 7 1 5 68…

927 34 568…

|W|

9 2 73 4 15 68…

1

Figure: Cartoon representation of W ’s from different sample realizations of knockoffs

Page 177: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2)

=p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 178: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)

=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 179: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 180: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1)

∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 181: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 182: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X1

p(X1|X−1) = p(X1|X2) =p(X1, X2)

p(X2)=q1(X1)Q2(X2|X1)

Z1(X2)

Z1(z) =∑

u

q1(u)Q2(z|u)

Sampling X2

p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)

Z1(X2)

normalization constant Z2(X3)

Z2(z) =∑

u

Q2(u|X1)Q3(z|u)Q2(u|X1)

Z1(u)

Page 183: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X3

p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)

∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)

Z2(X3)

normalization constant Z3(X4)

Z3(z) =∑

u

Q3(u|X2)Q4(z|u)Q3(u|X2)

Z2(u)

And so on sampling Xj ...

Computationally efficient O(p)

Page 184: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X3

p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)

∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)

Z2(X3)

normalization constant Z3(X4)

Z3(z) =∑

u

Q3(u|X2)Q4(z|u)Q3(u|X2)

Z2(u)

And so on sampling Xj ...

Computationally efficient O(p)

Page 185: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X3

p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)

∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)

Z2(X3)

normalization constant Z3(X4)

Z3(z) =∑

u

Q3(u|X2)Q4(z|u)Q3(u|X2)

Z2(u)

And so on sampling Xj ...

Computationally efficient O(p)

Page 186: Emmanuel Cand es, Stanford Universitycandes/talks/slides/Wald2.pdfepisodic central nervous system disease, including seizures, ataxias! log 10 (P) 0 5 10 15 0 5 10 15 0 5 10 15 0 5

Sampling X3

p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)

∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)

Z2(X3)

normalization constant Z3(X4)

Z3(z) =∑

u

Q3(u|X2)Q4(z|u)Q3(u|X2)

Z2(u)

And so on sampling Xj ...

Computationally efficient O(p)