emmanuel cand es, stanford universitycandes/talks/slides/wald2.pdfepisodic central nervous system...
TRANSCRIPT
What’s Happening in Selective Inference II?
Emmanuel Candes, Stanford University
The 2017 Wald Lectures, Joint Statistical Meetings, Baltimore, August 2017
Lecture 2: Special dedication
Chiara Sabatti
Agenda: The knockoff machine
(1) The knockoff framework (mostly from yesterday)
(2) Knockoffs for fixed covariates
(3) Knockoffs for random covariates
(4) Knockoffs for genome-wide association studies (GWAS)
(5) Genetic data analysis
The Knockoffs Framework(Summary from Lecture 1)
Controlled variable selection
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE | Vol 447 | 7 June 2007
666Nature ©2007 Publishing Group
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE | Vol 447 | 7 June 2007
666Nature ©2007 Publishing Group
Response Y (e.g. disease status)
Features X1, . . . , Xp (e.g. SNPs)
Question: distribution of Y |X depends on X through which variables?
Goal: select set of features Xj that are likely to be relevantwithout too many false positives – do not run into the problem of irreproducibilty
FDR = E[ # false positives
# features selected︸ ︷︷ ︸FDP
]
Controlled variable selection
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE | Vol 447 | 7 June 2007
666Nature ©2007 Publishing Group
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE | Vol 447 | 7 June 2007
666Nature ©2007 Publishing Group
Response Y (e.g. disease status)
Features X1, . . . , Xp (e.g. SNPs)
Question: distribution of Y |X depends on X through which variables?
Goal: select set of features Xj that are likely to be relevantwithout too many false positives – do not run into the problem of irreproducibilty
FDR = E[ # false positives
# features selected︸ ︷︷ ︸FDP
]
Which variables should we report?
Feature importance Zj from random forests
●●●●
●●●●●●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●●●●●
●
●●
●
●●●●●●
●
●●
●●●
●
●
●●
●
●●●●
●
●●
●
●
●
●
●
●●
●
●●●●
●
●
●●●●
●
●●●●●●●
●●●●●●●●●
●
●●
●●●●●●●●●●●●●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●●●
●●
●●
●●●●●●●
●●●●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●●
●●●
●●●
●●●
●
●●●
●
●
●
●●●
●
●
●
●●
●
●
●●●●●
●
●
●●●●
●
●●●
●
●●
●●●●
●
●●
●●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●●●●
●
●
●●●
●
●
●●●●●
●●
●
●
●
●●●●●●●
●
●●●
●●●
●●●●●●
●
●
●
●
●
●
●
●
●●●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●●●●●●●
●●●●●●
●
●
●●
●
●
●
●●
●●●●
●
●
●
●●●●
●
●
●
●
●
●●●●●
●
●
●●●
●●
●●
●●
●
●
●
●●
●●●
●●●
●●●●●
●●●●●●●●●●●●●
●
●
●●●●●
●
●
●
●
●
●●
●
●
●●●●
●●●
●
●●
0 100 200 300 400 500
12
34
56
7
Variables
Fea
ture
Impo
rtan
ce
Which variables should we report?
Feature importance Zj from random forests
●●●●
●●●●●●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●●●
●
●
●
●●●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●●●●●
●
●●
●
●●●●●●
●
●●
●●●
●
●
●●
●
●●●●
●
●●
●
●
●
●
●
●●
●
●●●●
●
●
●●●●
●
●●●●●●●
●●●●●●●●●
●
●●
●●●●●●●●●●●●●
●
●●
●
●
●
●
●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●●●
●●
●●
●●●●●●●
●●●●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●
●
●
●●●●
●●●
●●●
●●●
●
●●●
●
●
●
●●●
●
●
●
●●
●
●
●●●●●
●
●
●●●●
●
●●●
●
●●
●●●●
●
●●
●●●●●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●●●●
●
●
●●●
●
●
●●●●●
●●
●
●
●
●●●●●●●
●
●●●
●●●
●●●●●●
●
●
●
●
●
●
●
●
●●●●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●●●●●●●
●●●●●●
●
●
●●
●
●
●
●●
●●●●
●
●
●
●●●●
●
●
●
●
●
●●●●●
●
●
●●●
●●
●●
●●
●
●
●
●●
●●●
●●●
●●●●●
●●●●●●●●●●●●●
●
●
●●●●●
●
●
●
●
●
●●
●
●
●●●●
●●●
●
●●
0 100 200 300 400 500
12
34
56
7
Variables
Fea
ture
Impo
rtan
ce
● ●
True positives?
Knockoffs as negative controls
●
●●●●●●●
●●
●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●
●
●●
●
●
●●●
●
●
●●●
●●●
●
●
●
●●●●●
●
●
●
●
●●●
●
●●●●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●●●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●●●●
●
●●●●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●●●●●
●
●●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●●●●●●●●
●
●●
●
●
●●
●
●
●
●
●●●●●●
●
●
●●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●●
●
●
●●●●●●●●
●●
●
●
●●●●●●●●●●
●
●●
●
●
●●●●●●●
●
●●●
●●●●●
●
●●
●
●
●●●●
●●
●
●
●
●
●
●●●●●●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●●●●
●●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●●●
●
●
●
●●
●
●●●●
●●
●
●
●
●●
●●
●
●
●●
●
●●●●
●●
●
●
●●●●●●●●●●●●●
●
●
●●
●●●●●
●
●
●
●
●●
●
●●●●●
●●●●●●●●
●●●
●●●
●
●
●●
●
●
●
●●●●●●●●●●●●
●
●●●
●
●
●●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●●●●
●
●●●
●
●●●●
●●●●●●
●●
●●
●●●●●●●
●
●
●●●●
●●●●●●
●●
●●●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●●●
●●●
●
●
●●●●
●
●
●
●
●
●
●●
●●
●●●●●●●●
●●●●
●●●●●
●
●●
●●●
●
●●●
●●
●
●●●●●●●
●
●
●●●●
●
●
●
●●●
●●
●
●
●
●
●●●
●
●●
●●●
●●●●
●●
●
●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●●
●●
●●●
●●
●●●
●
●●
●
●●●●
●
●
●●●
●
●●●●
●
●
●
●●●
●●●●●●●
●●
●●●●●●●
●●
●●
●
●
●
●●●●●●
●●●●●
●
●●●●●●●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●
●●●
●
●
●●
●●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
●●●●
●
●
●
●
●
●●●
●●●
●●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●●●●
●
●●●●●●
●
●●●●●●●
●
●●●
●
●●●●●
●●
●●●
●●
●●●●●●●●●●●●●●●
●
●
●
●●
●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●●●
0 200 400 600 800 1000
12
34
Variables
Fea
ture
Impo
rtan
ce
●
●
OriginalKnockoffs
Exchangeability of feature importance statistics
Knockoff agnostic feature importance Z
(Z1, . . . , Zp︸ ︷︷ ︸originals
, Z1, . . . , Zp︸ ︷︷ ︸knockoffs
) = z([X, X], y)
●
●●●●●●●
●●
●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●
●
●●
●
●
●●●
●
●
●●●
●●●
●
●
●
●●●●●
●
●
●
●
●●●
●
●●●●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●●●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●●●●
●
●●●●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●●●●●
●
●●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●●●●●●●●
●
●●
●
●
●●
●
●
●
●
●●●●●●
●
●
●●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●●
●
●
●●●●●●●●
●●
●
●
●●●●●●●●●●
●
●●
●
●
●●●●●●●
●
●●●
●●●●●
●
●●
●
●
●●●●
●●
●
●
●
●
●
●●●●●●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●●●●
●●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●●●
●
●
●
●●
●
●●●●
●●
●
●
●
●●
●●
●
●
●●
●
●●●●
●●
●
●
●●●●●●●●●●●●●
●
●
●●
●●●●●
●
●
●
●
●●
●
●●●●●
●●●●●●●●
●●●
●●●
●
●
●●
●
●
●
●●●●●●●●●●●●
●
●●●
●
●
●●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●●●●
●
●●●
●
●●●●
●●●●●●
●●
●●
●●●●●●●
●
●
●●●●
●●●●●●
●●
●●●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●●●
●●●
●
●
●●●●
●
●
●
●
●
●
●●
●●
●●●●●●●●
●●●●
●●●●●
●
●●
●●●
●
●●●
●●
●
●●●●●●●
●
●
●●●●
●
●
●
●●●
●●
●
●
●
●
●●●
●
●●
●●●
●●●●
●●
●
●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●●
●●
●●●
●●
●●●
●
●●
●
●●●●
●
●
●●●
●
●●●●
●
●
●
●●●
●●●●●●●
●●
●●●●●●●
●●
●●
●
●
●
●●●●●●
●●●●●
●
●●●●●●●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●
●●●
●
●
●●
●●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
●●●●
●
●
●
●
●
●●●
●●●
●●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●●●●
●
●●●●●●
●
●●●●●●●
●
●●●
●
●●●●●
●●
●●●
●●
●●●●●●●●●●●●●●●
●
●
●
●●
●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●●●
0 200 400 600 800 1000
12
34
This lectureCan construct knockoff features such that
j null =⇒ (Zj , Zj)d= (Zj , Zj)
more generally T subset of nulls =⇒ (Z, Z)swap(T )d= (Z, Z)
Z1 ZpZ2 ZpZ2Z1
Exchangeability of feature importance statistics
Knockoff agnostic feature importance Z
(Z1, . . . , Zp︸ ︷︷ ︸originals
, Z1, . . . , Zp︸ ︷︷ ︸knockoffs
) = z([X, X], y)
●
●●●●●●●
●●
●●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●●●●
●
●
●●
●
●
●●●
●
●
●●●
●●●
●
●
●
●●●●●
●
●
●
●
●●●
●
●●●●
●
●●●
●
●●
●
●
●●
●
●●
●
●
●●●●●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●●●●●
●
●●●●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●●●●●●
●
●●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●●●●●●●●
●
●●
●
●
●●
●
●
●
●
●●●●●●
●
●
●●●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●●
●
●
●●●●●●●●
●●
●
●
●●●●●●●●●●
●
●●
●
●
●●●●●●●
●
●●●
●●●●●
●
●●
●
●
●●●●
●●
●
●
●
●
●
●●●●●●
●
●
●●●
●
●
●●
●●
●●
●
●
●
●
●●
●●
●●●●
●●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●●
●
●
●●●
●
●
●
●●
●
●●●●
●●
●
●
●
●●
●●
●
●
●●
●
●●●●
●●
●
●
●●●●●●●●●●●●●
●
●
●●
●●●●●
●
●
●
●
●●
●
●●●●●
●●●●●●●●
●●●
●●●
●
●
●●
●
●
●
●●●●●●●●●●●●
●
●●●
●
●
●●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●●●●
●
●●●
●
●●●●
●●●●●●
●●
●●
●●●●●●●
●
●
●●●●
●●●●●●
●●
●●●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●●●
●●●
●
●
●●●●
●
●
●
●
●
●
●●
●●
●●●●●●●●
●●●●
●●●●●
●
●●
●●●
●
●●●
●●
●
●●●●●●●
●
●
●●●●
●
●
●
●●●
●●
●
●
●
●
●●●
●
●●
●●●
●●●●
●●
●
●●
●
●
●●●
●●
●●
●
●
●
●●●
●
●●
●●
●●●
●●
●●●
●
●●
●
●●●●
●
●
●●●
●
●●●●
●
●
●
●●●
●●●●●●●
●●
●●●●●●●
●●
●●
●
●
●
●●●●●●
●●●●●
●
●●●●●●●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●
●●●
●
●
●●
●●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●●●●●●
●●●●
●
●
●
●
●
●●●
●●●
●●
●
●●●
●●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●●●●
●
●●●●●●
●
●●●●●●●
●
●●●
●
●●●●●
●●
●●●
●●
●●●●●●●●●●●●●●●
●
●
●
●●
●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●●●
0 200 400 600 800 1000
12
34
This lectureCan construct knockoff features such that
j null =⇒ (Zj , Zj)d= (Zj , Zj)
more generally T subset of nulls =⇒ (Z, Z)swap(T )d= (Z, Z)
Z1 ZpZ2 ZpZ2Z1
Knockoffs-adjusted scores
++____ +++__++__0
|W|if null
Ordering of variables + 1-bit p-values
Adjusted scores Wj with flip-sign property
Combine Zj and Zj into single (knockoff) score Wj
Wj = wj(Zj , Zj) wj(Zj , Zj) = −wj(Zj , Zj)
e.g. Wj = Zj − Zj Wj = Zj ∨ Zj ·{
+1 Zj > Zj
−1 Zj ≤ Zj=⇒ Conditional on |W |, signs of null Wj ’s are i.i.d. coin flips
Selection by sequential testing
++____ +++__++
0 |W|
+++++...
t
Select S+(t) =⇒ FDP(t) =1+|S−(t)|1 ∨ |S+(t)|
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
Theorem (Barber and C. (’15))
Select S+(τ), τ = min {t : FDP(t) ≤ q}Knockoff
E[
# false positives
# selections + q−1
]≤ q
Knockoff+
E[
# false positives
# selections
]≤ q
Why Can We Invert the Estimate of FDP?Proof Sketch of FDR Control
Why does all this work?
τ = min
{t :
1+|S−(t)||S+(t)| ∨ 1
≤ q}
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
++____
+++__
++__
0
To show
E[
V +(τ)
1 + V −(τ)
]≤ 1
Why does all this work?
τ = min
{t :
1+|S−(t)||S+(t)| ∨ 1
≤ q}
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
++____
+++__
++__
0
FDP(τ) =#{j null : j ∈ S+(τ)}#{j : j ∈ S+(τ)} ∨ 1
To show
E[
V +(τ)
1 + V −(τ)
]≤ 1
Why does all this work?
τ = min
{t :
1+|S−(t)||S+(t)| ∨ 1
≤ q}
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
++____
+++__
++__
0
FDP(τ) =#{j null : j ∈ S+(τ))}#{j : j ∈ S+(τ)} ∨ 1
· 1 + #{j null : j ∈ S−(τ)}1 + #{j null : j ∈ S−(τ)}
To show
E[
V +(τ)
1 + V −(τ)
]≤ 1
Why does all this work?
τ = min
{t :
1+|S−(t)||S+(t)| ∨ 1
≤ q}
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
++____
+++__
++__
0
FDP(τ) ≤ q ·
V +(τ)︷ ︸︸ ︷#{j null : j ∈ S+(τ)}
1 + #{j null : j ∈ S−(τ)}︸ ︷︷ ︸V −(τ)
To show
E[
V +(τ)
1 + V −(τ)
]≤ 1
Why does all this work?
τ = min
{t :
1+|S−(t)||S+(t)| ∨ 1
≤ q}
S+(t) = {j : Wj ≥ t}S−(t) = {j : Wj ≤ −t}
++____
+++__
++__
0
FDP(τ) ≤ q ·
V +(τ)︷ ︸︸ ︷#{j null : j ∈ S+(τ)}
1 + #{j null : j ∈ S−(τ)}︸ ︷︷ ︸V −(τ)
To show
E[
V +(τ)
1 + V −(τ)
]≤ 1
Martingales
V +(t)
1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t
__++__0
if nullt
V +(t) V −(t),
|W |
Conditioned on V +(s) + V −(s), V +(s) is hypergeometric
E[
V +(s)
1 + V −(s)|V ±(t), V +(s) + V −(s)
]≤ V +(t)
1 + V −(t)
Martingales
V +(t)
1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t
__++__0
if nullt s
V +(t) V −(t),
|W |
Conditioned on V +(s) + V −(s), V +(s) is hypergeometric
E[
V +(s)
1 + V −(s)|V ±(t), V +(s) + V −(s)
]≤ V +(t)
1 + V −(t)
Martingales
V +(t)
1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t
__++__0
if nullt s
V +(t) V −(t),
V +(s) + V −(s) = m
|W |
Conditioned on V +(s) + V −(s), V +(s) is hypergeometric
E[
V +(s)
1 + V −(s)|V ±(t), V +(s) + V −(s)
]≤ V +(t)
1 + V −(t)
Martingales
V +(t)
1 + V −(t)is a (super)martingale wrt Ft = {σ(V ±(u))}u≤t
__++__0
if nullt s
V +(t) V −(t),
V +(s) + V −(s) = m
|W |
Conditioned on V +(s) + V −(s), V +(s) is hypergeometric
E[
V +(s)
1 + V −(s)|V ±(t), V +(s) + V −(s)
]≤ V +(t)
1 + V −(t)
Optional stopping theorem
0
if null τ
FDR ≤ q E[
V +(τ)
1 + V −(τ)
]≤ q E
Bin(#nulls,1/2)︷ ︸︸ ︷V +(0)
1 + V −(0)
≤ q
Knockoffs for Fixed Features
Joint with Barber
Linear model
y =
∑j βjXj︷︸︸︷Xβ + z
n× 1 n× p p× 1 n× 1
y ∼ N (Xβ, σ2I)
Fixed design X
Noise level σ unknown
Multiple testing: Hj : βj = 0 (is jth variable in the model?)
Identifiability =⇒ p ≤ n
Inference (FDR control) will hold conditionally on X
Knockoff features (fixed X)
Originals Knockoffs
X ′jXk = X ′jXk for all j, k
X ′jXk = X ′jXk for all j 6= k
No need for new data or experiment
No knowledge of response y
Knockoff features (fixed X)
Originals Knockoffs
X ′jXk = X ′jXk for all j, k
X ′jXk = X ′jXk for all j 6= k
No need for new data or experiment
No knowledge of response y
Knockoff features (fixed X)
Originals Knockoffs
X ′jXk = X ′jXk for all j, k
X ′jXk = X ′jXk for all j 6= k
No need for new data or experiment
No knowledge of response y
Knockoff construction (n ≥ 2p)
Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.
[X X
]′ [X X
]=
[Σ Σ− diag{s}
Σ− diag{s} Σ
]:= G
� 0
G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0
Solution
X = X(I − Σ−1 diag{s}) + UC
U ∈ Rn×p with col. space orthogonal to that of X
C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0
Knockoff construction (n ≥ 2p)
Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.
[X X
]′ [X X
]=
[Σ Σ− diag{s}
Σ− diag{s} Σ
]:= G � 0
G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0
Solution
X = X(I − Σ−1 diag{s}) + UC
U ∈ Rn×p with col. space orthogonal to that of X
C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0
Knockoff construction (n ≥ 2p)
Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.
[X X
]′ [X X
]=
[Σ Σ− diag{s}
Σ− diag{s} Σ
]:= G � 0
G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0
Solution
X = X(I − Σ−1 diag{s}) + UC
U ∈ Rn×p with col. space orthogonal to that of X
C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0
Knockoff construction (n ≥ 2p)
Problem: given X ∈ Rn×p, find X ∈ Rn×p s.t.
[X X
]′ [X X
]=
[Σ Σ− diag{s}
Σ− diag{s} Σ
]:= G � 0
G � 0 ⇐⇒ diag{s} � 02Σ− diag{s} � 0
Solution
X = X(I − Σ−1 diag{s}) + UC
U ∈ Rn×p with col. space orthogonal to that of X
C ′C Cholevsky factorization of 2 diag{s} − diag{s}Σ−1 diag{s} � 0
Knockoff construction (n ≥ 2p)
X ′jXj = 1− sj (Standardized columns)
Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1
Under equivariance, minimizes the value of |〈Xj , Xj〉|
SDP knockoffsminimize
∑j |1− sj |
subject to sj ≥ 0diag{s} � 2Σ
Highly structured semidefinite program (SDP)
Other possibilities ...
Knockoff construction (n ≥ 2p)
X ′jXj = 1− sj (Standardized columns)
Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1
Under equivariance, minimizes the value of |〈Xj , Xj〉|
SDP knockoffsminimize
∑j |1− sj |
subject to sj ≥ 0diag{s} � 2Σ
Highly structured semidefinite program (SDP)
Other possibilities ...
Knockoff construction (n ≥ 2p)
X ′jXj = 1− sj (Standardized columns)
Equi-correlated knockoffssj = 2λmin(Σ) ∧ 1
Under equivariance, minimizes the value of |〈Xj , Xj〉|
SDP knockoffsminimize
∑j |1− sj |
subject to sj ≥ 0diag{s} � 2Σ
Highly structured semidefinite program (SDP)
Other possibilities ...
Why?
For null feature Xj
X ′jy = X ′jXβ +X ′jzd= X ′jXβ + X ′jz = X ′jy
Construct knockoffs
Why?
For a null feature Xj,
X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y
Jan 21 2015 Controlling false discovery rate via knockoffs 12/36
Why?
For null feature Xj
X ′jy = X ′jXβ +X ′jzd= X ′jXβ + X ′jz = X ′jy
Construct knockoffs
Why?
For a null feature Xj,
X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y
Jan 21 2015 Controlling false discovery rate via knockoffs 12/36
Why?
For any subset of nulls T
[X X]′swap(T ) yd= [X X]′ y
[X X]′swap(T ) =
Construct knockoffs
Why?
For a null feature Xj,
X>j y = X>j Xβ? + X>j z D= X>j Xβ? + X>j z = X>j y
Jan 21 2015 Controlling false discovery rate via knockoffs 12/36
Exchangeability of feature importance statistics
Sufficiency:
(Z, Z) = z([X X
]′ [X X
],[X X
]′y)
Knockoff-agnostic: swapping originals and knockoffs =⇒ swaps Z’s
z([X X
]swap(T )
, y) = (Z, Z)swap(T )
Theorem (Barber and C. (15))
For any subset T of nulls
(Z,Z)swap(T )d= (Z, Z)
=⇒ FDR control (conditional on X)
Z1 ZpZ2 ZpZ2Z1
Exchangeability of feature importance statistics
Sufficiency:
(Z, Z) = z([X X
]′ [X X
],[X X
]′y)
Knockoff-agnostic: swapping originals and knockoffs =⇒ swaps Z’s
z([X X
]swap(T )
, y) = (Z, Z)swap(T )
Theorem (Barber and C. (15))
For any subset T of nulls
(Z,Z)swap(T )d= (Z, Z)
=⇒ FDR control (conditional on X)
Z1 ZpZ2 ZpZ2Z1
Telling the effect direction
[...] in classical statistics, the significance of comparisons (e. g., θ1 − θ2)is calibrated using Type I error rate, relying on the assumption that thetrue difference is zero, which makes no sense in many applications.[...] a more relevant framework in which a true comparison can bepositive or negative, and, based on the data, you can state “θ1 > θ2 withconfidence”, “θ2 > θ1 with confidence”, or “no claim with confidence”.
A. Gelman & F. Tuerlinckx
Directional FDR
Are any effects exactly zero?
FDRdir = E[
# selections with wrong effect direction
# selections
]
↑ ︸ ︷︷ ︸Directional false discovery rate Directional false discovery proportion
Directional FDR (Benjamini & Yekutieli, ’05)
Sign errors (Type-S) (Gelman & Tuerlinckx, ’00)
Important for misspecified models — exact sparsity unlikely
Directional FDR control
(Xj − Xj)′y
ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0
Sign estimate sgn((Xj − Xj)′y)
Theorem (Barber and C., ’16)
Exact same knockoff selection + sign estimate
FDR ≤ FDRdir ≤ q
++____ +++__++__null non null
0 + +__
|W|
Null coin flips are unbiased
Directional FDR control
(Xj − Xj)′y
ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0
Sign estimate sgn((Xj − Xj)′y)
Theorem (Barber and C., ’16)
Exact same knockoff selection + sign estimate
FDR ≤ FDRdir ≤ q
++____ +++__++__null non null
0 + +__
|W|
Null coin flips are unbiased
Directional FDR control
(Xj − Xj)′y
ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0
Sign estimate sgn((Xj − Xj)′y)
Theorem (Barber and C., ’16)
Exact same knockoff selection + sign estimate
FDR ≤ FDRdir ≤ q
++____ +++__++__null non null
0 + +__
|W|
Null coin flips are unbiased
Directional FDR control
(Xj − Xj)′y
ind∼ N (sj · βj , 2σ2 · sj) sj ≥ 0
Sign estimate sgn((Xj − Xj)′y)
Theorem (Barber and C. (16))
Exact same knockoff selection + sign estimate
FDR ≤ FDRdir ≤ q
++____ +++__++__0 + +__
|W|
Great subtlety: coin flips are now biased
Empirical results
Features N (0, In), n = 3000, p = 1000
k = 30 variables with regression coefficients of magnitude 3.5
Method FDR (%) Power (%) Theor. FDR(nominal level q = 20%) control?
Knockoff+ (equivariant) 14.40 60.99 YesKnockoff (equivariant) 17.82 66.73 No
Knockoff+ (SDP) 15.05 61.54 YesKnockoff (SDP) 18.72 67.50 No
BHq 18.70 48.88 NoBHq + log-factor correction 2.20 19.09 Yes
BHq with whitened noise 18.79 2.33 Yes
Effect of signal amplitude
Same setup with k = 30 (q = 0.2)
2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2
0
5
10
15
20
25
Amplitude A
FD
R (
%)
●
Nominal levelKnockoffKnockoff+BHq
●
●
●
●●
● ●
●●
●
●
●●
● ●
2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2
0
20
40
60
80
100
Amplitude A
Pow
er (
%)
●
KnockoffKnockoff+BHq
●
●●
●
●
●
●
●
●
●●
●
●
● ●
Effect of feature correlation
Features ∼ N (0,Θ) Θjk = ρ|j−k|
n = 3000, p = 1000, and k = 30 and amplitude = 3.5
0.0 0.2 0.4 0.6 0.8
0
5
10
15
20
25
30
Feature correlation ρ
FD
R (
%)
●
Nominal levelKnockoffKnockoff+BHq
●
●●
●●
●●
●●
●
0.0 0.2 0.4 0.6 0.8
0
20
40
60
80
100
Feature correlation ρ
Pow
er (
%)
●
KnockoffKnockoff+BHq
●
●●
●
●
●
●
●
●●
Fixed Design Knockoff Data Analysis
HIV drug resistance
Drug type # drugs Sample size # protease or RT # mutations appearingpositions genotyped ≥ 3 times in sample
PI 6 848 99 209NRTI 6 639 240 294
NNRTI 3 747 240 319
response y: log-fold-increase of lab-tested drug resistance
covariate Xj : presence or absence of mutation #j
Data from R. Shafer (Stanford) available at:
http://hivdb.stanford.edu/pages/published_analysis/genophenoPNAS2006/
HIV data
TSM list: mutations associated with the PI class of drugs in general, and is not
specialized to the individual drugs in the class
Results forPI type drugs
Knockoff BHq
Data set size: n=768, p=201
# H
IV−
1 pr
otea
se p
ositi
ons
sele
cted
0
5
10
15
20
25
30
35
Resistance to APV
Appear in TSM listNot in TSM list
Knockoff BHq
Data set size: n=329, p=147
0
5
10
15
20
25
30
35
Resistance to ATV
Knockoff BHq
Data set size: n=826, p=208
0
5
10
15
20
25
30
35
Resistance to IDV
Knockoff BHq
Data set size: n=516, p=184
0
5
10
15
20
25
30
35
Resistance to LPV
Knockoff BHq
Data set size: n=843, p=209
0
5
10
15
20
25
30
35
Resistance to NFV
Knockoff BHq
Data set size: n=825, p=208
0
5
10
15
20
25
30
35
Resistance to SQV
HIV data
Results forNRTI type drugs
Results forNNRTI type drugs
Knockoff BHq
Data set size: n=633, p=292
# H
IV−
1 R
T p
ositi
ons
sele
cted
0
5
10
15
20
25
30
Resistance to X3TC
Appear in TSM listNot in TSM list
Knockoff BHq
Data set size: n=628, p=294
0
5
10
15
20
25
30
Resistance to ABC
Knockoff BHq
Data set size: n=630, p=292
0
5
10
15
20
25
30
Resistance to AZT
Knockoff BHq
Data set size: n=630, p=293
0
5
10
15
20
25
30
Resistance to D4T
Knockoff BHq
Data set size: n=632, p=292
0
5
10
15
20
25
30
Resistance to DDI
Knockoff BHq
Data set size: n=353, p=218
0
5
10
15
20
25
30
Resistance to TDF
Knockoff BHq
Data set size: n=732, p=311
# H
IV−
1 R
T p
ositi
ons
sele
cted
0
5
10
15
20
25
30
35
Resistance to DLV
Appear in TSM listNot in TSM list
Knockoff BHq
Data set size: n=734, p=318
0
5
10
15
20
25
30
35
Resistance to EFV
Knockoff BHq
Data set size: n=746, p=319
0
5
10
15
20
25
30
35
Resistance to NVP
High-dimensional setting
n ≈ 5, 000 subjects
p ≈ 330, 000 SNPs/vars to test
improve coverage around type 2 diabetes (T2D) candidate genesin 1,874 Finnish individuals from the Finland–United StatesInvestigation of NIDDM Genetics (FUSION) study13. In a secondscan, after quality-control filtering, we examined 356,539 SNPs (MAF4 5%) from the Affymetrix 500K Mapping Array Set in 4,184individuals from the SardiNIA Study of Aging10,14. The Sardiniansample is organized into a number of small- to medium-sizedpedigrees. We took advantage of this relatedness to reduce genotypingcosts: we genotyped 1,412 individuals with the Affymetrix 500KMapping Array Set (organized into groups of 2–3 individuals pernuclear family) and then propagated their genotypes to the remainingindividuals, who were genotyped using only the Affymetrix 10KMapping Array14,17,18 (see Methods). To increase statistical power,we also contacted the authors of a previously published study15 toobtain results for 347,010 SNPs (MAF 4 5%) genotyped in 2,758Finnish and Swedish individuals from the Diabetes Genetics Initiative(DGI) using the Affymetrix 500K Mapping Array Set. Further detailsof the DGI study and independent follow-up analyses are provided ina companion manuscript16. All three initial scans excluded individualstaking lipid lowering therapies, for a total of 8,816 phenotypedindividuals (Table 1). Informed consent was obtained from all
study participants and ethics approval was obtained from theparticipating institutions.
Because the three studies used different marker sets with anoverlap of only 44,998 SNPs across studies, we used information onpatterns of haplotype variation in the HapMap CEU samples (release21)19 to infer missing genotypes in silico and to facilitate comparisonbetween the studies13. Imputation analyses were carried out withMarkov Chain Haplotyping software (MaCH; see URLs section inMethods). For our analyses, we only considered SNPs that were eithergenotyped or could be imputed with relatively high confidence; that is,SNPs for which patterns of haplotype sharing between sampledindividuals and those genotyped by the HapMap consistently indi-cated a specific allele. Comparison of imputed and experimentallyderived genotypes in our samples yielded estimated error rates of1.46% (for imputation based on Illumina genotypes) to 2.14%(imputation based on Affymetrix genotypes) per allele, consistentwith expectations from HapMap data. For additional details ofquality-control and imputation procedures, see Methods and Supple-mentary Table 1 online.
We then conducted a series of association analyses to relatethe B2,261,000 genotyped and/or imputed SNPs with plasma
20
15
10
5
01 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6
1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
–log
10 (
P v
alue
)
20
15
10
5
0
–log
10 (
P v
alue
)
20
15
10
5
0
–log
10 (
P v
alue
)
20
15
10
5
0
0
–log
10 (
P v
alue
) 20
15
10
5
0
–log
10 (
P v
alue
) 20
15
10
5
0
–log
10 (
P v
alue
)
Percentile
1 2 3 4 5 60
Percentile
1 2 3 4 5 60
Percentile
HDL cholesterol LDL cholesterol Triglycerides
HDL cholesterol
LDL cholesterol
Triglycerides
GALNT2
APOB
PCSK9
ANGPTL3
GCKR
MLXIPL
LPL
TRIB1
APOA5
LIPCNCAN/CILP2
GALNT2 RBKS
B4GALT4 B3GALT4
SORT1/CELSR2/PSRC1
LPLABCA1
MVK/MMAB
LIPC
LCAT LIPG
APOE cluster
LDLR
NCAN/CILP2
CETP
Figure 1 Summary of genome-wide association scans. The figure summarizes combined genome-wide association scan results in the top 3 panels (plottedas –log10 P value for HDL cholesterol, LDL cholesterol and triglycerides). Loci that were not followed up are in gray. Loci that were followed-up are in green(combined dataset yielded convincing evidence of association, P o 5 ! 10"8), orange (combined dataset yielded promising evidence of association,P o 10"5), or red (combined dataset did not suggest association, P 4 10"5). The three panels in the bottom row display quantile-quantile plots for teststatistics. The red line corresponds to all test statistics, the blue line corresponds to results after excluding statistics at replicated loci (in green, top panel),and the gray area corresponds to the 90% confidence region from a null distribution of P values (generated from 100 simulations).
NATURE GENETICS VOLUME 40 [ NUMBER 2 [ FEBRUARY 2008 16 3
A R T I C L E S
©20
08 N
atur
e Pu
blis
hing
Gro
up h
ttp://
ww
w.n
atur
e.co
m/n
atur
egen
etic
s
p > n −→ cannot construct knockoffs as before
X ′jXk = X ′jXk ∀ j, kX ′jXk = X ′jXk ∀ j 6= k
=⇒ Xj = Xj ∀j
High dimensional knockoffs: screen and confirm
original data set
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
X(0)y(0)
screen on sample 1
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
y(1)X(1)
inference on sample 2
Theory (Barber and C., ’16)
Safe data re-use to improve power (Barber and C., ’16)
High dimensional knockoffs: screen and confirm
original data set
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
X(0)y(0)
screen on sample 1
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
y(1)X(1)
inference on sample 2
Theory (Barber and C., ’16)
Safe data re-use to improve power (Barber and C., ’16)
High dimensional knockoffs: screen and confirm
original data set
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
X(0)y(0)
screen on sample 1
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
y(1)X(1)
inference on sample 2
Theory (Barber and C., ’16)
Safe data re-use to improve power (Barber and C., ’16)
High dimensional knockoffs: screen and confirm
original data set
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
X(0)y(0)
screen on sample 1
Knocko↵ in high dimensions
y X � y(1) X(1) �(1) y(2) X(2) �(2) exploratory confirmatory =
1
screen on sample 1exploratory confirmatory
y(1)X(1)
inference on sample 2
Theory (Barber and C., ’16)
Safe data re-use to improve power (Barber and C., ’16)
Some extensions
y =(X1
)
︸ ︷︷ ︸n×p1
·β1 +(X2
)
︸ ︷︷ ︸n×p2
·β2 + · · ·+N (0, σ2In)
Group sparsity — build knockoffs at the group-wise levelDai & Barber 2015
Identify key groups with PCA — build knockoffs only for the top PC in eachgroupChen, Hou, Hou 2017
Build knockoffs only for prototypes selected from each groupReid & Tibshirani 2015
Multilayer knockoffs to control FDR at the individual and group levelssimultaneouslyKatsevich & Sabatti 2017
Knockoffs for Random Features
Joint with Fan, Janson & Lv
Variable selection in arbitrary models
Random pair (X,Y ) (perhaps thousands/millions of covariates)
p(Y |X) depends on X through which variables?
Working definition of null variables
Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j
Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.
Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S
Logistic model: P(Y = 0|X) =1
1 + eX>β
If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0
Variable selection in arbitrary models
Random pair (X,Y ) (perhaps thousands/millions of covariates)
p(Y |X) depends on X through which variables?
Working definition of null variables
Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j
Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.
Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S
Logistic model: P(Y = 0|X) =1
1 + eX>β
If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0
Variable selection in arbitrary models
Random pair (X,Y ) (perhaps thousands/millions of covariates)
p(Y |X) depends on X through which variables?
Working definition of null variables
Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j
Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.
Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S
Logistic model: P(Y = 0|X) =1
1 + eX>β
If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0
Variable selection in arbitrary models
Random pair (X,Y ) (perhaps thousands/millions of covariates)
p(Y |X) depends on X through which variables?
Working definition of null variables
Say j ∈ H0 is null iff Y ⊥⊥ Xj |X−j
Local Markov property =⇒ non nulls are smallest subset S (Markov blanket) s.t.
Y ⊥⊥ {Xj}j∈Sc | {Xj}j∈S
Logistic model: P(Y = 0|X) =1
1 + eX>β
If variables X1:p are not perfectly dependent, then j ∈ H0 ⇐⇒ βj = 0
Knockoff features (random X)
i.i.d. samples from p(X,Y )
Distribution of X known
Distribution of Y |X (likelihood) completely unknown
Originals X = (X1, . . . , Xp)
Knockoffs X = (X1, . . . , Xp)
(1) Pairwise exchangeability
(X, X)swap(S)d= (X, X)
e.g.
(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)
(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)
Knockoff features (random X)
i.i.d. samples from p(X,Y )
Distribution of X known
Distribution of Y |X (likelihood) completely unknown
Originals X = (X1, . . . , Xp)
Knockoffs X = (X1, . . . , Xp)
(1) Pairwise exchangeability
(X, X)swap(S)d= (X, X)
e.g.
(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)
(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)
Knockoff features (random X)
i.i.d. samples from p(X,Y )
Distribution of X known
Distribution of Y |X (likelihood) completely unknown
Originals X = (X1, . . . , Xp)
Knockoffs X = (X1, . . . , Xp)
(1) Pairwise exchangeability
(X, X)swap(S)d= (X, X)
e.g.
(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)
(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)
Knockoff features (random X)
i.i.d. samples from p(X,Y )
Distribution of X known
Distribution of Y |X (likelihood) completely unknown
Originals X = (X1, . . . , Xp)
Knockoffs X = (X1, . . . , Xp)
(1) Pairwise exchangeability
(X, X)swap(S)d= (X, X)
e.g.
(X1, X2, X3, X1, X2, X3)swap({2,3})d= (X1, X2, X3, X1, X2, X3)
(2) X ⊥⊥ Y |X (ignore Y when constructing knockoffs)
Exchangeability of feature importance statistics
Theorem (C., Fan, Janson Lv (’16))
For knockoff-agnostic scores and any subset T of nulls
(Z,Z)swap(T )d= (Z, Z)
This holds no matter the relationship between Y and X
This holds conditionally on Y
=⇒ FDR control (conditional on Y ) no matter the relationship between X and Y
Z1 ZpZ2 ZpZ2Z1
Exchangeability of feature importance statistics
Theorem (C., Fan, Janson Lv (’16))
For knockoff-agnostic scores and any subset T of nulls
(Z,Z)swap(T )d= (Z, Z)
This holds no matter the relationship between Y and X
This holds conditionally on Y
=⇒ FDR control (conditional on Y ) no matter the relationship between X and Y
Z1 ZpZ2 ZpZ2Z1
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗) ∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗) ∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗)
∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗) ∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗) ∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Knockoffs for Gaussian features
Swapping any subset of original and knockoff features leaves (joint) dist. invariant
e.g. T = {2, 3} (X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Note Xd=X
X ∼ N (µ,Σ)
Possible solution
(X, X) ∼ N (∗, ∗∗) ∗ =
[µµ
]∗ ∗ =
[Σ Σ− diag{s}
Σ− diag{s} Σ
]
s such that ∗∗ � 0
Given X, sample X from X |X (regression formula)
Different from knockoff features for fixed X!
Robustness
●
Exact Cov
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●
Exact Cov
Graph. Lasso
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●●
Exact Cov
Graph. Lasso50% Emp. Cov
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ● ●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●●
●
Exact Cov
Graph. Lasso50% Emp. Cov
62.5% Emp. Cov
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ● ●●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●●
●
●
Exact Cov
Graph. Lasso50% Emp. Cov
62.5% Emp. Cov
75% Emp. Cov
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ● ●●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●●
●
●
●
Exact Cov
Graph. Lasso50% Emp. Cov
62.5% Emp. Cov
75% Emp. Cov
87.5% Emp. Cov
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ● ●●
●
●
0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness
● ●●
●
●
●
●
Exact Cov
Graph. Lasso50% Emp. Cov
62.5% Emp. Cov
75% Emp. Cov
87.5% Emp. Cov
100% Emp. Cov0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
Pow
er
● ● ●●
●
●●0.00
0.25
0.50
0.75
1.00
0.0 0.5 1.0Relative Frobenius Norm Error
FD
R
Figure: Covariates are AR(1) with autocorrelation coefficient 0.3. n = 800, p = 1500,and target FDR is 10%. Y |X follows logistic model with 50 nonzero entries
Robustness theory
Ongoing with R. Barber andR. Samworth
(Partial) subject of 2017 TweedieAward Lecture
Rina F. Barber
Knockoffs inference with random features
Pros:
No parameters
No p-values
Holds for finite samples
No matter the dependence between Y and X
No matter the dimensionality
Cons: Need to know distribution of covariates
Relationship with classical setup
Classical MF Knockoffs
Observations of X are fixedInference is conditional on obs. values
Observations of X are random1
Strong model linking Y and X Model free2
Useful inference even if model inexact Useful inference even if model inexact3
1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled
2 Shifts the ‘burden’ of knowledge
3 More later
Relationship with classical setup
Classical MF Knockoffs
Observations of X are fixedInference is conditional on obs. values
Observations of X are random1
Strong model linking Y and X Model free2
Useful inference even if model inexact Useful inference even if model inexact3
1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled
2 Shifts the ‘burden’ of knowledge
3 More later
Relationship with classical setup
Classical MF Knockoffs
Observations of X are fixedInference is conditional on obs. values
Observations of X are random1
Strong model linking Y and X Model free2
Useful inference even if model inexact Useful inference even if model inexact3
1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled
2 Shifts the ‘burden’ of knowledge
3 More later
Relationship with classical setup
Classical MF Knockoffs
Observations of X are fixedInference is conditional on obs. values
Observations of X are random1
Strong model linking Y and X Model free2
Useful inference even if model inexact Useful inference even if model inexact3
1 Often appropriate in ‘big’ data apps: e.g. SNPs of subjects randomly sampled
2 Shifts the ‘burden’ of knowledge
3 More later
Shift in the burden of knowledge
When are our assumptions useful?
When we have large amounts of unsupervised data (e.g. economic studieswith same covariate info but different responses)
When we have more prior information about the covariates than about theirrelationship with a response (e.g. GWAS)
When we control the distribution of X (experimental crosses in genetics,gene knockout experiments,...)
Obstacles to obtaining p-values
Y |X ∼ Bernoulli(logit(X>β))
0
500
1000
1500
2000
0.00 0.25 0.50 0.75 1.00P−Values
coun
t
Global Null, AR(1) Design
0
500
1000
1500
2000
0.00 0.25 0.50 0.75 1.00P−Values
coun
t
20 Nonzero Coefficients, AR(1) Design
Figure: Distribution of null logistic regression p-values with n = 500 and p = 200
Obstacles to obtaining p-values
P{p-val ≤ . . .%} Sett. (1) Sett. (2) Sett. (3) Sett. (4)
5% 16.89% (0.37) 19.17% (0.39) 16.88% (0.37) 16.78% (0.37)
1% 6.78% (0.25) 8.49% (0.28) 7.02% (0.26) 7.03% (0.26)
0.1% 1.53% (0.12) 2.27% (0.15) 1.87% (0.14) 2.04% (0.14)
Table: Inflated p-value probabilities with estimated Monte Carlo SEs
Shameless plug: distribution of high-dimensional LRTs
Wilks’ phenomenon (1938)
2 logLd→ χ2
df
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
Sur, Chen, Candes (2017)
2 logLd→ κ
( pn
)χ2df
0
2500
5000
7500
10000
12500
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
Shameless plug: distribution of high-dimensional LRTs
Wilks’ phenomenon (1938)
2 logLd→ χ2
df
0
10000
20000
30000
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
Sur, Chen, Candes (2017)
2 logLd→ κ
( pn
)χ2df
0
2500
5000
7500
10000
12500
0.00 0.25 0.50 0.75 1.00P−Values
Cou
nts
‘Low’ dim. linear model with dependent covariates
Zj = |βj(λCV)|Wj = Zj − Zj
0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient
Pow
er
MethodsBHq MarginalBHq Max Lik.MF KnockoffsOrig. Knockoffs
0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient
FD
R
Figure: Low-dimensional setting: n = 3000, p = 1000
‘Low’ dim. logistic model with indep. covariates
Zj = |βj(λCV)|Wj = Zj − Zj
0.00
0.25
0.50
0.75
1.00
6 8 10Coefficient Amplitude
Pow
er
MethodsBHq MarginalBHq Max Lik.MF Knockoffs 0.00
0.25
0.50
0.75
1.00
6 8 10Coefficient Amplitude
FD
R
Figure: Low-dimensional setting: n = 3000, p = 1000
‘High’ dim. logistic model with dependent covariates
Zj = |βj(λCV)|Wj = Zj − Zj
0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient
Pow
er
MethodsBHq MarginalMF Knockoffs 0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.8Autocorrelation Coefficient
FD
R
Figure: High-dimensional setting: n = 3000, p = 6000
Bayesian knockoff statistics
LCD (Lasso coeff. difference)
BVS (Bayesian variable selection)
Zj = P(βj 6= 0 |y,X)
Wj = Zj − Zj
0.00
0.25
0.50
0.75
1.00
5 10 15Amplitude
Pow
er
MethodsBVS KnockoffsLCD Knockoffs
0.00
0.25
0.50
0.75
1.00
5 10 15Amplitude
FD
R
MethodsBVS KnockoffsLCD Knockoffs
Figure: n = 300, p = 1000 and Bayesian linear model with 60 expected variables
Inference is correct even if prior is wrong or MCMC has not converged
Bayesian knockoff statistics
LCD (Lasso coeff. difference)
BVS (Bayesian variable selection)
Zj = P(βj 6= 0 |y,X)
Wj = Zj − Zj
0.00
0.25
0.50
0.75
1.00
5 10 15Amplitude
Pow
er
MethodsBVS KnockoffsLCD Knockoffs
0.00
0.25
0.50
0.75
1.00
5 10 15Amplitude
FD
R
MethodsBVS KnockoffsLCD Knockoffs
Figure: n = 300, p = 1000 and Bayesian linear model with 60 expected variables
Inference is correct even if prior is wrong or MCMC has not converged
Partial summary
No valid p-values even for logistic regression
Shifts the burden of knowledge to X (covariates); makes sense in manycontexts
Robustness: simulations show properties of inference hold even when themodel for X is only approximately right.
Always have access to these diagnostic checks (later)
When assumptions are appropriate gain a lot of power, and can usesophisticated selection techniques
How to Construct Knockoffs for Hidden Markov Models
Joint with Sabatti & Sesia
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1
Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
A general construction (C., Fan, Janson and Lv, ’16)
(X1, X2, X3, X1, X2, X3)d= (X1, X2, X3, X1, X2, X3)
Algorithm Sequential Conditional Independent Pairs
for j = {1, . . . , p} doSample Xj from law of Xj |X-j , X1:j−1
end
e.g. p = 3
Sample X1 from X1 |X−1Joint law of X, X1 is known
Sample X2 from X2 |X−2, X1
Joint law of X, X1:2 is known
Sample X3 from X3 |X−3, X1:2
Joint law of X, X is known and is pairwise exchangeable!
Usually not practical, easy in some cases (e.g. Markov chains)
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Knockoff copies of a Markov chain
X = (X1, X2, . . . , Xp) is a Markov chain
p(X1, . . . , Xp) = q1(X1)
p∏
j=2
Qj(Xj |Xj−1) (X ∼ MC (q1,Q))
X1 X2 X3 X4
X1 X2 X3 X4
Observed variables
Knockoff variables
General algorithm can be implemented efficiently in the case of a Markov chain
Recursive update of normalizing constants
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Hidden Markov Models (HMMs)
X = (X1, X2, . . . , Xp) is a HMM if
{H ∼ MC (q1,Q) (latent Markov chain)
Xj |H ∼ Xj |Hjind.∼ fj(Xj ;Hj) (emission distribution)
H1 H2 H3
X1 X2 X3
The H variables are latent and only the X variables are observed
Haplotypes and genotypes
Haplotype Set of alleles on a single chromosome0/1 for common/rare allele
Genotype Unordered pair of alleles at a single marker
0 1 0 1 1 0 1 1 0 0 1 11 2 0 1 2 1
+Haplotype MHaplotype PGenotypes
A phenomenological HMM for haplotype & genotype data
Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)
Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)
fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)
MaCH (Li, ’10)
New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences
A phenomenological HMM for haplotype & genotype data
Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)
Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)
fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)
MaCH (Li, ’10)
New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences
A phenomenological HMM for haplotype & genotype data
Figure: Six haplotypes: color indicates ‘ancestor’ at each marker (Scheet, ’06)
Haplotype estimation/phasing (Browning, ’11)Imputation of missing SNPs (Marchini, ’10)
fastPHASE (Scheet, ’06)IMPUTE (Marchini, ’07)
MaCH (Li, ’10)
New application of same HMM: generation of knockoff copies of genotypes!Each genotype: sum of two independent HMM haplotype sequences
Knockoff copies of a hidden Markov model
Theorem (Sesia, Sabatti, C. ’17)
A knockoff copy of X of X can be constructed as
(1) Sample H from p(H|X) using forward-backward algorithm
(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain
(3) Sample X from the emission distribution of X given H = H
H1 H2 H3
X1 X2 X3
H1 H2 H1
X1 X2 X3
observed variables
latent variables knockoff latent variables
knockoff variables
Knockoff copies of a hidden Markov model
Theorem (Sesia, Sabatti, C. ’17)
A knockoff copy of X of X can be constructed as
(1) Sample H from p(H|X) using forward-backward algorithm
(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain
(3) Sample X from the emission distribution of X given H = H
H1 H2 H3
X1 X2 X3
H1 H2 H1
X1 X2 X3
observed variables
imputed latent variables knockoff latent variables
knockoff variables
Knockoff copies of a hidden Markov model
Theorem (Sesia, Sabatti, C. ’17)
A knockoff copy of X of X can be constructed as
(1) Sample H from p(H|X) using forward-backward algorithm
(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain
(3) Sample X from the emission distribution of X given H = H
H1 H2 H3
X1 X2 X3
H1 H2 H1
X1 X2 X3
observed variables
imputed latent variables knockoff latent variables
knockoff variables
Knockoff copies of a hidden Markov model
Theorem (Sesia, Sabatti, C. ’17)
A knockoff copy of X of X can be constructed as
(1) Sample H from p(H|X) using forward-backward algorithm
(2) Generate a knockoff H of H using the SCIP algorithm for a Markov chain
(3) Sample X from the emission distribution of X given H = H
H1 H2 H3
X1 X2 X3
H1 H2 H1
X1 X2 X3
observed variables
imputed latent variables knockoff latent variables
knockoff variables
Some Examples
Simulations with synthetic Markov chainMarkov chain covariates with 5 hidden states. Binomial response
4 5 6 7 8 9 10 12 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
4 5 6 7 8 9 10 12 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1
Zj = |βj(λCV)|, Wj = Zj − Zj
RobustnessMarkov chain covariates with 5 hidden states. Binomial response
4 5 6 7 8 9 10 12 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
4 5 6 7 8 9 10 12 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1
Zj = |βj(λCV)|, Wj = Zj − Zj
Simulations with synthetic HMMHMM covariates with latent “clockwise” Markov chain. Binomial response
3 4 5 6 7 8 9 10 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
3 4 5 6 7 8 9 10 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitions (true FX)n = 1000, p = 1000, target FDR: α = 0.1
Zj = |βj(λCV)|, Wj = Zj − Zj
RobustnessHMM covariates with latent “clockwise” Markov chain. Binomial response
3 4 5 6 7 8 9 10 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
3 4 5 6 7 8 9 10 15 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitions (estimated FX)n = 1000, p = 1000, target FDR: α = 0.1
Zj = |βj(λCV)|, Wj = Zj − Zj
Out-of-sample parameter estimationInhomogeneous Markov chain covariates with 5 hidden states. Binomial response
10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
10 25 50 75 100 500 1000 5000 10000Number of unsupervised observations
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitions (estimated FX from independent dataset)n = 1000, p = 1000, target FDR: α = 0.1
Zj = |βj(λCV)|, Wj = Zj − Zj
Genetic Data Analysis
Genetic analysis
Crohn’s disease (CD)
Wellcome Trust Case Control Consortium (WTCCC)
n ≈ 5, 000 subjects (≈ 2, 000 patients, ≈ 3, 000 healthy controls)
p ≈ 400, 000 SNPs
Previously analyzed in WTCCC (2007)
Lipid traits (HDL, LDL cholesterol)
Northern Finland 1966 Birth Cohort study of metabolic syndrome (NFBC)
n ≈ 4, 700 subjects
p ≈ 330, 000 SNPs
Previously analyzed in Sabatti et al. (2009)
Genetic analysis
Crohn’s disease (CD)
Wellcome Trust Case Control Consortium (WTCCC)
n ≈ 5, 000 subjects (≈ 2, 000 patients, ≈ 3, 000 healthy controls)
p ≈ 400, 000 SNPs
Previously analyzed in WTCCC (2007)
Lipid traits (HDL, LDL cholesterol)
Northern Finland 1966 Birth Cohort study of metabolic syndrome (NFBC)
n ≈ 4, 700 subjects
p ≈ 330, 000 SNPs
Previously analyzed in Sabatti et al. (2009)
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
High-level results
Knockoffs with nominal FDR level of 10%
Power is much higher:
DatasetNumber of discoveries
Original study Knockoffs (average)CD 9 22.8
HDL 5 8LDL 6 9.8
Quite a few of the discoveries made by knockoffs were confirmed by largerGWAS (Franke et al., ’10, Willer et al., ’13)
Knockoffs made a number of new discoveries
Expect some (roughly 10%) of these to be false discoveries
It is likely that many of these correspond to true discoveries
Evidence from independent studies about adjacent genes shows some ofthe top unconfirmed hits to be promising candidates
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Franke etal. ’10
WTCCC’07
100% rs11209026 (2) 1 67.31–67.42 yes yes
99% rs6431654 (20) 2 233.94–234.11 yes yes
98% rs6688532 (33) 1 169.4–169.65 yes
97% rs17234657 (1) 5 40.44–40.44 yes yes
95% rs11805303 (16) 1 67.31–67.46 yes yes
91% rs7095491 (18) 10 101.26–101.32 yes yes
91% rs3135503 (16) 16 49.28–49.36 yes yes
81% rs7768538 (1145) 6 25.19–32.91 yes yes
80% rs6601764 (1) 10 3.85–3.85 yes
75% rs7655059 (5) 4 89.5–89.53
73% rs6500315 (4) 16 49.03–49.07 yes yes
72% rs2738758 (5) 20 61.71–61.82 yes
70% rs7726744 (46) 5 40.35–40.71 yes yes
68% rs11627513 (7) 14 96.61–96.63
66% rs4246045 (46) 5 150.07–150.41 yes yes
62% rs9783122 (234) 10 106.43–107.61
61% rs6825958 (3) 4 55.73–55.77
Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Confirmedin Willeret al. ’13
Found inSabatti
et al. ’09
100% rs1532085 (4) 15 58.68–58.7 yes yes
100% rs7499892 (1) 16 57.01–57.01 yes yes
100% rs1800961 (1) 20 43.04–43.04 yes
99% rs1532624 (2) 16 56.99–57.01 yes yes
95% rs255049 (142) 16 66.41–69.41 yes yes
Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Confirmedin Willeret al. ’13
Found inSabatti
et al. ’09
99% rs4844614 (34) 1 207.3–207.88 yes
97% rs646776 (5) 1 109.8–109.82 yes yes
97% rs2228671 (2) 19 11.2–11.21 yes yes
94% rs157580 (4) 19 45.4–45.41 yes yes
92% rs557435 (21) 1 55.52–55.72 yes
80% rs10198175 (1) 2 21.13–21.13 yes yes
76% rs10953541 (58) 7 106.48–107.3
62% rs6575501 (1) 14 95.64–95.64
Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.
HDL0
5
10
15
20
25
Num
ber o
f disc
over
ies
LDL0
5
10
15
20
25
CD0
10
20
30
40
50
60
TraitHDL
0.0
0.2
0.4
0.6
0.8
1.0
Prop
ortio
n of
con
firm
ed d
iscov
erie
s
LDL CDTrait
Figure: Number of discoveries made on different GWAS datasets (left) and proportion ofdiscoveries confirmed by a meta-analysis (right). Red lines correspond to resultspublished in papers that first analyzed our datasets
Data analysis issues
(1) Estimate distribution of SNPs (HMM) to build knockoffs
(2) Highly correlated SNPs
(1) Estimating the HMM
Methodology of Scheet and Stephens ’06
Fitted with fastPHASE (EM), K ≈ 10 possible hidden states
For each individual, making a knockoff copy of 70,000 SNPs takes about 1.3sec on Intel Xeon CPU (2.6GHz) (after parameter estimation)
Data analysis issues
(1) Estimate distribution of SNPs (HMM) to build knockoffs
(2) Highly correlated SNPs
(1) Estimating the HMM
Methodology of Scheet and Stephens ’06
Fitted with fastPHASE (EM), K ≈ 10 possible hidden states
For each individual, making a knockoff copy of 70,000 SNPs takes about 1.3sec on Intel Xeon CPU (2.6GHz) (after parameter estimation)
Highly correlated SNPs
Hard to choose between two or more nearly-identical variables if the data supportsat least one of them being selected
SNPs
Clustering
SNPs
Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among
71,145 candidates for CD and 59,005 for cholesterol
Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps
Which rep? Most significant SNP as computed on 20% of the samples
Safe data re-use (optimize power) as in Barber and C. (16)
Clustering
Cluster
Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among
71,145 candidates for CD and 59,005 for cholesterol
Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps
Which rep? Most significant SNP as computed on 20% of the samples
Safe data re-use (optimize power) as in Barber and C. (16)
Clustering
Representatives
Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among
71,145 candidates for CD and 59,005 for cholesterol
Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps
Which rep? Most significant SNP as computed on 20% of the samples
Safe data re-use (optimize power) as in Barber and C. (16)
Clustering
Representatives
Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among
71,145 candidates for CD and 59,005 for cholesterol
Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps
Which rep? Most significant SNP as computed on 20% of the samples
Safe data re-use (optimize power) as in Barber and C. (16)
Clustering
Representatives
Cluster SNPs using estimated correlations as similarity measure andsingle-linkage cutoff of 0.5 settle for discovering important SNP clusters among
71,145 candidates for CD and 59,005 for cholesterol
Cluster variables? Choose a representative SNP from each cluster (see alsoReid and Tibshirani, ’15) approximate null: cluster rep ⊥⊥ Y | other reps
Which rep? Most significant SNP as computed on 20% of the samples
Safe data re-use (optimize power) as in Barber and C. (16)
Safe data re-use
Used for selecting reps and safely re-used for inference
Used only for inference
We used an independent split of the data to select representative SNPs
X(0)
X(1) X(1)
XX
X(0)
++____ +++__++__0
|W|if null
Re-use data to improve ordering but not to compute signs (1-bit p-values)
Simulations with genetic covariates
Real genetic covariates X
Logistic conditional model Y |X with 60 variables
8 10 12 14 16 18 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
8 10 12 14 16 18 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitionsZj = |βj(λCV)|, Wj = Zj − Zj , target FDR: α = 0.1
Simulations with genetic covariates
Real genetic covariates X
Logistic conditional model Y |X with 60 variables
8 10 12 14 16 18 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
Powe
r
8 10 12 14 16 18 20Signal amplitude
0.0
0.2
0.4
0.6
0.8
1.0
FDP
Figure: Power and FDP over 100 repetitionsZj = |βj(λCV)|, Wj = Zj − Zj , target FDR: α = 0.1
Diagnostic plot: simulation with data from Chromosome 1
Feature importance Zj = |βj(λCV)|
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●● ●
●
●●●
●
●●●
●
●● ●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 2000 4000 6000 8000 10000
0.00
0.05
0.10
0.15
Variables
Fea
ture
Impo
rtan
ce
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●● ●
●
●●●
●
●●●
●
●● ●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
Diagnostic plot: simulation with data from Chromosome 1
Feature importance Zj = |βj(λCV)|
0 2000 4000 6000 8000 10000
0.00
0.05
0.10
0.15
Variables
Fea
ture
Impo
rtan
ce
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Results of data analysis
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Franke etal. ’10
WTCCC’07
100% rs11209026 (2) 1 67.31–67.42 yes yes
99% rs6431654 (20) 2 233.94–234.11 yes yes
98% rs6688532 (33) 1 169.4–169.65 yes
97% rs17234657 (1) 5 40.44–40.44 yes yes
95% rs11805303 (16) 1 67.31–67.46 yes yes
91% rs7095491 (18) 10 101.26–101.32 yes yes
91% rs3135503 (16) 16 49.28–49.36 yes yes
81% rs7768538 (1145) 6 25.19–32.91 yes yes
80% rs6601764 (1) 10 3.85–3.85 yes
75% rs7655059 (5) 4 89.5–89.53
73% rs6500315 (4) 16 49.03–49.07 yes yes
72% rs2738758 (5) 20 61.71–61.82 yes
70% rs7726744 (46) 5 40.35–40.71 yes yes
68% rs11627513 (7) 14 96.61–96.63
66% rs4246045 (46) 5 150.07–150.41 yes yes
62% rs9783122 (234) 10 106.43–107.61
61% rs6825958 (3) 4 55.73–55.77
Table: SNP clusters found to be important for CD over 100 repetitions of knockoffs.
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Confirmedin Willeret al. ’13
Found inSabatti
et al. ’09
100% rs1532085 (4) 15 58.68–58.7 yes yes
100% rs7499892 (1) 16 57.01–57.01 yes yes
100% rs1800961 (1) 20 43.04–43.04 yes
99% rs1532624 (2) 16 56.99–57.01 yes yes
95% rs255049 (142) 16 66.41–69.41 yes yes
Table: SNP clusters found to be important for HDL over 100 repetitions of knockoffs.
Selectionfrequency
SNP(cluster size) Chr.
Position range(Mb)
Confirmedin Willeret al. ’13
Found inSabatti
et al. ’09
99% rs4844614 (34) 1 207.3–207.88 yes
97% rs646776 (5) 1 109.8–109.82 yes yes
97% rs2228671 (2) 19 11.2–11.21 yes yes
94% rs157580 (4) 19 45.4–45.41 yes yes
92% rs557435 (21) 1 55.52–55.72 yes
80% rs10198175 (1) 2 21.13–21.13 yes yes
76% rs10953541 (58) 7 106.48–107.3
62% rs6575501 (1) 14 95.64–95.64
Table: SNP clusters found to be important for LDL over 100 repetitions of knockoffs.
Summary and open questions
Knockoffs offers finite sample inferentialproperties in subtle and important problems
Knockoffs is a powerful, flexible, and robustsolution whenever there is considerable outsideinformation on dist. of X such as GWAS
Knockoffs addresses the replicability issue
Where is the burden of knowledge?
Robustness theory (Barber, Samworth and C.)
Derandomization (multiple knockoffs)
Knockoff constructions and statistics for otherapplications
Summary and open questions
Knockoffs offers finite sample inferentialproperties in subtle and important problems
Knockoffs is a powerful, flexible, and robustsolution whenever there is considerable outsideinformation on dist. of X such as GWAS
Knockoffs addresses the replicability issue
Where is the burden of knowledge?
Robustness theory (Barber, Samworth and C.)
Derandomization (multiple knockoffs)
Knockoff constructions and statistics for otherapplications
What’s happening in selective inference III?
Lecture 3 (Thu. 8:30 a.m.)
Other views on selective inference: geography & vignettes
False coverage rate (Benjamini & Yekutieli)
POSI (Berk, Brown, Buja, Zhang, Zhao)
Inference after Lasso (Taylor & al.)
Selective hypothesis testing (Fithian et al.)
Thank You!
Derandomization
Combine information from mutiple knockoffs: who’s consistently showing up?
9
…
2 7 3 41 5 68…
9 2 4 3 7 1 5 68…
927 34 568…
|W|
9 2 73 4 15 68…
1
Figure: Cartoon representation of W ’s from different sample realizations of knockoffs
Sampling X1
p(X1|X−1) = p(X1|X2)
=p(X1, X2)
p(X2)=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X1
p(X1|X−1) = p(X1|X2) =p(X1, X2)
p(X2)
=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X1
p(X1|X−1) = p(X1|X2) =p(X1, X2)
p(X2)=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X1
p(X1|X−1) = p(X1|X2) =p(X1, X2)
p(X2)=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1)
∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X1
p(X1|X−1) = p(X1|X2) =p(X1, X2)
p(X2)=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X1
p(X1|X−1) = p(X1|X2) =p(X1, X2)
p(X2)=q1(X1)Q2(X2|X1)
Z1(X2)
Z1(z) =∑
u
q1(u)Q2(z|u)
Sampling X2
p(X2|X−2, X1) = p(X2|X1, X3, X1) ∝ Q2(X2|X1)Q3(X3|X2)Q2(X2|X1)
Z1(X2)
normalization constant Z2(X3)
Z2(z) =∑
u
Q2(u|X1)Q3(z|u)Q2(u|X1)
Z1(u)
Sampling X3
p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)
∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)
Z2(X3)
normalization constant Z3(X4)
Z3(z) =∑
u
Q3(u|X2)Q4(z|u)Q3(u|X2)
Z2(u)
And so on sampling Xj ...
Computationally efficient O(p)
Sampling X3
p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)
∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)
Z2(X3)
normalization constant Z3(X4)
Z3(z) =∑
u
Q3(u|X2)Q4(z|u)Q3(u|X2)
Z2(u)
And so on sampling Xj ...
Computationally efficient O(p)
Sampling X3
p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)
∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)
Z2(X3)
normalization constant Z3(X4)
Z3(z) =∑
u
Q3(u|X2)Q4(z|u)Q3(u|X2)
Z2(u)
And so on sampling Xj ...
Computationally efficient O(p)
Sampling X3
p(X3|X−3, X1, X2) = p(X3|X2, X4, X1, X2)
∝ Q3(X3|X2)Q4(X4|X3)Q3(X3|X2)
Z2(X3)
normalization constant Z3(X4)
Z3(z) =∑
u
Q3(u|X2)Q4(z|u)Q3(u|X2)
Z2(u)
And so on sampling Xj ...
Computationally efficient O(p)