machine learning for predicting type 1 diabetes from high-throughput data · 2015. 1. 24. ·...
TRANSCRIPT
Introduction Genomics dataset Proteomics dataset Methods
Machine learning for predicting type 1 diabetesfrom high-throughput data
Angermueller Christof
Institute of Computational BiologyHelmholtz-Zentrum Munich
September 18, 2013
Angermueller Christof Machine learning for predicting T1D 1/38
Introduction Genomics dataset Proteomics dataset Methods
1 Introduction
2 Genomics datasetIntroductionHLA modelHLA/non-HLA modelFeature selectionConclusions
3 Proteomics datasetIntroductionNo feature selectionFeature selectionConclusions
4 MethodsClassificationFeature selection
Angermueller Christof Machine learning for predicting T1D 2/38
Introduction Genomics dataset Proteomics dataset Methods
Type 1 diabetes
Autoimmune disease
Autoantibodies destroy beta cells of pancreas
Lack of insulin production
High blood glucose levels
Consequences
Cardiovascular diseasesKidney failureBlindness
Angermueller Christof Machine learning for predicting T1D 3/38
Introduction Genomics dataset Proteomics dataset Methods
Disease factors
Genetic factors
HLA genes
non-HLA genes
Environmental factors
Coxsackie virus
Antibiotics
Milk
Gluten
Vitamin D
Angermueller Christof Machine learning for predicting T1D 4/38
Introduction Genomics dataset Proteomics dataset Methods
Stages in the development of T1D
Years
β c
ell
mass
Prediabetes Overt diabetes
Genetic predisposition
Environmentaltriggers
Occurrence islet autoantibodies
Progressive lossof islet β cells
Reduction insulin production
Blood glucose levels above clinical threshold
Stop insulin production
Too late!
Angermueller Christof Machine learning for predicting T1D 5/38
Introduction Genomics dataset Proteomics dataset Methods
Stages in the development of T1D
Years
β c
ell
mass
Prediabetes Overt diabetes
Genetic predisposition
Environmentaltriggers
Occurrence islet autoantibodies
Progressive lossof islet β cells
Reduction insulin production
Blood glucose levels above clinical threshold
Stop insulin productionToo late!
Angermueller Christof Machine learning for predicting T1D 5/38
Introduction Genomics dataset Proteomics dataset Methods
1 Introduction
2 Genomics datasetIntroductionHLA modelHLA/non-HLA modelFeature selectionConclusions
3 Proteomics datasetIntroductionNo feature selectionFeature selectionConclusions
4 MethodsClassificationFeature selection
Angermueller Christof Machine learning for predicting T1D 6/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Aims
Autoantibodies against beta cells
First indication to T1D development
Costly + Sample dependent
Genomics data
Early prediction
Cost-effective
HLA genes: (30%-50% total genetic risk)
non-HLA genes: not used for prediction so far
Aims
1 Improve T1D prediction using HLA and 40 non-HLA genes
2 Quantify effect of HLA/non-HLA genes on T1D risk
Angermueller Christof Machine learning for predicting T1D 7/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Aims
Autoantibodies against beta cells
First indication to T1D development
Costly + Sample dependent
Genomics data
Early prediction
Cost-effective
HLA genes: (30%-50% total genetic risk)
non-HLA genes: not used for prediction so far
Aims
1 Improve T1D prediction using HLA and 40 non-HLA genes
2 Quantify effect of HLA/non-HLA genes on T1D risk
Angermueller Christof Machine learning for predicting T1D 7/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Aims
Autoantibodies against beta cells
First indication to T1D development
Costly + Sample dependent
Genomics data
Early prediction
Cost-effective
HLA genes: (30%-50% total genetic risk)
non-HLA genes: not used for prediction so far
Aims
1 Improve T1D prediction using HLA and 40 non-HLA genes
2 Quantify effect of HLA/non-HLA genes on T1D risk
Angermueller Christof Machine learning for predicting T1D 7/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Dataset
Training set
Type 1 Diabetes Genetics Consortium
4587 cases + 1208 controls
Test set
Institute of Diabetes Research
765 cases + 423 controls
Features
1 HLA risk score: 0, 1, 2, 3, 4, 5
2 40 non-HLA risk scores: 0, 1, 2
Angermueller Christof Machine learning for predicting T1D 8/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Dataset
Training set
Type 1 Diabetes Genetics Consortium
4587 cases + 1208 controls
Test set
Institute of Diabetes Research
765 cases + 423 controls
Features
1 HLA risk score: 0, 1, 2, 3, 4, 5
2 40 non-HLA risk scores: 0, 1, 2
Angermueller Christof Machine learning for predicting T1D 8/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Dataset
Training set
Type 1 Diabetes Genetics Consortium
4587 cases + 1208 controls
Test set
Institute of Diabetes Research
765 cases + 423 controls
Features
1 HLA risk score: 0, 1, 2, 3, 4, 5
2 40 non-HLA risk scores: 0, 1, 2
Angermueller Christof Machine learning for predicting T1D 8/38
Introduction Genomics dataset Proteomics dataset Methods
HLA model
ROC logistic regression
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False positive rate
True
pos
itive
rat
e
Test: 0.78CV: 0.81Train: 0.82
Angermueller Christof Machine learning for predicting T1D 9/38
Introduction Genomics dataset Proteomics dataset Methods
HLA model
Logisitic regression coefficients
0
1
2
3
4
HLA5
HLA4
HLA3
HLA2
HLA1
Logi
stic
reg
ress
ion
coef
ficie
nts
p−value(0,0.005]
Angermueller Christof Machine learning for predicting T1D 10/38
Introduction Genomics dataset Proteomics dataset Methods
HLA/non-HLA model
AUC classification models
HLA LR LASSO LIN SVM RBF SVM RF
Train 0.82 0.87 0.87 0.87 1.00 1.00CV 0.81 0.86 0.86 0.86 0.79 0.85
Test 0.78 0.84 0.84 0.84 0.75 0.82
Angermueller Christof Machine learning for predicting T1D 11/38
Introduction Genomics dataset Proteomics dataset Methods
HLA/non-HLA model
ROC logistic regression
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False positive rate
True
pos
itive
rat
e
Test: 0.84CV: 0.86Train: 0.87
Angermueller Christof Machine learning for predicting T1D 12/38
Introduction Genomics dataset Proteomics dataset Methods
HLA/non-HLA model
Logistic regression coefficients
−0.25
0.00
0.25
0.50
0.75
HLA
PTPN22 INSIL
2R
ERBB3IL
10
ORMDL3
PRKD2
BACH2
GLIS3
RNLS
SH2B3IL
27
UBASH3A
RS1051
7086
IL7R
RS5753
037IL
2B
RGS1IF
IH1
RS7221
109
TNFAIP
3
GAB3
IL18
RAP
SCAP2
CTLA4
ZFP36L1
PRKCQ
RS7202
877
COBL
C6ORF17
3
CD226TLR
8KIA
A
SIRPGCD69
TAGAP
RS4900
384
PTPN2IL
2
CTSH
Logi
stic
reg
ress
ion
coef
ficie
nts
p−value (0,0.005] (0.005,0.05] (0.05,0.5] (0.5,1]
Angermueller Christof Machine learning for predicting T1D 13/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature rankings
hla insptp
n22
erbb
3il2
r
ubas
h3a
bach
2
ormdl3
il27
rs575
3037
glis3rn
lssh
2b3il7
rcts
hga
b3tlr8 il2b
il10
rs105
1708
6
prkd
2
il18r
ap
rs722
1109
zfp36
l1kia
atnf
aip3sir
pgrg
s1co
blifih
1ctl
a4cd
226
c6or
f173
cd69pr
kcq
rs720
2877
scap
2
rs490
0384
ptpn2tag
apil2
hla insptp
n22
erbb
3il2
r
ubas
h3a
bach
2
ormdl3
il27
rs575
3037
glis3rn
lssh
2b3il7
rcts
hga
b3tlr8 il2b
il10
rs105
1708
6
prkd
2
il18r
ap
rs722
1109
zfp36
l1kia
atnf
aip3sir
pgrg
s1co
blifih
1ctl
a4cd
226
c6or
f173
cd69pr
kcq
rs720
2877
scap
2
rs490
0384
ptpn2tag
apil2
hla insptp
n22
erbb
3il2
ror
mdl3
bach
2il2
7
ubas
h3a
glis3rn
ls
rs575
3037
sh2b
3cts
hil1
0pr
kd2ga
b3il7ril2
bifih
1
rs722
1109
rs105
1708
6
ctla4
zfp36
l1
tnfaip
3rg
s1il1
8rap
scap
2tlr8cd
226
prkc
qkia
a
c6or
f173
sirpg
rs720
2877
cd69
cobltag
ap
rs490
0384
ptpn2 il2
hla insptp
n22
bach
2er
bb3gli
s3 il2rga
b3
ormdl3
il10
rs105
1708
6
sh2b
3il7
r
ubas
h3a
rnls
ctshil2
7
rs722
1109
rs575
3037
ctla4
c6or
f173
il2btnf
aip3
prkd
2
il18r
apifih
1cd
226
scap
2rg
s1kiaatlr8
zfp36
l1co
blpr
kcqptp
n2
rs720
2877
rs490
0384
cd69tag
apsir
pg il2
hla insptp
n22
erbb
3ba
ch2gli
s3 il2rga
b3
rs575
3037
rs105
1708
6
ubas
h3a
c6or
f173
rnlsor
mdl3il1
0il7
rcts
hil2
bil2
7ctl
a4ifih
1sh
2b3
rs722
1109
prkd
2
tnfaip
3
il18r
ap
cd22
6
scap
2kia
arg
s1co
blpr
kcq
zfp36
l1
ptpn2tlr8
rs490
0384
rs720
2877
tagapcd
69sir
pg il2
hla insptp
n22
erbb
3
ubas
h3a
il2b
il10
il2ror
mdl3ga
b3gli
s3rnls tlr8 ifih
1pr
kd2cts
hil2
7cd
69ba
ch2
zfp36
l1il7
r
rs490
0384
rs575
3037
il18r
ap
c6or
f173
kiaa
il2sc
ap2
rs105
1708
6
ctla4
coblrg
s1cd
226
rs722
1109
sh2b
3sir
pgtag
ap
tnfaip
3
ptpn2
rs720
2877
prkc
qRF−R
MSVM RFE
SVM RFE
LASSO−R
FSCORE
VARBVS
0 10 20 30 40Angermueller Christof Machine learning for predicting T1D 14/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Kendall τ rank correlation
1.000.820.700.670.55 1.00
0.821.000.690.640.53 0.82
0.700.691.000.690.49 0.70
0.670.640.691.000.49 0.67
0.550.530.490.491.00 0.55
1.000.820.700.670.55 1.00
RF−R
MSVM RFE
SVM RFE
LASSO−R
FSCORE
VARBVS
RF−R MSVM RFESVM RFE LASSO−R FSCORE VARBVS
Angermueller Christof Machine learning for predicting T1D 15/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
AUC
0.82
0.83
0.84
0.85
0.86
1 5 10 15 20 25 30 35 40# Top−ranked features used for prediction
AU
C
FSCORE LASSO−R SVM RFE MSVM RFE RF−R VARBVS
Angermueller Christof Machine learning for predicting T1D 16/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
HLA: 0.78
HLA + 40 non-HLA: 0.84
HLA + INS + PTPN22 + ERBB3: 0.82
1 HLA genes effect T1D risk most
2 non-HLA genes can improve prediction in combination
Angermueller Christof Machine learning for predicting T1D 17/38
Introduction Genomics dataset Proteomics dataset Methods
1 Introduction
2 Genomics datasetIntroductionHLA modelHLA/non-HLA modelFeature selectionConclusions
3 Proteomics datasetIntroductionNo feature selectionFeature selectionConclusions
4 MethodsClassificationFeature selection
Angermueller Christof Machine learning for predicting T1D 18/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Stages in the development of T1D
Years
β c
ell
mass
Prediabetes Overt diabetes
Genetic predisposition
Environmentaltriggers
Occurrence islet autoantibodies
Progressive lossof islet β cells
Reduction insulin production
Blood glucose levels above clinical threshold
Stop insulin production
Time varies
Angermueller Christof Machine learning for predicting T1D 19/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Stages in the development of T1D
Years
β c
ell
mass
Prediabetes Overt diabetes
Genetic predisposition
Environmentaltriggers
Occurrence islet autoantibodies
Progressive lossof islet β cells
Reduction insulin production
Blood glucose levels above clinical threshold
Stop insulin production
Time varies
Angermueller Christof Machine learning for predicting T1D 19/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Aims
Problem
Time from autoantibodies to T1D onset varies a lot
Rapid progressor: T1D ≤ 3 years after autoantibodies
Slow progressor: no T1D > 10 years after autoantibodies
Aims
1 Discriminate slow/rapid progressors using peptides
2 Identify peptide markers
Angermueller Christof Machine learning for predicting T1D 20/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Aims
Problem
Time from autoantibodies to T1D onset varies a lot
Rapid progressor: T1D ≤ 3 years after autoantibodies
Slow progressor: no T1D > 10 years after autoantibodies
Aims
1 Discriminate slow/rapid progressors using peptides
2 Identify peptide markers
Angermueller Christof Machine learning for predicting T1D 20/38
Introduction Genomics dataset Proteomics dataset Methods
Introduction
Dataset
4384 blood serum peptides via mass-spectrometry
30 BABYDIAB samples
15 rapid progressors15 slow progressors
Challenges
Large p small nNo test set
Angermueller Christof Machine learning for predicting T1D 21/38
Introduction Genomics dataset Proteomics dataset Methods
No feature selection
ROC classification models
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00False positive rate
True
pos
itive
rat
e
LASSO: 0.54LIN SVM: 0.4RBF SVM: 0.32RF: 0.61
Angermueller Christof Machine learning for predicting T1D 22/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature rankings
p296
95
p389
1
p890
34
p153
12
p546
8
p240
32
p392
49
p329
34
p387
71
p112
15
p138
70
p855
6
p128
389
p232
2
p322
81
p104
29
p869
46
p926
0
p367
28
p162
64
p942
5
p455
04
p347
67
p603
0
p352
9
p187
91
p121
4
p177
43
p117
81
p833
p352
13
p171
69
p164
77
p290
84
p112
94
p507
91
p298
19
p473
28
p253
69
p128
83
p115
99
p206
38
p531
p154
09
p206
93
p635
6
p173
12
p111
85
p778
11
p337
33
p438
2
p352
13
p296
95
p153
12
p128
389
p112
95
p778
11
p926
0
p192
4
p869
46
p882
72
p389
1
p367
28
p240
32
p110
43
p263
83
p441
29
p253
69
p138
70
p546
8
p392
49
p166
76
p707
84
p387
71
p942
5
p282
63
p329
34
p164
77
p574
37
p833
p206
38
p890
34
p455
04
p290
84
p855
6
p298
19
p121
4
p194
06
p100
08
p244
9
p112
15
p322
10
p726
5
p145
06
p472
7
p473
28
p104
29
p155
81
p636
8
p163
59
p438
2
p110
43
p389
1
p367
28
p324
3
p153
12
p344
0
p112
15
p232
2
p116
6
p392
49
p120
00
p177
43
p160
88
p811
6
p135
3
p104
29
p472
7
p317
3
p907
8
p155
81
p627
1
p163
59
p707
84
p453
96
p572
2
p478
2
p337
6
p390
0
p546
8
p150
7
p157
3
p128
389
p604
2
p574
0
p387
71
p192
8
p291
47
p926
0
p131
74
p354
30
p771
7
p117
81
p253
69
p175
63
p429
75
p486
p441
29
p809
1
p240
32
p855
6p2
67
p253
69
p138
70
p427p9
260
p166
76
p296
95
p164
77
p128
389
p153
12
p438
2
p112
15
p389
1
p890
34
p413
50
p352
13
p387
71
p367
28
p240
32
p833
p131
96
p592
1
p107
83
p882
72
p290
84
p112
95
p672
24
p441
29
p145
06
p339
00
p593
0
p683
7
p725
3
p776
3
p870
2
p229
78
p992
2
p106
39
p101
11
p242
68
p273
80
p309
39
p310
19
p377
32
p580
95
p434p1
252
p584
4p4
72
p855
6p2
67p2
40p2
56p6
33p4
27p67p1
212
p591
1
p438
2p3
68
p212
76
p166
76
p138
70
p129
p455
04
p132
8
p331
8
p863
26p7
2p9
260
p160
53
p240
7
p771
7
p175
5p2
43
p337
33p7
4p2
5369
p164
77
p678
p286
11
p175
63
p902
0
p811
6
p279
6
p116
88
p635
6
p907
8
p174
01
p192
2
p189
83
p168
0p1
66p7
79p1
034
p206
74
p153
2
p243
5
p363
69
p531
p110
05
p183
82
p120
137
p270
0
p339
7
p481
07
p258
19
p264p4
776
p189
68
p785
81
p518
96
p592
1
p593
0
p683
7
p725
3
p776
3
p870
2
p229
78
p992
2
p101
11
p242
68
p273
80
p309
39
p310
19
p377
32
p580
95
p434p1
252
p584
4p4
72p1
350
p479p4
90p5
51p2
188
p126
01
p598p1
739
p610p6
62p1
274
p665p6
95p7
13p7
360
p716p7
48p7
67RF−R
FSCORE
LASSO−R
MSVM RFE
SVM RFE
VARBVS
0 10 20 30 40 50Angermueller Christof Machine learning for predicting T1D 23/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Kendall τ rank correlation
1.00 0.59 0.15−0.040.14 0.02
0.59 1.00 0.10−0.060.15 0.00
0.15 0.10 1.000.03−0.01 0.20
−0.04 −0.06 0.031.000.07 0.00
0.14 0.15 −0.010.071.00 −0.14
0.02 0.00 0.200.00−0.14 1.00
RF−R
FSCORE
LASSO−R
MSVM RFE
SVM RFE
VARBVS
RF−R FSCORE LASSO−RMSVM RFESVM RFE VARBVS
Angermueller Christof Machine learning for predicting T1D 24/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
AUC
0.4
0.6
0.8
1.0
1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100# Top−ranked features used for prediction
AU
C
FSCORE LASSO−R SVM RFE MSVM RFE RF−R VARBVS
Angermueller Christof Machine learning for predicting T1D 25/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
What I did wrong
Evaluation prediction performance
1 Rank features with all samples
2 Estimate AUC by cross-validation
AUC estimated on samples used for ranking
AUC too optimistic
Angermueller Christof Machine learning for predicting T1D 26/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Biological functions
SVM RFE MSVM RFENr Symbol Name Symbol Name
1 APOA4 apolipoprotein APOA4 apolipoprotein2 FN1 fibronectin 1 HP haptoglobin NA3 MASP1 mannan-binding SAA1 serum amyloid A14 SAA1 serum amyloid A1 C4BPB complement5 ATXN2 ataxin 2 FN1 fibronectin 16 ITIH1 inter-alpha-trypsin SAA1 serum amyloid A17 SPTA1 spectrin, alpha, FN1 fibronectin 18 FCN3 ficolin FCN3 ficolin9 ALB albumin NA SAA1 serum amyloid A1
10 CP ceruloplasmin C1QB complement11 ALB albumin NA SAA1 serum amyloid A112 SAA1 serum amyloid A1 FN1 fibronectin 113 C4BPB complement CFHR4 complement14 PCYOX1 prenylcysteine FN1 fibronectin 115 HP haptoglobin NA FN1 fibronectin 1
Angermueller Christof Machine learning for predicting T1D 27/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Biological functions
SVM RFE MSVM RFENr Symbol Name Symbol Name
1 APOA4 apolipoprotein APOA4 apolipoprotein2 FN1 fibronectin 1 HP haptoglobin NA3 MASP1 mannan-binding SAA1 serum amyloid A14 SAA1 serum amyloid A1 C4BPB complement5 ATXN2 ataxin 2 FN1 fibronectin 16 ITIH1 inter-alpha-trypsin SAA1 serum amyloid A17 SPTA1 spectrin, alpha, FN1 fibronectin 18 FCN3 ficolin FCN3 ficolin9 ALB albumin NA SAA1 serum amyloid A1
10 CP ceruloplasmin C1QB complement11 ALB albumin NA SAA1 serum amyloid A112 SAA1 serum amyloid A1 FN1 fibronectin 113 C4BPB complement CFHR4 complement14 PCYOX1 prenylcysteine FN1 fibronectin 115 HP haptoglobin NA FN1 fibronectin 1
Angermueller Christof Machine learning for predicting T1D 27/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Biological functions
SVM RFE MSVM RFENr Symbol Name Symbol Name
1 APOA4 apolipoprotein APOA4 apolipoprotein2 FN1 fibronectin 1 HP haptoglobin NA3 MASP1 mannan-binding SAA1 serum amyloid A14 SAA1 serum amyloid A1 C4BPB complement5 ATXN2 ataxin 2 FN1 fibronectin 16 ITIH1 inter-alpha-trypsin SAA1 serum amyloid A17 SPTA1 spectrin, alpha, FN1 fibronectin 18 FCN3 ficolin FCN3 ficolin9 ALB albumin NA SAA1 serum amyloid A1
10 CP ceruloplasmin C1QB complement11 ALB albumin NA SAA1 serum amyloid A112 SAA1 serum amyloid A1 FN1 fibronectin 113 C4BPB complement CFHR4 complement14 PCYOX1 prenylcysteine FN1 fibronectin 115 HP haptoglobin NA FN1 fibronectin 1
Angermueller Christof Machine learning for predicting T1D 27/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Complement system
Angermueller Christof Machine learning for predicting T1D 28/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sampleFeature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sampleFeature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sampleFeature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sampleFeature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sample
Feature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Conclusions
SVM RFE good for biomarker discovery
Few peptides discriminate between rapid/slow progressors
Peptides supported by literature
Results need to be verified
Larger sampleFeature ranking external to AUC estimation
Angermueller Christof Machine learning for predicting T1D 29/38
Introduction Genomics dataset Proteomics dataset Methods
Conclusions
Questions
Questions?
Angermueller Christof Machine learning for predicting T1D 30/38
Introduction Genomics dataset Proteomics dataset Methods
1 Introduction
2 Genomics datasetIntroductionHLA modelHLA/non-HLA modelFeature selectionConclusions
3 Proteomics datasetIntroductionNo feature selectionFeature selectionConclusions
4 MethodsClassificationFeature selection
Angermueller Christof Machine learning for predicting T1D 31/38
Introduction Genomics dataset Proteomics dataset Methods
Classification
Classification models
LR Logistic regression
LASSO Logistic regression with L1 regularization
LIN SVM SVM with linear kernel
RBF SVM SVM with RBF kernel
RF Random forest
Angermueller Christof Machine learning for predicting T1D 32/38
Introduction Genomics dataset Proteomics dataset Methods
Classification
Logistic regression
P(y = 1|x , θ) =1
1 + exp(−θT x)
LASSO: L1 norm regularization
−λ∑j
|θj |
0.00
0.25
0.50
0.75
1.00
−5.0 −2.5 0.0 2.5 5.0z
g(z)
Angermueller Christof Machine learning for predicting T1D 33/38
Introduction Genomics dataset Proteomics dataset Methods
Classification
Support vector machine
h(x) = sign
(k∑
i=1
αiy(i)K (x (i), x)
)
Linear kernel
K (u, v) = uT v
RBF kernel
K (u, v) = exp(−γ ‖u − v‖22)
support vector
support vector
Angermueller Christof Machine learning for predicting T1D 34/38
Introduction Genomics dataset Proteomics dataset Methods
Classification
Random forest
x1 < 5
x2 < 6 x2 < 3
y1 y2 y3 y4
x = (4, 8)
Yes No
Yes No Ye
s No
x1
x2
5
6
3 y1
y2
y3
y4
Random forest
K ensemble trees from bootstrap samples
L random variables per split
Angermueller Christof Machine learning for predicting T1D 35/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature selection
1 n
m
←Sam
ples
Features →
m >> n
Angermueller Christof Machine learning for predicting T1D 36/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature selection
1 n
m←Sam
ples
Features →
m << n
1 1 2 3 4 5 6 7 8 9 10 All features
2 5 7 2 1 10 6 9 3 8 4 Ranked features
3 5 7 2 1 10 6 9 3 8 4 Selected features
Angermueller Christof Machine learning for predicting T1D 37/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature selection
1 n
m←Sam
ples
Features →
m << n
1 1 2 3 4 5 6 7 8 9 10 All features
2 5 7 2 1 10 6 9 3 8 4 Ranked features
3 5 7 2 1 10 6 9 3 8 4 Selected features
Angermueller Christof Machine learning for predicting T1D 37/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature selection
1 n
m←Sam
ples
Features →
m << n
1 1 2 3 4 5 6 7 8 9 10 All features
2 5 7 2 1 10 6 9 3 8 4 Ranked features
3 5 7 2 1 10 6 9 3 8 4 Selected features
Angermueller Christof Machine learning for predicting T1D 37/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature selection
1 n
m←Sam
ples
Features →
m << n
1 1 2 3 4 5 6 7 8 9 10 All features
2 5 7 2 1 10 6 9 3 8 4 Ranked features
3 5 7 2 1 10 6 9 3 8 4 Selected features
Angermueller Christof Machine learning for predicting T1D 37/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
p267 − VTN
−1
0
1
2
rapid slow
Nor
mal
ised
con
cent
ratio
n
Angermueller Christof Machine learning for predicting T1D 38/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS
−0.
20.
00.
20.
40.
60.
8
−log(λ)
θ(λ)
x1x2x3x4x5x6x7x8x9x10
2 2 2 3 3 3 3 4 4 4 5 5 5 5 6 6 6 6 7 7
0 1 1 1 2 3 4 5 8 17 28 34 36 38 39 39 40
Angermueller Christof Machine learning for predicting T1D 38/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS
1 Compute SVM weigths θ
2 Select min θ2j
3 Remove j
4 Repeat
Angermueller Christof Machine learning for predicting T1D 38/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS
Multiple SVMs
Angermueller Christof Machine learning for predicting T1D 38/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS
Random forest
Increase out-of-bag error
Angermueller Christof Machine learning for predicting T1D 38/38
Introduction Genomics dataset Proteomics dataset Methods
Feature selection
Feature ranking methods
FSCORE
LASSO-R
SVM RFE
MSVM RFE
RF-R
VARBVS
Variational Bayesian VariableSelection
P(included in model|data)
Angermueller Christof Machine learning for predicting T1D 38/38