machine learning for predicting type 1 diabetes from high-throughput data · 2015. 1. 24. ·...

Introduction Genomics dataset Proteomics dataset Methods

Machine learning for predicting type 1 diabetesfrom high-throughput data

Angermueller Christof

Institute of Computational BiologyHelmholtz-Zentrum Munich

September 18, 2013

Angermueller Christof Machine learning for predicting T1D 1/38


1 Introduction

2 Genomics datasetIntroductionHLA modelHLA/non-HLA modelFeature selectionConclusions

3 Proteomics datasetIntroductionNo feature selectionFeature selectionConclusions

4 MethodsClassificationFeature selection



Type 1 diabetes

Autoimmune disease

Autoantibodies destroy beta cells of pancreas

Lack of insulin production

High blood glucose levels

Consequences

Cardiovascular diseasesKidney failureBlindness



Disease factors

Genetic factors

HLA genes

non-HLA genes

Environmental factors

Coxsackie virus

Antibiotics

Milk

Gluten

Vitamin D



Stages in the development of T1D

Years

β c

ell

mass

Prediabetes Overt diabetes

Genetic predisposition

Environmentaltriggers

Occurrence islet autoantibodies

Progressive lossof islet β cells

Reduction insulin production

Blood glucose levels above clinical threshold

Stop insulin production

Too late!




Years

β c

ell

mass








Stop insulin productionToo late!



1 Introduction






Introduction

Aims

Autoantibodies against beta cells

First indication to T1D development

Costly + Sample dependent

Genomics data

Early prediction

Cost-effective

HLA genes: (30%-50% total genetic risk)

non-HLA genes: not used for prediction so far

Aims

1 Improve T1D prediction using HLA and 40 non-HLA genes

2 Quantify effect of HLA/non-HLA genes on T1D risk



Introduction

Dataset

Training set

Type 1 Diabetes Genetics Consortium

4587 cases + 1208 controls

Test set

Institute of Diabetes Research

765 cases + 423 controls

Features

1 HLA risk score: 0, 1, 2, 3, 4, 5

2 40 non-HLA risk scores: 0, 1, 2



HLA model

ROC logistic regression

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00False positive rate

True

pos

itive

rat

e

Test: 0.78CV: 0.81Train: 0.82



HLA model

Logisitic regression coefficients

0

1

2

3

4

HLA5

HLA4

HLA3

HLA2

HLA1

Logi

stic

reg

ress

ion

coef

ficie

nts

p−value(0,0.005]



HLA/non-HLA model

AUC classification models

HLA LR LASSO LIN SVM RBF SVM RF

Train 0.82 0.87 0.87 0.87 1.00 1.00CV 0.81 0.86 0.86 0.86 0.79 0.85

Test 0.78 0.84 0.84 0.84 0.75 0.82



HLA/non-HLA model

ROC logistic regression

0.00

0.25

0.50

0.75

1.00


True

pos

itive

rat

e

Test: 0.84CV: 0.86Train: 0.87



HLA/non-HLA model

Logistic regression coefficients

−0.25

0.00

0.25

0.50

0.75

HLA

PTPN22 INSIL

2R

ERBB3IL

10

ORMDL3

PRKD2

BACH2

GLIS3

RNLS

SH2B3IL

27

UBASH3A

RS1051

7086

IL7R

RS5753

037IL

2B

RGS1IF

IH1

RS7221

109

TNFAIP

3

GAB3

IL18

RAP

SCAP2

CTLA4

ZFP36L1

PRKCQ

RS7202

877

COBL

C6ORF17

3

CD226TLR

8KIA

A

SIRPGCD69

TAGAP

RS4900

384

PTPN2IL

2

CTSH

Logi

stic

reg

ress

ion

coef

ficie

nts

p−value (0,0.005] (0.005,0.05] (0.05,0.5] (0.5,1]



Feature selection

Feature rankings

hla insptp

n22

erbb

3il2

r

ubas

h3a

bach

2

ormdl3

il27

rs575

3037

glis3rn

lssh

2b3il7

rcts

hga

b3tlr8 il2b

il10

rs105

1708

6

prkd

2

il18r

ap

rs722

1109

zfp36

l1kia

atnf

aip3sir

pgrg

s1co

blifih

1ctl

a4cd

226

c6or

f173

cd69pr

kcq

rs720

2877

scap

2

rs490

0384

ptpn2tag

apil2

hla insptp

n22

erbb

3il2

r

ubas

h3a

bach

2

ormdl3

il27

rs575

3037

glis3rn

lssh

2b3il7

rcts

hga

b3tlr8 il2b

il10

rs105

1708

6

prkd

2

il18r

ap

rs722

1109

zfp36

l1kia

atnf

aip3sir

pgrg

s1co

blifih

1ctl

a4cd

226

c6or

f173

cd69pr

kcq

rs720

2877

scap

2

rs490

0384

ptpn2tag

apil2

hla insptp

n22

erbb

3il2

ror

mdl3

bach

2il2

7

ubas

h3a

glis3rn

ls

rs575

3037

sh2b

3cts

hil1

0pr

kd2ga

b3il7ril2

bifih

1

rs722

1109

rs105

1708

6

ctla4

zfp36

l1

tnfaip

3rg

s1il1

8rap

scap

2tlr8cd

226

prkc

qkia

a

c6or

f173

sirpg

rs720

2877

cd69

cobltag

ap

rs490

0384

ptpn2 il2

hla insptp

n22

bach

2er

bb3gli

s3 il2rga

b3

ormdl3

il10

rs105

1708

6

sh2b

3il7

r

ubas

h3a

rnls

ctshil2

7

rs722

1109

rs575

3037

ctla4

c6or

f173

il2btnf

aip3

prkd

2

il18r

apifih

1cd

226

scap

2rg

s1kiaatlr8

zfp36

l1co

blpr

kcqptp

n2

rs720

2877

rs490

0384

cd69tag

apsir

pg il2

hla insptp

n22

erbb

3ba

ch2gli

s3 il2rga

b3

rs575

3037

rs105

1708

6

ubas

h3a

c6or

f173

rnlsor

mdl3il1

0il7

rcts

hil2

bil2

7ctl

a4ifih

1sh

2b3

rs722

1109

prkd

2

tnfaip

3

il18r

ap

cd22

6

scap

2kia

arg

s1co

blpr

kcq

zfp36

l1

ptpn2tlr8

rs490

0384

rs720

2877

tagapcd

69sir

pg il2

hla insptp

n22

erbb

3

ubas

h3a

il2b

il10

il2ror

mdl3ga

b3gli

s3rnls tlr8 ifih

1pr

kd2cts

hil2

7cd

69ba

ch2

zfp36

l1il7

r

rs490

0384

rs575

3037

il18r

ap

c6or

f173

kiaa

il2sc

ap2

rs105

1708

6

ctla4

coblrg

s1cd

226

rs722

1109

sh2b

3sir

pgtag

ap

tnfaip

3

ptpn2

rs720

2877

prkc

qRF−R

MSVM RFE

SVM RFE

LASSO−R

FSCORE

VARBVS

0 10 20 30 40Angermueller Christof Machine learning for predicting T1D 14/38


Feature selection

Kendall τ rank correlation

1.000.820.700.670.55 1.00

0.821.000.690.640.53 0.82

0.700.691.000.690.49 0.70

0.670.640.691.000.49 0.67

0.550.530.490.491.00 0.55

1.000.820.700.670.55 1.00

RF−R

MSVM RFE

SVM RFE

LASSO−R

FSCORE

VARBVS

RF−R MSVM RFESVM RFE LASSO−R FSCORE VARBVS



Feature selection

AUC

0.82

0.83

0.84

0.85

0.86

1 5 10 15 20 25 30 35 40# Top−ranked features used for prediction

AU

C

FSCORE LASSO−R SVM RFE MSVM RFE RF−R VARBVS



Conclusions

Conclusions

HLA: 0.78

HLA + 40 non-HLA: 0.84

HLA + INS + PTPN22 + ERBB3: 0.82

1 HLA genes effect T1D risk most

2 non-HLA genes can improve prediction in combination



1 Introduction






Introduction


Years

β c

ell

mass








Stop insulin production

Time varies



Introduction

Aims

Problem

Time from autoantibodies to T1D onset varies a lot

Rapid progressor: T1D ≤ 3 years after autoantibodies

Slow progressor: no T1D > 10 years after autoantibodies

Aims

1 Discriminate slow/rapid progressors using peptides

2 Identify peptide markers



Introduction

Dataset

4384 blood serum peptides via mass-spectrometry

30 BABYDIAB samples

15 rapid progressors15 slow progressors

Challenges

Large p small nNo test set



No feature selection

ROC classification models

0.00

0.25

0.50

0.75

1.00


True

pos

itive

rat

e

LASSO: 0.54LIN SVM: 0.4RBF SVM: 0.32RF: 0.61



Feature selection

Feature rankings

p296

95

p389

1

p890

34

p153

12

p546

8

p240

32

p392

49

p329

34

p387

71

p112

15

p138

70

p855

6

p128

389

p232

2

p322

81

p104

29

p869

46

p926

0

p367

28

p162

64

p942

5

p455

04

p347

67

p603

0

p352

9

p187

91

p121

4

p177

43

p117

81

p833

p352

13

p171

69

p164

77

p290

84

p112

94

p507

91

p298

19

p473

28

p253

69

p128

83

p115

99

p206

38

p531

p154

09

p206

93

p635

6

p173

12

p111

85

p778

11

p337

33

p438

2

p352

13

p296

95

p153

12

p128

389

p112

95

p778

11

p926

0

p192

4

p869

46

p882

72

p389

1

p367

28

p240

32

p110

43

p263

83

p441

29

p253

69

p138

70

p546

8

p392

49

p166

76

p707

84

p387

71

p942

5

p282

63

p329

34

p164

77

p574

37

p833

p206

38

p890

34

p455

04

p290

84

p855

6

p298

19

p121

4

p194

06

p100

08

p244

9

p112

15

p322

10

p726

5

p145

06

p472

7

p473

28

p104

29

p155

81

p636

8

p163

59

p438

2

p110

43

p389

1

p367

28

p324

3

p153

12

p344

0

p112

15

p232

2

p116

6

p392

49

p120

00

p177

43

p160

88

p811

6

p135

3

p104

29

p472

7

p317

3

p907

8

p155

81

p627

1

p163

59

p707

84

p453

96

p572

2

p478

2

p337

6

p390

0

p546

8

p150

7

p157

3

p128

389

p604

2

p574

0

p387

71

p192

8

p291

47

p926

0

p131

74

p354

30

p771

7

p117

81

p253

69

p175

63

p429

75

p486

p441

29

p809

1

p240

32

p855

6p2

67

p253

69

p138

70

p427p9

260

p166

76

p296

95

p164

77

p128

389

p153

12

p438

2

p112

15

p389

1

p890

34

p413

50

p352

13

p387

71

p367

28

p240

32

p833

p131

96

p592

1

p107

83

p882

72

p290

84

p112

95

p672

24

p441

29

p145

06

p339

00

p593

0

p683

7

p725

3

p776

3

p870

2

p229

78

p992

2

p106

39

p101

11

p242

68

p273

80

p309

39

p310

19

p377

32

p580

95

p434p1

252

p584

4p4

72

p855

6p2

67p2

40p2

56p6

33p4

27p67p1

212

p591

1

p438

2p3

68

p212

76

p166

76

p138

70

p129

p455

04

p132

8

p331

8

p863

26p7

2p9

260

p160

53

p240

7

p771

7

p175

5p2

43

p337

33p7

4p2

5369

p164

77

p678

p286

11

p175

63

p902

0

p811

6

p279

6

p116

88

p635

6

p907

8

p174

01

p192

2

p189

83

p168

0p1

66p7

79p1

034

p206

74

p153

2

p243

5

p363

69

p531

p110

05

p183

82

p120

137

p270

0

p339

7

p481

07

p258

19

p264p4

776

p189

68

p785

81

p518

96

p592

1

p593

0

p683

7

p725

3

p776

3

p870

2

p229

78

p992

2

p101

11

p242

68

p273

80

p309

39

p310

19

p377

32

p580

95

p434p1

252

p584

4p4

72p1

350

p479p4

90p5

51p2

188

p126

01

p598p1

739

p610p6

62p1

274

p665p6

95p7

13p7

360

p716p7

48p7

67RF−R

FSCORE

LASSO−R

MSVM RFE

SVM RFE

VARBVS

0 10 20 30 40 50Angermueller Christof Machine learning for predicting T1D 23/38


Feature selection

Kendall τ rank correlation

1.00 0.59 0.15−0.040.14 0.02

0.59 1.00 0.10−0.060.15 0.00

0.15 0.10 1.000.03−0.01 0.20

−0.04 −0.06 0.031.000.07 0.00

0.14 0.15 −0.010.071.00 −0.14

0.02 0.00 0.200.00−0.14 1.00

RF−R

FSCORE

LASSO−R

MSVM RFE

SVM RFE

VARBVS

RF−R FSCORE LASSO−RMSVM RFESVM RFE VARBVS



Feature selection

AUC

0.4

0.6

0.8

1.0

1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100# Top−ranked features used for prediction

AU

C

FSCORE LASSO−R SVM RFE MSVM RFE RF−R VARBVS



Feature selection

What I did wrong

Evaluation prediction performance

1 Rank features with all samples

2 Estimate AUC by cross-validation

AUC estimated on samples used for ranking

AUC too optimistic



Feature selection

Biological functions

SVM RFE MSVM RFENr Symbol Name Symbol Name

1 APOA4 apolipoprotein APOA4 apolipoprotein2 FN1 fibronectin 1 HP haptoglobin NA3 MASP1 mannan-binding SAA1 serum amyloid A14 SAA1 serum amyloid A1 C4BPB complement5 ATXN2 ataxin 2 FN1 fibronectin 16 ITIH1 inter-alpha-trypsin SAA1 serum amyloid A17 SPTA1 spectrin, alpha, FN1 fibronectin 18 FCN3 ficolin FCN3 ficolin9 ALB albumin NA SAA1 serum amyloid A1

10 CP ceruloplasmin C1QB complement11 ALB albumin NA SAA1 serum amyloid A112 SAA1 serum amyloid A1 FN1 fibronectin 113 C4BPB complement CFHR4 complement14 PCYOX1 prenylcysteine FN1 fibronectin 115 HP haptoglobin NA FN1 fibronectin 1



Feature selection

Complement system



Conclusions

Conclusions

SVM RFE good for biomarker discovery

Few peptides discriminate between rapid/slow progressors

Peptides supported by literature

Results need to be verified

Larger sampleFeature ranking external to AUC estimation



Conclusions

Conclusions





Larger sample

Feature ranking external to AUC estimation



Conclusions

Conclusions





Larger sampleFeature ranking external to AUC estimation



Conclusions

Questions

Questions?



1 Introduction






Classification

Classification models

LR Logistic regression

LASSO Logistic regression with L1 regularization

LIN SVM SVM with linear kernel

RBF SVM SVM with RBF kernel

RF Random forest



Classification

Logistic regression

P(y = 1|x , θ) =1

1 + exp(−θT x)

LASSO: L1 norm regularization

−λ∑j

|θj |

0.00

0.25

0.50

0.75

1.00

−5.0 −2.5 0.0 2.5 5.0z

g(z)



Classification

Support vector machine

h(x) = sign

(k∑

i=1

αiy(i)K (x (i), x)

)

Linear kernel

K (u, v) = uT v

RBF kernel

K (u, v) = exp(−γ ‖u − v‖22)

support vector

support vector



Classification

Random forest

x1 < 5

x2 < 6 x2 < 3

y1 y2 y3 y4

x = (4, 8)

Yes No

Yes No Ye

s No

x1

x2

5

6

3 y1

y2

y3

y4

Random forest

K ensemble trees from bootstrap samples

L random variables per split



Feature selection

Feature selection

1 n

m

←Sam

ples

Features →

m >> n



Feature selection

Feature selection

1 n

m←Sam

ples

Features →

m << n

1 1 2 3 4 5 6 7 8 9 10 All features

2 5 7 2 1 10 6 9 3 8 4 Ranked features

3 5 7 2 1 10 6 9 3 8 4 Selected features



Feature selection

Feature ranking methods

FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

p267 − VTN

−1

0

1

2

rapid slow

Nor

mal

ised

con

cent

ratio

n



Feature selection


FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS

−0.

20.

00.

20.

40.

60.

8

−log(λ)

θ(λ)

x1x2x3x4x5x6x7x8x9x10

2 2 2 3 3 3 3 4 4 4 5 5 5 5 6 6 6 6 7 7

0 1 1 1 2 3 4 5 8 17 28 34 36 38 39 39 40



Feature selection


FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS

1 Compute SVM weigths θ

2 Select min θ2j

3 Remove j

4 Repeat



Feature selection


FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS

Multiple SVMs



Feature selection


FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS

Random forest

Increase out-of-bag error



Feature selection


FSCORE

LASSO-R

SVM RFE

MSVM RFE

RF-R

VARBVS

Variational Bayesian VariableSelection

P(included in model|data)


machine learning for predicting type 1 diabetes from high-throughput data · 2015. 1. 24. ·...

Documents