a risk stratification model for lung cancer based on gene … · 2019. 7. 30. · risk score (1...

12
Research Article A Risk Stratification Model for Lung Cancer Based on Gene Coexpression Network and Deep Learning Hongyoon Choi 1 and Kwon Joong Na 2,3 1 Cheonan Public Health Center, Chungnam, Republic of Korea 2 Department of Community Health, Korea Health Promotion Institute, Seoul, Republic of Korea 3 Department of Clinical Medical Sciences, Seoul National University, College of Medicine, Seoul, Republic of Korea Correspondence should be addressed to Hongyoon Choi; [email protected] and Kwon Joong Na; [email protected] Received 13 October 2017; Revised 7 December 2017; Accepted 11 December 2017; Published 16 January 2018 Academic Editor: Jialiang Yang Copyright © 2018 Hongyoon Choi and Kwon Joong Na. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of previous models based on individual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based on gene coexpression network. Using multiple microarray, gene coexpression network analysis was performed to identify survival- related networks. A deep learning based risk stratification model was constructed with representative genes of these networks. e model was validated in two test sets. Survival analysis was performed using the output of the model to evaluate whether it could predict patients’ survival independent of clinicopathological variables. Five networks were significantly associated with patients’ survival. Considering prognostic significance and representativeness, genes of the two survival-related networks were selected for input of the model. e output of the model was significantly associated with patients’ survival in two test sets and training set ( < 0.00001, < 0.0001 and = 0.02 for training and test sets 1 and 2, resp.). In multivariate analyses, the model was associated with patients’ prognosis independent of other clinicopathological features. Our study presents a new perspective on incorporating gene coexpression networks into the gene expression signature and clinical application of deep learning in genomic data science for prognosis prediction. 1. Introduction Risk stratification based on gene expression profiles is of major biomedical interest in lung cancer research [1–6]. Previous studies developed risk stratification models that mostly fo- cused on individual prognostic genes. However, these studies have not fully considered the nature of biological networks and their systematic properties. Since it is more evident that biological processes are derived from numerous interactions between many cellular components, gene network analysis could provide valuable information of cancer pathogenesis [7]. Among the various biological networks, gene coexpres- sion network has some strengths: not relying on prior infor- mation about genes, avoiding biologically wrong assumptions about independence of gene expression levels, and alleviating multiple testing problems [8]. Lung cancer, mainly, non-small-cell lung cancer, is one of the most common cancers and is the leading cause of cancer- related death worldwide [9, 10]. Currently, TNM staging system is a universal guideline for prognosis prediction and treatment decision. However, heterogeneous molecular fea- tures of lung cancer require diverse adjuvant treatment op- tions and lead to different prognosis even in the same stage [11]. Hence, there has been a constant need for developing better risk stratification models to predict accurate prognosis and to improve cancer-related survival. e main objectives of this study were (1) to identify survival-related gene coexpression network modules (2) and to propose a deep learning- (DL-) based risk stratification model reflecting survival-related network modules. Using public microarray datasets from the Gene Expression Omni- bus (GEO), we identified survival-related network modules Hindawi BioMed Research International Volume 2018, Article ID 2914280, 11 pages https://doi.org/10.1155/2018/2914280

Upload: others

Post on 30-Jul-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

Research ArticleA Risk Stratification Model for Lung Cancer Based onGene Coexpression Network and Deep Learning

Hongyoon Choi 1 and Kwon Joong Na 2,3

1Cheonan Public Health Center, Chungnam, Republic of Korea2Department of Community Health, Korea Health Promotion Institute, Seoul, Republic of Korea3Department of Clinical Medical Sciences, Seoul National University, College of Medicine, Seoul, Republic of Korea

Correspondence should be addressed to Hongyoon Choi; [email protected] and Kwon Joong Na; [email protected]

Received 13 October 2017; Revised 7 December 2017; Accepted 11 December 2017; Published 16 January 2018

Academic Editor: Jialiang Yang

Copyright © 2018 Hongyoon Choi and Kwon Joong Na. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

Risk stratification model for lung cancer with gene expression profile is of great interest. Instead of previous models based onindividual prognostic genes, we aimed to develop a novel system-level risk stratification model for lung adenocarcinoma based ongene coexpression network. Using multiple microarray, gene coexpression network analysis was performed to identify survival-related networks. A deep learning based risk stratification model was constructed with representative genes of these networks. Themodel was validated in two test sets. Survival analysis was performed using the output of the model to evaluate whether it couldpredict patients’ survival independent of clinicopathological variables. Five networks were significantly associated with patients’survival. Considering prognostic significance and representativeness, genes of the two survival-related networks were selected forinput of the model. The output of the model was significantly associated with patients’ survival in two test sets and training set(𝑝 < 0.00001, 𝑝 < 0.0001 and 𝑝 = 0.02 for training and test sets 1 and 2, resp.). In multivariate analyses, the model was associatedwith patients’ prognosis independent of other clinicopathological features. Our study presents a new perspective on incorporatinggene coexpression networks into the gene expression signature and clinical application of deep learning in genomic data sciencefor prognosis prediction.

1. Introduction

Risk stratification based on gene expression profiles is of majorbiomedical interest in lung cancer research [1–6]. Previousstudies developed risk stratification models that mostly fo-cused on individual prognostic genes. However, these studieshave not fully considered the nature of biological networksand their systematic properties. Since it is more evident thatbiological processes are derived from numerous interactionsbetween many cellular components, gene network analysiscould provide valuable information of cancer pathogenesis[7]. Among the various biological networks, gene coexpres-sion network has some strengths: not relying on prior infor-mation about genes, avoiding biologicallywrong assumptionsabout independence of gene expression levels, and alleviatingmultiple testing problems [8].

Lung cancer, mainly, non-small-cell lung cancer, is one ofthe most common cancers and is the leading cause of cancer-related death worldwide [9, 10]. Currently, TNM stagingsystem is a universal guideline for prognosis prediction andtreatment decision. However, heterogeneous molecular fea-tures of lung cancer require diverse adjuvant treatment op-tions and lead to different prognosis even in the same stage[11]. Hence, there has been a constant need for developingbetter risk stratification models to predict accurate prognosisand to improve cancer-related survival.

The main objectives of this study were (1) to identifysurvival-related gene coexpression network modules (2) andto propose a deep learning- (DL-) based risk stratificationmodel reflecting survival-related network modules. Usingpublic microarray datasets from the Gene Expression Omni-bus (GEO), we identified survival-related network modules

HindawiBioMed Research InternationalVolume 2018, Article ID 2914280, 11 pageshttps://doi.org/10.1155/2018/2914280

Page 2: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

2 BioMed Research International

of lung adenocarcinoma. Subsequently, we constructed DL-based prognostic score using representative genes of survival-related network modules and it showed great prognosticproperty in all cohorts.

2. Materials and Methods

2.1. Gene Expression Data. Total eleven microarray datasetsfrom NCBI GEO were included in the study. Two datasetswith survival information were set as independent testsets and the others were merged and set as training set(Supplementary Table 1). Detailed preprocessingmethods aredescribed in Supplementary Methods (available here).

2.2. Weighted Gene Coexpression Network Construction fromthe Training Set. We used weighted gene coexpression net-work analysis (WGCNA) package [8, 12] to build a weightedgene coexpression network from the training set. We createda correlationmatrix on the basis of Pearson’s correlation coef-ficient for all pairwise genes across all samples. The power,the key parameter for weighted network, was selected tooptimize both the scale-free topology and sufficient nodeconnectivity and we chose a threshold of 6 in this study(Supplementary Figure 1). The correlation matrix was trans-formed into adjacency matrix using the power function,and pairwise topological overlap (TO) between genes wascalculated.We identified networkmodules using hierarchicalclustering method with TO dissimilarity as the distancemeasure. The modules were detected using dynamic tree cutalgorithm [13], defining height cutoff value of 0.99, deep splitas 2, andminimummodule size cutoff value of 30. Genes thatwere not assigned to any module were classified to color gray(Figure 1).

2.3. Identification and Validation of Survival-Related NetworkModules. For each module, we summarized the moduleexpression profile by module eigengene (ME), which is thefirst principal component of the expressionmatrix of the cor-respondingmodule.WeusedMEas the representative of eachmodule to evaluate association with overall survival (OS).The survival-related network modules were identified usingCox regression analysis in the training set. For validation,the same genes included in the network construction wereextracted from each test set. ME was calculated based on theexpression profile of each test set, and the association betweenME and OS was evaluated using Cox regression analysis tosee whether the modules identified from the training setare also associated with OS in each test set. The moduleswith uncorrected 𝑝 value under 0.05 were regarded as sig-nificant survival-related network modules. We functionallyannotated all survival-related network modules with geneontology biological process terms using hypergeometric test(Supplementary Methods).

2.4. DL-Based Risk Stratification Model. To simplify riskstratification model, we selected representative genes of thesurvival-related modules for model construction. Represen-tativeness of a gene was measured by gene module member-ship (GMM), a correlation coefficient between gene expres-sion profile and ME of given module. Expression profiles of

representative geneswere used for the input of theDLbecausethey were expected to preserve coexpression patterns and toreflect the systematic properties of survival-related networkmodules. Convolutional neural network (CNN) was specifi-cally used to extract gene expression patterns of modules. Itfinally produced gene network prognostic score (NetScore).Details of selection of representative genes and architectureof DL framework are described in Supplementary Methods.

The DL-based risk stratification model was generatedusing patients’ data of the training set. Parameters related totraining of the neural network including number of layers,nodes, training epoch and learning rate were determined by5-fold cross-validation. Training set was randomly dividedinto 5 subsets. At each step, a single subset was left for testingand other four subsets were used for training. The perfor-mance of the model was measured by Harrell’s C-index of thefinal output score of the model [14]. The optimal parameterswere selected according to the maximum average C-indexacross the 5-fold of the loop.The predictive value of NetScorewas independently validated in two test sets.C-index for eachtest set was also evaluated.

2.5. Survival Analysis Using NetScore in All Cohorts. Prog-nostic property of NetScore as a continuous variable wasevaluated by univariate Cox analysis. To define risk groups,NetScore was dichotomized using the median value in eachcohort. Kaplan-Meier method was used to assess survivalrates according to the risk groups and survival rate differenceswere assessed with the log-rank test. Additionally, indepen-dent prognostic value of NetScore was assessed by multi-variate and subgroup analysis. Multivariate Cox analysis wasperformed using clinical and pathological variables as well asNetScore. Subgroups were divided on the basis of clinical andpathological features, and univariate Cox analysis ofNetScorewas performed in each subgroup.

3. Results

3.1. GeneCoexpressionNetworkModules from the Training Set.We aimed at developing a risk stratification model based ongene coexpression networks (Figure 1(a)).The networks wereconstructed from the training set which consists of microar-ray data of 510 lung adenocarcinoma samples. The clinico-pathological features of all samples from the training set aredetailed in Supplementary Table 2. Using WGCNA, 23 coex-pression network modules were identified from the trainingset (Figure 1(b)).The relationship between modules is visual-ized with hierarchical clustering dendrogram and heatmap ofthe corresponding ME (Supplementary Figure 2).

3.2. Identification of Survival-Related Modules from the Train-ing Set and Validation in Test Sets. Total five modules weresignificantly associated with OS (Figure 1(c)): red (𝑝 <0.0001), turquoise (𝑝 = 0.018), magenta (𝑝 = 0.029), black(𝑝 = 0.043), and light green (𝑝 = 0.044). To validate thesurvival-related modules, we conducted survival analysisin two independent test sets (GSE31210 as test set 1 andGSE30219 as test set 2; 𝑛 = 226 and 84, resp.). Consequently,

Page 3: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

BioMed Research International 3

Coexpression network Survival-related modules Representative genes Gene networkprognostic score

Validating survival-related modules Validating scoring system

Trai

ning

Test

(a)Gene dendrogram and module colors

Dynamic

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Hei

ght

Tree Cut

Number of genes = 4615Number of subjects = 510

(b)

Gre

y 60

Cyan

Roya

l blu

ePi

nkM

idni

ght b

lue

Gre

ySa

lmon

Dar

k gr

een

Tan

Gre

enBr

own

Modules

Ligh

t yel

low

Purp

leLi

ght c

yan

Dar

k re

dBl

ueYe

llow

Gre

en y

ello

wLi

ght g

reen

Blac

kM

agen

taTu

rquo

ise Red

−log1

0(p

val

ue)

4

3

2

1

0

(c)

3

2

1

0

−log1

0(p

val

ue)

−log1

0(p

val

ue)

Test set 1

Blac

k

Mag

enta

Turq

uoiseRe

d

Ligh

t gre

en

Modules

Test set 2

Blac

k

Mag

enta

Turq

uoiseRe

d

Ligh

t gre

en

Modules

0.0

0.5

1.0

1.5

2.0

(d)

Figure 1: Gene coexpression network construction and survival-related modules identification. (a) A schematic diagram summarizing our riskstratification modeling strategy. Gene coexpression network was constructed from the training set. Gene network modules were extractedbased on topological overlap. Survival-related modules were identified from the training set and validated in the two test sets. We selectedrepresentative genes from survival-related modules, and built network-based prognostic scoring system using deep learning. (b) Genedendrogram and modules identified by weighted gene coexpression network analysis from the training set. Modules were labeled withdifferent colors. (c) Univariate Cox regression analysis of module eigengene in the training set was performed. Module eigengene is arepresentative expression value of genes of each module calculated by the principal component analysis. The dotted line represents cutoffvalue (𝑝 value = 0.05) for significance, and five modules were identified as survival-related network modules. (d) Survival-related networkmodules were validated in the two test sets using Cox regression analysis. Three modules from test set 1 and two modules from test set 2 weresignificantly associated with overall survival.

Page 4: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

4 BioMed Research International

r = 0.18 r = 0.12 r = 0.16

r = 0.53 r = 0.35

p = 0.008 p = 0.42 p = 0.007

p < 1 × 10−19 p < 1 × 10−26

6

4

2

0

6

4

2

0

−log1

0(p

val

ue)

0.00 0.00 0.25 0.500.25 0.50 0.750.75

(a)

(b) (c)

Figure 2: Selection of representative genes of survival-related network modules. (a) To construct risk stratification model, representative geneswere selected according to the gene module membership. Gene module membership was correlated with the significance of associationbetween individual gene expression and survival. 𝑦-axis represents statistical significance calculated by univariate Cox analysis of individualgenes. A strong correlation was found in the red and turquoise modules (𝑟 = 0.53 and 𝑝 < 1 × 10−19 for red module; 𝑟 = 0.35 and 𝑝 < 1 × 10−23for turquoise module). Coexpression networks of red (b) and turquoise (c) modules were visualized. Note that 160 genes among 880 genesof turquoise module and their connections were shown. 160 genes were selected according to the gene module membership. Size of nodes isproportional to gene module membership.

turquoise (𝑝 = 0.0005), light green (𝑝 = 0.019), and red(𝑝 = 0.030) modules in test set 1 and turquoise (𝑝 = 0.011)and red (𝑝 = 0.049) modules in test set 2 were significantlyassociated with OS (Figure 1(d)).

The networks of two common survival-related networkmodules (red and turquoise) are presented in Figures 2(a)and 2(b). The significantly enriched gene ontology terms ofthe red module included “organic acid catabolic process,”

Page 5: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

BioMed Research International 5

……

.

Mod

ule 2

Mod

ule 1

Convolution24 filters

24 features 24 features 24 features

FC layerdropout

FC layerdropout

FC layerdropout

Risk score(1 dimension)

Gene1

Gene10

(a)

p < 0.00001 p < 0.0001 p = 0.02

Training Test set 1 Test set 2

1.41.21.00.8 1.41.21.00.81.0 1.2 1.40.8

Gene network prognostic score

0

5

10

15

Haz

ard

ratio

(b)

0.00

0.25

0.50

0.75

1.00

Surv

ival

pro

babi

lity

150500 100Time (months)

Number at risk by time43136

137 241256

87

150500 100Time

Group

Low risk

High risk

p < 0.0001

High RiskLow Risk

Gro

up

(c)

0.00

0.25

0.50

0.75

1.00

Surv

ival

pro

babi

lity

80 100 12060400 20Time (months)

Number at risk by time

114112 110

1069380

6446

1719

65

01

10080 1200 4020 60Time

Group

Low risk

High risk

p < 0.0001

High RiskLow Risk

Gro

up

(d)

Figure 3: Continued.

Page 6: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

6 BioMed Research International

0.00

0.25

0.50

0.75

1.00

Surv

ival

pro

babi

lity

15050 2000 100Time (months)

Number at risk by time

73

42 223242 14

1302

150 200500 100Time

Group

Low risk

High risk

p = 0.054

High RiskLow Risk

Gro

up

(e)

Figure 3: Risk stratification model using representative genes of survival-related network modules. (a) To construct risk stratification model,deep convolutional neural network was used. Input data were expression value of top 10 genes from each of red and turquoise module. Thefirst layer consists of one-dimensional convolutional filters which extract gene expression patterns of each module. Three additional fullyconnected (FC) layers were followed and connected to the output score gene network prognostic score (NetScore). Detailed training processand architecture of the neural network are described in Supplementary Methods. (b) Univariate Cox regression analysis of NetScore as acontinuous variable was performed in the training and two test sets. It shows significant association between the score and overall survival inall sets. The blue line represents hazard ratio for overall survival and the blue area represents 95% confidence interval. (c–e) Overall survivalof dichotomized group according to NetScore was depicted by Kaplan-Meier survival curve. The statistical difference was tested by log-ranktest. The high-risk group showed worse survival in the training set (c) and test set 1 (d) with statistical significance. The high-risk group ofthe test set 2 (e) also showed worse prognosis though the difference did not reach statistical significance.

“carboxylic acid catabolic process,” and “small moleculecatabolic process,” and the turquoise module included “DNAstrand elongation involved in DNA replication,” “mitotic cellcycle phase transition,” “DNA-dependent DNA replication”(Supplementary Table 3).

3.3. DL-Based Risk Stratification Model Using RepresentativeGenes of Survival-Related Module. By measuring the corre-lation between gene significance for OS (𝑝 value) and GMMin each survival-related module, we identified two modulesdemonstrating high correlation with statistical significance(𝑟 = 0.53, 𝑝 < 1 × 10−19 and 𝑟 = 0.35, 𝑝 < 1 × 10−26 forred and turquoise module, resp.; Figure 2(c)). Based on thestrong correlation, we could assume that the genes with highrepresentativeness measured by GMMhave high significancefor OS and are the most important elements of the module;therefore, we selected top 10 genes according to GMMfrom the red and turquoise modules for the DL-based riskstratification model construction (Supplementary Figure 3).

The expression profiles of selected 20 genes were usedas input data of the risk stratification model (Figure 3(a)).NetScore, the final output of our model, was significantlyassociated with OS in the training and two test sets(Figure 3(b)) (𝑝 < 0.00001, 𝑝 < 0.0001 and 𝑝 = 0.02 for

training set and test sets 1 and 2, resp.). Subjects were dividedinto two groups, high- and low-risk groups, according tothe median value of NetScore in each cohort. The high-riskgroup was significantly associated with OS in the trainingset (𝑝 < 0.0001; Figure 3(c)) and in test set 1 (𝑝 < 0.0001;Figure 3(d)). A trend of the association was also shown in testset 2 (𝑝 = 0.054; Figure 3(e)).

3.4. NetScore as an Independent Predictive Factor for Prog-nosis. Cox multivariate analysis revealed that the risk groupwas associated with OS independent of stage as well as otherclinicopathological features in the training set and test set 1(Table 1). The independent predictive factors for OS in Coxmultivariate analysis were the risk group (𝑝 = 0.001) andT-stage 3 (𝑝 = 0.030) in training set and the risk group(𝑝 = 0.01) and EGFR mutation status (𝑝 = 0.005) in testset 1. In test set 2, there was no feature significantly associatedwith OS in univariate Cox analysis, though the high-riskgroup showed a trend of unfavorable prognosis (𝑝 = 0.06).We also evaluated the prognostic value of NetScore insubgroups divided by clinical andpathological features. In thetraining set, the high-risk group was significantly associatedwith poor prognosis in subgroups regardless of age and T-stage. In all subgroups, a trend of close relationship between

Page 7: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

BioMed Research International 7

Table1:Univaria

teandmultiv

ariateCox

regressio

nanalysisof

ther

iskstratifi

catio

nmod

elandclinicopathologicalvaria

bles.

Varia

bles

Univaria

teanalysis

Multiv

ariateanalysis

Hazardratio

(95%

CI)

𝑝value

Hazardratio

(95%

CI)

𝑝value

Training

set

Genen

etworkprogno

stics

core,highris

kgrou

p2.51

(1.65–3.81)

<0.001

3.057(1.556–6

.006)

0.001

Age,older

than

601.6

6(0.96–

2.86)

0.06

8Sex,male

1.28(0.86–

1.88)

0.221

Smok

ingsta

tus,ex-smoker

0.59

(0.31

–1.12

)0.108

Smok

ingsta

tus,neversmoker

0.51

(0.22–1.19)

0.120

Tsta

ge:II

2.50

(1.31

–4.79)

0.00

61.2

66(0.599–2.674)

0.537

Tsta

ge:III

13.32

(2.89–

61.32

)0.001

5.895(1.18

9–29.237)

0.030

Nsta

ge:I

2.27

(1.28–4.05)

0.005

1.762

(0.943–3.294)

0.076

Testset1

Genen

etworkprogno

sticscore,high

riskgrou

p4.39

(1.92–10.06)

0.00

042.97

(1.25–7.0

9)0.01

Age,older

than

601.2

7(0.65–2.48)

0.49

Sex,male

1.52(0.78–2.96)

0.22

Smok

ingsta

tus,neversmoker

0.61

(0.31

–1.19

)0.15

Stage:II

4.23

(2.17

–8.24)

0.00

002

EGFR

mutation+

0.47

(0.24–

0.93)

0.03

2.74

(1.36–

5.54)

0.005

KRASmutation+

0.87

(0.27–2.85)

0.82

0.64

(0.32

–1.27)

0.20

Testset2

Genen

etworkprogno

sticscore,high

riskgrou

p1.8

1(0.98–3.36)

0.06

Age,older

than

601.3

3(0.73–2.43)

0.35

Sex,male

0.83

(0.40–

1.75)

0.63

Stage:T2

1.65(0.84–

3.25)

0.14

Page 8: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

8 BioMed Research International

1.594e − 05

>60 (n = 83)

≤60 (n = 137)

Subgroup analysis in patients with high prognostic scoreSubgroup

Overall (n = 273) 2.511 (1.653–3.814)

HR (95% CI) p value

1.787 (0.981–3.256)

3.026 (1.633–5.608)

4.389 (1.891–10.18)

2.346 (1.249–4.443)

2.787 (0.880–8.831)

7.233 (2.415–21.66)

0.956 (0.189–4.824)

6.186 (1.837–20.83)

2.305 (1.081–4.914)

3.609 (1.748–7.448)

3.133 (0.992–9.890)

2.775 (1.134–6.790)

1.884 (0.889–3.996)

3.526 (1.164–10.68)

SexFemale (n = 142)

Current (n = 34)

Ex-smoker (n = 64)

Never (n = 26)

T1 (n = 51)

T stage

Smoking

N stage

T2 (n = 82)

N0 (n = 98)

N1 (n = 29)

Stage

1A (n = 62)

1B (n = 93)

2 (n = 49)

Male (n = 131)

Age (years)

0.0258

0.0253

0.0516

0.000519

0.0306

0.00327

0.957

0.000408

0.0815

0.096

0.0081

0.000573

0.000433

0.0578

Hazard ratio0 1 10

(a)

0.000471

>60 (n = 118)

≤60 (n = 108)

Subgroup analysis in patients with high prognostic score

(Test set 1)Subgroup

Overall (n = 226) 4.39 (1.916–10.06)

2.092 (0.754–5.089)

HR (95% CI) p value

14.94 (1.991–112.1)

5.571 (1.853–16.75)

5.853 (1.357–25.24)

3.321 (1.13–9.763)

5.432 (1.096–26.92)

9.041 (1.13–72.31)

0.847 (0.278–2.578)

8.685 (1.94–38.88)

0.769 (0.069–8.576)

5.449 (1.236–24.01)

Sex

Female (n = 121)

Ever (n = 111)

Never (n = 115)

Stage

Smoking

1A (n = 114)

1B (n = 54)

2 (n = 58)

Male (n = 105)

Age (years)

0.00855

0.002244

0.0081

0.018

0.0292

0.0382

0.0379

0.7703

0.0047

0.831

0.0251

0.156

Gene mutation

EGFR+ (n = 127)

KRAS+ (n = 20)

Negative (n = 68)

Hazard ratio1−1 10 20

(b)

Figure 4: Continued.

Page 9: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

BioMed Research International 9

Subgroup analysis in patients with high prognostic score(Test set 2)

Subgroup HR (95% CI) p value

>60 (n = 42)

≤60 (n = 42)

Overall (n = 84)

SexFemale (n = 18)

Male (n = 66)

Age (years)

1.831 (0.979–3.358)

6.051 (1.206–30.37)

1.449 (0.729–2.881)

1.740 (0.802–3.775)

2.096 (0.734–5.987)

0.0287

0.29

0.161

0.167

Hazard ratio1−1 10 20

0.0586

(c)

Figure 4: Subgroup analysis using NetScore. (a) Predictive value of our risk stratification model was tested in subgroups classified byclinicopathological characteristics of the training set. A trend of association between the risk group and overall survival was found in allsubgroups. (b, c) The same subgroup survival analysis was also performed in both test sets. (b) The risk group was associated with overallsurvival regardless of clinicopathological variables except female, stage II, and KRAS mutation subgroups in test set 1. (c) Regardless ofsubgroups, a trend of poor prognosis in high-risk group was also found in test set 2.

the risk group and OS was found except never-smokingsubgroup (Figure 4(a), Supplementary Figure 4). Accordingto subgroup analysis in test set 1, the risk group was closelyassociated with OS in male, old-aged, ever/never smokers,stage IA/IB, EGFR positive and all negative mutation sub-groups (Figure 4(b), Supplementary Figure 5). A trend ofassociation between the risk group and OS was also revealedin each subgroup of test set 2, regardless of clinical featuresincluding sex, age, and T-stage (Figure 4(c), SupplementaryFigure 6).

4. Discussion

In this study, we developed a risk stratification model forlung adenocarcinoma based on gene coexpression networksand deep learning. Survival-related network modules wereidentified in multiple cohorts and representative genes ofthese modules were selected for risk stratification modeling.Themodel constructed by deep CNN reflects gene expressionpatterns of survival-related network modules and it providesprognostic score, NetScore. The NetScore was significantlyassociated with OS in all cohorts and also an independentpredictor for OS from clinicopathological variables.

The model based on survival-related network modulescan provide more robust risk stratification compared withmodels focusing on statistical combination of individualprognostic genes which have been proposed in the previousstudies [1–6]. In spite of previous promising results of individ-ual gene-basedmodels, they failed to validate in independentsamples of other study [4]. Furthermore, there were few over-lapping significant prognostic genes in the previous models.A meta-analysis of published gene expression data revealedthat few genes were associated with survival of lung adeno-carcinoma [15].The result of few significant prognostic genesin large samples implied the limitation of usage of individual

genes for risk stratification. Besides, selection of individualsignificant genes has a substantial problem of multiple statis-tical testing [16]. Instead of these previous approaches, sys-temic approach integrating gene interaction as well as indi-vidual genes would be a breakthrough for robust risk stratifi-cation modeling because variation patterns of their expres-sion levels can be associated with prognosis.

Recently, DL has dramatically improved data analysis ingenomics and imaging fields [17, 18]. The main contributionof DL for our risk stratification model is to apply deepneural network to gene expression data. It employed convolu-tional layers for extracting multiple gene expression patterns.Another contribution is to solve regression problems ofsurvival data by using a specialized loss function [19] (seeSupplementary Methods). We compared predictive accuracyof DL-based model and conventional Cox proportional haz-ard model obtained from the expression level of selected 20genes. Predictability of the DL-based model was significantlyhigher than that of the Cox model in test set 1 (C-index =0.709 ± 0.042 and 0.608 ± 0.046, resp.; 𝑝 = 0.004). It was alsohigher in the training set and test set 2 though the differ-ence did not reach statistical significance (SupplementaryMethods, Supplementary Figure 7). Furthermore, to confirmrobustness of NetScore, the model was retrained by thedataset combined by original training and test set 1 andvalidated in test set 2. NetScore of the retrained model wasalso significantly associated with OS in the test set 2 (𝑝 =0.003; C-index = 0.651 ± 0.042). To our knowledge, NetScoreis the first study that apply deep convolutional neural net-work to high-dimensional gene expression data for predict-ing prognosis. By applying this novel approach to variousgenomic data, risk stratification and survival prediction couldbe improved compared with conventional Cox model.

NetScore was trained by various samples with differentclinicopathological characteristics. We found NetScore wasassociated with sex, smoking status, stage, and molecular

Page 10: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

10 BioMed Research International

subtypes (Supplementary Figure 8). Briefly, a trend of highNetScore was found in male, smokers, late stage, and KRASmutation positive samples. Nonetheless, NetScore was signif-icantly associatedwithOS independent of clinicopathologicalvariables according tomultivariate and subgroup analyses. Ofnote, NetScore was significant predictor in early stage sub-groups (stage IA/IB).This finding could be important becausethe new risk stratification could identify patients who mightneed adjuvant chemotherapy. For example, a recent clinicaltrial using 15-gene signature based on individual prognosticgenes showed successful selection of patients with stageIB and II NSCLC who would most likely benefit from adju-vant chemotherapy [20]. In the future, as a new prognosticbiomarker based on gene network, the usefulness of NetScoreshould be tested whether it could affect clinical decisionand compared with the previous prognostic models usingindividual genes.

We developed a risk stratification model for lung adeno-carcinoma using gene coexpression network. A future exten-sion of our work would be to apply this approach to thecoexpression networks of other cancer types. In terms oftechnical improvement, modification of DL architecture andselection process of representative genes could improve theprediction accuracy. Finally, we expected that a prospectivelydesigned clinical trial with well-controlled clinicopathologi-cal variables would help find clinical application of our newrisk stratification model.

Disclosure

This research did not receive any specific grant from fundingagencies in the public, commercial, or not-for-profit sectors.

Conflicts of Interest

The authors declare no financial conflicts of interest.

Authors’ Contributions

Hongyoon Choi and Kwon Joong Na contributed equally tothis work.

Supplementary Materials

Supplementary Methods for public data collection and pre-processing, functional annotation and network visualiza-tion of survival-related network modules, representativegenes selection for risk stratification model construction,comparison of predictability between DL-based model andconventional Cox proportional hazard model, and convolu-tional neural network for risk stratification. SupplementaryFigure 1: gene coexpression network topology analysis forvarious soft threshold powers. Supplementary Figure 2: therelationship between coexpression networkmodules. Supple-mentary Figure 3: representative genes selected from the redand turquoise modules. Supplementary Figure 4: subgroupanalysis using NetScore in the training set. SupplementaryFigure 5: subgroup analysis using NetScore in the test set 1.

Supplementary Figure 6: subgroup analysis using NetScorein the test set 2. Supplementary Figure 7: comparison of pre-dictability between deep neural network and conventionalCox regression models. Supplementary Figure 8: associ-ation of gene network prognostic score (NetScore) withclinicopathological variables. Supplementary Table 1: mi-croarray data sets from Gene Expression Omnibus used inthis study. Supplementary Table 2: demographic and baselineclinical characteristics of patients. Supplementary Table 3:significantly enriched gene ontology biological process termsof five survival-related network modules. (SupplementaryMaterials)

References

[1] H. Y. Chen, S. L. Yu, C. H. Chen et al., “A five-gene signatureand clinical outcome in non-small-cell lung cancer,” The NewEngland Journal of Medicine, vol. 356, no. 1, pp. 11–20, 2007.

[2] Y. Xie, G. Xiao, K. R. Coombes et al., “Robust gene expressionsignature from formalin-fixed paraffin-embedded samples pre-dicts prognosis of non-small-cell lung cancer patients,” ClinicalCancer Research, vol. 17, no. 17, pp. 5705–5714, 2011.

[3] M. Skrzypski, E. Jassem,M. Taron et al., “Three-gene expressionsignature predicts survival in early-stage squamous cell carci-noma of the lung,” Clinical Cancer Research, vol. 14, no. 15, pp.4794–4799, 2008.

[4] K. Shedden, J. M. G. Taylor, S. A. Enkemann et al., “Gene ex-pression-based survival prediction in lung adenocarcinoma: amulti-site, blinded validation study,” Nature Medicine, vol. 14,no. 8, pp. 822–827, 2008.

[5] P. Roepman, J. Jassem, E. F. Smit et al., “An immune responseenriched 72-gene prognostic profile for early-stage non-small-cell lung cancer,”Clinical Cancer Research, vol. 15, no. 1, pp. 284–290, 2009.

[6] P. C. Boutros, S. K. Lau,M. Pintilie et al., “Prognostic gene signa-tures for non-small-cell lung cancer,”Proceedings of the NationalAcadamy of Sciences of the United States of America, vol. 106, no.8, pp. 2824–2828, 2009.

[7] A. Barabasi, N. Gulbahce, and J. Loscalzo, “Network medicine:a network-based approach to human disease,” Nature ReviewsGenetics, vol. 12, no. 1, pp. 56–68, 2011.

[8] B. Zhang and S. Horvath, “A general framework for weightedgene co-expression network analysis,” Statistical Applications inGenetics and Molecular Biology, vol. 4, no. 1, article 1128, 2005.

[9] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2015,”CA: A Cancer Journal for Clinicians, vol. 65, no. 1, pp. 5–29, 2015.

[10] L. A. Torre, F. Bray, R. L. Siegel, J. Ferlay, and J. Lortet-Tieulent,“Global cancer statistics, 2012,” CA: A Cancer Journal forClinicians, vol. 65, no. 2, pp. 87–108, 2015.

[11] W. Pao and N. Girard, “New driver mutations in non-small-celllung cancer,” The Lancet Oncology, vol. 12, no. 2, pp. 175–180,2011.

[12] P. Langfelder and S. Horvath, “WGCNA: an R package forweighted correlation network analysis,” BMC Bioinformatics,vol. 9, article 559, 2008.

[13] P. Langfelder, B. Zhang, and S. Horvath, “Defining clusters froma hierarchical cluster tree: the dynamic tree cut package for R,”Bioinformatics, vol. 24, no. 5, pp. 719-720, 2008.

[14] F. E. Harrell, K. L. Lee, R.M. Califf, D. B. Pryor, and R. A. Rosati,“Regression modelling strategies for improved prognostic pre-diction,” Statistics in Medicine, vol. 3, no. 2, pp. 143–152, 1984.

Page 11: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

BioMed Research International 11

[15] B. Gyorffy, P. Surowiak, J. Budczies, and A. Lanczky, “Onlinesurvival analysis software to assess the prognostic value ofbiomarkers using transcriptomic data in non-small-cell lungcancer,” PLoS ONE, vol. 8, no. 12, Article ID e82241, 2013.

[16] D. B. Allison, X. Cui, G. P. Page, and M. Sabripour, “Microarraydata analysis: from disarray to consolidation and consensus,”Nature Reviews Genetics, vol. 7, no. 1, pp. 55–65, 2006.

[17] C. Angermueller, T. Parnamaa, L. Parts, and O. Stegle, “Deeplearning for computational biology,”Molecular Systems Biology,vol. 12, no. 7, article no. 878, 2016.

[18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,vol. 521, no. 7553, pp. 436–444, 2015.

[19] J. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y.Kluger, “DeepSurv: personalized treatment recommender sys-tem using a cox proportional hazards deep neural network,”https://arxiv.org/abs/1606.00931.

[20] C.-Q. Zhu,K.Ding,D. Strumpf et al., “Prognostic andpredictivegene signature for adjuvant chemotherapy in resected non-small-cell lung cancer,” Journal of Clinical Oncology, vol. 28, no.29, pp. 4417–4424, 2010.

Page 12: A Risk Stratification Model for Lung Cancer Based on Gene … · 2019. 7. 30. · Risk score (1 dimension) Gene1 Gene10 (a) p < 0.00001 p < 0.0001 p = 0.02 Training Test set 1 Test

Hindawiwww.hindawi.com

International Journal of

Volume 2018

Zoology

Hindawiwww.hindawi.com Volume 2018

Anatomy Research International

PeptidesInternational Journal of

Hindawiwww.hindawi.com Volume 2018

Hindawiwww.hindawi.com Volume 2018

Journal of Parasitology Research

GenomicsInternational Journal of

Hindawiwww.hindawi.com Volume 2018

Hindawi Publishing Corporation http://www.hindawi.com Volume 2013Hindawiwww.hindawi.com

The Scientific World Journal

Volume 2018

Hindawiwww.hindawi.com Volume 2018

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawiwww.hindawi.com Volume 2018

Hindawiwww.hindawi.com Volume 2018

Neuroscience Journal

Hindawiwww.hindawi.com Volume 2018

BioMed Research International

Cell BiologyInternational Journal of

Hindawiwww.hindawi.com Volume 2018

Hindawiwww.hindawi.com Volume 2018

Biochemistry Research International

ArchaeaHindawiwww.hindawi.com Volume 2018

Hindawiwww.hindawi.com Volume 2018

Genetics Research International

Hindawiwww.hindawi.com Volume 2018

Advances in

Virolog y Stem Cells International

Hindawiwww.hindawi.com Volume 2018

Hindawiwww.hindawi.com Volume 2018

Enzyme Research

Hindawiwww.hindawi.com Volume 2018

International Journal of

MicrobiologyHindawiwww.hindawi.com

Nucleic AcidsJournal of

Volume 2018

Submit your manuscripts atwww.hindawi.com