using network information with gene expression data - jean yee hwa yang
DESCRIPTION
Large-scale molecular interaction networks are dynamic in nature and changes in these networks, rather than changes in individual genes/proteins, are often drivers of complex diseases such as cancer. In this talk, I use data from stage III melanoma patients provided by Prof. Mann lab that comprise of clinical, mRNA and miRNA data to discuss how network information can be utilise in the analysis of gene expression analysis to aid in biological interpretation. I will also present an R software package, Variability Analysis in Networks (VAN), that enables an integrative analysis of protein-protein or microRNA-gene networks and expression data to identify hubs (i.e. highly connected proteins/microRNAs in a network) that are dysregulated, in terms of expression correlation with their interaction partners.TRANSCRIPT
Using network information with gene
expression data
Jean Yee Hwa Yang School of Mathematics and Statistics
Central dogma of molecular biology
2
Image source: Central dogma of molecular biology, Wikipedia; h<p://en.wikipedia.org/wiki/Central dogma of molecular biology
Microarrays are used to detect the extent to which genes are being expressed.
*!*!*!*!*!
mRNA
Expression!data!microarray!
Varia
ble
(500
0-30
000
gene
s) o
r (20
00 m
iRN
As)!
N samples!*!*!*!*!
*!
Technologies ~ measuring expression
microRNA
Next-gen Sequencing! Count!
data!
3
Motivation: Melanoma prognosis
› Melanomas are common in a large demographic of the population, especially in Caucasians living in sunny climates. Of those that metastasise (Stage III), about 40% go on to live cancer free, but another 40% succumb to the disease in less than 1 year.
› Samples were obtained from Professor Graham Mann's group from the Westmead Institute for Cancer Research and Melanoma Institute Australia.
› Aim: To predict survival prognosis for Stage III melanoma patients.
Currently, we have gene expression data for 79 Stage III individuals. In addition, we have clinical data consisted of patient stage at diagnosis, survival status as well as histology, pathological and mutation information.
4
Research aims
› New prognostic markers
- To determine whether there are significant biomarker and pathway differences between melanomas of good and bad prognosis after resection of nodal metastatic disease;
› New therapeutic targets
- To identify and validate the principal regulatory pathway abnormalities that characterise metastatic (stage III and IV) melanomas;
- To investigate novel genomic drivers of melanoma tumour progression and outcome.
Provided by Sara-Jane Schramm (Usyd)
Survival outcome Survival time of stage III melanoma patients
Survival time (years)
Freq
uenc
y
0 2 4 6 8 10 12
05
1015
20 Two survival groups
Bad prognosis: Survival < 1 year and
died due to melanoma
Good prognosis: Survival ≥ 4 years with no sign of
relapse
Gene expression (microarray data)
7
No correlation with BRAF mutation
Expressio
n value
s
Pink: no BRAF mutaQon Gray: BRAF mutaQon
PP GP
Gene expression : DE analysis
1. Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).
2. Cluster analysis: finding common patterns between samples / genes.
3. Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).
8
Three main types of questions
Image reproduced from JOURNAL OF INVESTIGATIVE DERMATOLOGY|Vol 133|2013
Gene expression : cluster analysis
1. Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).
2. Cluster analysis: finding common patterns between samples / genes.
3. Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).
Three main types of questions
Gene expression : classification
1. Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).
2. Cluster analysis: finding common patterns between samples / genes.
3. Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).
10
Three main types of questions
Gene 1 Mi1 < -‐0.67
Gene 2 Mi2 > 0.18
B-‐ALL
AML
T-‐ALL
yes
yes
no
no
Error rate
Number of features (genes)
Most of the approches to date can be considered as “single gene” analysis.
11
Three different levels of DE analysis
1. Single gene level: this is gene-by-gene analysis (individual node)
2. Gene set level: the features are subsets of genes (set of nodes), e.g. gene set test.
3. Network level: examine a subsets of genes (nodes in the network) together with information on relationships between the genes (the edges in the network).
12
Lets think of performing DE analysis at 3 different levels:
Networks
Network
CD19CD38
LYN
VAV1
VAV2
ABL1
LCP2
NCK1
VAV3
ZAP70
YWHAQ
CD6
GRAP2
GRB2
SHB
MAP4K1
SYK
RHOA
DVL2
ELAVL4
SWAP70
THY1
RHOG
CDC42
RAC1
KLK3
EGFR
ALKLCK
13
A network is made up of nodes and edges:
Network discovery vs perdefined networks
› Network discovery: use microarray information to find genes with highly correlated gene expression probes, and define edges accordingly (e.g. WGCNA).
› Predefined networks: use predefined gene interaction databases such as MetaCore or iRefWeb. - E.g. protein-protein interaction networks: a node represents a protein-coding
gene, and an edge between two nodes represents an interaction between the proteins coded for by the genes.
14
We have used two different methods for defining networks:
Protein-protein interaction data
› Human Protein Reference Database - Keshava Prasad et al. 2009
› iRefWeb - Turner et al. 2010
› BioGRID - Chatr-aryamontri et al. 2013
› MetaCore - From GeneGo Inc.
Hairball image generated using Cytoscape
(Smoot et al. 2011)
Thanks to Simone Li and Drs Igy Pang and David Fung at the Systems Biology Initiative, the University of New South Wales
VAV3 hub subnetwork
VAV3
RHOG
CDC42
RAC1
RHOAKLK3
EGFR
GRB2
ALK
LCP2
LCK
SYK
Metacore network dataset
› Split the network into subnetworks, containing a central hub gene (a gene with 5 interactors) and its immediate interactors.
› For example, one network dataset from Metacore database consists of 1273 hub subnetworks with a total of 3607 genes in common with the microarray dataset.
16
VAV3 hub subnetwork
Talyor et al, Nature Biotech, 2009
P-valueHub = frequency of random average hub difference > real average hub difference
1000
NATURE BIOTECH.|Vol 27|2009
Talyor et al, Nature Biotech, 2009
P-valueHub = frequency of random average hub difference > real average hub difference
1000
NATURE BIOTECH.|Vol 27|2009
Finding hubs of interest
19
For a given sub-network (predefined hub) i:
Hub gene Interactor gene i
Finding hubs of interest
› For each edge, k , the correlation difference between the two classes (GP and PP) was calculated.
20
For a given sub-network (predefined hub) i:
ΔPP,GP,k = PPcork −GPcork
Hub gene Interactor gene i
Finding hubs of interest
› For each sub-network i , calculate the average absolute difference in hub –interactor correlation:
where ni is the number of interactors of the central hub gene in the network i .
21
For a given sub-network (predefined hub) i:
ΔPP,GP,i = PPcori −GPcori
AveHubDiffi =ΔPP,GP,ki=1
ni∑ni −1
Rank the hub subnetworks based on their AveHubDiff values or use permutation test to determine the statistical significance of each hub.
Applying to Melanoma gene expression data
› A: Patients surviving >4yr post resection of metastatic disease
› B: Patients surviving <1yr post resection of metastatic disease
› C & D:
› Enlarged view (HDAC)
Results – gene co-expression networks are significantly disturbed among patients with good and poor clinical outcomes
PIG. CELL & MEL. RES.|In press|2013 Provided by Sara-Jane Schramm (Usyd)
Software: VAN
Transcriptomics data
����VWDWHV�Network data
�33,�PLFUR51$�JHQH�
Data analysis
2EWDLQ�KXE�LQWHUDFWRU�FRUUHODWLRQV�LQ�HDFK�VWDWH
3HUIRUP�WHVWV�RI�VLJQLILFDQFH�WR�LGHQWLI\�KXEV�ZKHUH�DYHUDJH�FRUUHODWLRQ��ZLWK�LQWHUDFWRUV��YDULHV�DFURVV�VWDWHV
3HUIRUP�PHWD�DQDO\VLV��EDVHG�RQ�p�YDOXHV��LI�PXOWLSOH�GDWDVHWV�DUH�FRQVLGHUHG
Cancer gene
census data
Hubs causally
implicated in cancer
+XE�DQG�LQWHUDFWRrV�í�DOLYH
CASP8
KCNQ1
%5&$�
HNF4A
HGS
EIF3A
06+�
*5%�
PDGFRB
+XE�DQG�LQWHUDFWRrV�í�GGP
CASP8
KCNQ1
%5&$�
HNF4A
HGS
EIF3A
06+�
*5%�
PDGFRB
HNF4A
KCNQ1
NOS1AP
PTK2B
ZMYM2
CYLD
CASP8
MSH2
EIF3A
CREB3
RFXANK
NOV
PFKFB2
KRT15
CLOCK
MIF
LCP2
LMO2
RNF126GABRG2
CDH5
MRFAP1
PIN1
ATMPOLR2A
MTA1
FANCF BARD1 MAP3K3 BRE
ATN1 RPS6KA3
SHC1
TAF6L
PPFIA2
MED10
NR3C1PARK7
CD2AP
SMAD3
GPX1HIST2H2BE
TERF1
UBA3
NCL
HIF1A
PPP1CB
RBPJMUC1
TFAP2A
HIPK2
USP20
U2AF1
TUBA4A
SP100
PLK1HDAC3
DES CDH1 SREBF1
JAK1
ACTB
TOPORS
KAT2A
TSC2
ETS2
MAP3K14
ZBTB16IRAK4
EIF3EUBE2D3 UGCG MLL CCNG1
AMFR
PLDN
TPRKB
TSG101
MAGOH
BNIP3HNRNPD
NFKBIZFKBP5
PIK3R2
MED25
RELA
CD3D
HIST1H3A
LYST
TADA3
CDYLNFKBIASLC9A3R1
GHR
GATA4
MAP2K7
TAF15
AKAP14
PTPN3
KLF4KCNA2
TOPBP1
HIST1H1C
MED28
SOX9
FBXO5
NUP155
DAZAP2
AXL
SMAD4
LYN
FKBP3
PIAS1
CDC42
SMEK2
COPS2
ABL1
KPNB1
NCOA6
IRAK1
SF3B1
GRM7
CDC25A
OGG1ARF3HMG20BPKN3
PLAURMTF2
AKT1
GSK3B
BCAR1
MED6YWHAB
SP1
ERCC4
GRIK1
PRMT1
EP300
ASH2L
NFYC
PRKDC
NUB1
RBBP7ACTN1
CALCR
FHL3IKZF3
SERPINB2
HSPA4
SDCBP
CREBBP
CBX1
UBE2L3
ANAPC10
SMC3
FASTK
CDK5R1
TRIB3
MBD3
BCL2L1
CDKN1A
LAT
FOS
MORF4L2
SUPT3H
RIPK1
PCNA
RBL1 STK24PARP1
SOCS3
SMARCA5
RBBP4
PDGFRB
UBC
HGS
DAXX
TRAF1
MYC
HDAC1
TNFRSF1A
PSMD2
BAD
CSTF2
PIAS4
DIABLO
TOP3A
CSNK2A1
TP53
PRKAB1
FYN
NPAS2
CCNE1
SH3GL3
CBLB
CRKL
PRDM2
CTNNB1
SUV39H1
MAP2K1
USP7
CHAF1A
BRCA2
C19orf40
SMN1
SYT1
RHOA
GABBR2
HOXA9
HSP90AA1
CSRP3
RASA1
GNAI3
IL2RG
CBL
PDGFRA
CAB39
GRB2
ESR1
MSX2DERL1
JAK2
MAD2L2
SMARCB1
CUL4A
PEX19CD27 KDM5A
CCDC130
HSPD1
CAPN1
UBE2I
SNRPF
PSEN2
PPP1R15A
CHD8
HDAC2
NDN
R &\WRVFDSH
Hubs of interest
1HWZRUN�YLVXDOL]DWLRQ�XVLQJ�5��RQH�KXE�DW�D�WLPH��RU�&\WRVFDSH��PXOWLSOH�KXEV�DW�WKH�VDPH�WLPH�
23
VAN: Identifying biologically perturbed networks using differential variability analysis
Hub and interactors
24
ANSR DM
Software: VAN
25
VAN: Identifying biologically perturbed networks using differential variability analysis
Transcriptome data
Network data
Cancer gene data
Data analsysis
Moving to classification
1. Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).
2. Cluster analysis: finding common patterns between genes.
3. Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and model (classifier).
26
Three main categories of question
How to extend this concept from DE analysis to classification and prediction
DE analysis Classification
1. Gene-based features: rank genes using network information (e.g. NetRank), or construct weights for genes using network information (e.g. weighted lasso).
2. Network-based features: dene some network measure which can be used to quantify network perturbation between the two classes; rank the networks accordingly (e.g. Rapaport et al., Taylor et al., BSS=WSS).
› Note that it is surprisingly difficult to come up with a network measure which can be translated from a DE framework into a classification framework.
27
Constructing features for the network approach in two main ways:
Talyor et al: feature
› Instead of using the top ranked networks as the classification features, Taylor et al. use the edges in the top ranked networks.
› Each edge k in the selected networks is assigned the feature value.
28
I1
HI2
I3 I4
I5
Looking at one hub
› Some individual networks are capable of separating the classes reasonably well, by considering the difference between hub and interactor expression (the LDA method).
29
−2 −1 0 1 2
0.4
0.6
0.8
1.0
1.2
1.4
CEBPB (49 interactors)
Expression for the CEBPB gene
Med
ian
abso
lute
exp
ress
ion
for t
he in
tera
ctor
s
GPPP
Classification procedure
30
Other network based approaches
31
Winter et al, Plos ComputaQonal Biology, 2012
Cross-validation error rate
32
Mod−t
Unw
eigh
ted
lass
o
Aver
age
expr
essi
on
Tayl
or
Rap
apor
t
Inne
r pro
duct
BSS/
WSS
Wei
ghte
d la
sso
(hub
)
Wei
ghte
d la
sso
(all)
0.2
0.3
0.4
0.5
0.6
Random forestC
lass
ifica
tion
erro
r
Single−gene Gene Network−based features
Gene−basedfeaturesset
› Error rates for Taylor's method are only slightly better than for the classical single-gene moderated-t method.
› However, the two methods are capturing dierent information: they are correctly classifying dierent subsets of patients.
33
Summary and discussion
› VAN (R package) enables the testing of modules for dysregulation based on two or more conditions, it is also suitable for the examination of changes across developmental timelines.
› Majority of network methods based on the discovery network do not perform as well as methods based on the predefined network.
› Combining Taylor's method and the single-gene method could yield a more accurate classier.
› Using the LDA method, some hub subnetworks independently act as accurate prognostic predictors.
› The best performing network feature selection methods only select small hub subnetworks.
34
Acknowledgements › Graham Mann (Usyd)
- Gulietta Pupo & Varsha Tembe
› Sara-Jane Schramm › John Thompson › Richard Scolyer (RPA)
› Marc Wilkins (UNSW) - Simone Li
- Chi Nam Ignatius Pang - David Fung - Apurv Goel
- Natalie Twine
› School of Mathematics and Statistics (Usyd)
- Samuel Mueller
- Vivek Jayaswal
- Kaushala Jayawardana
- Rebecca Barter
- Shila Ghanazfar
- Anna Campain