using network information with gene expression data - jean yee hwa yang

Using network information with gene

expression data

Jean Yee Hwa Yang School of Mathematics and Statistics

Central dogma of molecular biology

2

Image source: Central dogma of molecular biology, Wikipedia; h<p://en.wikipedia.org/wiki/Central dogma of molecular biology

Microarrays are used to detect the extent to which genes are being expressed.

*!*!*!*!*!

mRNA

Expression!data!microarray!

Varia

ble

(500

0-30

000

gene

s) o

r (20

00 m

iRN

As)!

N samples!*!*!*!*!

*!

Technologies ~ measuring expression

microRNA

Next-gen Sequencing! Count!

data!

3

Motivation: Melanoma prognosis

›  Melanomas are common in a large demographic of the population, especially in Caucasians living in sunny climates. Of those that metastasise (Stage III), about 40% go on to live cancer free, but another 40% succumb to the disease in less than 1 year.

›  Samples were obtained from Professor Graham Mann's group from the Westmead Institute for Cancer Research and Melanoma Institute Australia.

›  Aim: To predict survival prognosis for Stage III melanoma patients.

Currently, we have gene expression data for 79 Stage III individuals. In addition, we have clinical data consisted of patient stage at diagnosis, survival status as well as histology, pathological and mutation information.

4

Research aims

› New prognostic markers

-  To determine whether there are significant biomarker and pathway differences between melanomas of good and bad prognosis after resection of nodal metastatic disease;

› New therapeutic targets

-  To identify and validate the principal regulatory pathway abnormalities that characterise metastatic (stage III and IV) melanomas;

-  To investigate novel genomic drivers of melanoma tumour progression and outcome.

Provided by Sara-Jane Schramm (Usyd)

Survival outcome Survival time of stage III melanoma patients

Survival time (years)

Freq

uenc

y

0 2 4 6 8 10 12

05

1015

20 Two survival groups

Bad prognosis: Survival < 1 year and

died due to melanoma

Good prognosis: Survival ≥ 4 years with no sign of

relapse

Gene expression (microarray data)

7

No correlation with BRAF mutation

Expressio

n value

s

Pink: no BRAF mutaQon Gray: BRAF mutaQon

PP GP

Gene expression : DE analysis

1.  Differential expression (DE) analysis: finding DE genes between two classes (e.g. good prognosis vs poor prognosis).

2.  Cluster analysis: finding common patterns between samples / genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and a model (classifier).

8

Three main types of questions

Image reproduced from JOURNAL OF INVESTIGATIVE DERMATOLOGY|Vol 133|2013

Gene expression : cluster analysis





Gene expression : classification




10


Gene 1 Mi1 < -‐0.67

Gene 2 Mi2 > 0.18

B-‐ALL

AML

T-‐ALL

yes

yes

no

no

Error rate

Number of features (genes)

Most of the approches to date can be considered as “single gene” analysis.

11

Three different levels of DE analysis

1. Single gene level: this is gene-by-gene analysis (individual node)

2. Gene set level: the features are subsets of genes (set of nodes), e.g. gene set test.

3. Network level: examine a subsets of genes (nodes in the network) together with information on relationships between the genes (the edges in the network).

12

Lets think of performing DE analysis at 3 different levels:

Networks

Network

CD19CD38

LYN

VAV1

VAV2

ABL1

LCP2

NCK1

VAV3

ZAP70

YWHAQ

CD6

GRAP2

GRB2

SHB

MAP4K1

SYK

RHOA

DVL2

ELAVL4

SWAP70

THY1

RHOG

CDC42

RAC1

KLK3

EGFR

ALKLCK

13

A network is made up of nodes and edges:

Network discovery vs perdefined networks

› Network discovery: use microarray information to find genes with highly correlated gene expression probes, and define edges accordingly (e.g. WGCNA).

›  Predefined networks: use predefined gene interaction databases such as MetaCore or iRefWeb. -  E.g. protein-protein interaction networks: a node represents a protein-coding

gene, and an edge between two nodes represents an interaction between the proteins coded for by the genes.

14

We have used two different methods for defining networks:

Protein-protein interaction data

›  Human Protein Reference Database -  Keshava Prasad et al. 2009

›  iRefWeb -  Turner et al. 2010

›  BioGRID -  Chatr-aryamontri et al. 2013

›  MetaCore -  From GeneGo Inc.

Hairball image generated using Cytoscape

(Smoot et al. 2011)

Thanks to Simone Li and Drs Igy Pang and David Fung at the Systems Biology Initiative, the University of New South Wales

VAV3 hub subnetwork

VAV3

RHOG

CDC42

RAC1

RHOAKLK3

EGFR

GRB2

ALK

LCP2

LCK

SYK

Metacore network dataset

›  Split the network into subnetworks, containing a central hub gene (a gene with 5 interactors) and its immediate interactors.

›  For example, one network dataset from Metacore database consists of 1273 hub subnetworks with a total of 3607 genes in common with the microarray dataset.

16

VAV3 hub subnetwork

Talyor et al, Nature Biotech, 2009

P-valueHub = frequency of random average hub difference > real average hub difference

1000

NATURE BIOTECH.|Vol 27|2009

Finding hubs of interest

19

For a given sub-network (predefined hub) i:

Hub gene Interactor gene i


›  For each edge, k , the correlation difference between the two classes (GP and PP) was calculated.

20


ΔPP,GP,k = PPcork −GPcork

Hub gene Interactor gene i


›  For each sub-network i , calculate the average absolute difference in hub –interactor correlation:

where ni is the number of interactors of the central hub gene in the network i .

21


ΔPP,GP,i = PPcori −GPcori

AveHubDiffi =ΔPP,GP,ki=1

ni∑ni −1

Rank the hub subnetworks based on their AveHubDiff values or use permutation test to determine the statistical significance of each hub.

Applying to Melanoma gene expression data

›  A: Patients surviving >4yr post resection of metastatic disease

›  B: Patients surviving <1yr post resection of metastatic disease

› C & D:

›  Enlarged view (HDAC)

Results – gene co-expression networks are significantly disturbed among patients with good and poor clinical outcomes

PIG. CELL & MEL. RES.|In press|2013 Provided by Sara-Jane Schramm (Usyd)

Software: VAN

Transcriptomics data

��VWDWHV�Network data

�33,�PLFUR51$�JHQH�

Data analysis

2EWDLQ�KXE�LQWHUDFWRU�FRUUHODWLRQV�LQ�HDFK�VWDWH

3HUIRUP�WHVWV�RI�VLJQLILFDQFH�WR�LGHQWLI\�KXEV�ZKHUH�DYHUDJH�FRUUHODWLRQ��ZLWK�LQWHUDFWRUV��YDULHV�DFURVV�VWDWHV

3HUIRUP�PHWD�DQDO\VLV��EDVHG�RQ�p�YDOXHV��LI�PXOWLSOH�GDWDVHWV�DUH�FRQVLGHUHG

Cancer gene

census data

Hubs causally

implicated in cancer

+XE�DQG�LQWHUDFWRrV�í�DOLYH

CASP8

KCNQ1

%5&$�

HNF4A

HGS

EIF3A

06+�

*5%�

PDGFRB

+XE�DQG�LQWHUDFWRrV�í�GGP

CASP8

KCNQ1

%5&$�

HNF4A

HGS

EIF3A

06+�

*5%�

PDGFRB

HNF4A

KCNQ1

NOS1AP

PTK2B

ZMYM2

CYLD

CASP8

MSH2

EIF3A

CREB3

RFXANK

NOV

PFKFB2

KRT15

CLOCK

MIF

LCP2

LMO2

RNF126GABRG2

CDH5

MRFAP1

PIN1

ATMPOLR2A

MTA1

FANCF BARD1 MAP3K3 BRE

ATN1 RPS6KA3

SHC1

TAF6L

PPFIA2

MED10

NR3C1PARK7

CD2AP

SMAD3

GPX1HIST2H2BE

TERF1

UBA3

NCL

HIF1A

PPP1CB

RBPJMUC1

TFAP2A

HIPK2

USP20

U2AF1

TUBA4A

SP100

PLK1HDAC3

DES CDH1 SREBF1

JAK1

ACTB

TOPORS

KAT2A

TSC2

ETS2

MAP3K14

ZBTB16IRAK4

EIF3EUBE2D3 UGCG MLL CCNG1

AMFR

PLDN

TPRKB

TSG101

MAGOH

BNIP3HNRNPD

NFKBIZFKBP5

PIK3R2

MED25

RELA

CD3D

HIST1H3A

LYST

TADA3

CDYLNFKBIASLC9A3R1

GHR

GATA4

MAP2K7

TAF15

AKAP14

PTPN3

KLF4KCNA2

TOPBP1

HIST1H1C

MED28

SOX9

FBXO5

NUP155

DAZAP2

AXL

SMAD4

LYN

FKBP3

PIAS1

CDC42

SMEK2

COPS2

ABL1

KPNB1

NCOA6

IRAK1

SF3B1

GRM7

CDC25A

OGG1ARF3HMG20BPKN3

PLAURMTF2

AKT1

GSK3B

BCAR1

MED6YWHAB

SP1

ERCC4

GRIK1

PRMT1

EP300

ASH2L

NFYC

PRKDC

NUB1

RBBP7ACTN1

CALCR

FHL3IKZF3

SERPINB2

HSPA4

SDCBP

CREBBP

CBX1

UBE2L3

ANAPC10

SMC3

FASTK

CDK5R1

TRIB3

MBD3

BCL2L1

CDKN1A

LAT

FOS

MORF4L2

SUPT3H

RIPK1

PCNA

RBL1 STK24PARP1

SOCS3

SMARCA5

RBBP4

PDGFRB

UBC

HGS

DAXX

TRAF1

MYC

HDAC1

TNFRSF1A

PSMD2

BAD

CSTF2

PIAS4

DIABLO

TOP3A

CSNK2A1

TP53

PRKAB1

FYN

NPAS2

CCNE1

SH3GL3

CBLB

CRKL

PRDM2

CTNNB1

SUV39H1

MAP2K1

USP7

CHAF1A

BRCA2

C19orf40

SMN1

SYT1

RHOA

GABBR2

HOXA9

HSP90AA1

CSRP3

RASA1

GNAI3

IL2RG

CBL

PDGFRA

CAB39

GRB2

ESR1

MSX2DERL1

JAK2

MAD2L2

SMARCB1

CUL4A

PEX19CD27 KDM5A

CCDC130

HSPD1

CAPN1

UBE2I

SNRPF

PSEN2

PPP1R15A

CHD8

HDAC2

NDN

R &\WRVFDSH

Hubs of interest

1HWZRUN�YLVXDOL]DWLRQ�XVLQJ�5��RQH�KXE�DW�D�WLPH��RU�&\WRVFDSH��PXOWLSOH�KXEV�DW�WKH�VDPH�WLPH�

23

VAN: Identifying biologically perturbed networks using differential variability analysis

Hub and interactors

24

ANSR DM

Software: VAN

25

VAN: Identifying biologically perturbed networks using differential variability analysis

Transcriptome data

Network data

Cancer gene data

Data analsysis

Moving to classification


2.  Cluster analysis: finding common patterns between genes.

3.  Classification & prediction: predicting an outcome based on a set of explanatory variables (features) and model (classifier).

26

Three main categories of question

How to extend this concept from DE analysis to classification and prediction

DE analysis Classification

1.  Gene-based features: rank genes using network information (e.g. NetRank), or construct weights for genes using network information (e.g. weighted lasso).

2.  Network-based features: dene some network measure which can be used to quantify network perturbation between the two classes; rank the networks accordingly (e.g. Rapaport et al., Taylor et al., BSS=WSS).

› Note that it is surprisingly difficult to come up with a network measure which can be translated from a DE framework into a classification framework.

27

Constructing features for the network approach in two main ways:

Talyor et al: feature

›  Instead of using the top ranked networks as the classification features, Taylor et al. use the edges in the top ranked networks.

›  Each edge k in the selected networks is assigned the feature value.

28

I1

HI2

I3 I4

I5

Looking at one hub

›  Some individual networks are capable of separating the classes reasonably well, by considering the difference between hub and interactor expression (the LDA method).

29

−2 −1 0 1 2

0.4

0.6

0.8

1.0

1.2

1.4

CEBPB (49 interactors)

Expression for the CEBPB gene

Med

ian

abso

lute

exp

ress

ion

for t

he in

tera

ctor

s

GPPP

Classification procedure

30

Other network based approaches

31

Winter et al, Plos ComputaQonal Biology, 2012

Cross-validation error rate

32

Mod−t

Unw

eigh

ted

lass

o

Aver

age

expr

essi

on

Tayl

or

Rap

apor

t

Inne

r pro

duct

BSS/

WSS

Wei

ghte

d la

sso

(hub

)

Wei

ghte

d la

sso

(all)

0.2

0.3

0.4

0.5

0.6

Random forestC

lass

ifica

tion

erro

r

Single−gene Gene Network−based features

Gene−basedfeaturesset

›  Error rates for Taylor's method are only slightly better than for the classical single-gene moderated-t method.

› However, the two methods are capturing dierent information: they are correctly classifying dierent subsets of patients.

33

Summary and discussion

›  VAN (R package) enables the testing of modules for dysregulation based on two or more conditions, it is also suitable for the examination of changes across developmental timelines.

› Majority of network methods based on the discovery network do not perform as well as methods based on the predefined network.

› Combining Taylor's method and the single-gene method could yield a more accurate classier.

› Using the LDA method, some hub subnetworks independently act as accurate prognostic predictors.

›  The best performing network feature selection methods only select small hub subnetworks.

34

Acknowledgements ›  Graham Mann (Usyd)

-  Gulietta Pupo & Varsha Tembe

›  Sara-Jane Schramm ›  John Thompson ›  Richard Scolyer (RPA)

›  Marc Wilkins (UNSW) -  Simone Li

-  Chi Nam Ignatius Pang -  David Fung -  Apurv Goel

-  Natalie Twine

›  School of Mathematics and Statistics (Usyd)

-  Samuel Mueller

-  Vivek Jayaswal

-  Kaushala Jayawardana

-  Rebecca Barter

-  Shila Ghanazfar

-  Anna Campain

using network information with gene expression data - jean yee hwa yang

Technology

central hub gene

network information

network level

proteincoding gene

hub subnetworks

single gene analysis

gene set level

avehubdiff subnetwork