1 robust diagnosis of dlbcl from gene expression data from different laboratories dimacs - rutcor...

25
1 Robust diagnosis of DLBCL from Robust diagnosis of DLBCL from gene expression data from gene expression data from different laboratories different laboratories DIMACS - RUTCOR Workshop on Boolean and Pseudo-Boolean Functions in Memory of Peter L. Hammer January 19-22, 2009

Post on 22-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

1

Robust diagnosis of DLBCL from gene Robust diagnosis of DLBCL from gene expression data from different expression data from different

laboratories laboratories

DIMACS - RUTCOR Workshop on

Boolean and Pseudo-Boolean Functions

in Memory of Peter L. HammerJanuary 19-22, 2009

2

Peter L Hammer

Sorin Alexe David E Axelrod

RUTGERS UNIV

Gustavo StolovitzkyIBM TJ WATSON RESEARCH

Gyan BhanotArnold J LevineINSTITUTE FOR ADVANCED STUDY PRINCETON David Weissmann

CANCER INSTITUTE OF NEW JERSEY

3

Overview Overview

Motivation

Pattern-based ensemble classifiers

Case study – compare data from two labs for

DLBCL vs FL diagnosisShipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab)

Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab)

Alexe, Alexe, Axelrod, Hammer, Weissmann (2005) Artificial Intelligence in MedicineBhanot, Alexe, Stolowitzky, Levine (2005) Genome Informatics

4

Non-Hodgkin lymphomas

FL low grade non-Hodgkin lymphoma / no cure if advanced stage

second most frequent subtype of nodal lymphoid malignancies

Incidence has risen from 2–3/ to more than 5–7/ 100,000/year (’50 –’00)

t(14;18) translocation:over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL

DLBCL high grade non-Hodgkin lymphoma / high variability to treatment most frequent subtype of NHL

< 2 years survival if untreated

Biomarkers: FL transformation to DLBCL• p53/MDM2 (Moller et al., 1999)• p16 (Pyniol, 1998)• p38MAPK (Elenitoba-Johnson et al., 2003)• c-myc (Lossos et al., 2002)

5

Gene arrays

Gene arrays are a way to study the variation of

mRNA levels between different types of cells.

This allows diagnosis and inference of pathways

that cause disease / early stage diagnosis

Identify molecular profiles of disease –

personalized medicine

6

Lymphoma datasetsLymphoma datasets

Data: WI (Shipp et al., 2002) Affy HuGeneFL

CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2

Samples:

WI: 58 DLBCL & 19 FL

CU: 14 DLBCL & 7 FL

Genes:

WI: 6817

CU: 12581

7

Diagnosis problemDiagnosis problem

InputTraining (biomedical) data: 2 classes: FL and DLBCL

m samples described by N >> features

OutputCollection of robust biomarkers, modelsRobust, accurate classifier /

tested on out-of-sample data

8

Data preprocessingCreating training and test dataNormalizationNoise estimation

Robust feature selectionFiltering Support set selection

Artificial Neural Networks

Support Vector Machines

Weighted Voting System (LAD)

k-Nearest Neighbors

Decision Trees(C4.5)

Logistic Regression

Pattern data(training)

Raw data(training)

Principal Components

META-CLASSIFIERValidation(test data)

Input data

Calibration

INDIVIDUAL CLASSIFIERS

INTERMEDIATE CLASSIFIERS

Classifier (Weighted Voting)

Biology-based feature selectionFilteringSupport set selection

9

Patterns (Patterns (Logical Analysis of DataLogical Analysis of Data, Hammer 1988), Hammer 1988)

Positive Patterns Negative Patterns

Model

-Exhaustive collections of patterns

-Pattern space

-Classification / attribute analysis / new class

identification

10

Data Preprocessing

50 % P calls, UL = 16000, LL = 20 2/1 stratify WI data to train/test CU data test Normalize data to median 1000 per array Generate 500 data sets using noise + k fold stratified

sampling + jackknife Find genes with high correlation to phenotype using t-test

or SNR. Keep genes that are in > 90% of datasets

11

Choosing support setsChoosing support sets

Create quality patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation

Sort genes by their appearance in good patterns

Select top genes to cover each sample by at least 10 patterns

Alexe, Alexe, Hammer, Vizvari (2005)

12

The 30 genes that The 30 genes that

best distinguish best distinguish

FL from DLBCLFL from DLBCLG

ene

sym

bol

Ship

p e

t a

l.

Gene

s@

Work

t-te

st

p5

3 r

eg

ula

ted

Bio

log

ical

function

SEPP1 * * * oxidative stress

TXNIP * * metastases suppressor

DNASE1L3 * * apoptosis

CDH11 * * * cell adhesion

LUCA15 * apoptosis

GPR18 * * * signaling pathway

CLU * * * apoptosis

LY9 * * cell adhesion

RHOH * * T-cell differentiation

ELF2 transcription

CCNG2 * cell cycle

CR2 complement activation

CDKN2D * cell cycle

PPP2R5C * signal transduction

G18 cell growth

LY86 * apoptosis

ARPC1B cell motility

MCM7 * * * * cell cycle

BCL2A1 * * * apoptosis

IMPDH2 * * GMP biosynthesis

RRP45 * immune response

STAT1 NF-kappaB cascade

DLG7 * * * cell-cell signaling

SLC1A5 * * transport

TUBB2 * * microtubule movement

PSMA6 protein catabolism

PSMC1 * * * spinocerebellar ataxia

LGALS3 * * * sugar binding

CLTA * * transport

PAGA * * cell proliferation

13

#Gene index

Gene description Accession #

Pearson correlation of

genes in support set

with DLBCL vs FL outcome

Frequency of participation in the

definition of combinatorial

biomarkers

Functional gene group #

(*)

1 506 DNA REPLICATION LICENSING FACTOR CDC47 HOMOLOG D55716_at 0.45 42.08 12 1612 (clone GPCR W) G protein-linked receptor gene (GPCR) gene, 5' end of cds L42324_at -0.49 30.00 23 972 Rad2 HG4074-HT4344_at 0.45 23.33 14 2137 HIGH AFFINITY IMMUNOGLOBULIN GAMMA FC RECEPTOR I "A FORM" PRECURSOR M63835_at 0.43 23.33 25 605 5-aminoimidazole-4-carboxamide-1-beta-D-ribonucleotide transformylase/inosinicase D82348_at 0.53 22.50 -6 6815 Tubulin, Beta 2 HG1980-HT2023_at 0.50 8.33 47 7102 HLA-A MHC class I protein HLA-A (HLA-A28,-B40, -Cw3) M94880_f_at -0.43 8.33 28 2988 RCH1 RAG (recombination activating gene) cohort 1 U28386_at 0.48 7.08 19 4028 LDHA Lactate dehydrogenase A X02152_at 0.62 6.25 610 4292 PKM2 Pyruvate kinase, muscle X56494_at 0.55 5.00 611 4485 IDH2 Isocitrate dehydrogenase 2 (NADP+), mitochondrial X69433_at 0.47 5.00 612 1430 Protein tyrosine phosphatase (CIP2)mRNA L25876_at 0.44 4.17 513 1988 INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 PRECURSOR M35878_at -0.28 4.17 214 582 KIAA0175 gene D79997_at 0.45 2.08 -15 1092 GAMMA-INTERFERON-INDUCIBLE PROTEIN IP-30 PRECURSOR J03909_at 0.53 2.08 316 2929 Mitochondrial serine hydroxymethyltransferase gene, nuclear encoded mitochondrion protein U23143_at 0.42 2.08 -17 3005 Bcl-2 related (Bfl-1) mRNA U29680_at 0.44 2.08 518 4010 PGK1 Phosphoglycerate kinase 1 V00572_at 0.36 2.08 619 2789 CENPA Centromere protein A (17kD) U14518_at 0.51 0.00 520 6703 Dents Disease candidate gene X81836_s_at 0.37 0.00 -

Table 1. Selected non-minimal support set of 20 genes for distingushing DLBCL from FL cases. * 1: DNA replication, recombination and repair, 2: cell surface proteins and receptors, 3: protein synthesis and degradation, 4: structural proteins, 5: cell cycle and apoptosis, 6: metabolism, -: other.

Genes identified by LAD (AIIM 2005) to distinguish DLBCL from FL

14

Examples of FL and DLBCL patternsExamples of FL and DLBCL patterns

Pos Neg Pos Neg

P1 >- 1.13 >- 0.62 97 0 91 23

P2 £0.91 >- 0.77 95 0 79 31

N1 >- 0.26 £- 0.55 0 100 3 54

Training set Test set

Gene Symbol

Pattern

Prevalence (%)

GPR18 CLU DLG7 MCM7

WI training data:

Each DLBCL case satisfies at least one of the patterns P1 and P2

Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)

15

Pattern dataPattern data

WI training data

WI test data

Positive patterns

Negative

patternsD

LB

CL

FL

CU test data

16

Meta-classifier performanceMeta-classifier performance

Sensitivity (%)

Specificity (%)

Error rate(%)

Sensitivity (%)

Specificity (%)

Error rate(%)

ANN 0.08 94.74 92.31 5.88 82.35 84.62 17.02SVM 0.08 97.37 92.31 3.92 97.06 76.92 8.51kNN 0.09 97.37 100.00 1.96 91.18 84.62 10.64WV 0.07 92.11 92.31 7.84 94.12 76.92 10.64C4.5 0.06 94.74 84.62 7.84 94.12 69.23 12.77LR 0.07 97.37 84.62 5.88 94.12 69.23 12.77ANN 0.10 100.00 100.00 0.00 97.06 76.92 8.51SVM 0.10 100.00 100.00 0.00 97.06 76.92 8.51kNN 0.10 100.00 100.00 0.00 100.00 69.23 8.51WV 0.10 100.00 100.00 0.00 97.06 76.92 8.51C4.5 0.10 100.00 100.00 0.00 91.18 76.92 12.77LR 0.05 100.00 76.92 5.88 100.00 61.54 10.64

100.00 100.00 0.00 100.00 76.92 6.38Meta-classifier

Weight

Training Test

Tra

ined

on

raw

data

Tra

ined

on

patte

rn d

ata

Classifier

17

Error distribution: raw and pattern dataError distribution: raw and pattern data

0 10 20 30 40 50

Meta-classifier

Classifiers trained on pattern data

Classifiers trained on raw data

CU test dataWI test data

18

Biology based methodBiology based method

19

p53 related genes p53 related genes identified by filtering identified by filtering

procedure procedure

CCNB1 EPRS PMAIP1 E2F3MCM7 GSK3B ACAA2 MDM4BRCA1 COL6A1 E2F5* AMPD2BCL2A1 HRAS POLA RBBP4PPP2R4 SERPING1 HMGB2 CCNG2*EIF2S2 CCNA2 PSMB5 HARSCOMT CCT6A ACTA2 CASP6IARS PRKDC INSR RPS6KA1MPI CAD SNRPA GRP58ALAS1 TNFRSF1B G1P2 TP53MRPL3 ZNF184* IMPDH1 SMAD2NCF2 ALDOA MAP2K2 ATP5C1AARS KARS TOP2A TIMP3KIF11 MAD2L1 CXCL1 THBS2CDK4 GOT1 BAG1 MYCBPATP1B1 CDC25B TOP1 DTRCDC20 PSMA1 MAP4 TIMP3PRIM1 KIAA0101 FDFT1 CBSCDC2 PCNA MTA1 CDKN2D*TOP2A TCF3 CDKN1A RELACDK2 CYC1 HLAE*MYC UPP1 PLK1CCNE1 TOPBP1 CDK7

Gene symbol

FL FL DLBCL DLBCL progressionprogression

20

p53 pattern datap53 pattern data

WI data CU data

Positive patterns

Negative

patterns

DL

BC

LF

L

21

Examples of p53 responsive genes patternsExamples of p53 responsive genes patterns M

CM

7

CCN

B1

BCL

2A1

CCN

E1

KIA

A01

01

CD

C2

CBS

E2F

5

Pos

Neg Pos

Neg

P1 >- 0.66 >- 0.89 93 11 86 29P2 >- 0.66 >- 0.78 90 11 71 29P3 >- 0.8 >- 0.33 69 11 64 14N1 £- 0.66 3 74 14 71N2 £- 0.56 £- 0.18 3 68 21 57N3 £- 0.11 >0.11 3 68 7 71

Gene symbol

Patte

rn

Prevalence (%)

Training set Test set

WI data:Each DLBCL case satisfies one of the patterns P1, P2, P3Each FL case satisfies one of the patterns N1, N2, N3

22

p53 combinatorial biomarkerp53 combinatorial biomarker

77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed

79% DLBCL & 23% FL cases (3.4 fold)

at least two genes over-expressed

0

10

20

30

40

50

60

70

80

90

<= 1 >=2

# of over-expressed genes in DLBCL vs. FL

(p53, PLK1, CDK2)

% c

ases DLBCL

FL

Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL

(specificity 50-60%, sensitivity 60-70%)

23

What are these genes?What are these genes?

Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific

cell transformation, neoplastic, drives quiescent cells into mitosis over-expressed in various human tumors Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new

prognostic marker for cancer Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL

Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E

P53 tumor suppressor gene (Levine 1982)

24

Conclusions Conclusions

Pattern-based meta-classifier is robust against noise

Good prediction of FL DLBCL

Biology based analysis also possible

Yields useful biomarker

Should study biologically motivated sets of genes build pathways

25Thank you for your attention !

<>