netbiosig2014-talk by david amar

Pathways as robust biomarkers for cancer

classification: the power of big expression data

David Amar, Tom Hait, and Ron Shamir

Blavatnik School of Computer Science

Tel Aviv University

Motivation and introduction

Comparative genomicsStandard expression experiments: cases vs.

controls -> differential genes -> interpretationProblems

Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking

Goal: find specific changes for a tested diseaseE.g., an up-regulated pathwayCrucial for clinical studies

Previous integrative classification studies Huang et al. 2010 PNAS (9,160 samples); Schmid

et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patternsOnly 1-3 platformsMany datasets were removed from GEONo “healthy” class (Huang);No diseases (Lee)

Pathprint (Altschuler et al. 2013)Use pathwaysTissue classification (as in Lee et al.)

Integrating pathways and molecular profilesEnrichment tests

Improves interpretability GSEA\GSA

Ranked based Higher statistical power

ClassificationExtract pathway features

Example: given a pathway remove non-differential genes

Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)

Pathway-based gene expression database

PathwaysKEGG Reactome

Biocarta NCI

Expression profiles

Sample labelsDiseas

Dataset\sample

descriptionSingle sample - single pathway

analysis

For each

pathway

• Mean• SD

Pathway features

Platform data

Single sample analysis

g1 , g2 ,g3 , … , gk

Ranked genes\

transcripts

Sample j

g1 , g2 ,g3 , … , gk

Weighted ranks

/i kiW ie

w 1 , w 2 ,w 3 , … , w k

Standardized profile

low expressio

highexpressio

Single sample analysisInput: an expression profile of a sample

A vector of real values for each patientStep 1: rank the genesStep 2: calculate a score for each gene

Rank of gene g in sample s

Total number of ranked genes (Yang et al. 2012,2013)

Pathway features1723 pathways in total

Covering 7842 genesMean size: 36.35 (median 15)

Score all genes that are in the pathway databases

Pathway statistics:Mean scoreStandard deviationSkewnessKS test

Pathway DBsKEGG

Reactome

Biocarta NCI

Patient labelsUnite ~180 datasets, >14,000

samplesPublic databases contain ‘free

text’Problem: automatic mapping

fails, example:GDS4358:” lymph-node

biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy”

MetaMap top score: “HIV infections”

Solution: manual analysisRead descriptions and papers

Current microarray dataData from GEO

13,314 samples17 platforms

Sample annotationIgnore terms with less than

100 samples 5 datasets 48 disease terms

Disease terms

Pathway features

Disease terms {0,1}

Analysis and results

Multi-label classification algorithmsLearn a single classifier for each

disease Ignore class dependencies

Adaptation: Bayesian CorrectionLearn single classifiersCorrect errors using the DO DAG

Transformation: use the label power sets and learn a multiclass modelUsing RF: multi-label trees

Was better than most approaches in an experimental study (Madjarov et al. 2012)

How to validate an classifier?Use leave-dataset out cross-validation

Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately

Disease terms {0,1}

Probabilities [0,1]

The output of a multi-label learner

Test set

A problem (!)What is in the background?For a disease D define:

Positives: disease samplesNegatives: direct controlsBackground controls

Example: 500 positives

500 negatives

10000 BGCs

Multistep validationIt is recommended to use several scores (Lee et al.

2013)Measure global AUPRFor each disease we calculate three scores

Measure Used (additional) information

AUPR: check separation between positives and all others

Sick vs. not sick

ROC: test for separation between positives and negatives

Direct use of negatives

Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value)

Mapping of samples to datasets

Performance results

Meta analysis q-value < 0.001 (filled boxes)

Positives vs. negatives ROC

Performance results

8.5% improvement in recall, 12% in precision, compared to Huang et al.

Validation on RNA-SeqData from TCGA: 1,699 samples

Pathway-Disease networkSteps (for each of the selected diseases):

1. Disease-pathway edges1. RF importance: Select the top features2. Test for disease relevance

2. Add edges between diseases1. Use the DO structure

3. Add edges between pathways1. Based on significant overlap in genes

Cancer network

DownUp

Cardiovascular disease

DownUp

Gastric cancers

SummaryLarge scale integrationMulti-label learningCareful validationPathway based features as biomarkersSummary of the results in a networkCurrently

Add genes: overcome missing valuesShows improvement in validation

AcknowledgementsRon ShamirTom Hait

netbiosig2014-talk by david amar

disease samples negatives

diseasepathway edges

pathwaydisease network

tested disease

disease relevance

cardiovascular disease

samples schmid

careful validation pathway

Science

proposal amar

amar vārso ==========

english - amar ujala

amar fashi chai

pronoun - amar ujala

netbiosig2013-talk david amar

final amar

amar genetics

netbiosig2014-talk by salvatore loguercio

hotel amar

magnetism - amar ujala

netbiosig2014-talk by yu xia

amar nayyar

xml security by rami amar january 2003 by rami amar january...

netbiosig2014-talk by gerald quon

electricity - amar ujala

amar ujala

amar - drduany.org

bpl list-kolkata municipal corporation ward no: ulb … ·...

amar horoscope