inference of patient-speci c pathway activities from multi
TRANSCRIPT
Inference of patient-specific pathway activitiesfrom multi-dimensional cancer genomics data
using PARADIGM. Bioinformatics, 2010
C.J.Vaske et al.
May 22, 2013
Presented by: Rami Eitan
Complex Genomic Rearrangements
I Cancer tissue experience molecular changesI Varied genomic data available
I copy number variationsI mutations, gene expression
I Stratification of cancers can improve:I diagnosisI prognosisI risk assessmentI response to treatment
Complex Genomic Rearrangements
I Genetic alterations differ between patients
I Pathways often are common
Pathways
I What is a pathway?
Figure : The P53 pathway
Pathways
I A set of interactions between entities, logically groupedtogether around a biological process.
I Protein-coding genes, small molecules, complexes, genefamilies, abstract processes
I Available databases: Reactome, KEGG, NCI
Motivation
• Integra+ve analysis of cancer genome data – Copy number varia+ons, gene expressions
• Leverage pathway informa+on to find frequently occurring pathway perturba+ons – NCI pathway interac+on database, KEGG etc.
Observed Data
Figure : Gene expression
Figure : Copy number
Motivation
• Pathway informa+on contains informa+on on how genes are supposed to behave
Input
I Infer integrated pathway activity (IPA)
I Produce a matrix A. Aij is the inferred activity of entity i inpatient j
PARADIGM
Factor graph
I Factor graph is a probabilistic graphical model.
I Variables, factors.
Figure : A simple factor graph
PARADIGM Model
• Factor graph representa+on of various en++es corresponding to a single gene
PARADIGM Model: Gene Interactions
PARADIGM Model:
• A factor graph for a pathway
Model Specification
• Convert an NCI pathway into a factor graph – NCI pathway to Bayesian network
• Directed network • Each variable takes values of -‐1 (de-‐ac+va+on), 0 (normal), 1 (ac+va+on) – mRNA: over expression for ac+va+on
– Copy number varia+ons: more than two copies for ac+va+ons
• Probability distribu+on of each node – Labeled edges for posi+ve/nega+ve interac+ons – Set the value of the child node as weighted votes from its parents
Model Specification
• Conver+ng the Bayesian network to a factor graph – Assign a factor to each group of variables consis+ng of a node and its
parents
• Z: normaliza+on constant
• ε = 0.001
Inference
I Observed variables: copy number variations, gene expressions
I Unobserved variables: protein, protein activity, overallpathway activity state
I Learn models with EM algorithmI E step: Infer the probabilities of the unobserved variablesI M step: Change parameters to to maximize the likelihood
given the probabilities
Expectation Maximization
Figure : EM algorithm
Log-likelihood Ratio Test
• Test sta+s+c for assessing en+ty i’s ac+vity given data D
– The probabili+es can be obtained by performing inference on the factor graph
Significance assessment
I Permutate the labels of the observed data
I ’Within’ permutation: choosing random genes from the samepathway
I ’Any’ permutation: choosing any random genes
I 1000 permutations of each type are used to determine nulldistribution
Decoy paths
I Create decoy paths by replacing genes with random genes
I Maintain the same structure
I All complexes and abstract processes remain the same
Log-likelihood Ratio Test
• Aggrega+ng over mul+ple values en+ty i takes
Dataset
• Breast cancer copy number and gene expression data
• TCGA Glioblastoma copy number and gene expression data
• Pathways from NCI pathway interac+on database (PID)
Results - breast cancer
I Breast Cancer dataset:I 56172 IPA’s (7%) found to be significantly higherI 497 significant entities per patient on averageI 103 out of 127 pathways had at least one entity altered in 20%
or more of the patients
Results - GBM
I GBM dataset:I 141682 IPA’s (9%) found to be significantly higherI 616 significant entities per patient on averageI 110 out of 127 pathways had at least one entity altered in 20%
or more of the patients
EM Convergence
• Original data vs. permuted data
Red: real data Green: permuted data
Results - decoy paths
Distinguishing decoy from real pathways
Figure : PARADIGM vs SPIA: FP rate
Results - decoy paths
I Distinguishing decoy from real pathwaysI Breast cancer AUC:
I PARADIGM: 0.669I SPIA: 0.602
I GBM AUC:I PARADIGM: 0.642I SPIA: 0.604
Top PARADIGM Pathways of Breast Cancer
Top PARADIGM Pathways of Glioblastoma
Glioblastoma Subtypes
Survival Rates for Each Subtypes
Results - Patient vs permutation
Figure : Patient vs permuted IPA’s
Results - Patient vs permutation
Figure : Patient vs permuted IPA’s. Source: BroadInstitute/Dana-Farber Cancer Institute/Harvard Medical School
Summary
• PARADIGM integrates different types of data, including gene-‐expression, copy number varia+on, and pathway database, in order to infer pathway ac+vi+es for individual cancer pa+ents. – Factor graph model for represen+ng pathway and modeling datasets
– Pathway ac+vi+es inferred by PARADIGM can be used to iden+fy cancer subtypes
Questions
Discussion
I Can the method be successfully expanded to more observeddata?
I Instead of using the pathways as is, can this method be usedto find new pathways and interactions?