Download - The MORPH Algorithm
The MORPH Algorithm
MORPH = MOdule guided Ranking of candidate PatHway genes
high throughput dataSlides: Rachel E. Bell, June 2013
MotivationChallenges in studying biological pathways
• Identify missing pathway members• Information gaps on participating genes: a) e.g. nature of interactions between metabolites and
gene expressionb) understanding control mechanisms, feedback, cross-talk• Many genes in genome(s) have unknown function
Biological Pathways: OverviewWhat is a pathway?A series of interactions between genes (proteins) involved in performing a certain biological function
Cell input = extracellular/ endogenous:e.g.: stress, changes in PH, UV exposure, nutrients Cell output = response:e.g.: transcription of genes, sucrose degradation
MORPH Algorithm: Overview
INPUT
ALGORITHMOUTPUT
High throughput data of gene expression, networks and biological pathways
Machine learning and validation methods
Predict genes involved in biological pathways
Other methods for functional predictionCoexpression-based methods (& possibly pathways)e.g.: ACT, GeneCat, ATED-II, MapMan Assumptions: 1) Similar expression patterns -> similar function or regulation2) Pathway genes -> coordinated expression
Network-based methods (& gene expression)e.g: Markov random field (MRF) models , k-nearest neighbours (k-NN), ADOMETA: coexpression, phylogeny, clustering on chrom., metabolic networks
Assumption: Closer nodes -> common functions
Introduction: MORPH Algorithm
MORPH uses pathway information, gene expression data and network information
Compared to other methods, MORPH:• offers robustness (performs well on many pathways)• increases networks coverage • applied to different organisms
Talk outline1. MORPH input types: (a) gene expression data, (b)
pathways and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
MORPH IntroductionArabidopsis Thaliana
Solanum Lycopersicum
(Tomato)
MORPH was developed on 2 model organisms
MORPH Input: Arabidopsis Thaliana
Pathways: 66 AraCyc, 164 MapManPreprocessing: filter pathways with <10 genes with expression dataTotal 230 pathways, 2 sets
Gene Expression datasets: seedlings, tissues (leaves, roots, flowers, seeds), seed developmental stages, DS1Preprocessing: filter low variance and detection call, average replicates, normalize to controls, standardize experimentsTotal 216 GE profiles, 4 datasets, ~12500 genes
MORPH Input: Arabidopsis Thaliana
Metabolic (MD) Network (AraCyc) Node = metabolic genes (enzymes)Edges = nodes share a metabolite (reactant or product)Preprocessing: remove most common metabolites (they connect enzymes with weak functional associations)Total: 1987 genes, 56244 interactions
PPI Network (PAIR & Interactome Map databases)Node = genes (proteins)Edges = interactions between proteins Preprocessing: Unite (predicted & expt.) interactions from both databasesTotal: 4642 genes, 149229 interactions
Talk outline1. MORPH input types: (a) gene expression data, (b) pathways
and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
MORPH GoalMORPH goal:Given a specific biological pathway MORPH seeks candidate genes that participate in (or regulate) the pathway.
A key step in MORPH is the partitioning of genes into modules (clusters).
MORPH receives 3 types of input:
1. Pathways2. Gene expression data3. Partitioning into modules
Assumptions of clustering data into modules
Q: Why use modules?
• Modules reflect broad functions
• Some functions are related to target pathway
• Pathway genes -> more coordinated expression than random genes
Different strategies for partitioning genes
Expression based clustering
Network based clustering
Input: Partitioning Gene Modules and Networks
Annotation based clustering
SOM = self-organizing map(partitions all genes)
CLICK = CLuster Identification via Connectivity Kernels(partitions most genes)
Enzyme/not enzyme
Orthologs in rice & maize/no orthologs
Matisse*
Markov cluster algorithm (MCL)
Input: Partitioning NetworksReminder: MATISSE seeks connected sub-networks with high expression similarity
InteractionHigh expression similarity
(Ulitsky & Shamir, 2007)
Goal: construct modules using gene expression data and networks
Problem: low coverage of MD network
Input: Partitioning Networks - MATISSE*
Results: Matisse* increased MD network coverage to ~4500 genes
Matisse* performed similarly to Matisse
Motivation - overcome low coverage of networks
MATISSE* (modified MATISSE)
• Add genes with high correlation• Repeat until module correlation
<0.4• Connectivity ignored
Clustering algorithm MethodSOM Co-expressionCLICK Co-expression
Clustering algorithm Network Markov cluster process (MCL) PPIMATISSE* PPIMATISSE* MD network
Gene expression-based clustering
Modules using network data
Bipartition CategoriesEnzymes Y/NOrthologs Y/N
Summary: Methods of Partitioning Gene Modules and Networks
Total of 8 clustering solutions
No clustering - single module
Annotation-based clustering
Talk outline1. MORPH input types: (a) gene expression data, (b) pathways
and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
MORPH = MOdule guided Ranking of candidate PatHway genes
Input:1. Pathway genes S = {s1,s2,…sl}2. Gene expression profiles 3. Partition solution for genes with gene
expression data: k modules = M1……Mk 4. Similarity function (D)
Pearson/Spearman
MORPH is an algorithm for prioritizing novel candidate genes in a given specific pathway.
Module-Guided Ranking Algorithm
Step #1: Partition genes into k modules M1,M2,…,Mk #1
#2#3
Step #3: Analyze each module separately
Step #2: • Identify pathway genes s1,s2,…,sl
and candidate genes g• ignore modules with no pathway
genes• add module for non partitioned
pathway genes
Step #4: For each g (candidate gene) in module Mi calculate mean similarity with sj (pathway genes) using gene expression data
Module-Guided Ranking Algorithm
candidate genes
pre-defined module
Similarity function(Pearson’s Corr.)
pathway genes in module
provides ranking within module
#3 #4
Step #5: Standardize mean similarity scores within each module
candidate genes
stdev / mean of
mean similarity scores of all candidate genes in module Mi
Step #6: Rank all candidate genes (using standardized z-scores)
#5
#6
Module-Guided Ranking Algorithm
How do we assess predictions of many pathways?
Given a clustering solution AND gene dataset
run algorithm for each pathway
Arabidopsis Thaliana 230 pathways
Assessment of pathways using Leave-One-Out Cross-Validation (LOOCV) procedure
Kharchenko et al., 2006
Leave-One-Out Cross-Validation (LOOCV) procedure
LOOCV generates for each pathway gene -> SELF-RANK
SELF RANK of a gene is its position in ranking, when left out of algorithm calculation
Definition
Self rank of pathway gene = its overall strength of association with remaining pathway genes
Meaning
Self-Rank Curve: AUSR scoreLOOCV procedure
For each pathway S:1. Remove one gene (v) -> S\{v}2. Consider S\{v} = test set3. Generate ranking of v using S\
{v}4. Repeat for every v
• Calculate self-rank for all v in S• Create self-rank plot• Self-rank threshold of k=1..1000• Calculate area under self-rank
curve (AUSR) Self-Rank plot of the Carotenoid Biosynthetic Pathway contains 13 genes; SOM - clustering solution
Figure 2
(Random gene set of size 13 genes)
(k)
AUSR score assesses pathway solutions (given input combinations – discussed next)
Talk outline1. MORPH input types: (a) gene expression data, (b) pathways
and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
FIGURE 3: Comparison of 2 gene expr. datasets
AUSR(seedlings) - AUSR(DS1)
Different: gene expression dataset
Same: MD network, Matisse*, 66 AraCyc Pathways
Inspired adoption of selection (learning configuration)
Different input produces different AUSR scores
Learning ConfigurationEvery pathway tested with gene expression dataset and partitioning solution (modules)
Total of 4x8 = 32 combinations
Learning configuration = combination of: gene expression dataset (4) AND Clustering solution (8)
Definition
Machine LearningLOOCV used to select optimal learning configuration (i.e. data set and clustering) for each examined pathway.
LOOCV avoids overfitting, since test gene is left out.
MORPH applies a selection procedure
Comparison of selection process to other ‘fixed’ configurations
Results• Better: enzymes or
MD network • Poorer: PPI network,
no clustering, SOM, CLICK & Orthologs
(metabolic genes had higher corr.)
Selection improved on all configurations Figure 4: The average AUSR for each learning combination
(gene expr. dataset + clustering solution)
66 AraCyc metabolic pathways
Robustness of selection method
Real vs. Random Pathways
randomly selected sets with same size (repeated 100 times for each size)
Results29/66 AUSR > maximal random score
AUSR > 0.75 15/66 - real pathways0 - random
66 AraCyc pathways
Figure 5: AUSR Scores of Real and Random PathwaysSizes
AUSR
0.0
0.5
1.0
1.5
Talk outline1. MORPH input types: (a) gene expression data, (b) pathways
and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
Comparison of MORPH to other methods: Arabidopsis Thaliana pathways
66 AraCyc Pathways
Input:Gene expression: seeds, tissues, seedlings, DS1 Networks: PPI and MD networks Pathways: AraCyc, MapMan
Coexpression (no network data) methods using reference datasets: ACT, DS1
Markov Ranking Field (MRF) methods (network data)
CMRF = total # of pathway gene in neighbourhoodWMRF= total similarity with path. genes in neighbourhoodk-Nearest Neighbour (k-NN) (network data)
Figures 4B & 4C
164 MapMan Pathways
*
*
AraCyc pathways with AUSR>0.8
MapMan pathways with AUSR>0.7
k-NN predictor complements MORPH
Figure 4D & 4E: Comparison to other methods
My analysis: AUSR scores of MORPH and k-NN
k-NN is twice as good as MORPH for high AUSRs >0.9 (6 compared to 3)
Data retrieved from Supplemental Data Set 3
Carotenoid Pathway and the MORPH Candidate genes
Carotenoids are antioxidants, perform stress response functions
Candidate Genes (Numbered Octagons)
• 8/25 top candidates have predicted functions, with little details of roles in plants
• Other predictions inc. genes with similar functions – response to oxidative stress
SQE3 –catalyzes the precursor of a pathway which is coordinated expression with the carotenoid pathway
SPS2 – Plastoquinone pathway essential for carotenoid pathway
Predictors include MORPH, k-NN, MRF-based, and coexpression based classifiers.
(A) Average and median AUSR scores.(B) The number of pathways that had AUSR score above 0.7
Comparison of MORPH to other methods 93 Tomato pathways
Figure 7
Talk outline1. MORPH input types: (a) gene expression data, (b) pathways
and (c) networks2. Types of clustering (modules) methods3. The MORPH algorithm and validation4. Results 5. Comparison to other methods6. Summary
Summary: Advantages of MORPH
1. Robust – different pathways2. k-NN consider only genes in the network, MORPH increases
network coverage3. k-NN more dependent on sub-networks diameter (higher
diameter lower AUSR), MORPH more robust4. Self-rank k=1000 threshold for AUSR, ignores poor pathway
gene correlations5. Potential useful predictions
Summary: Drawbacks of MORPH
1. If pathway genes not coherent, better select best/top module(s) than average
2. Dependent on input quality (e.g. AraCyc > MapMan)3. Predicts close pathways (drawback/advantage)4. Requires known pathway info for predictions
Questions?
Top AUC scores for tested pathwaysPathway Spearman AUC Pearson AUC Sizephotosynthesis light reactions 0.995115 0.994654 26
Chlorophyllide biosynthesis I 0.952 0.950643 14
Carotenoids Core pathway 0.859312 0.868158 13
tRNA charging pathway 0.832438 0.831844 32
gluconeogenesis 0.831634 0.833135 30
triacylglycerol degradation 0.78642 0.770003 12
cysteine biosynthesis I 0.785097 0.787916 11
fatty acid β-oxidation II (core pathway) 0.746601 0.752534 15
glycolysis I 0.742482 0.747914 44
glycolysis IV (plant cytosol) 0.730273 0.74716 44
Calvin-Benson-Bassham cycle 0.723338 0.729027 29
glucosinolate biosynthesis from homomethionine 0.721732 0.721641 11
homogalacturonan biosynthesis 0.720999 0.729749 12
glucosinolate biosynthesis from hexahomomethionine 0.719277 0.719277 11
glucosinolate biosynthesis from pentahomomethionine 0.719277 0.719277 11
ethylene biosynthesis from methionine 0.709665 0.766496 12
MORPH Classifications3 types of input data:Pathways genes (s1,s2,…sl)Gene expressionPartition gene expression data into k modules = M1,…,Mk
66 Arabidopsis Thaliana4 datasets
8 Partitioning methods