mining coherent dense subgraphs across multiple biological networks

Mining Coherent Dense Subgraphs across Multiple Biological Networks

Vahid MirjaliliCSE 891

• Motivation:• Finding patterns across multiple networks, to

identify biological modules, and function prediction

• Current algorithms are too costly• Developed a novel algorithm: CODENSE– Scalable in number and size– Adjustable based on the exact or approximate

pattern mining

• Clustering can detect meaningful biological modules– e.g. a dense protein interaction sub-network may

correspond to a protein complex– Dense co-expression sub-network may represent a co-

expression cluster• Biological modules are expected to be active across

multiple conditions• One idea: aggregate all the networks and identify

dense sub-graphs in the aggregated network– Risk of false positive detection

Aggregated graph:False positive in the aggregated graph

• Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph

Solution to the false-positive summary-graph

• Frequent sub-graphs• Mine the dense sub-graphs directly in each

original network• A sub-graph is frequent if it occurs in multiple

times in a set of graphs• In biological networks, each gene occur only

once in a graph no isomorphism problem

Frequent dense sub-grpah

• A frequent dense sub-graph doesn’t show accurate information– Some edges in the frequent sub-graph shown

above do not occur in the original set– It is more meaningful to divide this to two

sub-graphs

Coherent Dense Sub-graphs

• All edges in a coherent sub-graphs should have correlated occurrences in the original graph set

• CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks)– CODENSE can distinguish the two modules– Good scalability– Discovery of overlapping clusters

Overlapping Sub-graphs

• Partition-based clustering algorithms fail to identify overlapping sub-graphs

• Mining Overlapping Dense Sub-graphs (MODES)

Application• Identify frequent co-expression clusters across multiple

microarray datasets

Microarray dataset: – Un-weighted, undirected graph– Each gene represents a node– Two genes are connected by an edge if they show high

expression correlation• A densely connected sub-graph tight co-expression cluster• Clusters from a single microarray dataset include spurious

links, and may not be homogenous in function and regulation

Problem Formulation

• A relation graph contains n simple graphs, such as – A common vertex set V is shared by the graphs

• Support(G): the numbers of graphs in a relation graph dataset (D)

• A graph is frequent if support(G) > threshold• Summary graph: is an un-weighted graph

extracted from D, where an edge exists only if it occurs in more than k graphs in D

Problem Formulation

• Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)

)(ewi

]1,1,1,0,0,0[)(]0,0,0,1,1,1[)(

cbwbaw

• Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated

• For efficiency, only construct the S graph for a sub-graph of the summary graph

),( sEVVS

• Coherent Graph: a sub-graph extracted from the summary graph is coherent if– All its edges have support > k– Its second-order graph is dense

• Graph Density:

keGe )(support:)(sub

)1(2)(

nnmGdensity

m: number of edgesn: n umber of nodes

Two facts:• If a frequent sub-graph is dense, then it must

be dense in the summary graph as well, but the reverse way doesn’t hold true always

• If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense

• Aggregate the graphs into a summary graph• Eliminate infrequent edges

MODES: Mining Overlapping DEnse Subgraphs

• Developed based on HCS: Highly Connected Sub-graphs

• Can efficiently identify dense sub-graphs• Can mine overlapping sub-graphs• Two approaches:– Minimum cut– Normalized cut (Shi, Malik 2000)

• Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut

CODENSE analysis

• Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph

• Can mine network modules• Can mine both exact and approximate patterns

(by modifying the similarity threshold)• Can be extended to weighted graph (using Pearson

correlation instead of Euclidean distance )

Experimental Study: co-expression network

• 39 yeast microarray datasets• 6661 genes• Calculate the Pearson correlation between the

expression levels (r)

• Construct the relation graph, (connectivity of two genes determined by the Pearson correlation)

2

2

1)2(

rrn

n: number of measurements

• Create the summary graph , while removing edges that occur less than 6 times across 39 graphs

• Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1

• For each sub( ), construct the second-order graph S• Apply MODES to S to identify sub-grpahs with

density > d2• Transform the edges vertices, and apply MODES

again to identify the dense sub-graphs with density > d3

G

G

G

Functional Module Discovery:MODES vs CODENSE

• MODES identified 366 clusters, but only 151 were functionally homogenous (42%)

• CODENSE identified 770 clusters, which 76% of those were homogenous

• Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks

• Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which

are not functionally homogenous

Protein biosynthesis replicative cell aging mitochondrial electron transfer

Functional prediction:

• CODENSE identified this 6-nodes sub-graph• 5 genes belong to “protein biosynthesis”

category

• Predict: ASC1 must be involved in proteinbiosynthesis as well

Test with 448 known genes: 50% accuracy

mining coherent dense subgraphs across multiple biological networks

Documents