mining coherent dense subgraphs across multiple biological networks
DESCRIPTION
Mining Coherent Dense Subgraphs across Multiple Biological Networks. Vahid Mirjalili CSE 891. Motivation: Finding patterns across multiple networks, to identify biological modules, and function prediction Current algorithms are too costly Developed a novel algorithm: CODENSE - PowerPoint PPT PresentationTRANSCRIPT
Mining Coherent Dense Subgraphs across Multiple Biological Networks
Vahid MirjaliliCSE 891
• Motivation:• Finding patterns across multiple networks, to
identify biological modules, and function prediction
• Current algorithms are too costly• Developed a novel algorithm: CODENSE– Scalable in number and size– Adjustable based on the exact or approximate
pattern mining
• Clustering can detect meaningful biological modules– e.g. a dense protein interaction sub-network may
correspond to a protein complex– Dense co-expression sub-network may represent a co-
expression cluster• Biological modules are expected to be active across
multiple conditions• One idea: aggregate all the networks and identify
dense sub-graphs in the aggregated network– Risk of false positive detection
Aggregated graph:False positive in the aggregated graph
• Adding six graphs together, and deleting the edges that occur less than 3 times resulting summary graph
Solution to the false-positive summary-graph
• Frequent sub-graphs• Mine the dense sub-graphs directly in each
original network• A sub-graph is frequent if it occurs in multiple
times in a set of graphs• In biological networks, each gene occur only
once in a graph no isomorphism problem
Frequent dense sub-grpah
• A frequent dense sub-graph doesn’t show accurate information– Some edges in the frequent sub-graph shown
above do not occur in the original set– It is more meaningful to divide this to two
sub-graphs
Coherent Dense Sub-graphs
• All edges in a coherent sub-graphs should have correlated occurrences in the original graph set
• CODENSE divides the networks into 2 meta-graphs and perform clustering on these two graphs only (instead of individual networks)– CODENSE can distinguish the two modules– Good scalability– Discovery of overlapping clusters
Overlapping Sub-graphs
• Partition-based clustering algorithms fail to identify overlapping sub-graphs
• Mining Overlapping Dense Sub-graphs (MODES)
Application• Identify frequent co-expression clusters across multiple
microarray datasets
Microarray dataset: – Un-weighted, undirected graph– Each gene represents a node– Two genes are connected by an edge if they show high
expression correlation• A densely connected sub-graph tight co-expression cluster• Clusters from a single microarray dataset include spurious
links, and may not be homogenous in function and regulation
Problem Formulation
• A relation graph contains n simple graphs, such as – A common vertex set V is shared by the graphs
• Support(G): the numbers of graphs in a relation graph dataset (D)
• A graph is frequent if support(G) > threshold• Summary graph: is an un-weighted graph
extracted from D, where an edge exists only if it occurs in more than k graphs in D
Problem Formulation
• Edge Support Vector: is the weight of edge e in graph i (for an un-weighted graph it would be 0 or 1)
)(ewi
]1,1,1,0,0,0[)(]0,0,0,1,1,1[)(
cbwbaw
• Second-Order Graph: where each node represents an edge from the relation graph dataset (D) and an edge between nodes u and v exists if w(u) and w(v) are highly correlated
• For efficiency, only construct the S graph for a sub-graph of the summary graph
),( sEVVS
• Coherent Graph: a sub-graph extracted from the summary graph is coherent if– All its edges have support > k– Its second-order graph is dense
• Graph Density:
keGe )(support:)(sub
)1(2)(
nnmGdensity
m: number of edgesn: n umber of nodes
Two facts:• If a frequent sub-graph is dense, then it must
be dense in the summary graph as well, but the reverse way doesn’t hold true always
• If a sub-graph is coherent (its edges have high correlation across the dataset), then its second-order sub-graph is dense
• Aggregate the graphs into a summary graph• Eliminate infrequent edges
MODES: Mining Overlapping DEnse Subgraphs
• Developed based on HCS: Highly Connected Sub-graphs
• Can efficiently identify dense sub-graphs• Can mine overlapping sub-graphs• Two approaches:– Minimum cut– Normalized cut (Shi, Malik 2000)
• Apply the normalized cut in the initial steps of HCS algorothm, then if the size of partitions is small proceed with minimum cut
C
CODENSE analysis
• Simplify the identification of coherent dense sub-graphs across n graphs into mining in two special graphs: summary graph + second-order graph
• Can mine network modules• Can mine both exact and approximate patterns
(by modifying the similarity threshold)• Can be extended to weighted graph (using Pearson
correlation instead of Euclidean distance )
Experimental Study: co-expression network
• 39 yeast microarray datasets• 6661 genes• Calculate the Pearson correlation between the
expression levels (r)
• Construct the relation graph, (connectivity of two genes determined by the Pearson correlation)
2
2
1)2(
rrn
n: number of measurements
• Create the summary graph , while removing edges that occur less than 6 times across 39 graphs
• Apply MODES to identify dense sub-grahs: sub( ) with cutoff density d1
• For each sub( ), construct the second-order graph S• Apply MODES to S to identify sub-grpahs with
density > d2• Transform the edges vertices, and apply MODES
again to identify the dense sub-graphs with density > d3
G
G
G
Functional Module Discovery:MODES vs CODENSE
• MODES identified 366 clusters, but only 151 were functionally homogenous (42%)
• CODENSE identified 770 clusters, which 76% of those were homogenous
• Improvement is due to second-order graph by eliminating edges which do not show co-occurrence across all networks
• Example of MODES false positive: MODES identified 5 genes: MSF1, PHB1, CBP4, NDI1, SCO2 which
are not functionally homogenous
Protein biosynthesis replicative cell aging mitochondrial electron transfer
Functional prediction:
• CODENSE identified this 6-nodes sub-graph• 5 genes belong to “protein biosynthesis”
category
• Predict: ASC1 must be involved in proteinbiosynthesis as well
Test with 448 known genes: 50% accuracy