copyright 2006, data mining research laboratory integrated mining of ppi networks: a case for...

Copyright 2006, Data Mining Research Laboratory

Integrated Mining of PPI Networks: A Case for Ensemble

Clustering

Srinivasan ParthasarathyDepartment of Computer Science and

EngineeringThe Ohio State University

Joint work with Sitaram Asur and Duygu Ucar


I. Preliminaries and Motivation


Proteins

• Central component of cell machinery and life– It is the proteins dynamically generated by a cell

that execute the genetic program [Kahn 1995]

• Proteins work with other proteins [Von Mering et al 2002]– Form large interaction networks typically refered

to as protein-protein interaction (PPI) networks– Regulate and support each other for specific

functionality or process


Protein Protein Interaction Networks• Why analyze?

– To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002]

• Understanding the organism

– Protein function prediction• E.g. no functional annotations for one-third of baker’s yeast

– Drug design• Goal: To find modular clusters


Challenges in analyzing PPI Networks

– Noisy data

• False positives [Deane 2002], false negatives [Hsu 06]– Existence of Hub Nodes

• Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else!

– Proteins can be multi-faceted• Can belong to multiple functional groups – most clustering

algorithms are hard – need for soft or fuzzy clustering– Data Integration Issues

• Multiple Sources– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence

• Different targets– Y2H, Mass Spec – target binding– Gene co-occurrence – target functional

• Different weaknesses (missing certain interactions)– Y2H – translation– mass-spectrometry – transport & sensing


Ensemble Clustering

• A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03]

• Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement

• Our hypothesis: Potentially offers a viable solution for problems simultaneously– Given nice theory in the context of classification it is likely to

be particularly useful in a noisy environment.• A weak analogy to the audience vote in millionaire

– Naturally handles arrangements produced from different sources or domain driven segmentation.


Ensemble Clustering on PPI networks:Key Questions

• What are the base clustering methods and arrangements to use in the context of interaction networks?– How to handle the influence of noise and hubs?

• How do we scale to problems of the scale of interaction networks?

• How do we address the issue of soft clustering?

• How to address the issue of data integration?– Another day another time


II. Ensemble Clustering Framework


Birds-eye-view (coarse grained)

Clustering Arrangements

Topology-basedSimilarity Metrics

Clustering Algorithms

Cluster Representation(soft)Consensus Clustering

Final clusters

Scale-free graph

xy base clustering arrangements

x y


Similarity Metrics

• Central to any clustering algorithm• Key idea:

– Leverage topological information to determine the similarity between two proteins in the interaction network

– With ensemble approach we are not limited to one!• Metrics :

– Clustering coefficient based (edge oriented, local)– Edge Betweenness based (edge oriented, global)– Neighborhood based (local, non-edge oriented)


Clustering coefficient-based similarity

• Clustering coefficient– "all-my-friends-know-each-other" property

– Measures the interconnectivity of a node’s neighbors.

• Clustering coefficient-based similarity of two connected nodes vi and vj

– Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes

5

1 2

3 4

6

vi vj


Edge betweenness-based similarity

• Shortest path edge betweenness [Newman et al]– “I-am-between-every-pair” property– Computes the fraction of shortest paths passing

through an edge

– Edges that lie between communities have high values of betweenness

– Edge betweenness-based similarity

5

1 2

3 4

6 7

8


Neighborhood-based similarity

• “my-friends-are-your-friends” property• Based on the number of common neighbors

between nodes (Czekanowski-Dice metric [Brun et al, 2004])

where Int(i) = number of neighbors of node i

5

1 2

3 4

6


Base Clustering• Base clustering algorithms : Different criteria

– kMetis – Repeated bisections – Direct k-way partitioning

• Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP– Edge betweenness-based – global, targets FP– Neighborhood – local, potentially targets FN &

FP

• 3X3 = 9 arrangements (variance is good!)– K clusters per arrangement (K clusters)


PCA-based Consensus Technique

Cluster Purification

Dimensionality Reduction

Consensus Clustering


Cluster Purification

• Goal : Prune unreliable base clusters • Intra-cluster similarity measure

where SP(i,j) represents shortest path between i and j

• Low intra-cluster distance => high reliability

• Remove clusters with low reliability


Dimensionality Reduction

• Cluster membership matrix to represent pruned base clusters

• Dimensions likely to be high (9 X k)• Clustering inefficient for high-dimensional data

– Distance metric computations do not scale well• Lot of noise and redundancy in the matrix• Solution : Reduce dimensions of the matrix

– Apply logistic PCA– Variant of PCA for binary data (Schein et al, 2003)



• Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm– Begin with each point in a separate cluster– Iteratively merge clusters that are similar

• Recursive Bisection (RBR) algorithm• Soft Clustering Variants

– Find initial clusters using agglo or RBR– Assign points to multiple clusters based on similarity

– Hub nodes have high propensity for multiple membership


Base Clustering

Topological Metrics

Weighted GraphCluster Purification

Principal ComponentAnalysis

Final clusters

Base clustering arrangements

Agglomerative Clustering

Weights

Pruning

PCA-agglo PCA-rbr

Ensemble Framework

(Detailed View)


PCA-soft-variants

Soft


III. Evaluation


Validation Metrics: Domain Independant

• Topological measure : Modularity [Newman&Girvan04]– Measures the modularity within clusters

– dij represents fraction of edges linking nodes in clusters i and j

• Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03]– Measures the shared information between the

consensus and base clustering arrangements


Validation Metric: Domain Dependant

• Domain-based measure:– Gene ontology annotations for each cluster of

proteins• Cellular Component • Molecular Function• Biological Process

– P-value to measure statistical significance of clusters• Computes the probability of the grouping being random• Smaller p-values represent higher biological

significance

– Clustering Score to measure overall clustering arrangement


Experimental Setup

• Algorithms proposed by Strehl et al , 2003– HyperGraph Partitioning Algorithm (HGPA)

• Minimal Hyperedge Separator using HMetis– Meta-CLustering Algorithm (MCLA)

• Group related hyperedges to form meta-clusters• Assign each point to the closest meta-cluster

– Cluster-based Similarity Partitioning (CSPA)• Pairwise similarity matrix is partitioned with METIS

• Algorithms proposed by Gionis et al, ICDE 2005– Agglomerative algorithm (CE-agglo)– Density-based clustering algorithm (CE-balls)– Use strict thresholds and are non-parametric

• Database of Interacting Proteins (DIP)– 4928 proteins, 17194 interactions


Modularity and NMI

CSPA algorithm ran out of memoryCE-agglo and CE-balls algorithms resulted in pairs and singleton clusters(cluster-sizes 2121 and 2783 respectively)

PCA-based consensus methods provide best scores!

Algorithm Modularity NMI

PCA-agglo 0.471 0.66

PCA-rbr 0.46 0.656

MCLA 0.41 0.614

HGPA 0.1 0.275


Comparison with Ensemble Algorithms

Ensemble Algorithms

0

0.1

0.2

0.3

0.40.5

0.6

0.7

0.8

0.9

1

CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo

Clu

ster

ing

Scor

e

Process

Function

Component

PCA-based Consensus methods outperform all other algorithms!

MCLA performs best of the other algorithms


Existing Solutions to Identify Dense Regions

• Molecular Complex Detection (MCODE)– Bader et al, 2003– Use local neighborhood density to identify seed

vertices– Group highly weighted vertices around seed

vertices• Markov Cluster Algorithm (MCL)

– Dongen et al 2000– Random walks on the graph will infrequently

go from one natural cluster to another – Cluster structure separates out– Fast, scalable and non-parametric


Comparison with MCODE and MCL

• MCODE produced only 59 clusters– Not all proteins clustered (794/4928)– 10-20 clusters insignificant

• MCL produced 1246 clusters– Most of the clusters insignificant (close to 75-80%)

Algorithm Modularity

PCA-agglo

0.471

MCL 0.217

MCODE 0.372


Soft Clustering: Comparison with Hub Duplication (Ucar 2006)For Hub

i++

Hi

Hi

D’iHi

Hub-induced Subgraph Si Dense components of Si

Duplicate Hi

Graph Partitioning


Benefits of Soft Ensemble Clustering


A closer look at soft clustering performance

• CKA1 (hub protein)

Base Algorithm

Annotation PCA-agglo

PCA-softagglo

Direct-bet Kinase CK2 complex Kinase CK2 complex

Kinase CK2 complex

Direct-cc rRNA metabolism rRNA metabolism

RBR-bet Kinase CK2 complex Cell organization and biogenesis

RBR-cc Kinase CK2 complex

Metis-bet Cell organization and biogenesis

Metis-cc


Concluding Remarks

• Clustering PPI networks is challenging

– Noise– Presence of hubs – Need for soft clustering– Integration

• Ensemble clustering shows promise as a unified method to handle these problems

– Competes well against existing stand-alone solutions

– Scalable -- straightforward parallelization for the most part

• Ongoing work– General applicability

• WWW applications• Social network analysis

– Explicit modeling of domain knowledge

• E.g. encoding directionality

– Data Integration• Key is to weight edges and/or

components of the ensemble

– Leveraging graphical models

– More robust base models• Extrinsic similarity measures• Impact of anomalies


Questions?

• We acknowledge the following grants for support

– NSF: CAREER-IIS-0347662 – NSF: NGS-CNS-0406386 – NSF: RI-CNS-0403342 – DOE: ECPI-FG02

• Graduate Student Colleagues– S. Asur and D. Ucar

• Details– http://dmrl.cse.ohio-state.edu– www.cse.ohio-state.edu/~srini/

copyright 2006, data mining research laboratory integrated mining of ppi networks: a case for...

Documents