copyright 2006, data mining research laboratory integrated mining of ppi networks: a case for...

32
Copyright 2006, Data Mining Research Laboratory Integrated Mining of PPI Networks: A Case for Ensemble Clustering Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar

Upload: noel-jenkins

Post on 02-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Copyright 2006, Data Mining Research Laboratory

Integrated Mining of PPI Networks: A Case for Ensemble

Clustering

Srinivasan ParthasarathyDepartment of Computer Science and

EngineeringThe Ohio State University

Joint work with Sitaram Asur and Duygu Ucar

Copyright 2006, Data Mining Research Laboratory

I. Preliminaries and Motivation

Copyright 2006, Data Mining Research Laboratory

Proteins

• Central component of cell machinery and life– It is the proteins dynamically generated by a cell

that execute the genetic program [Kahn 1995]

• Proteins work with other proteins [Von Mering et al 2002]– Form large interaction networks typically refered

to as protein-protein interaction (PPI) networks– Regulate and support each other for specific

functionality or process

Copyright 2006, Data Mining Research Laboratory

Protein Protein Interaction Networks• Why analyze?

– To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002]

• Understanding the organism

– Protein function prediction• E.g. no functional annotations for one-third of baker’s yeast

– Drug design• Goal: To find modular clusters

Copyright 2006, Data Mining Research Laboratory

Challenges in analyzing PPI Networks

– Noisy data

• False positives [Deane 2002], false negatives [Hsu 06]– Existence of Hub Nodes

• Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else!

– Proteins can be multi-faceted• Can belong to multiple functional groups – most clustering

algorithms are hard – need for soft or fuzzy clustering– Data Integration Issues

• Multiple Sources– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence

• Different targets– Y2H, Mass Spec – target binding– Gene co-occurrence – target functional

• Different weaknesses (missing certain interactions)– Y2H – translation– mass-spectrometry – transport & sensing

Copyright 2006, Data Mining Research Laboratory

Ensemble Clustering

• A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03]

• Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement

• Our hypothesis: Potentially offers a viable solution for problems simultaneously– Given nice theory in the context of classification it is likely to

be particularly useful in a noisy environment.• A weak analogy to the audience vote in millionaire

– Naturally handles arrangements produced from different sources or domain driven segmentation.

Copyright 2006, Data Mining Research Laboratory

Ensemble Clustering on PPI networks:Key Questions

• What are the base clustering methods and arrangements to use in the context of interaction networks?– How to handle the influence of noise and hubs?

• How do we scale to problems of the scale of interaction networks?

• How do we address the issue of soft clustering?

• How to address the issue of data integration?– Another day another time

Copyright 2006, Data Mining Research Laboratory

II. Ensemble Clustering Framework

Copyright 2006, Data Mining Research Laboratory

Birds-eye-view (coarse grained)

Clustering Arrangements

Topology-basedSimilarity Metrics

Clustering Algorithms

Cluster Representation(soft)Consensus Clustering

Final clusters

Scale-free graph

xy base clustering arrangements

x y

Copyright 2006, Data Mining Research Laboratory

Similarity Metrics

• Central to any clustering algorithm• Key idea:

– Leverage topological information to determine the similarity between two proteins in the interaction network

– With ensemble approach we are not limited to one!• Metrics :

– Clustering coefficient based (edge oriented, local)– Edge Betweenness based (edge oriented, global)– Neighborhood based (local, non-edge oriented)

Copyright 2006, Data Mining Research Laboratory

Clustering coefficient-based similarity

• Clustering coefficient– "all-my-friends-know-each-other" property

– Measures the interconnectivity of a node’s neighbors.

• Clustering coefficient-based similarity of two connected nodes vi and vj

– Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes

5

1 2

3 4

6

vi vj

Copyright 2006, Data Mining Research Laboratory

Edge betweenness-based similarity

• Shortest path edge betweenness [Newman et al]– “I-am-between-every-pair” property– Computes the fraction of shortest paths passing

through an edge

– Edges that lie between communities have high values of betweenness

– Edge betweenness-based similarity

5

1 2

3 4

6 7

8

Copyright 2006, Data Mining Research Laboratory

Neighborhood-based similarity

• “my-friends-are-your-friends” property• Based on the number of common neighbors

between nodes (Czekanowski-Dice metric [Brun et al, 2004])

where Int(i) = number of neighbors of node i

5

1 2

3 4

6

Copyright 2006, Data Mining Research Laboratory

Base Clustering• Base clustering algorithms : Different criteria

– kMetis – Repeated bisections – Direct k-way partitioning

• Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP– Edge betweenness-based – global, targets FP– Neighborhood – local, potentially targets FN &

FP

• 3X3 = 9 arrangements (variance is good!)– K clusters per arrangement (K clusters)

Copyright 2006, Data Mining Research Laboratory

PCA-based Consensus Technique

Cluster Purification

Dimensionality Reduction

Consensus Clustering

Copyright 2006, Data Mining Research Laboratory

Cluster Purification

• Goal : Prune unreliable base clusters • Intra-cluster similarity measure

where SP(i,j) represents shortest path between i and j

• Low intra-cluster distance => high reliability

• Remove clusters with low reliability

Copyright 2006, Data Mining Research Laboratory

Dimensionality Reduction

• Cluster membership matrix to represent pruned base clusters

• Dimensions likely to be high (9 X k)• Clustering inefficient for high-dimensional data

– Distance metric computations do not scale well• Lot of noise and redundancy in the matrix• Solution : Reduce dimensions of the matrix

– Apply logistic PCA– Variant of PCA for binary data (Schein et al, 2003)

Copyright 2006, Data Mining Research Laboratory

Consensus Clustering

• Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm– Begin with each point in a separate cluster– Iteratively merge clusters that are similar

• Recursive Bisection (RBR) algorithm• Soft Clustering Variants

– Find initial clusters using agglo or RBR– Assign points to multiple clusters based on similarity

– Hub nodes have high propensity for multiple membership

Copyright 2006, Data Mining Research Laboratory

Base Clustering

Topological Metrics

Weighted GraphCluster Purification

Principal ComponentAnalysis

Final clusters

Base clustering arrangements

Agglomerative Clustering

Weights

Pruning

PCA-agglo PCA-rbr

Ensemble Framework

(Detailed View)

Consensus Clustering

PCA-soft-variants

Soft

Copyright 2006, Data Mining Research Laboratory

III. Evaluation

Copyright 2006, Data Mining Research Laboratory

Validation Metrics: Domain Independant

• Topological measure : Modularity [Newman&Girvan04]– Measures the modularity within clusters

– dij represents fraction of edges linking nodes in clusters i and j

• Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03]– Measures the shared information between the

consensus and base clustering arrangements

Copyright 2006, Data Mining Research Laboratory

Validation Metric: Domain Dependant

• Domain-based measure:– Gene ontology annotations for each cluster of

proteins• Cellular Component • Molecular Function• Biological Process

– P-value to measure statistical significance of clusters• Computes the probability of the grouping being random• Smaller p-values represent higher biological

significance

– Clustering Score to measure overall clustering arrangement

Copyright 2006, Data Mining Research Laboratory

Experimental Setup

• Algorithms proposed by Strehl et al , 2003– HyperGraph Partitioning Algorithm (HGPA)

• Minimal Hyperedge Separator using HMetis– Meta-CLustering Algorithm (MCLA)

• Group related hyperedges to form meta-clusters• Assign each point to the closest meta-cluster

– Cluster-based Similarity Partitioning (CSPA)• Pairwise similarity matrix is partitioned with METIS

• Algorithms proposed by Gionis et al, ICDE 2005– Agglomerative algorithm (CE-agglo)– Density-based clustering algorithm (CE-balls)– Use strict thresholds and are non-parametric

• Database of Interacting Proteins (DIP)– 4928 proteins, 17194 interactions

Copyright 2006, Data Mining Research Laboratory

Modularity and NMI

CSPA algorithm ran out of memoryCE-agglo and CE-balls algorithms resulted in pairs and singleton clusters(cluster-sizes 2121 and 2783 respectively)

PCA-based consensus methods provide best scores!

Algorithm Modularity NMI

PCA-agglo 0.471 0.66

PCA-rbr 0.46 0.656

MCLA 0.41 0.614

HGPA 0.1 0.275

Copyright 2006, Data Mining Research Laboratory

Comparison with Ensemble Algorithms

Ensemble Algorithms

0

0.1

0.2

0.3

0.40.5

0.6

0.7

0.8

0.9

1

CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo

Clu

ster

ing

Scor

e

Process

Function

Component

PCA-based Consensus methods outperform all other algorithms!

MCLA performs best of the other algorithms

Copyright 2006, Data Mining Research Laboratory

Existing Solutions to Identify Dense Regions

• Molecular Complex Detection (MCODE)– Bader et al, 2003– Use local neighborhood density to identify seed

vertices– Group highly weighted vertices around seed

vertices• Markov Cluster Algorithm (MCL)

– Dongen et al 2000– Random walks on the graph will infrequently

go from one natural cluster to another – Cluster structure separates out– Fast, scalable and non-parametric

Copyright 2006, Data Mining Research Laboratory

Comparison with MCODE and MCL

• MCODE produced only 59 clusters– Not all proteins clustered (794/4928)– 10-20 clusters insignificant

• MCL produced 1246 clusters– Most of the clusters insignificant (close to 75-80%)

Algorithm Modularity

PCA-agglo

0.471

MCL 0.217

MCODE 0.372

Copyright 2006, Data Mining Research Laboratory

Soft Clustering: Comparison with Hub Duplication (Ucar 2006)For Hub

i++

Hi

Hi

D’iHi

Hub-induced Subgraph Si Dense components of Si

Duplicate Hi

Graph Partitioning

Copyright 2006, Data Mining Research Laboratory

Benefits of Soft Ensemble Clustering

Copyright 2006, Data Mining Research Laboratory

A closer look at soft clustering performance

• CKA1 (hub protein)

Base Algorithm

Annotation PCA-agglo

PCA-softagglo

Direct-bet Kinase CK2 complex Kinase CK2 complex

Kinase CK2 complex

Direct-cc rRNA metabolism rRNA metabolism

RBR-bet Kinase CK2 complex Cell organization and biogenesis

RBR-cc Kinase CK2 complex

Metis-bet Cell organization and biogenesis

Metis-cc

Copyright 2006, Data Mining Research Laboratory

Concluding Remarks

• Clustering PPI networks is challenging

– Noise– Presence of hubs – Need for soft clustering– Integration

• Ensemble clustering shows promise as a unified method to handle these problems

– Competes well against existing stand-alone solutions

– Scalable -- straightforward parallelization for the most part

• Ongoing work– General applicability

• WWW applications• Social network analysis

– Explicit modeling of domain knowledge

• E.g. encoding directionality

– Data Integration• Key is to weight edges and/or

components of the ensemble

– Leveraging graphical models

– More robust base models• Extrinsic similarity measures• Impact of anomalies

Copyright 2006, Data Mining Research Laboratory

Questions?

• We acknowledge the following grants for support

– NSF: CAREER-IIS-0347662 – NSF: NGS-CNS-0406386 – NSF: RI-CNS-0403342 – DOE: ECPI-FG02

• Graduate Student Colleagues– S. Asur and D. Ucar

• Details– http://dmrl.cse.ohio-state.edu– www.cse.ohio-state.edu/~srini/