integrated mining of ppi networks: a case for ensemble clustering
DESCRIPTION
Integrated Mining of PPI Networks: A Case for Ensemble Clustering. Srinivasan Parthasarathy Department of Computer Science and Engineering The Ohio State University Joint work with Sitaram Asur and Duygu Ucar. I. Preliminaries and Motivation. Proteins. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/1.jpg)
Copyright 2006, Data Mining Research Laboratory
Integrated Mining of PPI Networks: A Case for Ensemble
Clustering
Srinivasan ParthasarathyDepartment of Computer Science and
EngineeringThe Ohio State University
Joint work with Sitaram Asur and Duygu Ucar
![Page 2: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/2.jpg)
Copyright 2006, Data Mining Research Laboratory
I. Preliminaries and Motivation
![Page 3: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/3.jpg)
Copyright 2006, Data Mining Research Laboratory
Proteins
• Central component of cell machinery and life– It is the proteins dynamically generated by a cell
that execute the genetic program [Kahn 1995]
• Proteins work with other proteins [Von Mering et al 2002]– Form large interaction networks typically refered
to as protein-protein interaction (PPI) networks– Regulate and support each other for specific
functionality or process
![Page 4: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/4.jpg)
Copyright 2006, Data Mining Research Laboratory
Protein Protein Interaction Networks• Why analyze?
– To fully understand cellular machinery, simply listing proteins is not enough – (clusters of) interactions need to be delineated as well [v.Mering 2002]
• Understanding the organism
– Protein function prediction• E.g. no functional annotations for one-third of baker’s yeast
– Drug design• Goal: To find modular clusters
![Page 5: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/5.jpg)
Copyright 2006, Data Mining Research Laboratory
Challenges in analyzing PPI Networks
– Noisy data
• False positives [Deane 2002], false negatives [Hsu 06]– Existence of Hub Nodes
• Particularly problematic for standard clustering and graph partitioning algorithms -- lead to very large core clusters and not much else!
– Proteins can be multi-faceted• Can belong to multiple functional groups – most clustering
algorithms are hard – need for soft or fuzzy clustering– Data Integration Issues
• Multiple Sources– 2-Hyrbid, Mass Spectrometry, genetic co-occurrence
• Different targets– Y2H, Mass Spec – target binding– Gene co-occurrence – target functional
• Different weaknesses (missing certain interactions)– Y2H – translation– mass-spectrometry – transport & sensing
![Page 6: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/6.jpg)
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering
• A useful approach to combine the results from multiple clustering arrangements into a single arrangement based on consensus [SG03]
• Objective: Mapping between clusters obtained by different algorithms to a single clustering arrangement
• Our hypothesis: Potentially offers a viable solution for problems simultaneously– Given nice theory in the context of classification it is likely to
be particularly useful in a noisy environment.• A weak analogy to the audience vote in millionaire
– Naturally handles arrangements produced from different sources or domain driven segmentation.
![Page 7: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/7.jpg)
Copyright 2006, Data Mining Research Laboratory
Ensemble Clustering on PPI networks:Key Questions
• What are the base clustering methods and arrangements to use in the context of interaction networks?– How to handle the influence of noise and hubs?
• How do we scale to problems of the scale of interaction networks?
• How do we address the issue of soft clustering?
• How to address the issue of data integration?– Another day another time
![Page 8: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/8.jpg)
Copyright 2006, Data Mining Research Laboratory
II. Ensemble Clustering Framework
![Page 9: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/9.jpg)
Copyright 2006, Data Mining Research Laboratory
Birds-eye-view (coarse grained)
Clustering Arrangements
Topology-basedSimilarity Metrics
Clustering Algorithms
Cluster Representation(soft)Consensus Clustering
Final clusters
Scale-free graph
xy base clustering arrangements
x y
![Page 10: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/10.jpg)
Copyright 2006, Data Mining Research Laboratory
Similarity Metrics
• Central to any clustering algorithm• Key idea:
– Leverage topological information to determine the similarity between two proteins in the interaction network
– With ensemble approach we are not limited to one!• Metrics :
– Clustering coefficient based (edge oriented, local)– Edge Betweenness based (edge oriented, global)– Neighborhood based (local, non-edge oriented)
![Page 11: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/11.jpg)
Copyright 2006, Data Mining Research Laboratory
Clustering coefficient-based similarity
• Clustering coefficient– "all-my-friends-know-each-other" property
– Measures the interconnectivity of a node’s neighbors.
• Clustering coefficient-based similarity of two connected nodes vi and vj
– Measures the contribution of the edge between the nodes towards the clustering coefficient of the nodes
5
1 2
3 4
6
vi vj
![Page 12: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/12.jpg)
Copyright 2006, Data Mining Research Laboratory
Edge betweenness-based similarity
• Shortest path edge betweenness [Newman et al]– “I-am-between-every-pair” property– Computes the fraction of shortest paths passing
through an edge
– Edges that lie between communities have high values of betweenness
– Edge betweenness-based similarity
5
1 2
3 4
6 7
8
![Page 13: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/13.jpg)
Copyright 2006, Data Mining Research Laboratory
Neighborhood-based similarity
• “my-friends-are-your-friends” property• Based on the number of common neighbors
between nodes (Czekanowski-Dice metric [Brun et al, 2004])
where Int(i) = number of neighbors of node i
5
1 2
3 4
6
![Page 14: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/14.jpg)
Copyright 2006, Data Mining Research Laboratory
Base Clustering• Base clustering algorithms : Different criteria
– kMetis – Repeated bisections – Direct k-way partitioning
• Topology-based similarity measures : weight interactions – Clustering coefficient-based – local, targets FP– Edge betweenness-based – global, targets FP– Neighborhood – local, potentially targets FN &
FP
• 3X3 = 9 arrangements (variance is good!)– K clusters per arrangement (K clusters)
![Page 15: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/15.jpg)
Copyright 2006, Data Mining Research Laboratory
PCA-based Consensus Technique
Cluster Purification
Dimensionality Reduction
Consensus Clustering
![Page 16: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/16.jpg)
Copyright 2006, Data Mining Research Laboratory
Cluster Purification
• Goal : Prune unreliable base clusters • Intra-cluster similarity measure
where SP(i,j) represents shortest path between i and j
• Low intra-cluster distance => high reliability
• Remove clusters with low reliability
![Page 17: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/17.jpg)
Copyright 2006, Data Mining Research Laboratory
Dimensionality Reduction
• Cluster membership matrix to represent pruned base clusters
• Dimensions likely to be high (9 X k)• Clustering inefficient for high-dimensional data
– Distance metric computations do not scale well• Lot of noise and redundancy in the matrix• Solution : Reduce dimensions of the matrix
– Apply logistic PCA– Variant of PCA for binary data (Schein et al, 2003)
![Page 18: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/18.jpg)
Copyright 2006, Data Mining Research Laboratory
Consensus Clustering
• Agglomerative Hierarchical Clustering – Bottom-up clustering algorithm– Begin with each point in a separate cluster– Iteratively merge clusters that are similar
• Recursive Bisection (RBR) algorithm• Soft Clustering Variants
– Find initial clusters using agglo or RBR– Assign points to multiple clusters based on similarity
– Hub nodes have high propensity for multiple membership
![Page 19: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/19.jpg)
Copyright 2006, Data Mining Research Laboratory
Base Clustering
Topological Metrics
Weighted GraphCluster Purification
Principal ComponentAnalysis
Final clusters
Base clustering arrangements
Agglomerative Clustering
Weights
Pruning
PCA-agglo PCA-rbr
Ensemble Framework
(Detailed View)
Consensus Clustering
PCA-soft-variants
Soft
![Page 20: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/20.jpg)
Copyright 2006, Data Mining Research Laboratory
III. Evaluation
![Page 21: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/21.jpg)
Copyright 2006, Data Mining Research Laboratory
Validation Metrics: Domain Independant
• Topological measure : Modularity [Newman&Girvan04]– Measures the modularity within clusters
– dij represents fraction of edges linking nodes in clusters i and j
• Information theoretic measure : Normalized Mutual Information [Strehl & Ghosh03]– Measures the shared information between the
consensus and base clustering arrangements
![Page 22: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/22.jpg)
Copyright 2006, Data Mining Research Laboratory
Validation Metric: Domain Dependant
• Domain-based measure:– Gene ontology annotations for each cluster of
proteins• Cellular Component • Molecular Function• Biological Process
– P-value to measure statistical significance of clusters• Computes the probability of the grouping being random• Smaller p-values represent higher biological
significance
– Clustering Score to measure overall clustering arrangement
![Page 23: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/23.jpg)
Copyright 2006, Data Mining Research Laboratory
Experimental Setup
• Algorithms proposed by Strehl et al , 2003– HyperGraph Partitioning Algorithm (HGPA)
• Minimal Hyperedge Separator using HMetis– Meta-CLustering Algorithm (MCLA)
• Group related hyperedges to form meta-clusters• Assign each point to the closest meta-cluster
– Cluster-based Similarity Partitioning (CSPA)• Pairwise similarity matrix is partitioned with METIS
• Algorithms proposed by Gionis et al, ICDE 2005– Agglomerative algorithm (CE-agglo)– Density-based clustering algorithm (CE-balls)– Use strict thresholds and are non-parametric
• Database of Interacting Proteins (DIP)– 4928 proteins, 17194 interactions
![Page 24: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/24.jpg)
Copyright 2006, Data Mining Research Laboratory
Modularity and NMI
CSPA algorithm ran out of memoryCE-agglo and CE-balls algorithms resulted in pairs and singleton clusters(cluster-sizes 2121 and 2783 respectively)
PCA-based consensus methods provide best scores!
Algorithm Modularity NMI
PCA-agglo 0.471 0.66
PCA-rbr 0.46 0.656
MCLA 0.41 0.614
HGPA 0.1 0.275
![Page 25: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/25.jpg)
Copyright 2006, Data Mining Research Laboratory
Comparison with Ensemble Algorithms
Ensemble Algorithms
0
0.1
0.2
0.3
0.40.5
0.6
0.7
0.8
0.9
1
CE-balls CE-agglo HGPA PCA-agglo PCA-rbr MCLA Wt-agglo
Clu
ster
ing
Scor
e
Process
Function
Component
PCA-based Consensus methods outperform all other algorithms!
MCLA performs best of the other algorithms
![Page 26: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/26.jpg)
Copyright 2006, Data Mining Research Laboratory
Existing Solutions to Identify Dense Regions
• Molecular Complex Detection (MCODE)– Bader et al, 2003– Use local neighborhood density to identify seed
vertices– Group highly weighted vertices around seed
vertices• Markov Cluster Algorithm (MCL)
– Dongen et al 2000– Random walks on the graph will infrequently
go from one natural cluster to another – Cluster structure separates out– Fast, scalable and non-parametric
![Page 27: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/27.jpg)
Copyright 2006, Data Mining Research Laboratory
Comparison with MCODE and MCL
• MCODE produced only 59 clusters– Not all proteins clustered (794/4928)– 10-20 clusters insignificant
• MCL produced 1246 clusters– Most of the clusters insignificant (close to 75-80%)
Algorithm Modularity
PCA-agglo
0.471
MCL 0.217
MCODE 0.372
![Page 28: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/28.jpg)
Copyright 2006, Data Mining Research Laboratory
Soft Clustering: Comparison with Hub Duplication (Ucar 2006)For Hub
i++
Hi
Hi
D’iHi
Hub-induced Subgraph Si Dense components of Si
Duplicate Hi
Graph Partitioning
![Page 29: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/29.jpg)
Copyright 2006, Data Mining Research Laboratory
Benefits of Soft Ensemble Clustering
![Page 30: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/30.jpg)
Copyright 2006, Data Mining Research Laboratory
A closer look at soft clustering performance
• CKA1 (hub protein)
Base Algorithm
Annotation PCA-agglo
PCA-softagglo
Direct-bet Kinase CK2 complex Kinase CK2 complex
Kinase CK2 complex
Direct-cc rRNA metabolism rRNA metabolism
RBR-bet Kinase CK2 complex Cell organization and biogenesis
RBR-cc Kinase CK2 complex
Metis-bet Cell organization and biogenesis
Metis-cc
![Page 31: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/31.jpg)
Copyright 2006, Data Mining Research Laboratory
Concluding Remarks
• Clustering PPI networks is challenging
– Noise– Presence of hubs – Need for soft clustering– Integration
• Ensemble clustering shows promise as a unified method to handle these problems
– Competes well against existing stand-alone solutions
– Scalable -- straightforward parallelization for the most part
• Ongoing work– General applicability
• WWW applications• Social network analysis
– Explicit modeling of domain knowledge
• E.g. encoding directionality
– Data Integration• Key is to weight edges and/or
components of the ensemble
– Leveraging graphical models
– More robust base models• Extrinsic similarity measures• Impact of anomalies
![Page 32: Integrated Mining of PPI Networks: A Case for Ensemble Clustering](https://reader035.vdocuments.us/reader035/viewer/2022062723/56813e81550346895da8ad91/html5/thumbnails/32.jpg)
Copyright 2006, Data Mining Research Laboratory
Questions?
• We acknowledge the following grants for support
– NSF: CAREER-IIS-0347662 – NSF: NGS-CNS-0406386 – NSF: RI-CNS-0403342 – DOE: ECPI-FG02
• Graduate Student Colleagues– S. Asur and D. Ucar
• Details– http://dmrl.cse.ohio-state.edu– www.cse.ohio-state.edu/~srini/