anindya bhattacharya and rajat k. de bioinformatics, 2008
TRANSCRIPT
Anindya Bhattacharya and Rajat K. DeBioinformatics, 2008
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
2
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
3
Correlation Clustering
4
Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004.
It is basically based on the notion of graph partitioning.
5
How to construct the graph? Nodes: genes. Edges: correlation between the genes.
Two types of edges: Positive edge. Negative edge.
6
For example:
7
XX YY Positive correlation coefficient: Positive edge( )
XX YY Negative correlation coefficient: Negative edge( )
CC
GG
BB
DD
AA
HH
GG
FF
EE
Cluster 1
Cluster 2
Graph Construction
Graph Partitioning CC
GG
BB
DD
AA
HH
GG
FF
EE
How to measure the quality of clusters? The number of agreements. The number of disagreements.
The number of agreements: the number of genes that are in correct clusters.
The number of disagreements: the number of genes wrongly clustered.
8
For example:
9
AA
CC
DD EE
BB
Cluster 1
Cluster 2
The measure of agreements is the sum of:(1) # of positive edges in the same clusters(2) # of negative edges in different clustersThe measure of disagreements is the sum of:(1) # of negative edges in the same clusters(2) # of positive edges in different clusters
4 + 4 = 8
0 + 2 = 2
Minimization of disagreements or equivalently Maximization of agreements!
However, it’s NP-Complete proved by Bansal et al., 2004.
Another problem is without the magnitude of correlation coefficients.
10
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
11
Pearson correlation coefficientTerms and measurements used in
DCCADivisive Correlation Clustering
Algorithm
12
Consider a set of genes, , for each of which expression values are given.
The Pearson correlation coefficient between two genes and is defined as:
13
lth sample value of gene
mean value of gene from samples
: and are positively correlated with the degree of correlation as its magnitude.
: and are negatively correlated with value .
14
We define some terms and measurements used in DCCA: Attraction Repulsion Attraction/Repulsion value Average correlation value
15
Attraction: There’s an attraction between and if .
Repulsion: There’s a repulsion between and if .
Attraction/Repulsion value: Magnitude of
is the strength of attraction or repulsion.
16
The genes will be grouped into disjoint clusters .
Average correlation value: Average correlation value for a gene with respect to cluster is defined as:
17
the number of data points in
indicates that the average correlation for a gene with other genes inside the cluster .
Average correlation value reflects the degree of inclusion of to cluster .
18
19
Divisive Correlation Clustering Algorithm
11 mm
m samples
11 mm
n genes
DCCA
C1C1 C2C2 CkCk
K disjoint clustersX1
Xn
Step 1:
Step 2: for each iteration, do: Step 2-i:
20
Step 2: Step 2-ii:
Step 2-iii:
21
C1C1 C2C2 CpCp
Which cluster exists the most repulsion value?
Cluster C!
Step 2-iv:
22
xixi
xjxj
xk
xk
xk
xk
xk
xk
xk
xkx
k
xk
xk
xk
xk
xk
Cluster C
xjxj
xixi
Cp
Cq
Step 2-v:
23
xk
xk
C1C1 C2C2 CKCK
The highest average correlation value!
C1C1 C2C2 CKCKxk
xk
Place a copy of xk
CNEW: new clusters
Step 2-vi:
24
C1C1 C2C2 CKCK
C1C1 C2C2 CKCK
CNEW: new clusters
Any change?
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
25
Performance comparison A synthetic dataset ADS Nine gene expression datasets
26
A synthetic dataset ADS:
27
Three groups.
Experimental results:
28
Clustering correctly.
Experimental results:
29
Undesired Clusters.
Undesired Clusters.
Five yeast datasets: Yeast ATP, Yeast PHO, Yeast AFR, Yeast
AFRt, Yeast Cho et al.Four mammalian datasets:
GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745.
30
Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster.
31
Attributes
Mutual information
The entropies for each cluster-attribute pair.
The entropies for clustering result independent of attributes.
The entropies for each of the NA attributes independent of clusters.
z-score is defined as:
32
The computed MI for the clustered data, using the
attribute database.
MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained.
Mean of these MI-values.
The standard deviation of these MI-values.
A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result.
Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets.
33
Experimental results:
34
Experimental results:
35
Experimental results:
36
Experimental results:
37
Experimental results:
38
IntroductionDivisive Correlation Clustering
AlgorithmResultsConclusions
39
Pros: DCCA is able to obtain clustering
solution from gene-expression dataset with high biological significance.
DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input.
40
Cons: The computation cost for repairing any
misplacement occurring in clustering step is high.
DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1.
41