April 22, 2023 Data Mining: Concepts and Techniques
1
-Clusters Capturing Subspace Correlation in a Large Data Set
Authors: Yang Jiong, Wei Wang etc.(ICDE02)Presenter: Xuehua Shen
April 22, 2023 Data Mining: Concepts and Techniques
2
Presentation Layout
Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm
April 22, 2023 Data Mining: Concepts and Techniques
3
Clustering Clustering: the process of grouping a set of
objects into classes of similar objects Similar to one another within the same
cluster Dissimilar to the objects in other clusters
April 22, 2023 Data Mining: Concepts and Techniques
4
Major Clustering Methods Partition algorithm
Hierarchy algorithm
Density-based
Grid-based
Model-based
April 22, 2023 Data Mining: Concepts and Techniques
5
Similarity Clustering: the process of grouping a set of
objects into classes of similar objects
But how to define similarity?
April 22, 2023 Data Mining: Concepts and Techniques
6
Similarity cont. Traditional clustering model: based on distance
functions
Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function
q q
pp
jxixjxixjxixjid )||...|||(|),(2211
April 22, 2023 Data Mining: Concepts and Techniques
7
Similarity cont. -Clusters model: similar when exhibiting a
coherent pattern on a subset of dimensions
Can cluster objects which show shifting pattern or scaling pattern
April 22, 2023 Data Mining: Concepts and Techniques
8
Similarity cont. Example of Coherent Pattern: Shifting Pattern Scaling Pattern
April 22, 2023 Data Mining: Concepts and Techniques
9
Subspace Clustering From high dimensional clustering
(problematic) To subspace clustering
Not restricted with fixed ordering of columns contrasted with pattern in time-series data
Challenge: curse of dimensionality!
April 22, 2023 Data Mining: Concepts and Techniques
10
Subspace Clustering cont. Example of subspace clustering
CH11 CH1B CH1D CH2I CH2B
CTFC3
4392 284 4108 280 228
VPS8 401 281 120 275 298
EFB1 318 280 37 277 215
SSA1 401 292 109 580 238
FUN14
2857 285 2576 271 226
SP07 228 290 48 285 224
MDM10
538 272 266 277 236
CYS3 322 288 41 278 219
CH11 CH1D CH2B
VPS8 401 120 298
EFB1 318 37 215
CYS3 322 41 219
April 22, 2023 Data Mining: Concepts and Techniques
11
Applications Microarray Data Analysis in Biology
E-Commerce
April 22, 2023 Data Mining: Concepts and Techniques
12
Microarray Data Analysis Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues
Values in Matrix: expression level relative abundance of the mRNA of a gene
under a specific condition
April 22, 2023 Data Mining: Concepts and Techniques
13
Microarray Data Analysis cont.
From Scaling Pattern to Shifting Pattern
Red: Interested Gene, Green: Controlled Gene Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions
)log( Re sityGreenIntendIntensity
ijd
April 22, 2023 Data Mining: Concepts and Techniques
14
E-Commerce Example: Rating of Movies (1: lowest rate, 10: highest rate)
Shifting Pattern If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too
Movie1
Movie2
Movie3
Movie4
Viewer1
1 2 3 6
Viewer2
2 3 4 7
Viewer3
4 5 6 9
April 22, 2023 Data Mining: Concepts and Techniques
15
Presentation Layout Overview of clustering Related Work of -Clusters -Clusters Model FLOC algorithm
April 22, 2023 Data Mining: Concepts and Techniques
16
Related Work CLIQUE, ORCLUS, PROCLUS (subspace
clustering)
Can’t capture neither the shifting pattern nor the scaling pattern
Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array
April 22, 2023 Data Mining: Concepts and Techniques
17
Bicluster Model: Mean squared residue score of
submatrix:
a submatrix AIJ is called a -biCluster if H(I,J)
Algorithm: A random algorithm to give an approximate answer
Ii JjIiijJIIJijIIj
JjijJiJ
JjIiIJIjiJijJI
dddddd
ddddJIH
,||||
1||1
||1
,
2||||
1
,,
)((),(
April 22, 2023 Data Mining: Concepts and Techniques
18
Weakness of bicluster Missing Values
Constraints
April 22, 2023 Data Mining: Concepts and Techniques
19
Presentation Layout Overview Related Work -Clusters Model FLOC algorithm
April 22, 2023 Data Mining: Concepts and Techniques
20
Occupancy Threshold A parameter to control the percentage of missing values in a submatrix
|J’i| is the specified attributes for object i in -Clusters
|J| is the number of attributes in the -Clusters
J
J i'
April 22, 2023 Data Mining: Concepts and Techniques
21
Occupancy Threshold cont. Similar occupancy threshold for attribute j in
-Clusters
Example =0.6
1 34 53 4
1 3 33 4 5
3 4 4
April 22, 2023 Data Mining: Concepts and Techniques
22
Volume The volume of a -Clusters(I,J) is the number
of specified entries dij in (I,J) Example volume is 3*3=9
1 3 33 4 5
3 4 4
April 22, 2023 Data Mining: Concepts and Techniques
23
Base Object Base
Attribute Base
',
'
i
Jj ijJi J
dd
','
j
Ii ijjI I
dd
April 22, 2023 Data Mining: Concepts and Techniques
24
Base cont. -Clusters Base
For perfect -Clusters
IJ
JjIi ijIJ v
dd
,
IJIjiJij dddd
April 22, 2023 Data Mining: Concepts and Techniques
25
Residue Entry Residue if dij is specified
otherwise is 0IJIjiJijij ddddr
April 22, 2023 Data Mining: Concepts and Techniques
26
Residue cont. -Clusters Residue
r-residue -Clusters if -clusters residue is equal to or smaller than r
IJ
JjIi ij
v
r ,
April 22, 2023 Data Mining: Concepts and Techniques
27
Presentation Layout
Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm(Flexible Overlapping
Clustering)
April 22, 2023 Data Mining: Concepts and Techniques
28
Flow Chart
Y N
Generating initial clusters
Determine the best action For each row and each
column
Perform the best actionsequentially
improved
April 22, 2023 Data Mining: Concepts and Techniques
29
Initial Cluster Randomly Generate k initial cluster
Different parameters makes different size cluster
April 22, 2023 Data Mining: Concepts and Techniques
30
Choose best actions For every object or attribute, there are k
actions which can be done,
Choose the best action among the k candidates according to gain
Gain is the difference between original residue and the residue assuming the action is done on the cluster
April 22, 2023 Data Mining: Concepts and Techniques
31
Choose Best Actions cont. Even if gain is negative sometimes we do the action in order to get the global
optimum
April 22, 2023 Data Mining: Concepts and Techniques
32
Do the actions sequentially Generate the actions sequence 1) the same order in all iterations
2) random order sequence
3) weighted random order sequence
April 22, 2023 Data Mining: Concepts and Techniques
33
Output the Best cluster After some iterations, no improvement of
minimum residue, algorithm stops and k best cluster is output
April 22, 2023 Data Mining: Concepts and Techniques
34
End Thank you!