-clusters capturing subspace correlation in a large data set

34
June 16, 2022 Data Mining: Concepts and Tec hniques 1 -Clusters Capturing Subspace Correlation in a Large Data Set Authors: Yang Jiong, Wei Wang etc. (ICDE02) Presenter: Xuehua Shen [email protected]

Upload: avent

Post on 23-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

-Clusters Capturing Subspace Correlation in a Large Data Set. Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen [email protected]. Presentation Layout. Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm. Clustering. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

1

-Clusters Capturing Subspace Correlation in a Large Data Set

Authors: Yang Jiong, Wei Wang etc.(ICDE02)Presenter: Xuehua Shen

[email protected]

Page 2: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

2

Presentation Layout

Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm

Page 3: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

3

Clustering Clustering: the process of grouping a set of

objects into classes of similar objects Similar to one another within the same

cluster Dissimilar to the objects in other clusters

Page 4: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

4

Major Clustering Methods Partition algorithm

Hierarchy algorithm

Density-based

Grid-based

Model-based

Page 5: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

5

Similarity Clustering: the process of grouping a set of

objects into classes of similar objects

But how to define similarity?

Page 6: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

6

Similarity cont. Traditional clustering model: based on distance

functions

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function

q q

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

Page 7: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

7

Similarity cont. -Clusters model: similar when exhibiting a

coherent pattern on a subset of dimensions

Can cluster objects which show shifting pattern or scaling pattern

Page 8: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

8

Similarity cont. Example of Coherent Pattern: Shifting Pattern Scaling Pattern

Page 9: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

9

Subspace Clustering From high dimensional clustering

(problematic) To subspace clustering

Not restricted with fixed ordering of columns contrasted with pattern in time-series data

Challenge: curse of dimensionality!

Page 10: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

10

Subspace Clustering cont. Example of subspace clustering

CH11 CH1B CH1D CH2I CH2B

CTFC3

4392 284 4108 280 228

VPS8 401 281 120 275 298

EFB1 318 280 37 277 215

SSA1 401 292 109 580 238

FUN14

2857 285 2576 271 226

SP07 228 290 48 285 224

MDM10

538 272 266 277 236

CYS3 322 288 41 278 219

CH11 CH1D CH2B

VPS8 401 120 298

EFB1 318 37 215

CYS3 322 41 219

Page 11: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

11

Applications Microarray Data Analysis in Biology

E-Commerce

Page 12: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

12

Microarray Data Analysis Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues

Values in Matrix: expression level relative abundance of the mRNA of a gene

under a specific condition

Page 13: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

13

Microarray Data Analysis cont.

From Scaling Pattern to Shifting Pattern

Red: Interested Gene, Green: Controlled Gene Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions

)log( Re sityGreenIntendIntensity

ijd

Page 14: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

14

E-Commerce Example: Rating of Movies (1: lowest rate, 10: highest rate)

Shifting Pattern If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too

Movie1

Movie2

Movie3

Movie4

Viewer1

1 2 3 6

Viewer2

2 3 4 7

Viewer3

4 5 6 9

Page 15: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

15

Presentation Layout Overview of clustering Related Work of -Clusters -Clusters Model FLOC algorithm

Page 16: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

16

Related Work CLIQUE, ORCLUS, PROCLUS (subspace

clustering)

Can’t capture neither the shifting pattern nor the scaling pattern

Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array

Page 17: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

17

Bicluster Model: Mean squared residue score of

submatrix:

a submatrix AIJ is called a -biCluster if H(I,J)

Algorithm: A random algorithm to give an approximate answer

Ii JjIiijJIIJijIIj

JjijJiJ

JjIiIJIjiJijJI

dddddd

ddddJIH

,||||

1||1

||1

,

2||||

1

,,

)((),(

Page 18: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

18

Weakness of bicluster Missing Values

Constraints

Page 19: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

19

Presentation Layout Overview Related Work -Clusters Model FLOC algorithm

Page 20: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

20

Occupancy Threshold A parameter to control the percentage of missing values in a submatrix

|J’i| is the specified attributes for object i in -Clusters

|J| is the number of attributes in the -Clusters

J

J i'

Page 21: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

21

Occupancy Threshold cont. Similar occupancy threshold for attribute j in

-Clusters

Example =0.6

1 34 53 4

1 3 33 4 5

3 4 4

Page 22: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

22

Volume The volume of a -Clusters(I,J) is the number

of specified entries dij in (I,J) Example volume is 3*3=9

1 3 33 4 5

3 4 4

Page 23: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

23

Base Object Base

Attribute Base

',

'

i

Jj ijJi J

dd

','

j

Ii ijjI I

dd

Page 24: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

24

Base cont. -Clusters Base

For perfect -Clusters

IJ

JjIi ijIJ v

dd

,

IJIjiJij dddd

Page 25: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

25

Residue Entry Residue if dij is specified

otherwise is 0IJIjiJijij ddddr

Page 26: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

26

Residue cont. -Clusters Residue

r-residue -Clusters if -clusters residue is equal to or smaller than r

IJ

JjIi ij

v

r ,

Page 27: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

27

Presentation Layout

Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm(Flexible Overlapping

Clustering)

Page 28: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

28

Flow Chart

Y N

Generating initial clusters

Determine the best action For each row and each

column

Perform the best actionsequentially

improved

Page 29: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

29

Initial Cluster Randomly Generate k initial cluster

Different parameters makes different size cluster

Page 30: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

30

Choose best actions For every object or attribute, there are k

actions which can be done,

Choose the best action among the k candidates according to gain

Gain is the difference between original residue and the residue assuming the action is done on the cluster

Page 31: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

31

Choose Best Actions cont. Even if gain is negative sometimes we do the action in order to get the global

optimum

Page 32: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

32

Do the actions sequentially Generate the actions sequence 1) the same order in all iterations

2) random order sequence

3) weighted random order sequence

Page 33: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

33

Output the Best cluster After some iterations, no improvement of

minimum residue, algorithm stops and k best cluster is output

Page 34: -Clusters Capturing Subspace Correlation in a Large Data Set

April 22, 2023 Data Mining: Concepts and Techniques

34

End Thank you!