intelligent database systems lab n.y.u.s.t. i. m. determining the best k for clustering...

13
Intelligent Database Systems Lab N.Y.U.S. T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter : Lin, Shu-Han Authors : Hua Yan, Keke Chen, Ling Liu, Joonsoo Bae Data & Knowledge Engineering (DKE) 68 (2009) 28–48

Upload: anabel-skinner

Post on 04-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Determining the best K for clustering transactional datasets –

A coverage density-based approach

Presenter : Lin, Shu-Han

Authors : Hua Yan, Keke Chen, Ling Liu, Joonsoo Bae

Data & Knowledge Engineering (DKE) 68 (2009) 28–48

Page 2: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Page 3: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

Cluster the transactional datasets – a kind of special categorical data

Time complexity: O(dmN2logN)

3

Name Buy

Jane Coke, Milk

Mary Coke, Pepsi

Tom Milk, Water

Denny Milk, Juice

TinaJuice, Red

Wine, Pepsi

Boolean values

Name Coke Milk Pepsi Water Juice Red Wine

Jane 1 1 0 0 0 0Mary 1 0 1 0 0 0Tom 0 1 0 1 0 0

Denny 0 1 0 0 1 0Tina 0 0 1 0 1 1

Page 4: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objectives

To design a method ACTD (Agglomerative Clustering algorithm with Transactional-cluster-modes Dissimilarity) especially for transactional data

Instead of ACE (Agglomerative Categorical clustering with Entropy criterion) Find best-K

More efficiently

4

Page 5: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

ACE ACTD

Methodology – Overview of SCALE

5

(Sampling, Clustering structure Assessment, cLustering & domain-specfic Evaluation)

Agglomerative

BKPlot DMDI

Page 6: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Coverage Density

Transactional-cluster-mode A subset of items

Methodology – ACTDIntra-cluster similarity

6

9

7

33

7

Nk

Mk

1.c

2/3,b

2/3,a

.8,

in this case, only c is the transactional-cluster-mode

Page 7: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Transactional-cluster-mode dissimilarity

Time complexity: O(dmN2logN) O(MN2logN)

Methodology – ACTDInter-cluster similarity

7

032

33-1

5

2

10

6-1

52

33-1

2

1

12

6-1

62

33-1

[0, .5]

Page 8: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – DMDI

8

Valleys、change dramatically

Page 9: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments – Performance

9

Page 10: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments – Quality

10

Page 11: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments – Quality on sample dataset

11

With noise

Page 12: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

12

Conclusions

The ACTD The Coverage Density-based method is promising for

transactional datasets Faster

More stable

than entropy-based method

The Agglomerative Hierarchical clustering algorithm and DMDI can help to find best-K

Page 13: Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining the best K for clustering transactional datasets – A coverage density-based approach Presenter

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

13

Comments

Advantage …

Drawback …

Application …