kdd poster nurjahan begum

Observation 1: The convergence of DTW and Euclidean distance results for increasing data sizes.

Observation 2: The increasing effectiveness of lower-bounding pruning for increasing data sizes.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy

Nurjahan Begum,Nurjahan Begum, LiudmilaLiudmila Ulanova,Ulanova, Jun WangJun Wang11 andand EamonnEamonn KeoghKeogh

University University of California, of California, Riverside Riverside UT DallasUT Dallas11

Why is DTW Clustering Hard?Why is DTW Clustering Hard?

Motivation of DTW ClusteringMotivation of DTW Clustering Density Peaks (DP) AlgorithmDensity Peaks (DP) Algorithm

Why Existing Work is Why Existing Work is not not the Answer?the Answer?

TADPoleTADPole: Our Proposed Algorithm: Our Proposed Algorithm

How ‘good’ are TADPole Clusters?

Case Study 1: Electromagnetic Case Study 1: Electromagnetic ArticulographArticulograph

How Effective is How Effective is TADPole’sTADPole’s Pruning?Pruning?

#kanyewest

#Michael

#MichaelJackson

#taylorswift

0 40 80 120 hours

Synonym Discovery ?

Association Discovery ?

“I’mma let you finish”

Bos taurus

Hyperoodon ampullatus

Talpa europaea

Bos taurus

Hyperoodon ampullatus

Talpa europaea

Cetartiodactyla

DTW ED

0 1000 2000

0.01

0.03

0.05

0.07

1-N

N

erro

r ra

te

Size of training set

Euclidean

DTW

0 1000 2000

0.6

0.7

0.8

0.9

Dataset Size

Ran

d I

nd

ex DTW

Euclidean

Neither of these two observations help!

5

1

2

3

4

6

7

8

9 10

11

12

13

1

2

3

4

5

6

7

8

9 10

11

12

13

Mislabeled

by k-means

Outlier

Scalability Issue: DTW is not a metric, therefore very difficult to index

Quality Issue: Need clustering algorithm which is insensitive to outliers

3 steps

1. Density Calculation

2. NN within Higher Density List Calculation

3. Cluster Assignment

1 2 3

4

5

6 8

7

9 10

11 12 13 1

2

3

4

5

6

7

8

9

10

11

12

13

4

3

6

4

5

3

1

3

1

1

2

2

2

ρ

3 5

Elements with higher density

4.2 6

Item 1’s cluster label = item 3’s cluster label

1 dc

j

ciji dd )(

Pruning During Local Density Computation

j

LBMatrix(i,j)

Dij

UBMatrix(i,j)

LBMatrix(i,j)

Dij

UBMatrix(i,j)

dc

LBMatrix(i,j)

Dij

UBMatrix(i,j)

B)

C)

D)

i j

i

i

j

j

i Dij = 0 A)

Pruning During NN Distance Calculation From Higher Density List

LBMatrix(i,j1)

D1

UBMatrix(i,j1)

D2

UBMatrix(i,j2)

D3

UBMatrix(i,j3)

A)

B)

C)

i j1

i

i

j2

j3

D4

UBMatrix(i,j4)

i j4

D)

LBMatrix(i,j2)

LBMatrix(i,j4)

LBMatrix(i,j3)

Dis

tan

ce C

alcu

lati

on

s

0 3500

1

3

5

7 x 10

6

TADPole

Number of objects

Absolute

Number

0 3500 0

100

Number of objects

Brute force

TADPole

Percentage

DP: 9 Hours TADPole: 9 minutes

Distance Computation Ordering:Distance Computation Ordering:

Anytime Anytime TADPoleTADPole

Distance Computation Percentage 100%

0.4

1

0

Ran

d I

ndex

Euclidean

Distance

Oracle

Order

TADPole

Order

0 10%

0.4

1

Oracle Order

Random Order

TADPole Order

Random

Order

Ran

d I

ndex

Distance Computation Percentage

Zoom-In of Above Figure

This reflects the

90% of DTW

calculations that

were admissibly

pruned

This reflects the

10% of DTW

calculations that

were calculated in

anytime ordering

10%

0 150

Y

Z

Y

Z

1 2 3 4 5 6 7

0.84

0.92

1

Distance Computation Percentage

Ran

d I

nd

ex Euclidean Distance

Oracle Order

Random Order

TADPole Order

Pruning: 94%

Case Study 2: Case Study 2: PulsusPulsus DatasetDataset

Suspected Pulsus

Severe Pulsus

Healthy

Oximeter

Vein

Artery

Photo Detector

LED

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Patient 639 Patient 523 Patient 618 Patient 2975918

0 10 20 30 40 50 60 0 10 20 30 40 50 60

Normalized Respiration Rate

Normalized Heart Rate

Po

wer

Sp

ectr

al

Den

sity

Frequency

A) B)

C) D) E) F)

200 600 1000 1400 1800 200 600 1000 1400 1800

Non-Severe Pulsus Severe Pulsus

PP

G

ReproducibilityReproducibility

All the code and datasets used in this paper are publicly available in: www.cs.ucr.edu/~nbegu001/SpeededClusteringDTW

Pruning: 88%

kdd poster nurjahan begum

Documents