kdd poster nurjahan begum
TRANSCRIPT
Observation 1: The convergence of DTW and Euclidean distance results for increasing data sizes.
Observation 2: The increasing effectiveness of lower-bounding pruning for increasing data sizes.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy
Nurjahan Begum,Nurjahan Begum, LiudmilaLiudmila Ulanova,Ulanova, Jun WangJun Wang11 andand EamonnEamonn KeoghKeogh
University University of California, of California, Riverside Riverside UT DallasUT Dallas11
Why is DTW Clustering Hard?Why is DTW Clustering Hard?
Motivation of DTW ClusteringMotivation of DTW Clustering Density Peaks (DP) AlgorithmDensity Peaks (DP) Algorithm
Why Existing Work is Why Existing Work is not not the Answer?the Answer?
TADPoleTADPole: Our Proposed Algorithm: Our Proposed Algorithm
How ‘good’ are TADPole Clusters?
Case Study 1: Electromagnetic Case Study 1: Electromagnetic ArticulographArticulograph
How Effective is How Effective is TADPole’sTADPole’s Pruning?Pruning?
#kanyewest
#Michael
#MichaelJackson
#taylorswift
0 40 80 120 hours
Synonym Discovery ?
Association Discovery ?
“I’mma let you finish”
Bos taurus
Hyperoodon ampullatus
Talpa europaea
Bos taurus
Hyperoodon ampullatus
Talpa europaea
Cetartiodactyla
DTW ED
0 1000 2000
0.01
0.03
0.05
0.07
1-N
N
erro
r ra
te
Size of training set
Euclidean
DTW
0 1000 2000
0.6
0.7
0.8
0.9
Dataset Size
Ran
d I
nd
ex DTW
Euclidean
Neither of these two observations help!
5
1
2
3
4
6
7
8
9 10
11
12
13
1
2
3
4
5
6
7
8
9 10
11
12
13
Mislabeled
by k-means
Outlier
Scalability Issue: DTW is not a metric, therefore very difficult to index
Quality Issue: Need clustering algorithm which is insensitive to outliers
3 steps
1. Density Calculation
2. NN within Higher Density List Calculation
3. Cluster Assignment
1 2 3
4
5
6 8
7
9 10
11 12 13 1
2
3
4
5
6
7
8
9
10
11
12
13
4
3
6
4
5
3
1
3
1
1
2
2
2
ρ
3 5
Elements with higher density
4.2 6
Item 1’s cluster label = item 3’s cluster label
1 dc
j
ciji dd )(
Pruning During Local Density Computation
j
LBMatrix(i,j)
Dij
UBMatrix(i,j)
LBMatrix(i,j)
Dij
UBMatrix(i,j)
dc
LBMatrix(i,j)
Dij
UBMatrix(i,j)
B)
C)
D)
i j
i
i
j
j
i Dij = 0 A)
Pruning During NN Distance Calculation From Higher Density List
LBMatrix(i,j1)
D1
UBMatrix(i,j1)
D2
UBMatrix(i,j2)
D3
UBMatrix(i,j3)
A)
B)
C)
i j1
i
i
j2
j3
D4
UBMatrix(i,j4)
i j4
D)
LBMatrix(i,j2)
LBMatrix(i,j4)
LBMatrix(i,j3)
Dis
tan
ce C
alcu
lati
on
s
0 3500
1
3
5
7 x 10
6
TADPole
Number of objects
Absolute
Number
0 3500 0
100
Number of objects
Brute force
TADPole
Percentage
DP: 9 Hours TADPole: 9 minutes
Distance Computation Ordering:Distance Computation Ordering:
Anytime Anytime TADPoleTADPole
Distance Computation Percentage 100%
0.4
1
0
Ran
d I
ndex
Euclidean
Distance
Oracle
Order
TADPole
Order
0 10%
0.4
1
Oracle Order
Random Order
TADPole Order
Random
Order
Ran
d I
ndex
Distance Computation Percentage
Zoom-In of Above Figure
This reflects the
90% of DTW
calculations that
were admissibly
pruned
This reflects the
10% of DTW
calculations that
were calculated in
anytime ordering
10%
0 150
Y
Z
Y
Z
1 2 3 4 5 6 7
0.84
0.92
1
Distance Computation Percentage
Ran
d I
nd
ex Euclidean Distance
Oracle Order
Random Order
TADPole Order
Pruning: 94%
Case Study 2: Case Study 2: PulsusPulsus DatasetDataset
Suspected Pulsus
Severe Pulsus
Healthy
Oximeter
Vein
Artery
Photo Detector
LED
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Patient 639 Patient 523 Patient 618 Patient 2975918
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Normalized Respiration Rate
Normalized Heart Rate
Po
wer
Sp
ectr
al
Den
sity
Frequency
A) B)
C) D) E) F)
200 600 1000 1400 1800 200 600 1000 1400 1800
Non-Severe Pulsus Severe Pulsus
PP
G
ReproducibilityReproducibility
All the code and datasets used in this paper are publicly available in: www.cs.ucr.edu/~nbegu001/SpeededClusteringDTW
Pruning: 88%