k-shape: efficient and accurate clustering of time series john paparrizos and luis gravano sigmod...

k-Shape: Efficient and Accurate Clustering of Time Series

John Paparrizos and Luis Gravano

Sigmod 2015

Time Series: Sequentially Collected Observations

• Time series are ubiquitous and abundant in many disciplinesV

olta

ge

Medicine: Electrocardiogram (ECG)

Time

Atria activation Ventricle activation Recovery wave

Engineering: Human gaits

TimeG

roun

d F

orce [Weyand et al., Journal of Applied Physiology 2000]

Time-Series Analysis: Similar Challenges Across Tasks• Popular time-series analysis tasks:

• Manipulation of time series is challenging and time consuming• Operations handle data distortions, noise, missing data, high dimensionality, …• Each domain has different requirements and needs

• Choice of distance measure is critical for effective time-series analysis

Querying

query

Classification ClusteringClass A Class B

new time series

A or B?

Input Output

Our focus

Shape-Based Clustering: Group Series with Similar PatternsInput: a set of time series Output: k time-series clusters

Funnel cluster Cylinder cluster Bell cluster

Time

Am

plit

ud

e

Time

Am

plit

ud

e

Time

Am

plit

ud

e

• Group time series into clusters based on their shape similarity (i.e., regardless of any differences in amplitude and phase)

Offer scaling and translation invariances Time

Am

plitu

de

TimeA

mpl

itude

distance: 483 distance: 37Ignore differences in

amplitude

Time

Am

plitu

de

distance: 694

Offer shift invariance

Ignore differences in phase

Time

Am

plitu

de

distance: 17

Shape-Based Clustering: k-Means

• Objective: Find the partition P* that minimizes the within-cluster sum of squared distances between time series and centroids

• k-Means finds a locally optimal solution by iteratively performing two steps:• Assignment step: assigns time series to clusters of their nearest centroids• Refinement step: updates centroids to reflect changes in cluster membership

• Centroid computation finds the time series that minimizes the sum of squared distances to all other time series in the cluster

• Requirements: a distance measure and a centroid computation method

Shape-Based Clustering: Existing Approaches

• Choice of distance measures:

• Choice of centroid computation methods:• Arithmetic mean of the coordinates of time series (AVG)• Non-linear alignment and averaging filters (NLAAF)• Prioritized shape averaging (PSA)• Dynamic time warping barycenter averaging (DBA)

• Issues with existing approaches: • Cannot scale as they rely on expensive methods (e.g., DTW)

Euclidean Distance (ED) Dynamic Time Warping (DTW)

ED

DTW

k-Shape: A Novel Instantiation of k-Means for Efficient Shape-Based Clustering

k-Shape account for shapes of time series during clustering

• k-Shape is a scale-, translate-, and shift-invariant clustering method

• Distance measure: A normalized version of the cross-correlation measure

• Centroid computation: A novel method based on that distance measureTime

Am

plitu

de

Possible shifts

Cor

rela

tion

Time

Am

plitu

deOur centroid

k-Means centroid

Input

k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure

• Cross-correlation measure: Keep one sequence static and slide the other to compute the correlation (i.e., inner product) for each shift

• Intuition: Determine shift that exhibits the maximum correlation

Am

plitu

deC

orre

latio

n

Time

Shifts

k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure

• Key issue: Time-series normalizations and cross-correlation sequence

• Shape-Based Distance Measure (SBD): • Z-normalize the time series (i.e., mean=0 and std=1) • Divide the cross-correlation sequence by the geometric mean of

the autocorrelations of the individual time series

• Naïve implementation makes SBD appear as slow as DTW• SBD becomes very efficient with the use of Fast Fourier

Transform

Time

Am

plitu

de

Perfectly aligned time series of length 1024. Cross-correlation should be maximized at shift 1024

Inadequate normalizations produce misleading results

Shifts

Am

plitu

de ✖✔

K-Sharp- Coefficient Normalization

k-Shape’s Time-Series Centroid Computation Method

• Centroid computation of k-means problematic for misaligned time series

• k-Shape performs two steps in its centroid computation:• First step: Align time series towards a reference sequence• Second step: Compute the time series that maximizes the sum of squared

correlations to all other time series in the cluster

Time

Am

plitu

de

Time

Am

plitu

de

Time

Am

plitu

de

Time

Am

plitu

de

Time

Am

plitu

de

Time

Our centroid

k-Means centroid

Input

k-Shape’s Algorithm

Experimental Settings

Datasets: Largest public collection of annotated time-series datasets (UCR Archive) – 48 real and synthetic datasets

Baselines for distance measures (Dist):• Euclidean Distance (ED): efficient – yet – accurate measure• Dynamic Time Warping (DTW): the currently best performing – but expensive –

distance measure• constrained Dynamic Time Warping (cDTW): constrained version of DTW with

improved accuracy and efficiency

Baselines for scalable clustering methods:• k-AVG+Dist: k-means with a Dist• KSC: k-means with pairwise scaling and shifting in distance and centroid• k-DBA: k-means with DTW and suitable centroid computation (i.e., DBA)

Evaluation metrics:• For runtime: CPU time utilization• For distance measures: one nearest neighbor classification accuracy• For clustering methods: Rand Index• For statistical analysis: Wilcoxon test and Friedman test

[Keogh et al. 2011]

[Faloutsos et al., SIGMOD 1994]

[Keogh, VLDB 2002]

[Keogh, VLDB 2002]

SBD Against State-of-the-Art Distance Measures

Results are relative to ED

Distance Measure > = < Better

Average Accuracy

Runtime

DTWDTWLB

29 2 17 ✔ 0.78815573x6040x

cDTWopt

31 15 2 ✔ 0.8142873x322x

cDTW5

34 3 11 ✔ 0.8091558x122x

cDTW10

33 1 14 ✔ 0.8042940x364x

SBDNoFFTSBDNoPow2SBD

30 12 6 ✔ 0.795224x8.7x4.4x

LB subscript: Keogh’s lower bounding; 5, 10, and opt superscripts: 5%, 10%, and optimal % of time-series length

• Accuracy:• SBD significantly outperforms ED

• SBD is as competitive as DTW and cDTW

• Efficiency:• SBD is one and two orders of magnitude

faster than cDTW and DTW, respectively

k-Shape Against Scalable Clustering Methods

Results are relative to k-AVG+ED

Algorithm > = < Better Worse Rand Index Runtime

k-AVG+SBD 32 1 15 ✖ ✖ 0.745 3.6x

k-AVG+DTW 10 0 38 ✖ ✔ 0.584 3444x

KSC 22 0 26 ✖ ✖ 0.636 448x

k-DBA 18 0 30 ✖ ✖ 0.733 3892x

k-Shape+DTW 19 1 28 ✖ ✖ 0.698 4175x

k-Shape 36 1 11 ✔ ✖ 0.772 12.4x

• Accuracy:• k-Shape is the only scalable method that

outperforms k-AVG+ED

• k-Shape outperforms both KSC and k-DBA

• Efficiency:• k-Shape is one and two orders of magnitude

faster than KSC and k-DBA, respectively

Shape-Based Time-Series Clustering: Conclusion

• k-Shape outperforms all state-of-the-art partitional, hierarchical, and spectral clustering approaches Execpt one.

• This method achieving similar performance but it is two orders of magnitude slower than k-Shape and its distance measure requires tuning, unlike that for k-Shape

• Overall, k-Shape is a domain-independent, accurate, and scalable approach for time-series clustering.

Thank you!

k-shape: efficient and accurate clustering of time series john paparrizos and luis gravano sigmod...

Documents

shapes of time series

manipulation of time

group series

observationstime series

set of time seriesoutput

shape similarity

shift intuition

centroid inputkshapes