k-shape: efficient and accurate clustering of time series john paparrizos and luis gravano sigmod...
TRANSCRIPT
k-Shape: Efficient and Accurate Clustering of Time Series
John Paparrizos and Luis Gravano
Sigmod 2015
Time Series: Sequentially Collected Observations
• Time series are ubiquitous and abundant in many disciplinesV
olta
ge
Medicine: Electrocardiogram (ECG)
Time
Atria activation Ventricle activation Recovery wave
Engineering: Human gaits
TimeG
roun
d F
orce [Weyand et al., Journal of Applied Physiology 2000]
Time-Series Analysis: Similar Challenges Across Tasks• Popular time-series analysis tasks:
• Manipulation of time series is challenging and time consuming• Operations handle data distortions, noise, missing data, high dimensionality, …• Each domain has different requirements and needs
• Choice of distance measure is critical for effective time-series analysis
Querying
query
Classification ClusteringClass A Class B
new time series
A or B?
Input Output
Our focus
Shape-Based Clustering: Group Series with Similar PatternsInput: a set of time series Output: k time-series clusters
Funnel cluster Cylinder cluster Bell cluster
Time
Am
plit
ud
e
Time
Am
plit
ud
e
Time
Am
plit
ud
e
• Group time series into clusters based on their shape similarity (i.e., regardless of any differences in amplitude and phase)
Offer scaling and translation invariances Time
Am
plitu
de
TimeA
mpl
itude
distance: 483 distance: 37Ignore differences in
amplitude
Time
Am
plitu
de
distance: 694
Offer shift invariance
Ignore differences in phase
Time
Am
plitu
de
distance: 17
Shape-Based Clustering: k-Means
• Objective: Find the partition P* that minimizes the within-cluster sum of squared distances between time series and centroids
• k-Means finds a locally optimal solution by iteratively performing two steps:• Assignment step: assigns time series to clusters of their nearest centroids• Refinement step: updates centroids to reflect changes in cluster membership
• Centroid computation finds the time series that minimizes the sum of squared distances to all other time series in the cluster
• Requirements: a distance measure and a centroid computation method
Shape-Based Clustering: Existing Approaches
• Choice of distance measures:
• Choice of centroid computation methods:• Arithmetic mean of the coordinates of time series (AVG)• Non-linear alignment and averaging filters (NLAAF)• Prioritized shape averaging (PSA)• Dynamic time warping barycenter averaging (DBA)
• Issues with existing approaches: • Cannot scale as they rely on expensive methods (e.g., DTW)
Euclidean Distance (ED) Dynamic Time Warping (DTW)
ED
DTW
k-Shape: A Novel Instantiation of k-Means for Efficient Shape-Based Clustering
k-Shape account for shapes of time series during clustering
• k-Shape is a scale-, translate-, and shift-invariant clustering method
• Distance measure: A normalized version of the cross-correlation measure
• Centroid computation: A novel method based on that distance measureTime
Am
plitu
de
Possible shifts
Cor
rela
tion
Time
Am
plitu
deOur centroid
k-Means centroid
Input
k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure
• Cross-correlation measure: Keep one sequence static and slide the other to compute the correlation (i.e., inner product) for each shift
• Intuition: Determine shift that exhibits the maximum correlation
Am
plitu
deC
orre
latio
n
Time
Shifts
k-Shape’s Scale-, Translate-, and Shift-Invariant Distance Measure
• Key issue: Time-series normalizations and cross-correlation sequence
• Shape-Based Distance Measure (SBD): • Z-normalize the time series (i.e., mean=0 and std=1) • Divide the cross-correlation sequence by the geometric mean of
the autocorrelations of the individual time series
• Naïve implementation makes SBD appear as slow as DTW• SBD becomes very efficient with the use of Fast Fourier
Transform
Time
Am
plitu
de
Perfectly aligned time series of length 1024. Cross-correlation should be maximized at shift 1024
Inadequate normalizations produce misleading results
Shifts
Am
plitu
de ✖✔
K-Sharp- Coefficient Normalization
k-Shape’s Time-Series Centroid Computation Method
• Centroid computation of k-means problematic for misaligned time series
• k-Shape performs two steps in its centroid computation:• First step: Align time series towards a reference sequence• Second step: Compute the time series that maximizes the sum of squared
correlations to all other time series in the cluster
Time
Am
plitu
de
Time
Am
plitu
de
Time
Am
plitu
de
Time
Am
plitu
de
Time
Am
plitu
de
Time
Our centroid
k-Means centroid
Input
k-Shape’s Algorithm
Experimental Settings
Datasets: Largest public collection of annotated time-series datasets (UCR Archive) – 48 real and synthetic datasets
Baselines for distance measures (Dist):• Euclidean Distance (ED): efficient – yet – accurate measure• Dynamic Time Warping (DTW): the currently best performing – but expensive –
distance measure• constrained Dynamic Time Warping (cDTW): constrained version of DTW with
improved accuracy and efficiency
Baselines for scalable clustering methods:• k-AVG+Dist: k-means with a Dist• KSC: k-means with pairwise scaling and shifting in distance and centroid• k-DBA: k-means with DTW and suitable centroid computation (i.e., DBA)
Evaluation metrics:• For runtime: CPU time utilization• For distance measures: one nearest neighbor classification accuracy• For clustering methods: Rand Index• For statistical analysis: Wilcoxon test and Friedman test
[Keogh et al. 2011]
[Faloutsos et al., SIGMOD 1994]
[Keogh, VLDB 2002]
[Keogh, VLDB 2002]
SBD Against State-of-the-Art Distance Measures
Results are relative to ED
Distance Measure > = < Better
Average Accuracy
Runtime
DTWDTWLB
29 2 17 ✔ 0.78815573x6040x
cDTWopt
31 15 2 ✔ 0.8142873x322x
cDTW5
34 3 11 ✔ 0.8091558x122x
cDTW10
33 1 14 ✔ 0.8042940x364x
SBDNoFFTSBDNoPow2SBD
30 12 6 ✔ 0.795224x8.7x4.4x
LB subscript: Keogh’s lower bounding; 5, 10, and opt superscripts: 5%, 10%, and optimal % of time-series length
• Accuracy:• SBD significantly outperforms ED
• SBD is as competitive as DTW and cDTW
• Efficiency:• SBD is one and two orders of magnitude
faster than cDTW and DTW, respectively
k-Shape Against Scalable Clustering Methods
Results are relative to k-AVG+ED
Algorithm > = < Better Worse Rand Index Runtime
k-AVG+SBD 32 1 15 ✖ ✖ 0.745 3.6x
k-AVG+DTW 10 0 38 ✖ ✔ 0.584 3444x
KSC 22 0 26 ✖ ✖ 0.636 448x
k-DBA 18 0 30 ✖ ✖ 0.733 3892x
k-Shape+DTW 19 1 28 ✖ ✖ 0.698 4175x
k-Shape 36 1 11 ✔ ✖ 0.772 12.4x
• Accuracy:• k-Shape is the only scalable method that
outperforms k-AVG+ED
• k-Shape outperforms both KSC and k-DBA
• Efficiency:• k-Shape is one and two orders of magnitude
faster than KSC and k-DBA, respectively
Shape-Based Time-Series Clustering: Conclusion
• k-Shape outperforms all state-of-the-art partitional, hierarchical, and spectral clustering approaches Execpt one.
• This method achieving similar performance but it is two orders of magnitude slower than k-Shape and its distance measure requires tuning, unlike that for k-Shape
• Overall, k-Shape is a domain-independent, accurate, and scalable approach for time-series clustering.
Thank you!