sax: a novel symbolic representation of time series authors jessica lin eamonn keogh li wei stefano...
TRANSCRIPT
SAX: a Novel Symbolic Representation of Time
Series
AuthorsJessica LinEamonn KeoghLi WeiStefano Lonardi
PresenterArif Bin Hossain
Slides incorporate materials kindly provided by Prof. Eamonn Keogh
Time Series
A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki]
Example: Economic, Sales, Stock market forecasting EEG, ECG, BCI analysis
0 2000 4000 6000 80000
10
20
30
Problems
Join: Given two data collections, link items occurring in each
Annotation: obtain additional information from given data
Query by content: Given a large data collection, find the k most similar objects to an object of interest.
Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity
Problems (Cont.)
Classification: Given a labeled training set, classify future unlabeled examples
Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.
Motif Finding: Given a large collection of objects, find the pair that is most similar.
Data Mining Constraints
For example, suppose you have one gig of main memory and want to do K-
means clustering…
For example, suppose you have one gig of main memory and want to do K-
means clustering…Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few
hours
Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few
hours
Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15
Generic Data Mining
• Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest
• Approximately solve the problem at hand in main memory
• Make (hopefully very few) accesses to the original data on disk to confirm the solution
Some Common Approximation
Why Symbolic Representation?
• Reduce dimension• Numerosity reduction• Hashing• Suffix Trees• Markov Models• Stealing ideas from text processing/
bioinformatics community
Symbolic Aggregate ApproXimation (SAX)
• Lower bounding of Euclidean distance• Lower bounding of the DTW distance• Dimensionality Reduction• Numerosity Reduction
baabccbc
SAX
Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)
NotationsC A time series C = c1, ….., cn
ĆA Piecewise Aggregate Approximation of a time series Ć = ć1,…ćw
ĈA symbolic representation of a time series Ĉ = ĉ1, …, ĉw
w Number PAA segments representing C
a Alphabet size
How to obtain SAX?
Step 1: Reduce dimension by PAA Time series C of length n can be represented in a
w-dimensional space by a vector Ć = ć1,…ćw
The ith element is calculated by
Reduce dimension from 20 to 5. The 2nd element will be
i
ijjn
wi
wn
wn
cc1)1(
8
52 20
5
j
CjC
How to obtain SAX?
Data is divided into w equal sized frames. Mean value of the data falling within a frame
is calculatedVector of these values becomes the PAA
0 20 40 60 80 100 120
C
C
How to obtain SAX?
Step 2: Discretization Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized
areas under Gaussian curve.
0
-
-
0 20 40 60 80 100 120
bbb
a
cc
c
a
baabccbc
Words: 8Alphabet: 3
Distance Measure
Given 2 time series Q and C Euclidean distance
Distance after transforming the subsequence to PAA
Distance Measure
Define MINDIST after transforming to symbolic representation
MINDIST lower bounds the true distance between the original time series
Numerosity Reduction
Subsequences are extracted by a sliding window
Sequences are mostly repetitive subsequence Sliding window finds aabbcc If the next sequence is also aabbcc, just store the
positionThis optimization depends on the data, but
typically yields a reduction factor of 2 or 3 Space shuttle telemetry with subsequence length 32
Experimental Validation
Clustering Hierarchical Partitional
Classification Nearest neighbor Decision tree
Motif discovery
Hierarchical Clustering
Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes
Partitional Clustering (k-means)
Assign each point to one of k clusters whose center is nearest
Each iteration tries to minimize the sum of squared intra-clustered error
Nearest Neighbor Classification
SAX beats Euclidean distance due to the smoothing effect of dimensional reduction
Decision Tree Classification
Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series
Motif Discovery
Implemented the random projection algorithm of Tompa and Buhler [ICMB2001] Hashing subsequenced into buckets using a random
subset of their features as a key
New Version: iSAX
Use binary numbers for labeling the wordsDifferent alphabet size(cardinality)within a
wordComparison of words with different
cardinalities
Thank you
Questions?