subsequence matching in time series databases xiaojin xu 04-25-2006

Subsequence Matching in Time Series Databases

Xiaojin Xu

04-25-2006

2

Papers

• Online Event driven Subsequence Matching over Financial Data Streams– Huanmei Wu, Betty Salzberg, Donghui Zhang

• Fast Subsequence Matching in Time-Series Databases– C. Faloutsos, M. Ranganathan, Y. Manolopoulos

3

Challenges of Subsequence Matching over Financial Data Streams

• Existing techniques of Subsequence Matching– Mainly focus on discovering the similarity between an

online querying subsequence and a traditional database– Queried data are static

• Subsequence Similarities of Financial Data Streams– Data changing constantly, single pass search required– Movement can be predicted by observing a repetitive

pattern of waves (zigzag shapes)– The relative position of the upper and lower end points is

important in subsequence similarity.– Subsequence similarity should be flexible with regard to

time shifting and scaling, amplitude rescaling…

4

Our online event-driven subsequence matching meets the requirements of financial data analysis

• Database is a dynamic stream database which stores recent financial data.

• 3-tier online segmentation and pruning• Similarity measure: distance function is defined

based on a permutation of the subsequence• Event-driven matching over an up-to-date database:

query will be carried out only when there is a new end point

• A new definition of trend for financial data stream

5

Processing Online Data Stream

Translating massive data streams into manageable data for database before matching

Aggregation and SmoothingPiecewise linear representationOnline segmentation and pruning

6

Aggregation and Smoothing

• One unique value for each time instance over a fixed time interval

• Use p-interval moving average to filter out noise and generate a clean trend signal

– X(i) is the value for i = 1, 2, ..., n

– n is the number of periods.

7

Piecewise Linear Representation (PLR)

• Segment over Bollinger Band Percent (%b)• %b indicator

middle_band = p-period moving averageupper_band = middle_band + 2*p-period standard deviationlower_band = middle_band - 2* p-period standard deviation%b =(close price – lower_band)/(upper_band –

lower_band)

• Advantages of %b indicator– Smoothed moving trend similar to the price movement– Normalized value of the real price.– Sensitive to price change

8

Segmentation• Use a sliding window which

– Can only contain at most m points– Begin after the last identified end point and end

right before the current point– Only contain last m points if more than m points

• Segmentation over b% finds a possible upper or lower end points in the current sliding window

• Current point is Pj(Xj,tj), the upper point Pi(Xi,ti) is a point in the sliding window that satisfies:1. Xi = max( X values of current sliding window )2. Xi > Xj + δ (δ is the given error threshold)3. P (Xi, ti) is the last one satisfying the above two conditions

9

Segmentation (Cont’d)

10

Pruning• Purpose — smoothing over recently identified end

points• Two step

– Filter: Pruning on %b– Refinement: pruning on raw data stream

• Pruning rule — If the absolute %b or raw data values of two adjacent end points differs by less than a certain value, that line segment should be removed.

11

Pruning (Cont’d)

12

Online segmentation and pruning• Whenever an upper/lower point is identified, the

previous line segment is checked for pruning• First check the need for pruning on %b• If pruning on %b, no pruning on raw data is done.

System waits for next stream data to come in• If no pruning on %b done, the same line segment

is checked for pruning on raw data• Keep which point after pruning?

– Compare the last end point with the third last end point. If upper points, the one with the larger value will be kept. Otherwise, keep the point with smaller value.

13

Online segmentation and pruning

14

Online segmentation and pruning• Strategy of identifying end points

– a smaller threshold δs for segmentation over %b, to ensure the sensitivity and reduce delay

– a larger threshold δpb for pruning over %b, to filter

out noise– a separate δp

d for pruning over raw stream data.

• The online segmentation and pruning are running simultaneously.

• At most three end points need to be kept for segmentation and pruning procedure

• All the fixed end points are updated into the database in real time

15

Permutation• Subsequence matching

– Find the subsequence of end points that are similar to the query subsequence

• Permutation– Stream of end points S = {(X1, t1), (X2, t2),…, (Xn, tn) },

divided into two subsets of upper and lower end points respectively, get S’

– S’ = {[(X1, t1), (X3, t3),…, (Xn-1, tn-1)], [(X2, t2), (X4, t4),…, (Xn, tn)]},Sort the X values of each subset, get S”

– S” = {[Xi1, Xi3, …Xin-1], [Xi2, Xi4, …Xin]}

where Xi1≤Xi3 ≤ … ≤Xin-1, Xi2≤Xi4≤… ≤Xin,– {i1, i3 ,…, in-1, i2 , i4 ,…, in} is the permutation of S

16

Subsequence Similarity• Definition:

S = {(X1, t1), (X2, t2),…, (Xn, tn) },S’ = {(X1’, t1’), (X2’, t2’),…, (Xn’, tn’) },S and S’ are similar if two conditions are satisfied:(1) S and S’ have the same permutation(2) d(S,S’) < γ where

• α,β, and γ≥0 and are user-defined parameters

• Permutation provides flexibility of time scaling and amplitude rescaling

17

Event driven subsequence match

• Stream data are massive, real time. Do similarity search after a fixed time period may lose potentially important information

• Event — A new potential end point is being identified and no pruning is need.

• Event-driven subsequence match– Performs subsequence similarity search automatically

only when there is a new event.– Generated query subsequence is the most recent n fixed

and potential end points• Advantage: Can reduce the huge computation

burden while maintain sensitivity to changes

18

Application － Trend Prediction• Trend of an end point: Tendency of the raw stream

after k end points from the current end point E. (ε is a user defined parameter)If Ek.X≥E.X+ε E.trend = UP

If Ek.X≤E.X- ε E.trend = DOWN

If E.X - ε <Ek.X <E.X+ ε E.trend = NOTREND

If Ek.does not exist, E.trend = UNDEFINED.

• Predict trend of query eventSubsequence similarity search returns a list of retrieved end pointsF(D) = (# of retrieved end points with trend D) / (total # of retrieved end points) ×100%if |F(UP) – F(DOWN)| < F(NOTREND) + λ

predict NOTREND;else if F(UP) > F(DOWN) predict UP;else predict DOWN; (λ is a user defined threshold)

19

Conclusion• The online simultaneous segmentation and

pruning algorithm for PLR achieves quick identification of new end points yet maintains accurate segmentation

• New similarity measure of a permutation and a distance function has better performance than measures based on Euclidean distance

• Experiments demonstrated that event-driven search outperformed the searches with any fixed time period.

20

Fast Subsequence Matching in Time-Series Databases

• Whole matching– Given N data sequences of S1, S2, …, SN and a query

sequence Q, find those sequences that are within distance ε from Q. Si and Q have the same length.

• Subsequence matching– Given N data sequences of S1, S2, …, SN of arbitrary

lengths, a query sequence Q and a tolerance ε, try to find data sequences Si that containing matching subsequences( with distance < ε from Q)

21

Whole matching

• Use a distance preserving transform( e.g. DFT) to extract f features from sequences

• Map f features into points in the f-dimensional feature space.

• Use spatial access method ( e.g. R*-tree) to search for range/approximate query.

• Precondition: data sequences and query sequences all have the same length

22

Defined Subsequence Matching• Given N data sequences of real numbers S1, S2, …,

SN of potentially deferent lengths

• The user specifies query subsequence Q of length Len(Q) and the tolerance ε (maximum distance)

• Try to find quickly all the sequences Si and the correct offsets k, such that the subsequence

Si[k: k+Len(Q)-1] matches the query sequence:

D(Q , Si[k: k+Len(Q)-1] ≤ ε

Sequential Scan is not efficient for space/time overhead

23

ST-index• Assume the minimum query length is w• Use a sliding window of size w and place it at

every possible position on every data sequence• Extract the features of subsequence inside the

window for each placement• A data sequence of length Len(S) is mapped to a

trail in feature space • The trail consists of Len(S)-w+1points. Each point

represent each possible offset of the sliding window

24

How to index the trails

• A straightforward way — I-naive– keep track of the individual points of each trail

and store them in a spatial access method

• Problem– Storing the individual points of trail in an R*-

tree is inefficient in space and speed– Almost every point in a data sequence will

correspond to a point in the f-dimensional feature space. 1: f increase for storage.

25

MBR• Divide the trail into sub-trails. Each sub-trail is

represented with minimum bounding (hyper)-rectangle (MBR).

• Only a few MBRs need to be stored.• When a query arrives, retrieve all the MBRs that

intersect the query region. • Some false alarms are included(their MBR intersect

the query region, but the sub-trails do not)• MBRs belonging to the same trail may overlap

26

MBR(Cont’d)• Information of MBR

– tstart, tend: offsets of first and last positionings– sequence_id: unique identifier of the data sequence– (F1low,F1high,F2low,F2high,…) : extent of the MBR

27

MBR(Cont’d)• Group the MBRs to form MBRs at higher level• None-leaf nodes do not store sequence_id or offsets

28

Insertion – How to divide trails into sub-trails

• I-fixed method– Sub-trail size is fixed number or a simple function of

Len(S)

– Resulting MBRs are not good.

29

I-adaptive method• Goal: Adapt to the distribution of points of the trail• Cost function

– L = (L1, L2 , …, Ln) : sides of n-dimensional MBR of a node in an R-tree

– Average number of disk accesses DA(L)

• Marginal cost of each point in the sub-trail of k points with the MBR– mc = DA(L)/k

30

I-adaptive method: Algorithm• Assign the first point of the trail in a trivial sub-trail• FOR each successive point

– IF it increases the marginal cost of the current sub-trail– THEN start another sub-trail– ELSE include it in the current sub-trail

31

Searching : Len(Q) = w

• Q is mapped to a point qf in feature space; the query corresponds to a sphere in feature space with center qf and radius ε ;

• Retrieve the sub-trails whose MBRs intersect the query region using our index

• Examine the corresponding subsequences of the data sequences to discard the false alarms

32

Searching : Len(Q) = pwIf Q and S agree within tolerance ε, then at least

one of the pairs (si, qi) of corresponding subsequences agree within tolerance ε/ ;

• Q is broken into p sub-queries which corresponds to p spheres in feature space with ε/ ;

• Retrieve the sub-trails whose MBRs intersect at least one sub-query region using ST-index

• Examine the corresponding subsequences of the data sequences to discard the false alarms

33

Conclusion• Designed a method that efficiently handles

approximate queries for subsequence matching• Fulfill the following requirements:Fast — Experiment results showed it achieves

orders of magnitude savings over the sequential scanning

It requires small space overhead It is dynamicCorrect : no false dismissals

Thank you!

subsequence matching in time series databases xiaojin xu 04-25-2006

Documents