time-series data analysis
DESCRIPTION
Time-series data analysis. Why deal with sequential data?. Because all data is sequential All data items arrive in the data store in some order Examples transaction data documents and words In some (or many) cases the order does not matter In many cases the order is of interest. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/1.jpg)
Time-series data analysis
![Page 2: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/2.jpg)
Why deal with sequential data?• Because all data is sequential
• All data items arrive in the data store in some order
• Examples– transaction data– documents and words
• In some (or many) cases the order does not matter
• In many cases the order is of interest
![Page 3: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/3.jpg)
Time-series data
Financial time series, process monitoring…
![Page 4: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/4.jpg)
Questions
• What is the structure of sequential data?
• Can we represent this structure compactly and accurately?
![Page 5: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/5.jpg)
Sequence segmentation• Gives an accurate representation of the structure of sequential data
• How?
– By trying to find homogeneous segments
• Segmentation question:
• Can a sequence T={t1,t2,…,tn} be described as a concatenation of subsequences S1,S2,…,Sk such that each Si is in some sense homogeneous?
• The corresponding notion of segmentation in unordered data is clustering
![Page 6: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/6.jpg)
Dynamic-programming algorithm• Sequence T, length n, k segments, cost function E(), table M• For i=1 to n
– Set M[1,i]=E(T[1…i]) //Everything in one cluster• For j=1 to k
– Set M[j,j] = 0 //each point in its own cluster• For j=2 to k
– For i=j+1 to n• Set M[j,i] = mini’<i{M[j-1,i]+E(T[i’+1…i])}
• To recover the actual segmentation (not just the optimal cost) store also the minimizing values i’
• Takes time O(n2k), space O(kn)
![Page 7: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/7.jpg)
Example
t
Rt
R
![Page 8: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/8.jpg)
Basic definitions
• Sequence T = {t1,t2,…,tn}: an ordered set of n d-dimensional real points tiЄRd
• A k-segmentation S: a partition of T into k contiguous segments {s1,s2,…,sk}
– Each segment sЄS is represented by a single value μsЄRd (the representative of the segment)
• Error Ep(S): The error of replacing individual points with representatives p
Ss st
psp tSE
1
||)(
![Page 9: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/9.jpg)
The k-segmentation problem
• Common cases for the error function
Ep: p = 1 and p = 2.
– When p = 1, the best μs corresponds the median of the points in segment s.
– When p = 2, the best μs corresponds to the mean of the points in segment s.
Given a sequence T of length n and a value k, find a k-segmentation S = {s1, s2, …,sk} of T such that the Ep error is minimized.
![Page 10: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/10.jpg)
Optimal solution for the k-segmentation problem
• Running time O(n2k)– Too expensive for large datasets!
[Bellman’61] The k-segmentation problem can be solved optimally using a standard dynamic-programming algorithm
![Page 11: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/11.jpg)
Heuristics• Bottom-up greedy (BU): O(nlogn)
– [Keogh and Smyth’97, Keogh and Pazzani’98]
• Top-down greedy (TD): O(nlogn)– [Douglas and Peucker’73, Shatkay and Zdonik’96, Lavrenko et. al’00]
• Global Iterative Replacement (GiR): O(nI)– [Himberg et. al ’01]
• Local Iterative Replacement (LiR): O(nI)– [Himberg et. al ’01]
![Page 12: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/12.jpg)
Approximation algorithm
[Theorem] The segmentation problem can be approximated within a constant factor of 3 for both E1 and E2 error measures. That is,
2,1 )(3)( pSESE OPTpDnSp
The running time of the approximation algorithm is:
)( 3/53/4 knO
![Page 13: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/13.jpg)
Divide ’n Segment (DnS) algorithm• Main idea
– Split the sequence arbitrarily into subsequences– Solve the k-segmentation problem in each subsequence– Combine the results
• Advantages
– Extremely simple– High quality results– Can be applied to other segmentation problems[Gionis’03,
Haiminen’04,Bingham’06]
![Page 14: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/14.jpg)
DnS algorithm - Details
Input: Sequence T, integer kOutput: a k-segmentation of T1. Partition sequence T arbitrarily into m disjoint intervals T1,T2,…,Tm
2. For each interval Ti solve optimally the k- segmentation problem using DP algorithm
3. Let T’ be the concatenation of mk representatives produced in Step 2. Each representative is weighted with the length of the segment it represents
4. Solve optimally the k-segmentation problem for T’ using the DP algorithm and output this segmentation as the final segmentation
![Page 15: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/15.jpg)
The DnS algorithm
Input sequence T consisting of n=20 points (k=2)
102030405060708090
100
![Page 16: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/16.jpg)
The DnS algorithm – Step 1Partition the sequence into m=3 disjoint intervals
102030405060708090
100
T1
T2
T3
![Page 17: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/17.jpg)
The DnS algorithm – Step 2
102030405060708090
100
T1
T2
T3
Solve optimally the k-segmentation problem into each partition (k=2)
![Page 18: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/18.jpg)
The DnS algorithm – Step 2
102030405060708090
100
T1
T2
T3
Solve optimally the k-segmentation problem into each partition (k=2)
![Page 19: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/19.jpg)
The DnS algorithm – Step 3
102030405060708090
100
Sequence T’ consisting of mk=6 representantives
w=8w=4
w=4
w=1
w=1
w=2
![Page 20: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/20.jpg)
The DnS algorithm – Step 4
102030405060708090
100
Solve k-segmentation on T’ (k=2)
w=8w=4
w=4
w=1
w=1
w=2
![Page 21: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/21.jpg)
Running time
• In the case of equipartition in Step 1, the running time of the algorithm as a function of m is:
Running time 3/53/40 2)( knmR
kmkkmnmmR 2
2
)()(
The function R(m) is minimized for32
0
knm
![Page 22: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/22.jpg)
The segmentation error
[Theorem] The segmentation error of the DnS algorithms is at most three times the error of the optimal (DP) algorithm for both E1 and E2 error measures.
2,1 )(3)( pSESE OPTpDnSp
![Page 23: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/23.jpg)
Proof for E1
– λt: the representative of point t in the optimal segmentation – τ: the representative of point t in the segmentation of Step 2
102030405060708090
100
Tt Tt
ttdtd ),(),( 11 Lemma :
![Page 24: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/24.jpg)
Proof
Tt
tDnS tdSE ),()( 11
Tt
tdtd ),(),( 11
Tt
tdtd ),(),( 11
Tt
ttdtdtd ),(),(),( 111
Tt
tTt
t tdtd ),(),(2 11
)(3 OPTSE
(triangle inequality)
(optimality of DP)
(triangle inequality)
(Lemma)
(triangle inequality)
(optimality of DP)
(triangle inequality)
(Lemma)
Tt Tt
ttdtd ),(),( 11 Lemma :
λt: the representative of point t in the optimal segmentation τ: the representative of point t in the segmentation of Step 2 μt: the representative of point t in the final segmentation in Step 4
![Page 25: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/25.jpg)
Trading speed for accuracy• Recursively divide (into m pieces) and segment
• If χ=(ni)1/2, where ni the length of the sequence in the i-th recursive level (n1=n) then
– running time of the algorithm is O(nloglogn)
– the segmentation error is at most O(logn) worse than the optimal
• If χ =const, the running time of the algorithm is O(n), but there are no guarantees for the segmentation error
![Page 26: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/26.jpg)
Real datasets – DnS algorithm
![Page 27: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/27.jpg)
Real datasets – DnS algorithm
![Page 28: Time-series data analysis](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816820550346895dddafc1/html5/thumbnails/28.jpg)
Speed vs. accuracy in practice