high performance discovery from time series streams dennis shasha joint work with yunyue zhu...
TRANSCRIPT
![Page 1: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/1.jpg)
High Performance Discovery from Time Series Streams
Dennis Shasha
Joint work with Yunyue Zhu
[email protected] [email protected]
Courant Institute, New York University
![Page 2: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/2.jpg)
Overall Outline• Data mining – both classical and
activist
• Algorithmic tools for time series
• Surprise.
![Page 3: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/3.jpg)
Goal of this work• Time series are important in so many
applications – biology, medicine, finance, music, physics, …
• A few fundamental operations occur all the time: burst detection, correlation, pattern matching.
• Do them fast to make data exploration faster, real time, and more fun.
• Extend functionality for music and science.
![Page 4: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/4.jpg)
StatStream (VLDB,2002): Example
• Stock prices streams– The New York Stock Exchange (NYSE) – 50,000 securities (streams); 100,000 ticks (trade and quote)
• Pairs Trading, a.k.a. Correlation Trading
• Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?”
XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …
![Page 5: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/5.jpg)
Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.
• Real time– high update frequency of the data stream– fixed response time, online
Correlated!
![Page 6: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/6.jpg)
Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.
• Real time– high update frequency of the data stream– fixed response time, online
![Page 7: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/7.jpg)
Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.
• Real time– high update frequency of the data stream– fixed response time, online
Correlated!
![Page 8: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/8.jpg)
StatStream: Algorithm
• Naive algorithm– N : number of streams
– w : size of sliding window
– space O(N) and time O(N2w) VS space O(N2) and time O(N2) .
• Suppose that the streams are updated every second.– With a Pentium 4 PC, the exact computing method can only monitor 700
streams with a delay of 2 minutes.
• Our Approach – Use Discrete Fourier Transform to approximate correlation
– Use grid structure to filter out unlikely pairs
– Our approach can monitor 10,000 streams with a delay of 2 minutes.
![Page 9: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/9.jpg)
StatStream: Stream synoptic data structure• Three level time interval hierarchy
– Time point, Basic window, Sliding window• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the basic window i+1
– The basic window time is the system response time.• Digests
Basic window digests:
sum
DFT coefs
Sliding window
Basic window
Time point
Basic window digests:
sum
DFT coefs
![Page 10: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/10.jpg)
StatStream: Stream synoptic data structure
Basic window digests:
sum
DFT coefs
Sliding window
Basic window
Time point
Basic window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
• Three level time interval hierarchy– Time point, Basic window, Sliding window
• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the
basic window i+1– The basic window time is the system response time.
• Digests
![Page 11: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/11.jpg)
StatStream: Stream synoptic data structure
Sliding window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
Sliding window
Basic window
Time point
Basic window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
• Three level time interval hierarchy– Time point, Basic window, Sliding window
• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the
basic window i+1– The basic window time is the system response time.
• Digests
![Page 12: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/12.jpg)
StatStream: Stream synoptic data structure
Sliding window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
Sliding window
Basic window
Time point
Basic window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
• Three level time interval hierarchy– Time point, Basic window, Sliding window
• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the
basic window i+1– The basic window time is the system response time.
• Digests
![Page 13: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/13.jpg)
StatStream: Stream synoptic data structure
Basic window digests:
sum
DFT coefs
Sliding window
Basic window
Time point
Basic window digests:
sum
DFT coefs
Basic window digests:
sum
DFT coefs
• Three level time interval hierarchy– Time point, Basic window, Sliding window
• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the
basic window i+1– The basic window time is the system response time.
• Digests
![Page 14: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/14.jpg)
Synchronized Correlation Uses Basic Windows
w
i i
w
i i
w
i iiw
rrss
rsrsrscorr
1
2
1
2
11
)()(),(
• Inner-product of aligned basic windows
Stream x
Stream y
Sliding window
Basic window
• Inner-product within a sliding window is the sum of the inner-products in all the basic windows in the sliding window.
![Page 15: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/15.jpg)
• Approximate with an orthogonal function family (e.g. DFT)
Approximate Synchronized Correlation
x1 x2 x3 x4 x5 x6 x7 x8
f1(1) f1(2) f1(3) f1(4) f1(5) f1(6) f1(7) f1(8)
f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8)
f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8)
x
x
x
c
c
c
3
2
1
![Page 16: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/16.jpg)
• Approximate with an orthogonal function family (e.g. DFT)
Approximate Synchronized Correlation
x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,
![Page 17: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/17.jpg)
• Approximate with an orthogonal function family (e.g. DFT)
Approximate Synchronized Correlation
x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,
y1 y2 y3 y4 y5 y6 y7 y8yyy ccc 321 ,,
![Page 18: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/18.jpg)
• Approximate with an orthogonal function family (e.g. DFT)
• Inner product of the time series Inner product of the digests
• The time and space complexity is reduced from O(b) to O(n).– b : size of basic window– n : size of the digests (n<<b)
• e.g. 120 time points reduce to 4 digests
Approximate Synchronized Correlation
x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,
y1 y2 y3 y4 y5 y6 y7 y8yyy ccc 321 ,,
i
pm pm
pmmVifif
,0
),()()(
![Page 19: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/19.jpg)
Approximate lagged Correlation
• Inner-product with unaligned windows
• The time complexity is reduced from O(b) to O(n2) , as opposed to O(n) for synchronized correlation. Reason: terms for different frequencies are non-zero in the lagged case.
sliding window
sliding window
![Page 20: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/20.jpg)
2
Grid Structure(to avoid checking all pairs)
• The DFT coefficients yields a vector.
• High correlation => closeness in the vector space
– We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs.
x
![Page 21: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/21.jpg)
Empirical Study : Speed
Comparison of processing time
0
100
200
300400
500
600
700
800
200 400 600 800 1000 1200 1400 1600
Number of Streams
Wa
ll C
loc
k T
ime
(s
ec
on
ds
)
Exact
DFT
Our algorithm is parallelizable.
![Page 22: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/22.jpg)
Empirical Study: Precision• Approximation errors
– Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation
– The approximation errors are small for the stock data.
0.51
2
0.5
1
23
0
0.001
0.002
0.003
0.004
0.005
Ave
rag
e A
pp
roxi
mat
ion
E
rro
r
Sliding windows (Hours)
Bas
ic
win
do
ws
(Min
ute
s)
![Page 23: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/23.jpg)
Sketches : Random Projection
• Correlation between time series of the returns of stock – Since most stock price time series are close to random walks, their return
time series are close to white noise
– DFT/DWT can’t capture approximate white noise series because there is no clear trend (too many frequency components).
• Solution : Sketches (a form of random landmark)– Sketches pool: matrix of random variables drawn from stable distribution
– Sketches : The random projection of all time series to lower dimensions by multiplication with the same matrix
– The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.
![Page 24: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/24.jpg)
Burst Detection
![Page 25: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/25.jpg)
Burst Detection: Applications
• Discovering intervals with unusually large numbers of events.
– In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. Might last milliseconds or days…
– In telecommunications, if the number of packages lost within a certain time period exceeds some threshold, it might indicate some network anomaly. Exact duration is unknown.
– In finance, stocks with unusual high trading volumes should attract the notice of traders (or perhaps regulators).
![Page 26: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/26.jpg)
Bursts across different window sizes in Gamma Rays
• Challenge : to discover not only the time of the burst, but also the duration of the burst.
![Page 27: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/27.jpg)
Elastic Burst Detection: Problem Statement
• Problem: Given a time series of positive numbers x1, x2,..., xn, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds:
– all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥ f(w)
• Brute force search : O(n^2) time
• Our shifted wavelet tree (SWT): O(n+k) time.
– k is the size of the output, i.e. the number of windows with bursts
![Page 28: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/28.jpg)
Burst Detection: Data Structure and Algorithm
– Define threshold for node for size 2k to be threshold for window of size 1+ 2k-1
![Page 29: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/29.jpg)
Burst Detection: Example
4 5 1 2 20 3 6 4 1 0 9 1 2 1 3 5
9 3 236 22 9
10 15 9
10 3 89 3 4
1226
3311
11 1012
4544
21
Window Size 2 3 4 5Threshold 24 26 47 50
![Page 30: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/30.jpg)
Burst Detection: Example
4 5 1 2 20 3 6 4 1 0 9 1 2 1 3 5
9 3 236 22 9
10 15 9
10 3 89 3 4
1226
3311
11 1012
4544
21
True AlarmFalse Alarm
Window Size 2 3 4 5Threshold 24 26 47 50
![Page 31: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/31.jpg)
False Alarms (requires work, but no errors)
p=0.000001
0
0.01
0.02
0.03
0.04
0.05
0.06
1 1.2 1.4 1.6 1.8 2
T=W/w
Fal
se A
larm
Rat
es
![Page 32: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/32.jpg)
Empirical Study : Gamma Ray Burst
Processing time vs. Number of Windows
01000020000300004000050000600007000080000
0 10 20 30 40 50
Number of Windows
Pro
cess
ing
time
(ms)
SWT Algorithm
Direct Algorithm
![Page 33: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/33.jpg)
Extension to other aggregates
• SWT can be used for any aggregate that is monotonic
– SUM, COUNT and MAX are monotonically increasing
• the alarm threshold is aggregate<threshold
– MIN is monotonically decreasing
• the alarm threshold is aggregate<threshold
– Spread =MAX-MIN
• Application in Finance
– Stock with burst of trading or quote(bid/ask) volume (Hammer!)
– Stock prices with high spread
![Page 34: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/34.jpg)
Empirical Study : Stock Price Spread Burst
Processing time vs. Number of Windows
1
10
100
1000
10000
100000
1000000
0 10 20 30 40 50
Number of Windows
Pro
cess
ing
time
(ms)
SWT Algorithm
Direct Algorithm
![Page 35: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/35.jpg)
Extension to high dimensions
![Page 36: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/36.jpg)
Elastic Burst in two dimensions
• Population Distribution in the US
![Page 37: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/37.jpg)
How to find the threshold for Elastic Burst?
• Suppose that the moving sum of a time series is a random variable from a normal distribution.
• Let the number of bursts in the time series within sliding window size w be So(w) and its expectation be Se(w).
– Se(w) can be computed from the historical data.
• Given a threshold probability p, we set the threshold of burst f(w) for window size w such that Pr[So(w) ≥ f(w)] ≤p.
![Page 38: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/38.jpg)
Find threshold for Elastic Bursts
• Φ(x) is the normal cdf, so symmetric around 0:
• Therefore
)(1)()( 1 pwSwf e
Φ(x)
x p
Φ-1(p)
ppX
ppX
)](Pr[
)](Pr[1
1
ppwS
wSwS
e
eo
)()(
)()(Pr 1
![Page 39: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/39.jpg)
Summary• Able to detect bursts of many different durations in
essentially linear time.• Can be used both for time series and for spatial
searching.• Can specify thresholds either with absolute numbers or
with probability of hit.• Algorithm is simple to implement and has low
constants (code is available).• Ok, it’s embarrassingly simple.
![Page 40: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/40.jpg)
With a Little Help From My Warped Correlation
• Karen’s humming Match:
• Dennis’s humming Match:
• “What would you do if I sang out of tune?"• Yunyue’s humming Match:
![Page 41: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/41.jpg)
Related Work in Query by Humming
• Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99]– Music represented by string of pitch directions: U, D, S (degenerated
interval)– Hum query is segmented to discrete notes, then string of pitch directions – Edit Distance between hum query and music score
• Problem– Very hard to segment the hum query– Partial solution: users are asked to hum articulately
• New Method : matching directly from audio [Mazzoni and Dannenberg 00]
• Problem– slowed down by DTW
![Page 42: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/42.jpg)
Time Series Representation of Query
• An example hum query
• Note segmentation is hard!
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7 8 9 10 11
time (seconds)
pit
ch v
alu
esSegment this!
![Page 43: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/43.jpg)
How to deal with poor hum queries?
• No absolute pitch– Solution: the average pitch is subtracted
• Incorrect tempo– Solution: Uniform Time Warping
• Inaccurate pitch intervals– Solution: return the k-nearest neighbors
• Local timing variations– Solution: Dynamic Time Warping
![Page 44: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/44.jpg)
Dynamic Time Warping
• Euclidean distance: sum of point-by-point distance
• DTW distance: allowing stretching or squeezing the time axis locally
Time Series 1
Time Series 2
![Page 45: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/45.jpg)
Envelope Transform using Piecewise Aggregate Approximation(PAA) [Keogh VLDB 02]
Original time series
Upper envelope
Lower envelope
U_Keogh
L_Keogh
![Page 46: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/46.jpg)
Envelope Transform using Piecewise Aggregate Approximation(PAA)
Original time series
Upper envelope
Lower envelope
U_Keogh
L_Keogh
Original time series
Upper envelope
Lower envelope
U_new
L_new
• Advantage of tighter envelopes – Still no false negatives, and fewer false positives
![Page 47: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/47.jpg)
Container Invariant Envelope Transform
• Container-invariant A transformation T for envelope such that
• Theorem: if a transformation is Container-invariant and Lower-bounding, then the distance between transformed times series x and transformed envelope of y lower bound their DTW distance.
Feature Space
![Page 48: High Performance Discovery from Time Series Streams Dennis Shasha Joint work with Yunyue Zhu yunyue@cs.nyu.edu shasha@cs.nyu.edu Courant Institute, New](https://reader036.vdocuments.us/reader036/viewer/2022070306/551653045503469d698b4c53/html5/thumbnails/48.jpg)
The VisionAbility to match time series quickly may
open up entire new application areas, e.g. fast reaction to external events, music by humming and so on.
Main problems: accuracy, excessive specification.
Reference (advert): High Performance Discovery in Time Series (Springer 2004)