Efficient Elastic Burst Detection in Data Streams
Yunyue Zhu and Dennis ShashaDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University
SIGKDD 2003
Abstract
Burst detection Find abnormal aggregates in data streams Sliding window
In some applications, we want to monitor many sliding window sizes simultaneously. Brute force: O(n2) Shifted Wavelet Tree: near linear time
Problem Statement
For a time series x1, x2, …, xn, given a set of window sizes w1, w2, …, wm, an aggregate function F and threshold associated with each window size, f(wj), j = 1, 2, …, m
Monitoring elastics window aggregates of the time series is to find all the subsequences of all the window sizes such that the aggregate applied to the subsequences cross their window sizes' thresholds, i.e.
Wavelet Tree Haar Wavelet Tree
Level 0: original time series Level 1: pair wise averages and differences o
f the adjacent data items at level 0 Level i: pair wise averages and differences o
n averages at level i - 1
The wavelet coefficients can represent the trend of the time series.
Wavelet coefficient → Aggregate Average and difference → Sum Problem: the windows at the same
level are non-overlapping
Wavelet Tree (cont.)
Shifted Wavelet Tree
Add additional “line” of windows They can be maintained explicitly or
implicitly.
Shifted Wavelet Tree (cont.)
Any subsequence of length w, w 2≦ i is included in one of the windows at level i + 1 of the SWT.
We say that windows with size w, 2i -1 < w 2≦ i , are monitored by level i + 1 of the SW
T.
Level 3
Level 4
7 3
SWT Construction
For each level i (i 1)≧ Compute the pair wise aggregate (sum) for each
two consecutive data items at level i - 1 Downsampling
sampling every second item in the series of aggregates → the input for the higher level in the SWT
O(n), n: time series length
Search for a Burst
Given window size w 2≦ i, threshold f(w)
Search in two stages The potential burst is detected at the leve
l i + 1 in the SWT Detailed search in those subsequences o
f size 2i with sum f(w)≧ O(k), k: #alarms (output size)
Streaming Algorithm
Assume that new data becomes available at every time unit.
The set of window sizes are 2L < w1 < w2 < … < wm < 2U.
Maintain the levels from L+2 to U+1 of the SWT that monitor those windows.
Two methods Online algorithm Batch algorithm
Streaming Algorithm:Online Algorithm
Whenever a new data item becomes available Update those 2(U - L) aggregates of the windo
ws in the SWT. If the aggregate at level i exceeds δi , perform a
detailed search on those windows monitored by i.
For level i, threshold δi = min f(wj), 2i-2 < wj ≦2i-1
Response time = one time unit
Streaming Algorithm:Batch Algorithm
Maintain the aggregates at level L+1 The aggregate in the most recently complet
ed window of level L+1 is updated every time unit.
An aggregate of a window at the upper levels will not be computed until all the data in that window are available.
Once an aggregate at a certain upper level is updated, we also check alarms for time intervals monitored by that level.
Higher throughput, longer response time.
Other Aggregates
The monitoring of many other aggregates based on elastic windows could benefit from our data structure, as long as the following conditions holds.
1. The aggregate F is monotonically increasing or decreasing with respect to the window. e.g. Max, Count → monotonically increasing Min → monotonically increasing
2. The alarm domain is one sided, that is, monotonic increasing → [threshold, ∞) monotonic decreasing → (-∞, threshold]
Extension to Two Dimensions
The problem is to report the positions of spatial sliding windows (rectangle regions) having different sizes, within which the density exceeds some predefined threshold.
Using the same techniques of SWT-1D.
Wavelet Tree 2D Shifted Wavelet Tree 2D
Effectiveness Study
Bursts of the number of times that countries were mentioned in the presidential speech of the state of the union.
A predefined sliding window size is insufficient.
Bursts at large time scales are not necessarily reflected at smaller time scales. may be composed of many consecutive “bumps"
Effectiveness Study (cont.)
Bursts in population distribution data (1990)
Window sizes 1°x1°, 2°x2° and 5°x5° in Latitude/Longitude
Effectiveness Study (cont.)
Performance Study
Experiments on a 1.5GHz Pentium 4 PC with 512 MB of main memory running Windows 2000.
Datasets The Gamma Ray data set
12 hours of data from a small region of the sky, where Gamma Ray bursts were actually reported
The data are time series of the number of photons observed (events) every 0.1 second.
Totally 19,015 events in this time series The NYSE TAQ Stock data set
Tick-by-tick trading activities of the IBM stock between July 1st, 1998 and July 1st, 2002.
5,331,145 trading records (ticks) Each record contains trading time, trading price and trading v
olume.
Training threshold Use the first few hours of Gamma Ray data and
the first year of Stock data as training data. For a window of size w, we compute the aggreg
ates on the training data with sliding window of size w => → y
f(w) = avg(→ y) + ξstd(→ y)
Window sizes: 5, 10, …,5 * Nw time units Nw : #windows, varies from 5 to 50 Time units: 0.1 sec for the Gamma Ray data, an
d 1 min for the stock data.
Performance Study (cont.)
The processing time of our algorithm is output-dependent.
Performance Study (cont.)
Experiments on stock data
Performance Study (cont.)
Use spread as aggregate function
Performance Study (cont.)
Conclusion and Future Work
This paper introduces elastic window model and demonstrates the desirability of the new model.
A novel data structure for efficient detection of elastic bursts and other aggregates.
Experiments show that our algorithm is faster than a brute force algorithm by several orders of magnitude.
Future work A robust way of setting the thresholds Non-monotonic aggregates