efficient elastic burst detection in data streams yunyue zhu and dennis shasha department of...

23
Efficient Elastic Burst D etection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical S ciences New York University SIGKDD 2003

Upload: marilyn-collins

Post on 03-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Efficient Elastic Burst Detection in Data Streams

Yunyue Zhu and Dennis ShashaDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University

SIGKDD 2003

Page 2: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Abstract

Burst detection Find abnormal aggregates in data streams Sliding window

In some applications, we want to monitor many sliding window sizes simultaneously. Brute force: O(n2) Shifted Wavelet Tree: near linear time

Page 3: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Problem Statement

For a time series x1, x2, …, xn, given a set of window sizes w1, w2, …, wm, an aggregate function F and threshold associated with each window size, f(wj), j = 1, 2, …, m

Monitoring elastics window aggregates of the time series is to find all the subsequences of all the window sizes such that the aggregate applied to the subsequences cross their window sizes' thresholds, i.e.

Page 4: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Wavelet Tree Haar Wavelet Tree

Level 0: original time series Level 1: pair wise averages and differences o

f the adjacent data items at level 0 Level i: pair wise averages and differences o

n averages at level i - 1

The wavelet coefficients can represent the trend of the time series.

Page 5: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Wavelet coefficient → Aggregate Average and difference → Sum Problem: the windows at the same

level are non-overlapping

Wavelet Tree (cont.)

Page 6: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Shifted Wavelet Tree

Add additional “line” of windows They can be maintained explicitly or

implicitly.

Page 7: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Shifted Wavelet Tree (cont.)

Any subsequence of length w, w 2≦ i is included in one of the windows at level i + 1 of the SWT.

We say that windows with size w, 2i -1 < w 2≦ i , are monitored by level i + 1 of the SW

T.

Level 3

Level 4

7 3

Page 8: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

SWT Construction

For each level i (i 1)≧ Compute the pair wise aggregate (sum) for each

two consecutive data items at level i - 1 Downsampling

sampling every second item in the series of aggregates → the input for the higher level in the SWT

O(n), n: time series length

Page 9: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Search for a Burst

Given window size w 2≦ i, threshold f(w)

Search in two stages The potential burst is detected at the leve

l i + 1 in the SWT Detailed search in those subsequences o

f size 2i with sum f(w)≧ O(k), k: #alarms (output size)

Page 10: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Streaming Algorithm

Assume that new data becomes available at every time unit.

The set of window sizes are 2L < w1 < w2 < … < wm < 2U.

Maintain the levels from L+2 to U+1 of the SWT that monitor those windows.

Two methods Online algorithm Batch algorithm

Page 11: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Streaming Algorithm:Online Algorithm

Whenever a new data item becomes available Update those 2(U - L) aggregates of the windo

ws in the SWT. If the aggregate at level i exceeds δi , perform a

detailed search on those windows monitored by i.

For level i, threshold δi = min f(wj), 2i-2 < wj ≦2i-1

Response time = one time unit

Page 12: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Streaming Algorithm:Batch Algorithm

Maintain the aggregates at level L+1 The aggregate in the most recently complet

ed window of level L+1 is updated every time unit.

An aggregate of a window at the upper levels will not be computed until all the data in that window are available.

Once an aggregate at a certain upper level is updated, we also check alarms for time intervals monitored by that level.

Higher throughput, longer response time.

Page 13: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Other Aggregates

The monitoring of many other aggregates based on elastic windows could benefit from our data structure, as long as the following conditions holds.

1. The aggregate F is monotonically increasing or decreasing with respect to the window. e.g. Max, Count → monotonically increasing Min → monotonically increasing

2. The alarm domain is one sided, that is, monotonic increasing → [threshold, ∞) monotonic decreasing → (-∞, threshold]

Page 14: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Extension to Two Dimensions

The problem is to report the positions of spatial sliding windows (rectangle regions) having different sizes, within which the density exceeds some predefined threshold.

Using the same techniques of SWT-1D.

Wavelet Tree 2D Shifted Wavelet Tree 2D

Page 15: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Effectiveness Study

Bursts of the number of times that countries were mentioned in the presidential speech of the state of the union.

Page 16: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

A predefined sliding window size is insufficient.

Bursts at large time scales are not necessarily reflected at smaller time scales. may be composed of many consecutive “bumps"

Effectiveness Study (cont.)

Page 17: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Bursts in population distribution data (1990)

Window sizes 1°x1°, 2°x2° and 5°x5° in Latitude/Longitude

Effectiveness Study (cont.)

Page 18: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Performance Study

Experiments on a 1.5GHz Pentium 4 PC with 512 MB of main memory running Windows 2000.

Datasets The Gamma Ray data set

12 hours of data from a small region of the sky, where Gamma Ray bursts were actually reported

The data are time series of the number of photons observed (events) every 0.1 second.

Totally 19,015 events in this time series The NYSE TAQ Stock data set

Tick-by-tick trading activities of the IBM stock between July 1st, 1998 and July 1st, 2002.

5,331,145 trading records (ticks) Each record contains trading time, trading price and trading v

olume.

Page 19: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Training threshold Use the first few hours of Gamma Ray data and

the first year of Stock data as training data. For a window of size w, we compute the aggreg

ates on the training data with sliding window of size w => → y

f(w) = avg(→ y) + ξstd(→ y)

Window sizes: 5, 10, …,5 * Nw time units Nw : #windows, varies from 5 to 50 Time units: 0.1 sec for the Gamma Ray data, an

d 1 min for the stock data.

Performance Study (cont.)

Page 20: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

The processing time of our algorithm is output-dependent.

Performance Study (cont.)

Page 21: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Experiments on stock data

Performance Study (cont.)

Page 22: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Use spread as aggregate function

Performance Study (cont.)

Page 23: Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences

Conclusion and Future Work

This paper introduces elastic window model and demonstrates the desirability of the new model.

A novel data structure for efficient detection of elastic bursts and other aggregates.

Experiments show that our algorithm is faster than a brute force algorithm by several orders of magnitude.

Future work A robust way of setting the thresholds Non-monotonic aggregates