efficient elastic burst detection in data streams yunyue zhu and dennis shasha department of...

Post on 03-Jan-2016

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Efficient Elastic Burst Detection in Data Streams

Yunyue Zhu and Dennis ShashaDepartment of Computer ScienceCourant Institute of Mathematical SciencesNew York University

SIGKDD 2003

Abstract

Burst detection Find abnormal aggregates in data streams Sliding window

In some applications, we want to monitor many sliding window sizes simultaneously. Brute force: O(n2) Shifted Wavelet Tree: near linear time

Problem Statement

For a time series x1, x2, …, xn, given a set of window sizes w1, w2, …, wm, an aggregate function F and threshold associated with each window size, f(wj), j = 1, 2, …, m

Monitoring elastics window aggregates of the time series is to find all the subsequences of all the window sizes such that the aggregate applied to the subsequences cross their window sizes' thresholds, i.e.

Wavelet Tree Haar Wavelet Tree

Level 0: original time series Level 1: pair wise averages and differences o

f the adjacent data items at level 0 Level i: pair wise averages and differences o

n averages at level i - 1

The wavelet coefficients can represent the trend of the time series.

Wavelet coefficient → Aggregate Average and difference → Sum Problem: the windows at the same

level are non-overlapping

Wavelet Tree (cont.)

Shifted Wavelet Tree

Add additional “line” of windows They can be maintained explicitly or

implicitly.

Shifted Wavelet Tree (cont.)

Any subsequence of length w, w 2≦ i is included in one of the windows at level i + 1 of the SWT.

We say that windows with size w, 2i -1 < w 2≦ i , are monitored by level i + 1 of the SW

T.

Level 3

Level 4

7 3

SWT Construction

For each level i (i 1)≧ Compute the pair wise aggregate (sum) for each

two consecutive data items at level i - 1 Downsampling

sampling every second item in the series of aggregates → the input for the higher level in the SWT

O(n), n: time series length

Search for a Burst

Given window size w 2≦ i, threshold f(w)

Search in two stages The potential burst is detected at the leve

l i + 1 in the SWT Detailed search in those subsequences o

f size 2i with sum f(w)≧ O(k), k: #alarms (output size)

Streaming Algorithm

Assume that new data becomes available at every time unit.

The set of window sizes are 2L < w1 < w2 < … < wm < 2U.

Maintain the levels from L+2 to U+1 of the SWT that monitor those windows.

Two methods Online algorithm Batch algorithm

Streaming Algorithm:Online Algorithm

Whenever a new data item becomes available Update those 2(U - L) aggregates of the windo

ws in the SWT. If the aggregate at level i exceeds δi , perform a

detailed search on those windows monitored by i.

For level i, threshold δi = min f(wj), 2i-2 < wj ≦2i-1

Response time = one time unit

Streaming Algorithm:Batch Algorithm

Maintain the aggregates at level L+1 The aggregate in the most recently complet

ed window of level L+1 is updated every time unit.

An aggregate of a window at the upper levels will not be computed until all the data in that window are available.

Once an aggregate at a certain upper level is updated, we also check alarms for time intervals monitored by that level.

Higher throughput, longer response time.

Other Aggregates

The monitoring of many other aggregates based on elastic windows could benefit from our data structure, as long as the following conditions holds.

1. The aggregate F is monotonically increasing or decreasing with respect to the window. e.g. Max, Count → monotonically increasing Min → monotonically increasing

2. The alarm domain is one sided, that is, monotonic increasing → [threshold, ∞) monotonic decreasing → (-∞, threshold]

Extension to Two Dimensions

The problem is to report the positions of spatial sliding windows (rectangle regions) having different sizes, within which the density exceeds some predefined threshold.

Using the same techniques of SWT-1D.

Wavelet Tree 2D Shifted Wavelet Tree 2D

Effectiveness Study

Bursts of the number of times that countries were mentioned in the presidential speech of the state of the union.

A predefined sliding window size is insufficient.

Bursts at large time scales are not necessarily reflected at smaller time scales. may be composed of many consecutive “bumps"

Effectiveness Study (cont.)

Bursts in population distribution data (1990)

Window sizes 1°x1°, 2°x2° and 5°x5° in Latitude/Longitude

Effectiveness Study (cont.)

Performance Study

Experiments on a 1.5GHz Pentium 4 PC with 512 MB of main memory running Windows 2000.

Datasets The Gamma Ray data set

12 hours of data from a small region of the sky, where Gamma Ray bursts were actually reported

The data are time series of the number of photons observed (events) every 0.1 second.

Totally 19,015 events in this time series The NYSE TAQ Stock data set

Tick-by-tick trading activities of the IBM stock between July 1st, 1998 and July 1st, 2002.

5,331,145 trading records (ticks) Each record contains trading time, trading price and trading v

olume.

Training threshold Use the first few hours of Gamma Ray data and

the first year of Stock data as training data. For a window of size w, we compute the aggreg

ates on the training data with sliding window of size w => → y

f(w) = avg(→ y) + ξstd(→ y)

Window sizes: 5, 10, …,5 * Nw time units Nw : #windows, varies from 5 to 50 Time units: 0.1 sec for the Gamma Ray data, an

d 1 min for the stock data.

Performance Study (cont.)

The processing time of our algorithm is output-dependent.

Performance Study (cont.)

Experiments on stock data

Performance Study (cont.)

Use spread as aggregate function

Performance Study (cont.)

Conclusion and Future Work

This paper introduces elastic window model and demonstrates the desirability of the new model.

A novel data structure for efficient detection of elastic bursts and other aggregates.

Experiments show that our algorithm is faster than a brute force algorithm by several orders of magnitude.

Future work A robust way of setting the thresholds Non-monotonic aggregates

top related