Download - Mining Unusual Patterns in Data Streams in Multi-Dimensional

Mining Unusual Patterns in Data Streams in Multi-

Dimensional Space

Jiawei Han

Department of Computer Science

University of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj

http://www.cs.uiuc.edu/~hanj








May 10, 2010 Mining Unusual Patterns in Data Streams 2

Outline

Characteristics of data streams

Mining unusual patterns in data streams

Multi-dimensional regression analysis of data

streams

Stream cubing and stream OLAP methods

Mining other kinds of patterns in data streams

Research problems


Data Streams

Data Streams Data streams—continuous, ordered, changing, fast, huge amount

Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can

only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

nature, needs multi-level and multi-dimensional processing


Stream Data Applications

Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &

manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

expensive)


Challenges of Stream Data Mining

Multiple, continuous, rapid, time-varying, ordered streams Main memory computation Mining queries are either continuous or ad-hoc Mining queries are often complex

Involving multiple streams, large amount of data, and history Finding patterns, models, anomaly, differences, …

Mining dynamics (changes, trends and evolutions) of data streams

Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in

nature


Stream Data Mining Tasks

Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……


Multi-Dimensional Stream Analysis: Examples

Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP

addresses, … Analysts want: changes, trends, unusual patterns, at reasonable

levels of details E.g., Average clicking traffic in North America on sports in the last

15 minutes is 40% higher than that in the last 24 hours.”

Analysis of power consumption streams Raw data: power consumption flow for every household, every

minute Patterns one may find: average hourly power consumption

surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago


A Key Step—Stream Data Reduction

Challenges of OLAPing stream data Raw data cannot be stored

Simple aggregates are not powerful enough

History shape and patterns at different levels are desirable: multi-

dimensional regression analysis

Proposal A scalable multi-dimensional stream “data cube” that can

aggregate regression model of stream data efficiently without

accessing the raw data

Stream data compression Compress the stream data to support memory- and time-efficient

multi-dimensional regression analysis


Regression Cube for Time-Series

Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space

Regression cube Base cube: only store regression parameters of base cells (e.g.,

2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear

regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need:

where k = 2 for quadratic.

,2

3

2

)1( 2 kkkkk

+=++


Basics of General Linear Regression

n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed

For sample i , a vector of k user-defined predictors ui:

The linear regression model:

where η is a k × 1 vector of regression parameters

( )

( )

=

=

−− 1,

1

1

1

0

...

1

...

ki

i

ik

ii

u

u

u

u

u

x

xu

( ) 1,1110 ...| −−+++== kikiiT

ii uuyE ηηηη uu


Theory of General Linear Regression

Collect into the model matrix U

The ordinary least square (OLS) estimate of is the argument that minimizes the residue sum of squares function

Main theorem to determine the OLS regression parameters

iu kn×

=

−

−

−

1,21

1,22221

1,11211

...1.

.

.

.

.

.

.

.

.

....1

...1

knnn

k

k

uuu

uuu

uuu

U

∧η η

)()()( ηηRSS T UyUy −−=η

yUUU TTRSS 1)(0)( −∧

=⇒=∂∂ ηηη


Linearly Compressed Representation (LCR)

Stream data compression for multi-dimensional regression analysis

Define, for i, j = 0,…,k-1:

The linearly compressed representation (LCR) of one cell:

Size of LCR of one cell:

quadratic in k, independent of the number of tuples n in one cell

∑=

=n

hhjhiij uu

1

θ

},1,...,0,|{}1,...,0|{ jikjiki iji ≤−=−=∧

θη

,2

3

2

)1( 2 kkkkk

+=++


Matrix Form of LCR

LCR consists of and , where

and

where

provides OLS regression parameters essential for regression analysis

is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment

LCR only stores the upper triangle of

),...,,( 110 −

∧∧∧∧= k

T

ηηηη∧η Θ

=

−−

−

−

−−− 1,1

1,1

1,0

2,11,10,1

121110

020100

.

.

...

...

...

.

.

.

.

.

....

...

kk

k

k

kkk θ

θθ

θθθ

θθθθθθ

Θ

∧ηΘ

⇒= ΘΘT Θ


Aggregation in Standard Dimensions

Given LCR of m cells that differ in one standard dimension, what is the LCR of the cell aggregated in that dimension? for m base cells

for an aggregated cell

The lossless aggregation formula

),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧

=== ηηη

),( aaaLCR Θ∧

= η

1

1

ΘΘ =

=∑=

∧∧

a

m

iia ηη


Stock Price Example—Aggregation in Standard Dimensions

Simple linear regression on time series dataCells of two companies

After aggregation:


Aggregation in Regression Dimensions

Given LCR of m cells that differ in one regression dimension, what is the LCR of the cell aggregated in that dimension

for m base cells

for the aggregated cell

The lossless aggregation formula

),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧

=== ηηη

),( aaaLCR Θ∧

= η

∑

∑∑

=

=

∧−

=

∧

=

=

m

iia

m

iii

m

iia

1

1

1

1

ΘΘ

ΘΘ ηη


Stock Price Example—Aggregation in Time Dimension

Cells of two adjacent time intervals:

After aggregation


Feasibility of Stream Regression Analysis

Efficient storage and scalable (independent of the

number of tuples in data cells)

Lossless aggregation without accessing the raw data

Fast aggregation: computationally efficient

Regression models of data cells at all levels

General results: covered a large and the most popular

class of regression

Including quadratic, polynomial, and nonlinear models


A Stream Cube Architecture

A tilted time frame Different time granularities

second, minute, quarter, hour, day, week, …

Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down

down to m-layer

Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?


A Tilted Time-Frame Model

31 days 24 hours 4 qtrs12 months

Time Now

24hrs 4qtrs 15minutes7 days

Time Now

25sec.Up to 7 days:

Up to a year:

Logarithmic (exponential) scale:4t 2t 1t

Time Now

4t8t16t


Benefits of Tilted Time-Frame Model

Each cell stores the measures according to tilt-time-frame

Limited memory space: Impossible to store the history in full scale

Emphasis more on recent data

Most applications emphasize on recent data (slide window)

Natural partition on different time granularities

Putting different weights on remote data

Useful even for uniform weight

Tilted time-frame forms a new time dimension

for mining changes and evolutions

Essential for mining unusual patterns or outliers

Finding those with dramatic changes

E.g., exceptional stocks—not following the trends


Two Critical Layers in the Stream Cube

(*, theme, quarter)

(user-group, URL-group, minute)

m-layer (minimal interest)

(individual-user, URL, second)

(primitive) stream data layer

o-layer (observation)


On-Line Materialization vs. On-Line Computation

On-line materialization

Materialization takes precious resources and time

Only incremental materialization (with slide window)

Only materialize “cuboids” of the critical layers?

Some intermediate cells that should be materialized

Popular path approach vs. exception cell approach

Materialize intermediate cells along the popular paths

Exception cells: how to set up exception thresholds?

Notice exceptions do not have monotonic behaviour

Computation problem

How to compute and store stream cubes efficiently?

How to discover unusual cells between the critical layer?


Stream Cube Structure: from m-layer to o-layer

(A1, *, C1)

(A1, *, C2) (A1, *, C2)

(A1, *, C2)

(A1, B1, C2) (A1, B2, C1)

(A2, *, C2)

(A2, B1, C1)

(A1, B2, C2)

(A2, B1, C2)

A2, B2, C1)

(A2, B2, C2)


Stream Cube Computation

Cube structure from m-layer to o-layer

Three approaches All cuboids approach

Materializing all cells (too much in both space and time)

Exceptional cells approach

Materializing only exceptional cells (saves space but not time

to compute and definition of exception is not flexible)

Popular path approach

Computing and materializing cells only along a popular path

Using H-tree structure to store computed cells (which form the

stream cube—a selectively materialized cube)


An H-Tree Cubing Structure

root

entertainment sports politics

uic uiuc uic uiuc

jeff mary jeff Jim

Q.I.Q.I. Q.I.

Regression:

Sum: xxxxCnt: yyyy

Quant-Info

Observation layer

Minimal int. layer


Partial Materialization Using H-Tree

H-tree: Introduced for computing data cubes and iceberg cubes

J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation

of Iceberg Cubes with Complex Measures”, SIGMOD'01

Compressed database, fast cubing, and space preserving in cube

computation

Using H-tree for partial stream cubing Space preserving:

Intermediate aggregates can be computed incrementally and

saved in tree nodes

Facilitate computing other cells and multi-dimensional analysis

H-tree with computed cells can be viewed as “stream cube”


1

10

100

1000

25 50 100 200 400

Size (in K tuples)

Run

tim

e (i

n se

cond

s)

popular-pathexception-cellsall-cubing

0

200

400

600

25 50 100 200 400

Size (in K tuples)

Mem

ory

usag

e (i

n M

B)

popular-path

exception-cells

all-cubing

a) Time vs. m-layer size b) Space vs. m-layer size

Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)


1

10

100

1000

10000

3 4 5 6 7

Number of levels

Run

tim

e (i

n se

cond

s)


1

10

100

1000

10000

3 4 5 6 7

Number of levels

Mem

ory

usag

e (i

n M

B)


Time and Space vs. the Number of Levels

a) Time vs. # levels b) Space vs. # levels


Mining Other Unusual Patterns in Stream Data

Clustering and outlier analysis for stream mining

Clustering data streams (Guha, Motwani et al. 2000-2002)

History-sensitive, high-quality incremental clustering

Classification of stream data

Evolution of decision trees: Domingos et al. (2000, 2001)

Incremental integration of new streams in decision-tree induction

Frequent pattern analysis

Approximate frequent patterns (Manku & Motwani VLDB’02)

Evolution and dramatic changes of frequent patterns


Conclusions

Stream data mining: A rich and largely unexplored field Current research focus in database community:

DSMS system architecture, continuous query processing, supporting mechanisms

Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems

Our philosophy: A multi-dimensional stream analysis framework

Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data


References

B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data

stream systems”, PODS'02 (tutorial).

S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,

30:109--120, 2001.

Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing

stream data: Is it feasible?”, DMKD'02.

Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression

analysis of time-series data streams”, VLDB'02.

P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.

M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only

get one look”, SIGMOD'02 (tutorial).

J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over

continuous data streams”, SIGMOD'01.

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,

FOCS'00.

G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.


www.cs.uiuc.edu/~hanj

Thank you !!!Thank you !!!









Download - Mining Unusual Patterns in Data Streams in Multi-Dimensional

Top Related