Mining Unusual Patterns in Data Streams in Multi-
Dimensional Space
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
May 10, 2010 Mining Unusual Patterns in Data Streams 2
Outline
Characteristics of data streams
Mining unusual patterns in data streams
Multi-dimensional regression analysis of data
streams
Stream cubing and stream OLAP methods
Mining other kinds of patterns in data streams
Research problems
May 10, 2010 Mining Unusual Patterns in Data Streams 3
Data Streams
Data Streams Data streams—continuous, ordered, changing, fast, huge amount
Traditional DBMS—data stored in finite, persistent data setsdata sets
Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single linear scan algorithm (can
only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
May 10, 2010 Mining Unusual Patterns in Data Streams 4
Stream Data Applications
Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply &
manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too
expensive)
May 10, 2010 Mining Unusual Patterns in Data Streams 5
Challenges of Stream Data Mining
Multiple, continuous, rapid, time-varying, ordered streams Main memory computation Mining queries are either continuous or ad-hoc Mining queries are often complex
Involving multiple streams, large amount of data, and history Finding patterns, models, anomaly, differences, …
Mining dynamics (changes, trends and evolutions) of data streams
Multi-level/multi-dimensional processing and data mining Most stream data are at pretty low-level or multi-dimensional in
nature
May 10, 2010 Mining Unusual Patterns in Data Streams 6
Stream Data Mining Tasks
Multi-dimensional (on-line) analysis of streams Clustering data streams Classification of data streams Mining frequent patterns in data streams Mining sequential patterns in data streams Mining partial periodicity in data streams Mining notable gradients in data streams Mining outliers and unusual patterns in data streams ……
May 10, 2010 Mining Unusual Patterns in Data Streams 7
Multi-Dimensional Stream Analysis: Examples
Analysis of Web click streams Raw data at low levels: seconds, web page addresses, user IP
addresses, … Analysts want: changes, trends, unusual patterns, at reasonable
levels of details E.g., Average clicking traffic in North America on sports in the last
15 minutes is 40% higher than that in the last 24 hours.”
Analysis of power consumption streams Raw data: power consumption flow for every household, every
minute Patterns one may find: average hourly power consumption
surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago
May 10, 2010 Mining Unusual Patterns in Data Streams 8
A Key Step—Stream Data Reduction
Challenges of OLAPing stream data Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels are desirable: multi-
dimensional regression analysis
Proposal A scalable multi-dimensional stream “data cube” that can
aggregate regression model of stream data efficiently without
accessing the raw data
Stream data compression Compress the stream data to support memory- and time-efficient
multi-dimensional regression analysis
May 10, 2010 Mining Unusual Patterns in Data Streams 9
Regression Cube for Time-Series
Initially, one time-series per base cell Too costly to store all these time-series Too costly to compute regression at multi-dimensional space
Regression cube Base cube: only store regression parameters of base cells (e.g.,
2 points vs. 1000 points) All the upper level cuboids can be computed precisely for linear
regression on both standard dimensions and time dimensions For quadratic regression, we need 5 points In general, we need:
where k = 2 for quadratic.
,2
3
2
)1( 2 kkkkk
+=++
May 10, 2010 Mining Unusual Patterns in Data Streams 10
Basics of General Linear Regression
n tuples in one cell: (xi , yi), i =1..n, where yi is the measure attribute to be analyzed
For sample i , a vector of k user-defined predictors ui:
The linear regression model:
where η is a k × 1 vector of regression parameters
( )
( )
=
=
−− 1,
1
1
1
0
...
1
...
ki
i
ik
ii
u
u
u
u
u
x
xu
( ) 1,1110 ...| −−+++== kikiiT
ii uuyE ηηηη uu
May 10, 2010 Mining Unusual Patterns in Data Streams 11
Theory of General Linear Regression
Collect into the model matrix U
The ordinary least square (OLS) estimate of is the argument that minimizes the residue sum of squares function
Main theorem to determine the OLS regression parameters
iu kn×
=
−
−
−
1,21
1,22221
1,11211
...1.
.
.
.
.
.
.
.
.
....1
...1
knnn
k
k
uuu
uuu
uuu
U
∧η η
)()()( ηηRSS T UyUy −−=η
yUUU TTRSS 1)(0)( −∧
=⇒=∂∂ ηηη
May 10, 2010 Mining Unusual Patterns in Data Streams 12
Linearly Compressed Representation (LCR)
Stream data compression for multi-dimensional regression analysis
Define, for i, j = 0,…,k-1:
The linearly compressed representation (LCR) of one cell:
Size of LCR of one cell:
quadratic in k, independent of the number of tuples n in one cell
∑=
=n
hhjhiij uu
1
θ
},1,...,0,|{}1,...,0|{ jikjiki iji ≤−=−=∧
θη
,2
3
2
)1( 2 kkkkk
+=++
May 10, 2010 Mining Unusual Patterns in Data Streams 13
Matrix Form of LCR
LCR consists of and , where
and
where
provides OLS regression parameters essential for regression analysis
is an auxiliary matrix that facilitates aggregations of LCR in standard and regression dimensions in a data cube environment
LCR only stores the upper triangle of
),...,,( 110 −
∧∧∧∧= k
T
ηηηη∧η Θ
=
−−
−
−
−−− 1,1
1,1
1,0
2,11,10,1
121110
020100
.
.
...
...
...
.
.
.
.
.
....
...
kk
k
k
kkk θ
θθ
θθθ
θθθθθθ
Θ
∧ηΘ
⇒= ΘΘT Θ
May 10, 2010 Mining Unusual Patterns in Data Streams 14
Aggregation in Standard Dimensions
Given LCR of m cells that differ in one standard dimension, what is the LCR of the cell aggregated in that dimension? for m base cells
for an aggregated cell
The lossless aggregation formula
),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧
=== ηηη
),( aaaLCR Θ∧
= η
1
1
ΘΘ =
=∑=
∧∧
a
m
iia ηη
May 10, 2010 Mining Unusual Patterns in Data Streams 15
Stock Price Example—Aggregation in Standard Dimensions
Simple linear regression on time series dataCells of two companies
After aggregation:
May 10, 2010 Mining Unusual Patterns in Data Streams 16
Aggregation in Regression Dimensions
Given LCR of m cells that differ in one regression dimension, what is the LCR of the cell aggregated in that dimension
for m base cells
for the aggregated cell
The lossless aggregation formula
),(,...),,(),,( 222111 mmmLCRLCRLCR ΘΘΘ∧∧∧
=== ηηη
),( aaaLCR Θ∧
= η
∑
∑∑
=
=
∧−
=
∧
=
=
m
iia
m
iii
m
iia
1
1
1
1
ΘΘ
ΘΘ ηη
May 10, 2010 Mining Unusual Patterns in Data Streams 17
Stock Price Example—Aggregation in Time Dimension
Cells of two adjacent time intervals:
After aggregation
May 10, 2010 Mining Unusual Patterns in Data Streams 18
Feasibility of Stream Regression Analysis
Efficient storage and scalable (independent of the
number of tuples in data cells)
Lossless aggregation without accessing the raw data
Fast aggregation: computationally efficient
Regression models of data cells at all levels
General results: covered a large and the most popular
class of regression
Including quadratic, polynomial, and nonlinear models
May 10, 2010 Mining Unusual Patterns in Data Streams 19
A Stream Cube Architecture
A tilted time frame Different time granularities
second, minute, quarter, hour, day, week, …
Critical layers Minimum interest layer (m-layer) Observation layer (o-layer) User: watches at o-layer and occasionally needs to drill-down
down to m-layer
Partial materialization of stream cubes Full materialization: too space and time consuming No materialization: slow response at query time Partial materialization: what do we mean “partial”?
May 10, 2010 Mining Unusual Patterns in Data Streams 20
A Tilted Time-Frame Model
31 days 24 hours 4 qtrs12 months
Time Now
24hrs 4qtrs 15minutes7 days
Time Now
25sec.Up to 7 days:
Up to a year:
Logarithmic (exponential) scale:4t 2t 1t
Time Now
4t8t16t
May 10, 2010 Mining Unusual Patterns in Data Streams 21
Benefits of Tilted Time-Frame Model
Each cell stores the measures according to tilt-time-frame
Limited memory space: Impossible to store the history in full scale
Emphasis more on recent data
Most applications emphasize on recent data (slide window)
Natural partition on different time granularities
Putting different weights on remote data
Useful even for uniform weight
Tilted time-frame forms a new time dimension
for mining changes and evolutions
Essential for mining unusual patterns or outliers
Finding those with dramatic changes
E.g., exceptional stocks—not following the trends
May 10, 2010 Mining Unusual Patterns in Data Streams 22
Two Critical Layers in the Stream Cube
(*, theme, quarter)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
o-layer (observation)
May 10, 2010 Mining Unusual Patterns in Data Streams 23
On-Line Materialization vs. On-Line Computation
On-line materialization
Materialization takes precious resources and time
Only incremental materialization (with slide window)
Only materialize “cuboids” of the critical layers?
Some intermediate cells that should be materialized
Popular path approach vs. exception cell approach
Materialize intermediate cells along the popular paths
Exception cells: how to set up exception thresholds?
Notice exceptions do not have monotonic behaviour
Computation problem
How to compute and store stream cubes efficiently?
How to discover unusual cells between the critical layer?
May 10, 2010 Mining Unusual Patterns in Data Streams 24
Stream Cube Structure: from m-layer to o-layer
(A1, *, C1)
(A1, *, C2) (A1, *, C2)
(A1, *, C2)
(A1, B1, C2) (A1, B2, C1)
(A2, *, C2)
(A2, B1, C1)
(A1, B2, C2)
(A2, B1, C2)
A2, B2, C1)
(A2, B2, C2)
May 10, 2010 Mining Unusual Patterns in Data Streams 25
Stream Cube Computation
Cube structure from m-layer to o-layer
Three approaches All cuboids approach
Materializing all cells (too much in both space and time)
Exceptional cells approach
Materializing only exceptional cells (saves space but not time
to compute and definition of exception is not flexible)
Popular path approach
Computing and materializing cells only along a popular path
Using H-tree structure to store computed cells (which form the
stream cube—a selectively materialized cube)
May 10, 2010 Mining Unusual Patterns in Data Streams 26
An H-Tree Cubing Structure
root
entertainment sports politics
uic uiuc uic uiuc
jeff mary jeff Jim
Q.I.Q.I. Q.I.
Regression:
Sum: xxxxCnt: yyyy
Quant-Info
Observation layer
Minimal int. layer
May 10, 2010 Mining Unusual Patterns in Data Streams 27
Partial Materialization Using H-Tree
H-tree: Introduced for computing data cubes and iceberg cubes
J. Han, J. Pei, G. Dong, and K. Wang, “Efficient Computation
of Iceberg Cubes with Complex Measures”, SIGMOD'01
Compressed database, fast cubing, and space preserving in cube
computation
Using H-tree for partial stream cubing Space preserving:
Intermediate aggregates can be computed incrementally and
saved in tree nodes
Facilitate computing other cells and multi-dimensional analysis
H-tree with computed cells can be viewed as “stream cube”
May 10, 2010 Mining Unusual Patterns in Data Streams 28
1
10
100
1000
25 50 100 200 400
Size (in K tuples)
Run
tim
e (i
n se
cond
s)
popular-pathexception-cellsall-cubing
0
200
400
600
25 50 100 200 400
Size (in K tuples)
Mem
ory
usag
e (i
n M
B)
popular-path
exception-cells
all-cubing
a) Time vs. m-layer size b) Space vs. m-layer size
Time and Space vs. Number of Tuples at the m-Layer (Dataset D3L3C10T400K)
May 10, 2010 Mining Unusual Patterns in Data Streams 29
1
10
100
1000
10000
3 4 5 6 7
Number of levels
Run
tim
e (i
n se
cond
s)
popular-pathexception-cellsall-cubing
1
10
100
1000
10000
3 4 5 6 7
Number of levels
Mem
ory
usag
e (i
n M
B)
popular-pathexception-cellsall-cubing
Time and Space vs. the Number of Levels
a) Time vs. # levels b) Space vs. # levels
May 10, 2010 Mining Unusual Patterns in Data Streams 30
Mining Other Unusual Patterns in Stream Data
Clustering and outlier analysis for stream mining
Clustering data streams (Guha, Motwani et al. 2000-2002)
History-sensitive, high-quality incremental clustering
Classification of stream data
Evolution of decision trees: Domingos et al. (2000, 2001)
Incremental integration of new streams in decision-tree induction
Frequent pattern analysis
Approximate frequent patterns (Manku & Motwani VLDB’02)
Evolution and dramatic changes of frequent patterns
May 10, 2010 Mining Unusual Patterns in Data Streams 31
Conclusions
Stream data mining: A rich and largely unexplored field Current research focus in database community:
DSMS system architecture, continuous query processing, supporting mechanisms
Stream data mining and stream OLAP analysis Powerful tools for finding general and unusual patterns Effectiveness, efficiency and scalability: lots of open problems
Our philosophy: A multi-dimensional stream analysis framework
Time is a special dimension: tilted time frame What to compute and what to save?—Critical layers Very partial materialization/precomputation: popular path approach Mining dynamics of stream data
May 10, 2010 Mining Unusual Patterns in Data Streams 32
References
B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom, “Models and issues in data
stream systems”, PODS'02 (tutorial).
S. Babu and J. Widom, “Continuous queries over data streams”, SIGMOD Record,
30:109--120, 2001.
Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and J. Wang. “Online analytical processing
stream data: Is it feasible?”, DMKD'02.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-dimensional regression
analysis of time-series data streams”, VLDB'02.
P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00.
M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and mining data streams: You only
get one look”, SIGMOD'02 (tutorial).
J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over
continuous data streams”, SIGMOD'01.
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering data streams”,
FOCS'00.
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams”, KDD'01.
May 10, 2010 Mining Unusual Patterns in Data Streams 33
www.cs.uiuc.edu/~hanj
Thank you !!!Thank you !!!