defense_aj.ppt

52
Statistical Mining in Data Streams Ankur Jain Dissertation Defense Computer Science, UC Santa Barbara Committee Edward Y. Chang (chair) Divyakant Agrawal Yuan-Fang Wang

Upload: tommy96

Post on 14-Jun-2015

104 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: defense_aj.ppt

Statistical Mining in Data Streams

Ankur JainDissertation Defense

Computer Science, UC Santa Barbara

CommitteeEdward Y. Chang (chair)

Divyakant AgrawalYuan-Fang Wang

Page 2: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 2

RoadmapThe Data Stream Model

Introduction and research issues Related work

Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing

Contribution SummaryFuture work

Page 3: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 3

Data Streams

“A data stream is an unbounded and continuous sequence of tuples.”

Tuples arrive online and could be multi-dimensionalA tuple seen once cannot be easily retrieved later No control over the tuple arrival order

Page 4: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 4

“Find the mean temperature of the lagoon in last 3 hours”

Applications – Sensor NetworksApplications – Network MonitoringApplications – Text Processing

DoS,PROBE,U2R ?

INTERNET

AnomaliesIntrusions

Blogs

Email

Click stream Clustering

Applications

• Video surveillance• Stock ticker monitoring• Process control & manufacturing• Traffic monitoring & analysis• Transaction log processing

Traditional DBMS does not work!

Page 5: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 5

Data Stream Projects STREAM (Stanford)

A general-purpose Data Stream Management System (DSMS) Telegraph (Berkeley)

Adaptive query processing TinyDB: General purpose sensor database

Aurora Project (Brown/MIT) Distributed stream processing Introduces new operators (map, drop etc.)

The Cougar Project (Cornell) Sensors form a distributed database system Cross-layer optimizations (data management layer and the routing layer)

MAIDS (UIUC) Mining Alarming Incidents in Data Streams Streaminer: Data stream mining

Page 6: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 6

Data Stream Processing – Key Ingredients

Adaptivity Incorporate evolutionary changes in the stream

Approximation Exact results are hard to compute fast with limited

memory

Page 7: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 7

A Data Stream Management System (DSMS)

The Central Stream Processing System

Query ProcessingResource ManagementAdaptive Stream Mining

User Query

Stream Synopsis

Streaming Query Result

Streaming Data Sources/SensorsQuery Precision

Sampling RateSliding Window Size

Data FilteringData SamplingResource Management

Sensor CalibrationData Acquisition

Query ProcessingResource ManagementAdaptive Stream Mining

Data FilteringData SamplingResource Management

Sensor CalibrationData Acquisition

Page 8: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 8

Thesis Outline “Develop fast, online, statistical methods for

mining data streams.” Adaptive non-linear clustering in multi-

dimensional streams Bayesian reasoning sensor stream processing Filtering methods for resource conservation Change detection in data streams Video sensor data stream processing

Page 9: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 9

RoadmapThe Data Stream Model

Introduction and research issues Related work

Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing

Contribution SummaryFuture work

Page 10: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 10

Clustering in High-Dimensional Streams

“Given a continuous sequence of points, group them into some number of clusters, such that the members of a cluster are geometrically close to each other.”

Page 11: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 11

Example Application – Network Monitoring

DoS , Probe,Normal?

DoS , Probe,Normal?

Connection tuples (high-dimensional)

INTERNETINTERNET

Page 12: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 12

Stream Clustering – New Challenges One-pass restriction and limited memory constraint

Fading cluster technique proposed by Aggarwal et al. Non-linear separation boundaries

We propose using the kernel trick to deal with the non-linearity issue

Data dimensionality We propose effective incremental dimension reduction

technique

Page 13: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 13

The 2-Tier Framework

x~Tier1:

Stream Segmentation

Tier 2:LDS Projection & Update

Input Space LDS

C9

C2

C6

C1

C5

C3

C4

C7

C8

Adaptive Non-linear Clustering

x

Latest point received from the stream 2-Tier clustering module

(uses the kernel trick)Fading Clusters

d-dimensional

q-dimensionalq < d

Page 14: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 14

The Fading Cluster MethodologyEach cluster Ci, has a recency value Ri s.t. Ri = f(t-tlast), where, t: current time tlast: last time Ci was updated f(t) = e- t

: fading factor A cluster is erased from memory (faded) when Ri · h,

where h is a user parameter controls the influence of historical data Total number of clusters is bounded

Page 15: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 15

Non-linearity in Data

Spectral clustering methods likely to perform better

Traditional clustering techniques (k-means) do not perform well

Feature space mapping

Input SpaceFeature Space

Page 16: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 16

Non-linearity in Network Intrusion Data

Input Space Feature Space

“ipsweep” attack data

Geometrically well-behaved

trend

Use kernel trick?

Page 17: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 17

The Kernel TrickActual projection in higher dimension is

computationally expensiveThe kernel trick does the non-linear projection

implicitly!

Given two input space vectors x,y k(x,y) = <(x),(y)>

Kernel Function

Gaussian kernel functionk(x,y) = exp(-||x-y||2) used in the previous example !

Page 18: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 18

:x = (x1, x2) → (x) = (x1

2, x22, √2x1x2)

<(x),(z)> = <(x12, x2

2, √2x1x2), (z12, z2

2, √2z1z2)>,

= x12z1

2 + x22z2

2 + 2x1x2z1z2,

= (x1z1 + x2z2)2,

= <x,z>2.k(x,z) = <x,z>2

Kernel Trick - Working ExampleNot required explicitly!

“Kernel trick allows us to make operations in high-dimensional feature space using a kernel function but without explicitly representing “

Page 19: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 19

Stream Clustering – New Challenges One-pass restriction and limited memory constraint

We use the fading cluster technique proposed by Aggarwal et. al.

Non-linear separation boundaries We propose using kernel methods to deal with the non-

linearity issue Data dimensionality

We propose effective incremental dimension reduction technique

Page 20: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 20

Dimensionality ReductionPCA like kernel method desirable

Explicit representation – EVD preferred KPCA is computationally prohibitive - O(n3) The principal components evolve with time –

frequent EVD updates may be necessaryWe propose to perform EVD on grouped-data

instead point-dataRequires a novel kernel method

Page 21: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 21

The 2-Tier Framework

x~Tier1:

Stream Segmentation

Tier 2:LDS Projection & Update

Input Space LDS

C9

C2

C6

C1

C5

C3

C4

C7

C8

Adaptive Non-linear Clustering

x

Latest point received from the stream 2-Tier clustering module

(uses the kernel trick)Fading Clusters

d-dimensional

q-dimensionalq < d

Page 22: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 22

The 2-Tier Framework …Tier 1 captures the temporal locality in a segment

Segment is a group of contiguous points in the stream geometrically packed closely in the feature space

Tier 2 adaptively selects segments to project data in LDS

Selected segments are called representative segments Implicit data in the feature space is projected explicitly

in LDS such that the feature-space distances are preserved

Page 23: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 23

Update cluster centers and recency values. Delete faded clusters Create new

cluster with x

Assign x to its nearest cluster

Clear contents of S

Add S in memoryand update LDS

Is S a representative

segment?

Is close to an active cluster?

x~

Obtain in LDSx~

YES

YES

NO

YES

NO

TIER 2

Obtain a point x From the stream

Add x to S

Is ((x) novel w.r.t S and s > smin)

OR is s = smax?

NO

TIER 1

The 2-Tier Framework …

Page 24: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 24

Network Intrusion Stream

• Simulated data from MIT Lincoln Labs.• 34 Continuous Attributes (Features)• 10.5 K Records• 22 types of intrusion attacks + 1 normal class

Page 25: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 25

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u=10

Page 26: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 26

Efficiency - EVD Computations

Newswire data3.8K Records, 16.5K Features, 10 news topics

Image data5K Records, 576 Features, 10 digits

Page 27: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 27

In Retrospect…We proposed an effective stream clustering

frameworkWe use the kernel trick to delineate non-linear

boundaries efficientlyWe use stream segmentation approach to

continuously project data into a low dimensional space

Page 28: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 28

RoadmapThe Data Stream Model

Introduction and research issues Related work

Contributions Towards Stream Mining Stream data clustering Bayesian reasoning sensor steam processing

Contribution SummaryFuture work

Page 29: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 29

Bayesian Reasoning for Sensor Data ProcessingUsers submit queries

with precision constraintsResource conservation is

of prime concern to prolong system life

Data acquisition Data communication

“Find the temperature with 80% confidence

Use probabilistic models at central site for approximate predictions preventing actual acquisitions

Page 30: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 30

Dependencies in Sensor Attributes

Get Temperature

Attribute Acquisition Cost

Temperature 50 J

Voltage 5 J

Dependency Model

Temperature

Voltage

Acquire Voltage!

Report TemperatureBayesian Networks

Get Voltage

Page 31: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 31

Using Correlation Models [Deshpande et al.]

Correlation models ignore conditional dependency

Humidity [35-40)

Intel Lab ( Real Sensor network data)Attributes: Voltage (V), Temperature (T), Humidity (H)

“voltage” is conditionally independent of “temperature”, given “humidity” !

“voltage” is correlated with “temperature”

Deshpande et al. [VLDB’04]

Page 32: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 32

BN vs. Correlations

NDBC Buoy DatasetBayesian Network•Maintains vital dependencies only•Lower search complexity O(n)•Storage O(nd), d: avg. node degree•Intuitive dependency structure

Correlation model [Deshpande et. al.]•Maintains all dependencies•Search space of finding best possible alternative sensor attribute is high•Joint probability is represented in O(n2) cells

Intel Lab. Dataset

Page 33: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 33

Bayesian Networks (BN)Qualitative Part – Directed Acyclic Graph (DAG)

• Nodes – Sensor Attributes• Edges – Attribute influence relationship

Quantitative Part – Conditional Probability Table (CPT)• Each node X has its own CPT , P(X|parents(X))

Together, the BN represent the joint probability in factored from: P(T,H,V,L) = P(T)P(H|T)P(V|H)P(L|T)

The “influence relationship” is represented by conditional entropy function H. H(Xi)=k

l=1 P( Xi = xil )log(P( Xi = xil ))

We learn the BN by minimizing H(Xi| Parents(Xi)).

Page 34: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 34

System Architecture

Group-query Plan Generation

Bayesian Inference Engine

Sensor Network

Acquisition Values

Acquisition PlanGroup Query (Q)

{(Wind Speed, 75%)}

{(Temperature, 95%),(Wind Speed, 85%)}

{(Air Pressure, 90%),(Wind Speed, 90%)}

{(Temperature, 80%)}

Query Processor

Storage

CPTsCost

BN

0.4C3

0.01C4

0.3C2

0.5C1

0.20.60.040.1

X1

X2

X5

X3

X4

X6

0.050.050.30.6

Page 35: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 35

Finding the Candidate AttributesFor any attribute in the group-query Q, analyze

candidates attributes in the Markov blanket recursively

Selection criterion

Select candidates in a

greedy fashion

Acquisition cost

Information Gain (Conditional Entropy)

Meet precision constraints

Maximize resource conservation

Page 36: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 36

Experiments – Resource Conservation

NDBC dataset, 7 attributes

Effect of using group-queries, |Q| - Group-query size

Effect of using MB Property with min = 0.90

Page 37: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 37

Wave Period (WP)

Wind Speed (SP)

Water Temperature (WT)

Wind Direction (DR)

Wave Height (WH)

Air Temperature (AT)

Air Pressure (AP)

Results - Selectivity

Page 38: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 38

In Retrospect…Bayesian networks can encode the sensor

dependencies effectivelyOur method provides significant resource

conservation for group-queries

Page 39: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 39

Contribution Summary “Adaptive Stream resource management using Kalman Filters.” [SIGMOD’04]

“Adaptive sampling for sensor networks.” [DMSN’04]

“Adaptive non-linear clustering for Data Streams.” [CIKM’06]

“Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention.” [CVPR’06]

“Filtering the data streams.” [in submission]

“Efficient diagnostic and aggregate queries on sensor networks.” [in submission]

“OCODDS: An On-line Change-Over Detection framework for tracking evolutionary changes in Data Streams.” [in submission]

Page 40: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 40

Future WorkDevelop non-linear techniques for capturing

temporal correlations in data streamsThe Bayesian framework can be extended to

address “what-if” queries with counterfactual evidence

The clustering framework can be extended for developing stream visualization systems

Incremental EVD techniques can improve the performance further

Page 41: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 41

Thank You !

Page 42: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 42

BACKUP SLDIES!

Page 43: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 43

Back to Stream ClusteringWe propose a 2-tier stream clustering

framework Tier 1: Kernel method that continuously divides the

stream into segments Tier 2: Kernel method that uses the segments to

project data in a low-dimensional space (LDS)The fading clusters reside in the LDS

Page 44: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 44

Clustering – LDS Projection

Page 45: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 45

Clustering – LDS Update

Page 46: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 46

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u=10

Cluster strengths at LDS dimensionality u=10

Page 47: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 47

Effect of dimensionality

Page 48: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 48

Query Plan Generation Given a group query, the query plan computes

“candidate attributes” that will actually be acquired to successfully address the query.

We exploit the “Markov Blanket (MB)” property to select candidate attributes.

Given a BN G, the Markov Blanket of a node Xi comprises the node, and its immediate parent and child.

))((

))(,())(|()|(

i

iiiii XMBP

XMBXPXMBXPGXP

Page 49: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 49

Exploiting the MB Property“Given a node Xi and a set of arbitrary nodes Y in a BN s.t. MB(Xi) µ Y [ Xi), the conditional entropy of Xi given Y is at least as high as that given its Markov blanket or H(Xi|Y) ¸ H(Xi|MB(Xi)).”

Proof: Separating MB(Xi) into two parts MB1 = MB(Xi) [ Y and MB2 = MB(Xi) - MB1 and denoting Z = Y – MB(Xi):

H(Xi|Y) = H(Xi|Z,MB1) Y = Z [ MB1

¸ H(Xi|Z,MB1,MB2) Additional information cannot

increase entropy = H(Xi|Z, MB(Xi)) MB(Xi) = MB1|MB2

= H(Xi|MB(Xi)) Markov-blanket definition

Page 50: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 50

Bayesian Reasoning -More Results…

Effect of using MB Property with min = 0.90

Query answer Quality loss50-node Synth. Data BN

Page 51: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 51

Bayesian Reasoning for Group Queries

More accurate in addressing group queries Q = { (Xi, i)|Xi 2 X Æ (0 < i ·1) Æ 1 · i · n) } s.t. i

<maxl P(Xi = xil)} X ={X1,X2 ,X3,…, Xn} Sensor attributes

i Confidence parameters

P(Xi = xil) Probability with which Xi assumes the value of xil

Bayesian reasoning is helpful in detecting abnormalities

Page 52: defense_aj.ppt

04/13/23 Statistical Mining in Data Streams 52

Bayesian Reasoning – Candidate attribute selection algorithm