defense_aj.ppt

Post on 14-Jun-2015

104 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Statistical Mining in Data Streams

Ankur JainDissertation Defense

Computer Science, UC Santa Barbara

CommitteeEdward Y. Chang (chair)

Divyakant AgrawalYuan-Fang Wang

04/13/23 Statistical Mining in Data Streams 2

RoadmapThe Data Stream Model

Introduction and research issues Related work

Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing

Contribution SummaryFuture work

04/13/23 Statistical Mining in Data Streams 3

Data Streams

“A data stream is an unbounded and continuous sequence of tuples.”

Tuples arrive online and could be multi-dimensionalA tuple seen once cannot be easily retrieved later No control over the tuple arrival order

04/13/23 Statistical Mining in Data Streams 4

“Find the mean temperature of the lagoon in last 3 hours”

Applications – Sensor NetworksApplications – Network MonitoringApplications – Text Processing

DoS,PROBE,U2R ?

INTERNET

AnomaliesIntrusions

Blogs

Email

Click stream Clustering

Applications

• Video surveillance• Stock ticker monitoring• Process control & manufacturing• Traffic monitoring & analysis• Transaction log processing

Traditional DBMS does not work!

04/13/23 Statistical Mining in Data Streams 5

Data Stream Projects STREAM (Stanford)

A general-purpose Data Stream Management System (DSMS) Telegraph (Berkeley)

Adaptive query processing TinyDB: General purpose sensor database

Aurora Project (Brown/MIT) Distributed stream processing Introduces new operators (map, drop etc.)

The Cougar Project (Cornell) Sensors form a distributed database system Cross-layer optimizations (data management layer and the routing layer)

MAIDS (UIUC) Mining Alarming Incidents in Data Streams Streaminer: Data stream mining

04/13/23 Statistical Mining in Data Streams 6

Data Stream Processing – Key Ingredients

Adaptivity Incorporate evolutionary changes in the stream

Approximation Exact results are hard to compute fast with limited

memory

04/13/23 Statistical Mining in Data Streams 7

A Data Stream Management System (DSMS)

The Central Stream Processing System

Query ProcessingResource ManagementAdaptive Stream Mining

User Query

Stream Synopsis

Streaming Query Result

Streaming Data Sources/SensorsQuery Precision

Sampling RateSliding Window Size

Data FilteringData SamplingResource Management

Sensor CalibrationData Acquisition

Query ProcessingResource ManagementAdaptive Stream Mining

Data FilteringData SamplingResource Management

Sensor CalibrationData Acquisition

04/13/23 Statistical Mining in Data Streams 8

Thesis Outline “Develop fast, online, statistical methods for

mining data streams.” Adaptive non-linear clustering in multi-

dimensional streams Bayesian reasoning sensor stream processing Filtering methods for resource conservation Change detection in data streams Video sensor data stream processing

04/13/23 Statistical Mining in Data Streams 9

RoadmapThe Data Stream Model

Introduction and research issues Related work

Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing

Contribution SummaryFuture work

04/13/23 Statistical Mining in Data Streams 10

Clustering in High-Dimensional Streams

“Given a continuous sequence of points, group them into some number of clusters, such that the members of a cluster are geometrically close to each other.”

04/13/23 Statistical Mining in Data Streams 11

Example Application – Network Monitoring

DoS , Probe,Normal?

DoS , Probe,Normal?

Connection tuples (high-dimensional)

INTERNETINTERNET

04/13/23 Statistical Mining in Data Streams 12

Stream Clustering – New Challenges One-pass restriction and limited memory constraint

Fading cluster technique proposed by Aggarwal et al. Non-linear separation boundaries

We propose using the kernel trick to deal with the non-linearity issue

Data dimensionality We propose effective incremental dimension reduction

technique

04/13/23 Statistical Mining in Data Streams 13

The 2-Tier Framework

x~Tier1:

Stream Segmentation

Tier 2:LDS Projection & Update

Input Space LDS

C9

C2

C6

C1

C5

C3

C4

C7

C8

Adaptive Non-linear Clustering

x

Latest point received from the stream 2-Tier clustering module

(uses the kernel trick)Fading Clusters

d-dimensional

q-dimensionalq < d

04/13/23 Statistical Mining in Data Streams 14

The Fading Cluster MethodologyEach cluster Ci, has a recency value Ri s.t. Ri = f(t-tlast), where, t: current time tlast: last time Ci was updated f(t) = e- t

: fading factor A cluster is erased from memory (faded) when Ri · h,

where h is a user parameter controls the influence of historical data Total number of clusters is bounded

04/13/23 Statistical Mining in Data Streams 15

Non-linearity in Data

Spectral clustering methods likely to perform better

Traditional clustering techniques (k-means) do not perform well

Feature space mapping

Input SpaceFeature Space

04/13/23 Statistical Mining in Data Streams 16

Non-linearity in Network Intrusion Data

Input Space Feature Space

“ipsweep” attack data

Geometrically well-behaved

trend

Use kernel trick?

04/13/23 Statistical Mining in Data Streams 17

The Kernel TrickActual projection in higher dimension is

computationally expensiveThe kernel trick does the non-linear projection

implicitly!

Given two input space vectors x,y k(x,y) = <(x),(y)>

Kernel Function

Gaussian kernel functionk(x,y) = exp(-||x-y||2) used in the previous example !

04/13/23 Statistical Mining in Data Streams 18

:x = (x1, x2) → (x) = (x1

2, x22, √2x1x2)

<(x),(z)> = <(x12, x2

2, √2x1x2), (z12, z2

2, √2z1z2)>,

= x12z1

2 + x22z2

2 + 2x1x2z1z2,

= (x1z1 + x2z2)2,

= <x,z>2.k(x,z) = <x,z>2

Kernel Trick - Working ExampleNot required explicitly!

“Kernel trick allows us to make operations in high-dimensional feature space using a kernel function but without explicitly representing “

04/13/23 Statistical Mining in Data Streams 19

Stream Clustering – New Challenges One-pass restriction and limited memory constraint

We use the fading cluster technique proposed by Aggarwal et. al.

Non-linear separation boundaries We propose using kernel methods to deal with the non-

linearity issue Data dimensionality

We propose effective incremental dimension reduction technique

04/13/23 Statistical Mining in Data Streams 20

Dimensionality ReductionPCA like kernel method desirable

Explicit representation – EVD preferred KPCA is computationally prohibitive - O(n3) The principal components evolve with time –

frequent EVD updates may be necessaryWe propose to perform EVD on grouped-data

instead point-dataRequires a novel kernel method

04/13/23 Statistical Mining in Data Streams 21

The 2-Tier Framework

x~Tier1:

Stream Segmentation

Tier 2:LDS Projection & Update

Input Space LDS

C9

C2

C6

C1

C5

C3

C4

C7

C8

Adaptive Non-linear Clustering

x

Latest point received from the stream 2-Tier clustering module

(uses the kernel trick)Fading Clusters

d-dimensional

q-dimensionalq < d

04/13/23 Statistical Mining in Data Streams 22

The 2-Tier Framework …Tier 1 captures the temporal locality in a segment

Segment is a group of contiguous points in the stream geometrically packed closely in the feature space

Tier 2 adaptively selects segments to project data in LDS

Selected segments are called representative segments Implicit data in the feature space is projected explicitly

in LDS such that the feature-space distances are preserved

04/13/23 Statistical Mining in Data Streams 23

Update cluster centers and recency values. Delete faded clusters Create new

cluster with x

Assign x to its nearest cluster

Clear contents of S

Add S in memoryand update LDS

Is S a representative

segment?

Is close to an active cluster?

x~

Obtain in LDSx~

YES

YES

NO

YES

NO

TIER 2

Obtain a point x From the stream

Add x to S

Is ((x) novel w.r.t S and s > smin)

OR is s = smax?

NO

TIER 1

The 2-Tier Framework …

04/13/23 Statistical Mining in Data Streams 24

Network Intrusion Stream

• Simulated data from MIT Lincoln Labs.• 34 Continuous Attributes (Features)• 10.5 K Records• 22 types of intrusion attacks + 1 normal class

04/13/23 Statistical Mining in Data Streams 25

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u=10

04/13/23 Statistical Mining in Data Streams 26

Efficiency - EVD Computations

Newswire data3.8K Records, 16.5K Features, 10 news topics

Image data5K Records, 576 Features, 10 digits

04/13/23 Statistical Mining in Data Streams 27

In Retrospect…We proposed an effective stream clustering

frameworkWe use the kernel trick to delineate non-linear

boundaries efficientlyWe use stream segmentation approach to

continuously project data into a low dimensional space

04/13/23 Statistical Mining in Data Streams 28

RoadmapThe Data Stream Model

Introduction and research issues Related work

Contributions Towards Stream Mining Stream data clustering Bayesian reasoning sensor steam processing

Contribution SummaryFuture work

04/13/23 Statistical Mining in Data Streams 29

Bayesian Reasoning for Sensor Data ProcessingUsers submit queries

with precision constraintsResource conservation is

of prime concern to prolong system life

Data acquisition Data communication

“Find the temperature with 80% confidence

Use probabilistic models at central site for approximate predictions preventing actual acquisitions

04/13/23 Statistical Mining in Data Streams 30

Dependencies in Sensor Attributes

Get Temperature

Attribute Acquisition Cost

Temperature 50 J

Voltage 5 J

Dependency Model

Temperature

Voltage

Acquire Voltage!

Report TemperatureBayesian Networks

Get Voltage

04/13/23 Statistical Mining in Data Streams 31

Using Correlation Models [Deshpande et al.]

Correlation models ignore conditional dependency

Humidity [35-40)

Intel Lab ( Real Sensor network data)Attributes: Voltage (V), Temperature (T), Humidity (H)

“voltage” is conditionally independent of “temperature”, given “humidity” !

“voltage” is correlated with “temperature”

Deshpande et al. [VLDB’04]

04/13/23 Statistical Mining in Data Streams 32

BN vs. Correlations

NDBC Buoy DatasetBayesian Network•Maintains vital dependencies only•Lower search complexity O(n)•Storage O(nd), d: avg. node degree•Intuitive dependency structure

Correlation model [Deshpande et. al.]•Maintains all dependencies•Search space of finding best possible alternative sensor attribute is high•Joint probability is represented in O(n2) cells

Intel Lab. Dataset

04/13/23 Statistical Mining in Data Streams 33

Bayesian Networks (BN)Qualitative Part – Directed Acyclic Graph (DAG)

• Nodes – Sensor Attributes• Edges – Attribute influence relationship

Quantitative Part – Conditional Probability Table (CPT)• Each node X has its own CPT , P(X|parents(X))

Together, the BN represent the joint probability in factored from: P(T,H,V,L) = P(T)P(H|T)P(V|H)P(L|T)

The “influence relationship” is represented by conditional entropy function H. H(Xi)=k

l=1 P( Xi = xil )log(P( Xi = xil ))

We learn the BN by minimizing H(Xi| Parents(Xi)).

04/13/23 Statistical Mining in Data Streams 34

System Architecture

Group-query Plan Generation

Bayesian Inference Engine

Sensor Network

Acquisition Values

Acquisition PlanGroup Query (Q)

{(Wind Speed, 75%)}

{(Temperature, 95%),(Wind Speed, 85%)}

{(Air Pressure, 90%),(Wind Speed, 90%)}

{(Temperature, 80%)}

Query Processor

Storage

CPTsCost

BN

0.4C3

0.01C4

0.3C2

0.5C1

0.20.60.040.1

X1

X2

X5

X3

X4

X6

0.050.050.30.6

04/13/23 Statistical Mining in Data Streams 35

Finding the Candidate AttributesFor any attribute in the group-query Q, analyze

candidates attributes in the Markov blanket recursively

Selection criterion

Select candidates in a

greedy fashion

Acquisition cost

Information Gain (Conditional Entropy)

Meet precision constraints

Maximize resource conservation

04/13/23 Statistical Mining in Data Streams 36

Experiments – Resource Conservation

NDBC dataset, 7 attributes

Effect of using group-queries, |Q| - Group-query size

Effect of using MB Property with min = 0.90

04/13/23 Statistical Mining in Data Streams 37

Wave Period (WP)

Wind Speed (SP)

Water Temperature (WT)

Wind Direction (DR)

Wave Height (WH)

Air Temperature (AT)

Air Pressure (AP)

Results - Selectivity

04/13/23 Statistical Mining in Data Streams 38

In Retrospect…Bayesian networks can encode the sensor

dependencies effectivelyOur method provides significant resource

conservation for group-queries

04/13/23 Statistical Mining in Data Streams 39

Contribution Summary “Adaptive Stream resource management using Kalman Filters.” [SIGMOD’04]

“Adaptive sampling for sensor networks.” [DMSN’04]

“Adaptive non-linear clustering for Data Streams.” [CIKM’06]

“Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention.” [CVPR’06]

“Filtering the data streams.” [in submission]

“Efficient diagnostic and aggregate queries on sensor networks.” [in submission]

“OCODDS: An On-line Change-Over Detection framework for tracking evolutionary changes in Data Streams.” [in submission]

04/13/23 Statistical Mining in Data Streams 40

Future WorkDevelop non-linear techniques for capturing

temporal correlations in data streamsThe Bayesian framework can be extended to

address “what-if” queries with counterfactual evidence

The clustering framework can be extended for developing stream visualization systems

Incremental EVD techniques can improve the performance further

04/13/23 Statistical Mining in Data Streams 41

Thank You !

04/13/23 Statistical Mining in Data Streams 42

BACKUP SLDIES!

04/13/23 Statistical Mining in Data Streams 43

Back to Stream ClusteringWe propose a 2-tier stream clustering

framework Tier 1: Kernel method that continuously divides the

stream into segments Tier 2: Kernel method that uses the segments to

project data in a low-dimensional space (LDS)The fading clusters reside in the LDS

04/13/23 Statistical Mining in Data Streams 44

Clustering – LDS Projection

04/13/23 Statistical Mining in Data Streams 45

Clustering – LDS Update

04/13/23 Statistical Mining in Data Streams 46

Network Intrusion Stream

Clustering accuracy at LDS dimensionality u=10

Cluster strengths at LDS dimensionality u=10

04/13/23 Statistical Mining in Data Streams 47

Effect of dimensionality

04/13/23 Statistical Mining in Data Streams 48

Query Plan Generation Given a group query, the query plan computes

“candidate attributes” that will actually be acquired to successfully address the query.

We exploit the “Markov Blanket (MB)” property to select candidate attributes.

Given a BN G, the Markov Blanket of a node Xi comprises the node, and its immediate parent and child.

))((

))(,())(|()|(

i

iiiii XMBP

XMBXPXMBXPGXP

04/13/23 Statistical Mining in Data Streams 49

Exploiting the MB Property“Given a node Xi and a set of arbitrary nodes Y in a BN s.t. MB(Xi) µ Y [ Xi), the conditional entropy of Xi given Y is at least as high as that given its Markov blanket or H(Xi|Y) ¸ H(Xi|MB(Xi)).”

Proof: Separating MB(Xi) into two parts MB1 = MB(Xi) [ Y and MB2 = MB(Xi) - MB1 and denoting Z = Y – MB(Xi):

H(Xi|Y) = H(Xi|Z,MB1) Y = Z [ MB1

¸ H(Xi|Z,MB1,MB2) Additional information cannot

increase entropy = H(Xi|Z, MB(Xi)) MB(Xi) = MB1|MB2

= H(Xi|MB(Xi)) Markov-blanket definition

04/13/23 Statistical Mining in Data Streams 50

Bayesian Reasoning -More Results…

Effect of using MB Property with min = 0.90

Query answer Quality loss50-node Synth. Data BN

04/13/23 Statistical Mining in Data Streams 51

Bayesian Reasoning for Group Queries

More accurate in addressing group queries Q = { (Xi, i)|Xi 2 X Æ (0 < i ·1) Æ 1 · i · n) } s.t. i

<maxl P(Xi = xil)} X ={X1,X2 ,X3,…, Xn} Sensor attributes

i Confidence parameters

P(Xi = xil) Probability with which Xi assumes the value of xil

Bayesian reasoning is helpful in detecting abnormalities

04/13/23 Statistical Mining in Data Streams 52

Bayesian Reasoning – Candidate attribute selection algorithm

top related