defense_aj.ppt
TRANSCRIPT
Statistical Mining in Data Streams
Ankur JainDissertation Defense
Computer Science, UC Santa Barbara
CommitteeEdward Y. Chang (chair)
Divyakant AgrawalYuan-Fang Wang
04/13/23 Statistical Mining in Data Streams 2
RoadmapThe Data Stream Model
Introduction and research issues Related work
Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing
Contribution SummaryFuture work
04/13/23 Statistical Mining in Data Streams 3
Data Streams
“A data stream is an unbounded and continuous sequence of tuples.”
Tuples arrive online and could be multi-dimensionalA tuple seen once cannot be easily retrieved later No control over the tuple arrival order
04/13/23 Statistical Mining in Data Streams 4
“Find the mean temperature of the lagoon in last 3 hours”
Applications – Sensor NetworksApplications – Network MonitoringApplications – Text Processing
DoS,PROBE,U2R ?
INTERNET
AnomaliesIntrusions
Blogs
Click stream Clustering
Applications
• Video surveillance• Stock ticker monitoring• Process control & manufacturing• Traffic monitoring & analysis• Transaction log processing
Traditional DBMS does not work!
04/13/23 Statistical Mining in Data Streams 5
Data Stream Projects STREAM (Stanford)
A general-purpose Data Stream Management System (DSMS) Telegraph (Berkeley)
Adaptive query processing TinyDB: General purpose sensor database
Aurora Project (Brown/MIT) Distributed stream processing Introduces new operators (map, drop etc.)
The Cougar Project (Cornell) Sensors form a distributed database system Cross-layer optimizations (data management layer and the routing layer)
MAIDS (UIUC) Mining Alarming Incidents in Data Streams Streaminer: Data stream mining
04/13/23 Statistical Mining in Data Streams 6
Data Stream Processing – Key Ingredients
Adaptivity Incorporate evolutionary changes in the stream
Approximation Exact results are hard to compute fast with limited
memory
04/13/23 Statistical Mining in Data Streams 7
A Data Stream Management System (DSMS)
The Central Stream Processing System
Query ProcessingResource ManagementAdaptive Stream Mining
User Query
Stream Synopsis
Streaming Query Result
Streaming Data Sources/SensorsQuery Precision
Sampling RateSliding Window Size
Data FilteringData SamplingResource Management
Sensor CalibrationData Acquisition
Query ProcessingResource ManagementAdaptive Stream Mining
Data FilteringData SamplingResource Management
Sensor CalibrationData Acquisition
04/13/23 Statistical Mining in Data Streams 8
Thesis Outline “Develop fast, online, statistical methods for
mining data streams.” Adaptive non-linear clustering in multi-
dimensional streams Bayesian reasoning sensor stream processing Filtering methods for resource conservation Change detection in data streams Video sensor data stream processing
04/13/23 Statistical Mining in Data Streams 9
RoadmapThe Data Stream Model
Introduction and research issues Related work
Data Stream Mining Stream data clustering Bayesian reasoning for sensor stream processing
Contribution SummaryFuture work
04/13/23 Statistical Mining in Data Streams 10
Clustering in High-Dimensional Streams
“Given a continuous sequence of points, group them into some number of clusters, such that the members of a cluster are geometrically close to each other.”
04/13/23 Statistical Mining in Data Streams 11
Example Application – Network Monitoring
DoS , Probe,Normal?
DoS , Probe,Normal?
Connection tuples (high-dimensional)
INTERNETINTERNET
04/13/23 Statistical Mining in Data Streams 12
Stream Clustering – New Challenges One-pass restriction and limited memory constraint
Fading cluster technique proposed by Aggarwal et al. Non-linear separation boundaries
We propose using the kernel trick to deal with the non-linearity issue
Data dimensionality We propose effective incremental dimension reduction
technique
04/13/23 Statistical Mining in Data Streams 13
The 2-Tier Framework
x~Tier1:
Stream Segmentation
Tier 2:LDS Projection & Update
Input Space LDS
C9
C2
C6
C1
C5
C3
C4
C7
C8
Adaptive Non-linear Clustering
x
Latest point received from the stream 2-Tier clustering module
(uses the kernel trick)Fading Clusters
d-dimensional
q-dimensionalq < d
04/13/23 Statistical Mining in Data Streams 14
The Fading Cluster MethodologyEach cluster Ci, has a recency value Ri s.t. Ri = f(t-tlast), where, t: current time tlast: last time Ci was updated f(t) = e- t
: fading factor A cluster is erased from memory (faded) when Ri · h,
where h is a user parameter controls the influence of historical data Total number of clusters is bounded
04/13/23 Statistical Mining in Data Streams 15
Non-linearity in Data
Spectral clustering methods likely to perform better
Traditional clustering techniques (k-means) do not perform well
Feature space mapping
Input SpaceFeature Space
04/13/23 Statistical Mining in Data Streams 16
Non-linearity in Network Intrusion Data
Input Space Feature Space
“ipsweep” attack data
Geometrically well-behaved
trend
Use kernel trick?
04/13/23 Statistical Mining in Data Streams 17
The Kernel TrickActual projection in higher dimension is
computationally expensiveThe kernel trick does the non-linear projection
implicitly!
Given two input space vectors x,y k(x,y) = <(x),(y)>
Kernel Function
Gaussian kernel functionk(x,y) = exp(-||x-y||2) used in the previous example !
04/13/23 Statistical Mining in Data Streams 18
:x = (x1, x2) → (x) = (x1
2, x22, √2x1x2)
<(x),(z)> = <(x12, x2
2, √2x1x2), (z12, z2
2, √2z1z2)>,
= x12z1
2 + x22z2
2 + 2x1x2z1z2,
= (x1z1 + x2z2)2,
= <x,z>2.k(x,z) = <x,z>2
Kernel Trick - Working ExampleNot required explicitly!
“Kernel trick allows us to make operations in high-dimensional feature space using a kernel function but without explicitly representing “
04/13/23 Statistical Mining in Data Streams 19
Stream Clustering – New Challenges One-pass restriction and limited memory constraint
We use the fading cluster technique proposed by Aggarwal et. al.
Non-linear separation boundaries We propose using kernel methods to deal with the non-
linearity issue Data dimensionality
We propose effective incremental dimension reduction technique
04/13/23 Statistical Mining in Data Streams 20
Dimensionality ReductionPCA like kernel method desirable
Explicit representation – EVD preferred KPCA is computationally prohibitive - O(n3) The principal components evolve with time –
frequent EVD updates may be necessaryWe propose to perform EVD on grouped-data
instead point-dataRequires a novel kernel method
04/13/23 Statistical Mining in Data Streams 21
The 2-Tier Framework
x~Tier1:
Stream Segmentation
Tier 2:LDS Projection & Update
Input Space LDS
C9
C2
C6
C1
C5
C3
C4
C7
C8
Adaptive Non-linear Clustering
x
Latest point received from the stream 2-Tier clustering module
(uses the kernel trick)Fading Clusters
d-dimensional
q-dimensionalq < d
04/13/23 Statistical Mining in Data Streams 22
The 2-Tier Framework …Tier 1 captures the temporal locality in a segment
Segment is a group of contiguous points in the stream geometrically packed closely in the feature space
Tier 2 adaptively selects segments to project data in LDS
Selected segments are called representative segments Implicit data in the feature space is projected explicitly
in LDS such that the feature-space distances are preserved
04/13/23 Statistical Mining in Data Streams 23
Update cluster centers and recency values. Delete faded clusters Create new
cluster with x
Assign x to its nearest cluster
Clear contents of S
Add S in memoryand update LDS
Is S a representative
segment?
Is close to an active cluster?
x~
Obtain in LDSx~
YES
YES
NO
YES
NO
TIER 2
Obtain a point x From the stream
Add x to S
Is ((x) novel w.r.t S and s > smin)
OR is s = smax?
NO
TIER 1
The 2-Tier Framework …
04/13/23 Statistical Mining in Data Streams 24
Network Intrusion Stream
• Simulated data from MIT Lincoln Labs.• 34 Continuous Attributes (Features)• 10.5 K Records• 22 types of intrusion attacks + 1 normal class
04/13/23 Statistical Mining in Data Streams 25
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u=10
04/13/23 Statistical Mining in Data Streams 26
Efficiency - EVD Computations
Newswire data3.8K Records, 16.5K Features, 10 news topics
Image data5K Records, 576 Features, 10 digits
04/13/23 Statistical Mining in Data Streams 27
In Retrospect…We proposed an effective stream clustering
frameworkWe use the kernel trick to delineate non-linear
boundaries efficientlyWe use stream segmentation approach to
continuously project data into a low dimensional space
04/13/23 Statistical Mining in Data Streams 28
RoadmapThe Data Stream Model
Introduction and research issues Related work
Contributions Towards Stream Mining Stream data clustering Bayesian reasoning sensor steam processing
Contribution SummaryFuture work
04/13/23 Statistical Mining in Data Streams 29
Bayesian Reasoning for Sensor Data ProcessingUsers submit queries
with precision constraintsResource conservation is
of prime concern to prolong system life
Data acquisition Data communication
“Find the temperature with 80% confidence
Use probabilistic models at central site for approximate predictions preventing actual acquisitions
04/13/23 Statistical Mining in Data Streams 30
Dependencies in Sensor Attributes
Get Temperature
Attribute Acquisition Cost
Temperature 50 J
Voltage 5 J
Dependency Model
Temperature
Voltage
Acquire Voltage!
Report TemperatureBayesian Networks
Get Voltage
04/13/23 Statistical Mining in Data Streams 31
Using Correlation Models [Deshpande et al.]
Correlation models ignore conditional dependency
Humidity [35-40)
Intel Lab ( Real Sensor network data)Attributes: Voltage (V), Temperature (T), Humidity (H)
“voltage” is conditionally independent of “temperature”, given “humidity” !
“voltage” is correlated with “temperature”
Deshpande et al. [VLDB’04]
04/13/23 Statistical Mining in Data Streams 32
BN vs. Correlations
NDBC Buoy DatasetBayesian Network•Maintains vital dependencies only•Lower search complexity O(n)•Storage O(nd), d: avg. node degree•Intuitive dependency structure
Correlation model [Deshpande et. al.]•Maintains all dependencies•Search space of finding best possible alternative sensor attribute is high•Joint probability is represented in O(n2) cells
Intel Lab. Dataset
04/13/23 Statistical Mining in Data Streams 33
Bayesian Networks (BN)Qualitative Part – Directed Acyclic Graph (DAG)
• Nodes – Sensor Attributes• Edges – Attribute influence relationship
Quantitative Part – Conditional Probability Table (CPT)• Each node X has its own CPT , P(X|parents(X))
Together, the BN represent the joint probability in factored from: P(T,H,V,L) = P(T)P(H|T)P(V|H)P(L|T)
The “influence relationship” is represented by conditional entropy function H. H(Xi)=k
l=1 P( Xi = xil )log(P( Xi = xil ))
We learn the BN by minimizing H(Xi| Parents(Xi)).
04/13/23 Statistical Mining in Data Streams 34
System Architecture
Group-query Plan Generation
Bayesian Inference Engine
Sensor Network
Acquisition Values
Acquisition PlanGroup Query (Q)
{(Wind Speed, 75%)}
{(Temperature, 95%),(Wind Speed, 85%)}
{(Air Pressure, 90%),(Wind Speed, 90%)}
{(Temperature, 80%)}
Query Processor
Storage
CPTsCost
BN
0.4C3
0.01C4
0.3C2
0.5C1
0.20.60.040.1
X1
X2
X5
X3
X4
X6
0.050.050.30.6
04/13/23 Statistical Mining in Data Streams 35
Finding the Candidate AttributesFor any attribute in the group-query Q, analyze
candidates attributes in the Markov blanket recursively
Selection criterion
Select candidates in a
greedy fashion
Acquisition cost
Information Gain (Conditional Entropy)
Meet precision constraints
Maximize resource conservation
04/13/23 Statistical Mining in Data Streams 36
Experiments – Resource Conservation
NDBC dataset, 7 attributes
Effect of using group-queries, |Q| - Group-query size
Effect of using MB Property with min = 0.90
04/13/23 Statistical Mining in Data Streams 37
Wave Period (WP)
Wind Speed (SP)
Water Temperature (WT)
Wind Direction (DR)
Wave Height (WH)
Air Temperature (AT)
Air Pressure (AP)
Results - Selectivity
04/13/23 Statistical Mining in Data Streams 38
In Retrospect…Bayesian networks can encode the sensor
dependencies effectivelyOur method provides significant resource
conservation for group-queries
04/13/23 Statistical Mining in Data Streams 39
Contribution Summary “Adaptive Stream resource management using Kalman Filters.” [SIGMOD’04]
“Adaptive sampling for sensor networks.” [DMSN’04]
“Adaptive non-linear clustering for Data Streams.” [CIKM’06]
“Using stationary-dynamic camera assemblies for wide-area video surveillance and selective attention.” [CVPR’06]
“Filtering the data streams.” [in submission]
“Efficient diagnostic and aggregate queries on sensor networks.” [in submission]
“OCODDS: An On-line Change-Over Detection framework for tracking evolutionary changes in Data Streams.” [in submission]
04/13/23 Statistical Mining in Data Streams 40
Future WorkDevelop non-linear techniques for capturing
temporal correlations in data streamsThe Bayesian framework can be extended to
address “what-if” queries with counterfactual evidence
The clustering framework can be extended for developing stream visualization systems
Incremental EVD techniques can improve the performance further
04/13/23 Statistical Mining in Data Streams 41
Thank You !
04/13/23 Statistical Mining in Data Streams 42
BACKUP SLDIES!
04/13/23 Statistical Mining in Data Streams 43
Back to Stream ClusteringWe propose a 2-tier stream clustering
framework Tier 1: Kernel method that continuously divides the
stream into segments Tier 2: Kernel method that uses the segments to
project data in a low-dimensional space (LDS)The fading clusters reside in the LDS
04/13/23 Statistical Mining in Data Streams 44
Clustering – LDS Projection
04/13/23 Statistical Mining in Data Streams 45
Clustering – LDS Update
04/13/23 Statistical Mining in Data Streams 46
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u=10
Cluster strengths at LDS dimensionality u=10
04/13/23 Statistical Mining in Data Streams 47
Effect of dimensionality
04/13/23 Statistical Mining in Data Streams 48
Query Plan Generation Given a group query, the query plan computes
“candidate attributes” that will actually be acquired to successfully address the query.
We exploit the “Markov Blanket (MB)” property to select candidate attributes.
Given a BN G, the Markov Blanket of a node Xi comprises the node, and its immediate parent and child.
))((
))(,())(|()|(
i
iiiii XMBP
XMBXPXMBXPGXP
04/13/23 Statistical Mining in Data Streams 49
Exploiting the MB Property“Given a node Xi and a set of arbitrary nodes Y in a BN s.t. MB(Xi) µ Y [ Xi), the conditional entropy of Xi given Y is at least as high as that given its Markov blanket or H(Xi|Y) ¸ H(Xi|MB(Xi)).”
Proof: Separating MB(Xi) into two parts MB1 = MB(Xi) [ Y and MB2 = MB(Xi) - MB1 and denoting Z = Y – MB(Xi):
H(Xi|Y) = H(Xi|Z,MB1) Y = Z [ MB1
¸ H(Xi|Z,MB1,MB2) Additional information cannot
increase entropy = H(Xi|Z, MB(Xi)) MB(Xi) = MB1|MB2
= H(Xi|MB(Xi)) Markov-blanket definition
04/13/23 Statistical Mining in Data Streams 50
Bayesian Reasoning -More Results…
Effect of using MB Property with min = 0.90
Query answer Quality loss50-node Synth. Data BN
04/13/23 Statistical Mining in Data Streams 51
Bayesian Reasoning for Group Queries
More accurate in addressing group queries Q = { (Xi, i)|Xi 2 X Æ (0 < i ·1) Æ 1 · i · n) } s.t. i
<maxl P(Xi = xil)} X ={X1,X2 ,X3,…, Xn} Sensor attributes
i Confidence parameters
P(Xi = xil) Probability with which Xi assumes the value of xil
Bayesian reasoning is helpful in detecting abnormalities
04/13/23 Statistical Mining in Data Streams 52
Bayesian Reasoning – Candidate attribute selection algorithm