data stream classification and novel class detection
Post on 13-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
University of Texas at Dallas
Data Stream Classification and Data Stream Classification and Novel Class DetectionNovel Class Detection
Mehedy Masud, Latifur Khan, Qing Chen and Bhavani Thuraisingham
Department of Computer Science , University of Texas at Dallas
Jing Gao, Jiawei Han
Department of Computer Science , University of Illionois at
Urbana Champaign
Charu Aggarwal
IBM T. J. WatsonThis work was funded in part by
Aug 10, 2011Masud et al.
University of Texas at Dallas
Outline of The PresentationOutline of The Presentation
Background
Data Stream ClassificationNovel Class Detection
Aug 10, 2011Masud et al. 2
University of Texas at Dallas
IntroductionIntroductionCharacteristics of Data streams are:
◦ Continuous flow of
data
Network traffic
Sensor data Call center
records
◦ Examples:
Aug 10, 2011Masud et al. 3
University of Texas at Dallas
Uses past labeled data to build classification modelPredicts the labels of future instances using the modelHelps decision making
Data Stream ClassificationData Stream Classification
Network traffic
Classification model
Attack traffic
Firewall
Block and quarantine
Benign traffic
Server
Model update
Expert analysis and labeling
Aug 10, 2011Masud et al. 4
University of Texas at Dallas
Data Stream Classification Data Stream Classification (cont..)(cont..)
What are the applications?◦Security Monitoring◦Network monitoring and traffic
engineering.◦Business : credit card transaction
flows.◦Telecommunication calling records.◦Web logs and web page click
streams.
Aug 10, 2011Masud et al. 5
University of Texas at Dallas
Infinite length
Concept-drift
Concept-evolution
Feature Evolution
ChallengesChallenges
Aug 10, 2011Masud et al. 6
University of Texas at Dallas
Impractical to store and use all historical data
◦ Requires infinite storage
◦ And running time
Infinite LengthInfinite Length
Aug 10, 2011Masud et al. 7
University of Texas at Dallas
Concept-DriftConcept-Drift
Negative instancePositive instance
A data chunk
Current hyperplane
Previous hyperplane
Instances victim of concept-drift
Aug 10, 2011Masud et al. 8
University of Texas at Dallas
Concept-EvolutionConcept-Evolution
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X
X X X X X X
Novel classy
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances
AC
D
B
y
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -
A
CD
B
Aug 10, 2011Masud et al. 9
University of Texas at Dallas
Dynamic FeaturesDynamic FeaturesWhy new features evolving
◦ Infinite data stream
Normally, global feature set is unknown
New features may appear
◦ Concept drift
As concept drifting, new features may appear
◦ Concept evolution
New type of class normally holds new set of features
Different chunks may have different feature sets
Aug 10, 2011Masud et al. 10
University of Texas at Dallas
Dynamic FeaturesDynamic Features
Feature Extraction & Selection
i + 1st chunk
ith chunk
Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high.
Current
model
Training New Model
Feature SpaceConversion
Classification &Novel Class Detection
runway, climb
runway, clear, ramp
runway, ground, ramp
ith chunk and i + 1st chunk and models have
different feature sets
Feature Set
Aug 10, 2011Masud et al. 11
University of Texas at Dallas
Outline of The PresentationOutline of The Presentation
Introduction
Data Stream Classification
Novel Class Detection
Aug 10, 2011Masud et al. 12
University of Texas at Dallas
DataStream Classification DataStream Classification (cont..) (cont..)
Single Model Incremental Classification
Ensemble – model based classification◦Supervised◦Semi-supervised◦Active learning
Aug 10, 2011Masud et al. 13
University of Texas at Dallas
Single Model Incremental Classification
Ensemble – model based classification◦Data Selection ◦Semi-supervised◦Skewed Data
I
OverviewOverview
Aug 10, 2011Masud et al. 14
University of Texas at Dallas
Ensemble of ClassifiersEnsemble of Classifiers
C1
C2
C3
x,?
+
+
-input
ClassifierIndividual outputs
voting
+
Ensemble output
Aug 10, 2011Masud et al. 15
University of Texas at Dallas
Ensemble Classification of Ensemble Classification of Data StreamsData Streams
Divide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3
Data chunks
Classifiers
D1
C1
D2
C2
D3
C3
Ensemble
C1 C2 C3
D4
Prediction
D4
C4C4
C4
D5D5
C5C5
C5
D6
Labeled chunkUnlabeled chunk
Addresses infinite lengthand concept-drift
Note: Di may contain data points from different classes
Aug 10, 2011Masud et al. 16
University of Texas at Dallas
A completely new class of data arrives in the stream
Concept-Evolution ProblemConcept-Evolution Problem
y
x1
y1
y2
x
- - - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -
x<x1
T F
y<y1y<y2
T F-
T F + -
D B C
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
+A
A
B
C
D
(a) A decision tree, (b) corresponding feature space partitioning
(a) (b)
ECSMiner
y1
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X XX X X X X X
Novel classy
x1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -
(c)
(c) A Novel class (denoted by x) arrives in the stream.
Aug 10, 2011Masud et al. 17
University of Texas at Dallas
ECSMiner: OverviewECSMiner: Overview
Last labeled chunk
Data Stream
Ensemble of L models
Newer instances (unlabeled)
Older instances (labeled)
Training
New mod
el
Update
M1
M2 ML. . .
Overview of ECSMiner algorithm
xnow
Just arrived
Outlier detectio
n
Buffer?
Classification
No
Buffering and novel
class detection
Yes
ECSMiner
Based on: Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp 79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)).
Aug 10, 2011Masud et al. 18
University of Texas at Dallas
AlgorithmAlgorithm
Training Novel class detection and classification
ECSMiner
Aug 10, 2011Masud et al. 19
University of Texas at Dallas
Novel Class DetectionNovel Class DetectionNon parametric
◦does not assume any underlying model of existing
classes
Steps:
1.Creating and saving decision boundary during
training
2.Detecting and filtering outliers
3.Measuring cohesion and separation among test and
training instances
ECSMiner
Aug 10, 2011Masud et al. 20
University of Texas at Dallas
Training: Training: Creating Decision Creating Decision BoundaryBoundary
ECSMiner
++++ ++ + + +
+ +++ ++ +
+ + + + ++ +
+++ ++ ++ +++
+++++ ++++ +++ + ++ + +
++ ++ + ++
- - - - - - - - - - -- - - - - - - - - -
- -- - - - - - - - - -
- -- - - - - - - - - -
-
y
x1
y1
y2B
CA
D
x
-- - - - - - -
- - - - - - - -
+++ ++ + + + + + + + + + + +
Raw training dataClusters are created
y
x1
y1
y2
x
A
D
C
B
Pseudopoints
Addresses Infinite length problem
Aug 10, 2011Masud et al. 21
University of Texas at Dallas
Outlier Detection and Outlier Detection and FilteringFiltering
x1 x
y
y1
y2B
CA
D x
x
AND
Routlier
Routlier
Routlier
Ensemble of L modelsM1 M2 ML
xTest instance
. . .
True
X is a filtered outlier (Foutlier)(potential novel class instance)
False
X is an existing class instance
Test instance inside decision boundary (not outlier)
Test instance outside decision
boundary Raw outlier or
Routlier
Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible.
ECSMiner
Aug 10, 2011Masud et al. 22
University of Texas at Dallas
Novel Class DetectionNovel Class Detection
AND
Routlier
Routlier
Routlier
Ensemble of L modelsM1 M2 ML
xTest instance
. . .
True
X is a filtered outlier (Foutlier)(potential novel class instance)
False
X is an existing class instance
ECSMiner
(Step 1)
(Step 2)
Compute q-NSC with all models
and other Foutliers(Step 3)
q-NSC>0
for q’>q
Foutliers with
all models
?
(Step 4)
Novel class found
Y
N Treat as existing class
Aug 10, 2011Masud et al. 23
University of Texas at Dallas
Computing Cohesion & Computing Cohesion & SeparationSeparation
a(x) = mean distance from an Foutlier x to the instances in o,q(x)bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)q-Neighborhood Silhouette Coefficient (q-NSC):
a(x)),(x)bmax(
a(x)) (x)(b NSC(x)-q
min
min
If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
ECSMiner
x
o,5(x)
+,5(x)
- - - -
+ + + +
- - - -
-
+ + + + +
-,5(x)
a(x)
b+
(x)b-(x)
Aug 10, 2011Masud et al. 24
University of Texas at Dallas
Speeding Up Speeding Up Computing N-NSC for every Foutlier instance x
takes quadratic time in the number of Foutliers. In order to make the computation faster,
We create Ko pseudopoints (Fpseudopoints) from Foutliers using K-means clustering,where Ko = (No/S) * K. Here S is the chunk size and No is the number of Foutliers.perform the computations on the Fpseudopoints
Thus, the time complexity
◦ to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K))
◦ which is constant, since both Ko and K are independent of
the input size.
◦ However, by gaining speed we lose some precision,
although the loss is negligible (to be analyzed shortly) Aug 10, 2011Masud et al. 25
University of Texas at Dallas
Algorithm To Detect Novel Algorithm To Detect Novel ClassClass
ECSMiner
Aug 10, 2011Masud et al. 26
University of Texas at Dallas
““Speedup” PenaltySpeedup” PenaltyAs discussed earlier
◦by speeding up computation in step – 3, we lose
some precision since the result deviates from exact
result
◦This analysis shows that the deviation is negligiblei
j
x
i
j
(i-j)2
(x-j)2
(x-i)2
Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all instances in i belong to a novel class.
Aug 10, 2011Masud et al. 27
University of Texas at Dallas
““Speedup” PenaltySpeedup” PenaltyApproximate:
Exact:
Deviation:
Aug 10, 2011Masud et al. 28
University of Texas at Dallas
Experiments - DatasetsExperiments - Datasets• We evaluated our approach on two synthetic
and two real datasets:•SynC – Synthetic data with only concept-drift.
Generated using hyperplane equation. 2 classes, 10 attributes, 250K instances•SynCN – Synthetic data with concept-drift and
novel class. Generated using Gaussian distribution. 20 classes, 40 attributes, 400K instances•KDD cup 1999 intrusion detection (10% version)
– real dataset. 23 classes, 34 attributes, 490K instances•Forest cover – real dataset. 7 classes, 54
attributes, 581K instances
Aug 10, 2011Masud et al. 29
University of Texas at Dallas
Experiments - SetupExperiments - SetupDevelopment:
◦ Language: Java
H/W:
◦ Intel P-IV with
◦ 2GB memory and
◦ 3GHz dual processor CPU.
Parameter settings:
◦ K (number of pseudopoints per chunk) = 50
◦ N (minimum number of instances required to declare novel
class) = 50
◦ M (ensemble size) = 6
◦ S (chunk size) = 2,000
Aug 10, 2011Masud et al. 30
University of Texas at Dallas
Experiments - BaselineExperiments - Baseline Competing approaches:
◦ i) MineClass (MC): our approach
◦ ii) WCE-OLINDDA_Parallel (W-OP)
◦ iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the
Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default
parameter settings for WCE and OLINDDA
We use this combination since to the best of our knowledge there is no approach that Can
classify and detect novel classes simultaneously
OLINDDA assumes there is only one normal class, and all other classes are novel
◦ Therefore, we apply two variations –
W-OP keeps parallel OLINDDA models, one for each class
W-OS keeps a single model that absorbs a novel class when encountered
Aug 10, 2011Masud et al. 31
University of Texas at Dallas
Experiments - ResultsExperiments - ResultsEvaluation metrics
◦ Mnew = % of novel class instances Misclassified as existing class
= Fn∗100/Nc
◦ Fnew = % of existing class instances Falsely identified as novel
class = Fp∗100/ (N−Nc)
◦ ERR = Total misclassification error (%)(including Mnew and Fnew)
= (Fp+Fn+Fe)∗100/N
◦ where Fn = total novel class instances misclassified as existing
class,
◦ Fp = total existing class instances misclassified as novel class,
◦ Fe = total existing class instances misclassified (other than Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream.
Aug 10, 2011Masud et al. 32
University of Texas at Dallas
Experiments - ResultsExperiments - Results
Forest Cover KDD cup SynCN
Aug 10, 2011Masud et al. 33
University of Texas at Dallas
Experiments - ResultsExperiments - Results
Aug 10, 2011Masud et al. 34
University of Texas at Dallas
Experiments – Parameter Experiments – Parameter SensitivitySensitivity
Aug 10, 2011Masud et al. 35
University of Texas at Dallas
Experiments – RuntimeExperiments – Runtime
Aug 10, 2011Masud et al. 36
University of Texas at Dallas
Dynamic FeaturesDynamic FeaturesSolution:
◦ Global Features◦ Local Features◦ Union
Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han,
and Bhavani Thuraisingham, “Classification and Novel Class
Detection of Data Streams in A Dynamic Feature Space,” in Proc.
of Machine Learning and Knowledge Discovery in Databases,
European Conference, ECML PKDD 2010, Barcelona, Spain, Sept
2010, Springer, Page 337-352
Aug 10, 2011Masud et al. 37
University of Texas at Dallas
Feature Mapping Across Models Feature Mapping Across Models and Test Data Points and Test Data Points
Feature set varies in different chunks. Especially, when new class appears, new features should be selected and added to the feature set.
Strategy 1 – Lossy fixed (Lossy-F) conversion / Global
◦ Use the same fixed feature in the entire stream.
We call this a lossy conversion because future model and instances
may lose important features due to this mapping.
Strategy 2 – Lossy local (Lossy-L) conversion / Local
◦ We call this lossy conversion because it may loss feature values
during mapping.
Strategy 3 – Dimension preserving (D-Preserving)
Mapping / Union Aug 10, 2011Masud et al. 38
University of Texas at Dallas
Feature Space Conversion – Feature Space Conversion – Lossy-L Mapping (Local)Lossy-L Mapping (Local)
Assume that each data chunk has different feature vectors
When a classification model is trained, we save the feature vector with the model
When an instance is tested, its feature vector is mapped (i.e., projected) to the model’s feature vector.
Aug 10, 2011Masud et al. 39
University of Texas at Dallas
Feature Space Conversion – Feature Space Conversion – Lossy-L MappingLossy-L Mapping
For example, ◦ Suppose the model has two features (x,y)◦ The instance has two features (y,z)◦ When testing, assume the instance has
two features (x,y)◦ Where x = 0, and y value is kept as it is
Aug 10, 2011Masud et al. 40
University of Texas at Dallas
Conversion Strategy II – Lossy-L Conversion Strategy II – Lossy-L MappingMapping
Graphically:
Aug 10, 2011Masud et al. 41
University of Texas at Dallas
Conversion Strategy III – D-Conversion Strategy III – D-Preserving MappingPreserving Mapping
When an instance is tested, both the model’s feature vector and the instance’s feature vector are mapped (i.e., projected) to the union of their feature vectors.
◦ The feature dimension is increased.
◦ In the mapping, both the features in the testing
instance and model are preserved. The extra
features are filled with all 0s.
Aug 10, 2011Masud et al. 42
University of Texas at Dallas
Conversion Strategy III – D-Conversion Strategy III – D-Preserving MappingPreserving Mapping
For example, ◦ suppose the model has three features
(a,b,c)◦ The instance has four features (b,c,d,e)◦ When testing, we project both the model’s
feature vector and the instance’s feature vector to (a,b,c,d,e)
◦ Therefore, in the model, d, and e will be considered 0s and in the instance, a will be considered 0
Aug 10, 2011Masud et al. 43
University of Texas at Dallas
Conversion Strategy III – D-Conversion Strategy III – D-Preserving MappingPreserving Mapping
Previous Example
Aug 10, 2011Masud et al. 44
University of Texas at Dallas
DiscussionDiscussionLocal does not favor novel class, it favors
existing classes.
◦ Local features will be enough to model existing
classes.
Union favors novel class.
◦ New features may be discriminating for novel class,
hence Union works.
Aug 10, 2011Masud et al. 45
University of Texas at Dallas
ComparisonComparisonWhich strategy is the better?Assumption: lossless conversion (union)
preserves the properties of a novel class. In other words, if an instance belongs to a
novel class, it remains outside the decision boundary of any model Mi of the ensemble M in the converted feature space. Lemma:
If a test point x belongs to a novel class, it will be miss-classified by the ensemble M as an existing class instance under certain conditions when the Lossy-L conversion is used.
Aug 10, 2011Masud et al. 46
University of Texas at Dallas
ComparisonComparison Proof:
Let X1,…,XL,XL+1,…,XM be the dimensions of the model
and
Let X1,…,XL,XM+1,…,XN be the dimensions of the test
point
Suppose the radius of the closest cluster (in the
higher dimension) is R
Also, let the test point be a novel class instance.
Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,
…,XN
Aug 10, 2011Masud et al. 47
University of Texas at Dallas
ComparisonComparison Proof (continued):
Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN
Centroid of the cluster (original space): X1=x1,
…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM
Centroid of the cluster (combined space): x1,…,xL, xL+1,
…,xM , 0,…,0
Test point (original space):
X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL,
x’M+1,…,x’N
Test point (combined space): x’1,…,x’L, 0,…,0, x’M+1,
…,x’N
Aug 10, 2011Masud et al. 48
University of Texas at Dallas
ComparisonComparison Proof (continued):
Centroid (combined spc): x1,…,xL, xL+1,…,xM , 0 ,…, 0
Test point (combined space): x’1,…,x’L, 0,…, 0, x’M+1,…,x’N
R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2
M)+ (x’2M+1+…
+x’2N)
R2< a2 + b2
R2 = a2 + b2 - e2 (e2 >0)
a2 = R2 + (e2 – b2)
a2 < R2 (provided that e2 < b2)
Therefore, in Lossy-L conversion, the test point will not be
an outlier
Aug 10, 2011Masud et al. 49
University of Texas at Dallas
Baseline ApproachesBaseline Approaches WCE is Weighted Classifier Ensemble1, which addresses
multi-class ensemble classifier.
OLINDDA is a novel class detector 2 works only for binary
class.
FAE algorithm is an ensemble classifier that addresses
feature evolution3 and concept drift.
ECSMiner is a multi-class ensemble classifier that addresses
concept drift and concept evolution4.
Aug 10, 2011Masud et al. 50
University of Texas at Dallas
Approaches ComparisonApproaches Comparison
Proposed techniqu
es
Challenges
Infinite
length
Concep
t-drift
Concept-
evolution
Dynamic
Features
OLINDDA
WCE
FAE
ECSMiner
DXMiner
Aug 10, 2011Masud et al. 51
University of Texas at Dallas
Experiments: DatasetsExperiments: Datasets We evaluated our approach on different datasets:
Data SetConcep
t Drift
Concept
Evolutio
n
Dynamic
Feature
# of
Instanc
e
# of
Class
KDD 492K 7
Forest Cover 387K 7
NASA 140K 21
Twitter 335K 21
Aug 10, 2011Masud et al. 52
University of Texas at Dallas
Experiments: ResultsExperiments: Results
Evaluation metrics: let
◦ Fn = total novel class instances misclassified as
existing class,
◦ Fp = total existing class instances misclassified as
novel class,
◦ Fe = total existing class instances misclassified (other
than Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream
Aug 10, 2011Masud et al. 53
University of Texas at Dallas
Experiments: ResultsExperiments: ResultsWe use the following performance metrics to
evaluate our technique:
◦ Mnew = % of novel class instances Misclassified
as existing class, i.e,
◦ Fnew = % of existing class instances Falsely identified
as novel class, i.e.,
◦ ERR = Total misclassification error (%)(including Mnew and Fnew), i.e.,
Aug 10, 2011Masud et al. 54
University of Texas at Dallas
Experiments: SetupExperiments: SetupDevelopment:
◦ Language: Java
H/W:
◦ Intel P-IV with
◦ 3GB memory and
◦ 3GHz dual processor CPU.
Parameter settings:
◦ K (number of pseudo points per chunk) = 50
◦ q (minimum number of instances required to declare novel
class) = 50
◦ L (ensemble size) = 6
◦ S (chunk size) = 1,000
Aug 10, 2011Masud et al. 55
University of Texas at Dallas
Experiments: BaselineExperiments: BaselineCompeting approaches:
◦ i) DXMiner (DXM): our approach- 4 variations:
Lossy-F conversion
Lossy-L conversion
D-Preserving conversion
◦ ii) FAE-WCE-OLINDDA_Parallel (W-OP)
◦ Assumes there is only one normal class, and all other classes
are novel . W-OP keeps parallel OLINDDA models, one for each
class
We use this combination since to the best of our knowledge there is no
approach that can classify and detect novel classes simultaneously with
feature evolution.
◦ iii) FAE-ECSMiner
Aug 10, 2011Masud et al. 56
University of Texas at Dallas
Twitter ResultsTwitter Results
Aug 10, 2011Masud et al. 57
University of Texas at Dallas
Twitter ResultsTwitter Results
D-
preserving
Lossy -
Local
Lossy-
GlobalO-F
AUC 0.88 0.83 0.76 0.56
Aug 10, 2011Masud et al. 58
University of Texas at Dallas
NASA DatasetNASA Dataset
Deviatio
nInfo Gain O-F
AUC 0.996 0.967 0.876
Aug 10, 2011Masud et al. 59
University of Texas at Dallas
Forest Cover ResultsForest Cover Results
Aug 10, 2011Masud et al. 60
University of Texas at Dallas
Forest Cover ResultsForest Cover Results
D-
preservingO-F
AUC 0.97 0.74Aug 10, 2011Masud et al. 61
University of Texas at Dallas
KDD ResultsKDD Results
Aug 10, 2011Masud et al. 62
University of Texas at Dallas
KDD ResultsKDD Results
D-
preserving
FAE-
Olindda
AUC 0.98 0.96Aug 10, 2011Masud et al. 63
University of Texas at Dallas
Summary ResultsSummary Results
Aug 10, 2011Masud et al. 64
University of Texas at Dallas
Improved Outlier Detection and Multiple Novel Improved Outlier Detection and Multiple Novel Class DetectionClass Detection
Challenges◦ High false positive (FP) (existing classes detected as novel) and
false negative (FN) (missed novel classes) rates ◦ Two or more novel classes arrive at a time
Solutions1
◦ Dynamic decision boundary – based on previous mistakes
Inflate the decision boundary if high FP, deflate if high FN
◦ Build statistical model to filter out noise data and concept drift from the outliers.
◦ Multiple novel classes are detected by Constructing a graph where outlier cluster is a vertex Merging the vertices based on silhouette coefficient Counting the number of connected components in the resultant (i.e., merged)
graph
1 Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham, Addressing Concept-Evolution in Concept-Drifting Data Streams, In Proc ICDM ’10, Sydney, Australia, Dec 14-17, 2010.
Proposed Methods
Aug 10, 2011Masud et al. 67
University of Texas at Dallas
Outlier Threshold (OUTTH)Outlier Threshold (OUTTH)
To declare a testing instance being an outlier, using cluster
radius r is not enough because of the data noise
x
o,5(x)
+,5(x)
+ + + + +
a(x)
b+
(x)
+ +
◦ So, beyond the radius r, a threshold (OUTTH) will
be setup, so that most noisy data around model
cluster will be classified immediately
Proposed Methods
Aug 10, 2011Masud et al. 68
University of Texas at Dallas
Outlier Threshold (OUTTH)Outlier Threshold (OUTTH)
Every instance outside the cluster range has a weight
◦ If wt(x) >= OUTTH, this instance will be consider as
existing class.
◦ If wt(x) < OUTTH, this instance will be an outlier.
Pros:
◦ Noisy data will be classified immediately
Cons
◦ OUTTH is hard to be determined
Noisy data and novel class instance may occur simultaneously
Different dataset may have different OUTTH
Proposed Methods
))(exp()( rbxwt x
Aug 10, 2011Masud et al. 69
University of Texas at Dallas
Outlier Threshold (OUTTH)Outlier Threshold (OUTTH)
If threshold is too high, noisy data may become outlier
◦ FP rate will go up
If threshold is too low, novel class instance will be labeled as
existing class
◦ FN rate will go up
Proposed Methods
x
o,5(x)
+,5(x)
+ + + + +
a(x)
b+
(x)
+ +
OUTTH = ?
We need to balance on these two
Aug 10, 2011Masud et al. 70
University of Texas at Dallas
Introduction
Data Stream Classification
Clustering Novel Class Detection
• Finer Grain Novel Class Detection
• Dynamic Novel Class Detection
• Multiple Novel Class Detection
Aug 10, 2011Masud et al. 71
University of Texas at Dallas
Dynamic threshold settingDynamic threshold settingProposed Methods
◦ Defer approach
After a testing chunk has been labeled, based on the marginal FP and FN rate of the this testing chunk
update the OUTTH, and then apply the new OUTTH to the next testing chunk
◦ Eager approach
What is marginal FP or marginal FN
Once a marginal FP or marginal FN instance detected, update OUTTH with step function, and apply the
updated OUTTH to the next testing instance
x
+ + + + +
a(x)
+ +
Marginal FP
Marginal FN
Aug 10, 2011Masud et al. 72
University of Texas at Dallas
Dynamic threshold settingDynamic threshold settingProposed Methods
Aug 10, 2011Masud et al. 73
University of Texas at Dallas
Defer approach and Eager Defer approach and Eager approach comparisonapproach comparison
In Defer approach, OUTTH updates after a data chunk
is labeled
◦ Too late – In the testing chunk, many marginal FP or FN
may occur due to an improper OUTTH threshold
◦ Overreact – If there are many marginal FP or FN instances
in the labeled testing chunk, the OUTTH update may
overreact for the next testing chunk
In Eager approach, OUTTH updates aggressively
whenever marginal FP or FN happens.
◦ The model is more tolerate to noisy data and concept drift.
◦ The model is more sensitive to novel class instances.
Proposed Methods
Aug 10, 2011Masud et al. 74
University of Texas at Dallas
Outliers StatisticsOutliers Statistics For each outlier instance, we calculate the novelty
probability Pnov
◦ If Pnov is large (close to 1), indicates that the outlier has a
high probability of being a novel instance.
Pnov contains two parts
◦ The first part measures how far the outlier being away from
the model cluster
◦ The second part Psc is the Silhouette Coefficient, measures
the cohesion and separation to the model cluster of the q-
Neighbors of the outlier
Proposed Methods
scnov Pxwt
xwtxP
))(min(1
)(1)(
Aug 10, 2011Masud et al. 75
University of Texas at Dallas
Outliers StatisticsOutliers Statistics
• Concept Drift
• Novel Class
Three scenarios may occur simultaneously
Proposed Methods
• Noise Data
Aug 10, 2011Masud et al. 76
University of Texas at Dallas
Outlier Statistics Gini Outlier Statistics Gini AnalysisAnalysis The Gini coefficient is a measure of statistical
inequality. The discrete Gini coefficient is:
If we divide 0~1 into n equal size bin, and put all outlier
Pnov into corresponding bin, then we can get cdf yi
◦ If all Pnov is very low, to an extreme cdf yi = 1
◦ If all Pnov are very high, to an extreme cdf yi =0; except
yn=1
Proposed Methods
n
i i
n
i i
y
yinn
nsG
1
11
211
)(
0121
1121
1121
1)( 11
1
1
n
inn
nn
inn
ny
yinn
nsG
n
i
n
in
i i
n
i i
n
nn
ny
yinn
nsG
n
i i
n
i i 121
1121
1)(
1
1
Aug 10, 2011Masud et al. 77
University of Texas at Dallas
Outlier Statistics Gini Outlier Statistics Gini AnalysisAnalysis
◦ If all outlier Pnov distribute evenly, yi =i/n
Proposed Methods
n
n
n
n
n
n
nn
nnnn
ni
inn
n
i
iinn
ny
yinn
nsG
n
i
n
i
n
i
n
in
i i
n
i i
3
1
3
)12(21
)1(
2
6
)12)(1(21
1121
1
121
1121
1)(
1
1
2
1
1
1
1
After get the outlier Pnov distribution, calculate G(s)
If G(s)> , declare novel class
If G(s) <= , classified the outlier as existing
class instance.
When n ∞, 0.33
n
n
3
1
n
n
3
1
n
n
3
1
Aug 10, 2011Masud et al. 78
University of Texas at Dallas
Outlier Statistics Gini Analysis Outlier Statistics Gini Analysis LimitationLimitation
◦ To an extreme, it is impossible the differentiate concept drift and
concept evolution by Gini coefficient, when concept drift is just
“looks like” concept evolution.
Proposed Methods
Aug 10, 2011Masud et al. 79
University of Texas at Dallas
Introduction
Data Stream Classification
Clustering Novel Class Detection
• Finer Grain Novel Class Detection
• Dynamic Novel Class Detection
• Multiple Novel Class Detection
Aug 10, 2011Masud et al. 80
University of Texas at Dallas
Multi Novel Class DetectionMulti Novel Class DetectionProposed Methods
Data Stream
Novel class A
y
x1
y1
y2
x x1
y2
x
Positive Instance
Negative InstanceNovel Instance
Novel class B
y2
If we always assume novel instances belong to one novel type, one type of novel instances, either A or B, will be misclassified.
Aug 10, 2011Masud et al. 81
University of Texas at Dallas
Multi Novel Class DetectionMulti Novel Class DetectionProposed Methods
The main idea in detecting multiple novel classes is to
construct a graph, and identify the connected
components in the graph.
The number of connected components determines the
number of novel classes.
Aug 10, 2011Masud et al. 82
University of Texas at Dallas
Multi Novel Class DetectionMulti Novel Class DetectionProposed Methods
Two Phases:
◦ Building the connected graph
Build directed nearest neighbor graph.
From each vertex (outlier cluster), add
edge from this vertex to its nearest
neighbor.
Silhouette coefficient from the vertex to
its nearest neighbor is larger than some
threshold, the edge will be removed.
Problem: Linkage Circle
◦ Component merging phase
Gaussian distribution centric
decision
Aug 10, 2011Masud et al. 83
University of Texas at Dallas
Multi Novel Class DetectionMulti Novel Class DetectionProposed Methods
◦ Component merging phase
In probability theory, “ the normal (or Gaussian) distribution, is a continuous
probability distribution that is often used as a first approximation to describe real-
valued random variables that tend to cluster around a single mean value” 1
If two Gaussian Distribution variables (g1, g2) can be separated, the following condition
will be hold:
Since μ is proportion to σ, if the two variables (components) will remain separated;
otherwise, these two components will be merged.
2121 ),(_ ggdistcentroidd
2)10(
2)(
2)
2(
2
1
2
2
2
1 2
2
2
2
2
2
2
2
2
0
2
0 2
22
2
2
0
2
2
eex
dedxxexx
)(),(_ 2121 cggdistcentroidd
1. Amari Shunichi, Nagaoka Hiroshi. Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2, 2000.
Aug 10, 2011Masud et al. 84
University of Texas at Dallas
Experiments: DatasetsExperiments: Datasets We evaluated our approach on different datasets:
Experiment Results
Data SetConcept Drift
Concept Evolution
Dynamic Feature
# of Instance
# of Class
KDD 492K 7
Forest Cover 387K 7
NASA 140K 21
Twitter 335K 21
SynED 400K 20
Aug 10, 2011Masud et al. 85
University of Texas at Dallas
Experiments: SetupExperiments: Setup Development:
◦ Language: Java
H/W:
◦ Intel P-IV with
◦ 3GB memory and
◦ 3GHz dual processor CPU.
Parameter settings:
◦ K (number of pseudo points per chunk) = 50
◦ q (minimum number of instances required to declare novel class) =
50
◦ L (ensemble size) = 6
◦ S (chunk size) = 1,000
Experiment Results
Aug 10, 2011Masud et al. 86
University of Texas at Dallas
Experiments: BaselineExperiments: Baseline
Competing approaches:
◦ i) DEMminer our approach- 5 variations:
Lossy-F conversion
Lossy-L conversion
Lossless conversion - DEMminer
Dynamic OUTTH + Lossless conversion - DEMminer-Ex (without Gini)
Dynamic OUTTH + Gini + Lossless conversion - DEMminer-Ex
◦ ii) WCE-OLINDDA (O-W)
◦ iii) FAE-WCE-OLINDDA_Parallel (O-F)
We use this combination since to the best of our knowledge there is no approach
that can classify and detect novel classes simultaneously with feature evolution.
Aug 10, 2011Masud et al. 87
University of Texas at Dallas
Experiments: ResultsExperiments: Results Evaluation metrics:
◦ Fn = total novel class instances misclassified as existing
class,
◦ Fp = total existing class instances misclassified as novel
class,
◦ Fe = total existing class instances misclassified (other than
Fp),
◦ Nc = total novel class instances in the stream,
◦ N = total instances the stream
Experiment Results
Aug 10, 2011Masud et al. 88
University of Texas at Dallas
Twitter ResultsTwitter ResultsExperiment Results
Aug 10, 2011Masud et al. 89
University of Texas at Dallas
Twitter ResultsTwitter Results
DEMminer Lossy -L Lossy-F O-F
AUC 0.88 0.83 0.76 0.56
Experiment Results
Aug 10, 2011Masud et al. 90
University of Texas at Dallas
Twitter ResultsTwitter ResultsExperiment Results
Aug 10, 2011Masud et al. 91
University of Texas at Dallas
Twitter ResultsTwitter Results
DEMminer-Ex DEMminer OW
AUC 0.94 0.88 0.56
Experiment Results
Aug 10, 2011Masud et al. 92
University of Texas at Dallas
Forest Cover ResultsForest Cover ResultsExperiment Results
Aug 10, 2011Masud et al. 93
University of Texas at Dallas
Forest Cover ResultsForest Cover Results
DEMminerDEMminer-Ex
(without Gini)DEMminer-Ex OW
AUC 0.97 0.99 0.97 0.74
Experiment Results
Aug 10, 2011Masud et al. 94
University of Texas at Dallas
NASA DatasetNASA DatasetExperiment Results
Aug 10, 2011Masud et al. 95
University of Texas at Dallas
NASA DatasetNASA Dataset
Deviation Info Gain FAE
AUC 0.996 0.967 0.876
Experiment Results
Aug 10, 2011Masud et al. 96
University of Texas at Dallas
KDD ResultsKDD ResultsExperiment Results
Aug 10, 2011Masud et al. 97
University of Texas at Dallas
KDD ResultsKDD Results
DEMminer O-F
AUC 0.98 0.96
Experiment Results
Aug 10, 2011Masud et al. 98
University of Texas at Dallas
Result SummaryResult SummaryExperiment Results
Dataset Method ERR Mnew Fnew AUC FP FN
DEMminer
Lossy-F
Lossy-L
O-F
4.2 30.5 0.8
32.5 0.0 32.6
1.6 82.0 0.0
3.4 96.7 1.6
0.877
0.834
0.764
0.557
- -
- -
- -
- -
ASRS
DEMminer
DEMminer(info-gain)
O-F
0.02 - -
1.4 - -
3.4 - -
0.996
0.967
0.876
0.00 0.1
0.04 10.3
0.00 24.7
Forest Cover
DEMminer
O-F
3.6 8.4 1.3
5.9 20.6 1.1
0.973
0.743
- -
- -
KDDDEMminer
O-F
1.2 5.9 0.9
4.7 9.6 4.4
0.986
0.967
- -
- -
Aug 10, 2011Masud et al. 99
University of Texas at Dallas
Result SummaryResult SummaryExperiment Results
Dataset Method ERR Mnew Fnew AUC
DEMminer
DEMminer-Ex
OW
4.2 30.5 0.8
1.8 0.7 0.6
3.4 96.7 1.6
0.877
0.944
0.557
Forest Cover
DEMminer
DEMminer-Ex
OW
3.6 8.4 1.3
3.1 4.0 0.68
5.9 20.6 1.1
0.974
0.990
0.743
Aug 10, 2011Masud et al. 100
University of Texas at Dallas
Running Time ComparisonRunning Time ComparisonExperiment Results
Dataset
Time(sec)1/K Points/sec Speed gain
DEMminer Lossy-F O-F DEMminer Lossy-F O-F DEMminer over O-F
Twitter 23 3.5 66.7 43 289 15 2.9
ASRS 21 4.3 38.5 47 233 26 1.8
Forest Cover 1.0 1.0 4.7 967 1003 212 4.7
KDD 1.2 1.2 3.3 858 812 334 2.5
Aug 10, 2011Masud et al. 101
University of Texas at Dallas
Multi Novel Detection Multi Novel Detection ResultsResults
Experiment Results
Aug 10, 2011Masud et al. 102
University of Texas at Dallas
Multi Novel Detection ResultsMulti Novel Detection ResultsExperiment Results
Aug 10, 2011Masud et al. 103
University of Texas at Dallas
ConclusionConclusionExperiment Results
Aug 10, 2011Masud et al. 104
•Our data stream classification technique addresses
•Infinite length
•Concept-drift
•Concept-evolution
•Feature-evolution
•Existing approaches only address first two issues
•Applicable to many domains such as
•Intrusion/malware detection
•Text categorization
•Fault detection etc.
University of Texas at Dallas
ReferencesReferences J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. : BOAT-Optimistic Decision
Tree Construction. In Proc. SIGMOD, 1999.
P. Domingos and G. Hulten, “Mining high-speed data streams”. In Proc.
SIGKDD, pages 71-80, 2000.
Wenerstrom, B., Giraud-Carrier, C., “Temporal data mining in dynamic feature
spaces”. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 1141.1145.
Springer, Heidelberg (2006)
E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. “Cluster-based novel
concept detection in data streams applied to intrusion detection in computer
networks”. In Proc. 2008 ACM symposium on Applied computing, pages 976–
980, (2008).
M. Scholz and R. Klinkenberg. “An ensemble classifier for drifting concepts.” In
Proc. ICML/PKDD Workshop in Knowledge Discovery in Data Streams., 2005.
Aug 10, 2011Masud et al. 105
University of Texas at Dallas
References (contd.)References (contd.) Brutlag, J.(2000). “Aberrant behavior detection in time series for network
monitoring.” In: Proc. Usenix Fourteenth System Admin. Conf. LISA XIV, New
Orleans, LA. (Dec 2000)
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: “A geometric framework for
unsupervised anomaly detection: Detection intrusions in unlabeled data.”
Applications of Data Mining in Computer Security, Kluwer (2002).
Fan, W. “Systematic data selection to mine concept-drifting data streams.” In Proc.
KDD 04
Gao, J, Wei Fan, and Jiawei Han. (2007a). "On Appropriate Assumptions to Mine Data
Streams”
Gao, J. Wei Fan, Jiawei Han, Philip S. Yu. (2007b). “A General Framework for Mining
Concept-Drifting Data Streams with Skewed Distributions.” SDM 2007
Goebel, J. and T. Holz. Rishi: “Identify bot contaminated hosts by irc nickname
evaluation. In Usenix/Hotbots” ’07 Workshop, 2007.
Grizzard, J. B., V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon (2007). “Peer-to-
peer botnets: Overview and case study.” In Usenix/Hotbots ’07 Workshop.Aug 10, 2011Masud et al. 106
University of Texas at Dallas
References (contd.)References (contd.) Keogh & Pazzani, (2000) E.J., J., P.M.: “Scaling up dynamic time warping
for data mining applications.” In: ACM SIGKDD. (2000)
Lemos, R. (2006): Bot software looks to improve peerage. SecurityFocus.
http://www.securityfocus.com/news/11390 (2006).
Livadas, C., B.Walsh, D. Lapsley, and T. Strayer (2006) “Using machine
learning techniques to identify botnet traffic.” In 2nd IEEE LCN Workshop
on Network Security (WoNS’2006), November 2006.
LURHQ Threat Intelligence Group (2004). Sinit p2p trojan analysis.
http://www.lurhq.com/sinit.html (2004)
Rajab, M. A. J. Zarfoss, F. Monrose, and A. Terzis (2006) “A multifaceted
approach to understanding the botnet phenomenon.” In Proceedings of
the 6th ACM SIGCOMM on Internet Measurement Conference (IMC), 2006.
Kagan Tumar and Joydeep ghosh (1996).“Error correlation and error
reduction in ensemble classifiers” (Connection sciece), 8(3-4):385-403Aug 10, 2011Masud et al. 10
7
University of Texas at Dallas
References (contd.)References (contd.) Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani
Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to
Classify Concept-Drifting Data Streams.” In Proc, of 13th Pacific-Asia
Conference on Knowledge Discovery and Data Mining (PAKDD-09), Page:
363-375, Bangkok, Thailand, April 2009.
Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani
Thuraisingham, “A Practical Approach to Classify Evolving Data Streams:
Training with Limited Amount of Labeled Data.” In Proc. of 2008 IEEE
International Conference on Data Mining (ICDM 2008), Pisa, Italy, Page
929-934, December, 2008.
Clay Woolam, Mohammed Masud, and Latifur Khan , “Lacking Labels In
The Stream: Classifying Evolving Stream Data With Few Labels”. In Proc.
of 18th International Symposium on Methodologies for Intelligent
Systems (ISMIS), Page 552-562, September 2009 Prague, Czech Republic
Aug 10, 2011Masud et al. 108
University of Texas at Dallas
References (contd.)References (contd.)
Mohammad Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and
Bhavani Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data
Streams”. In Proc. of 2010 10th IEEE International Conference on Data Mining (ICDM
2010), Sydney, Australia, Dec 2010.
Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani
Thuraisingham , “Classification and Novel Class Detection of Data Streams in a Dynamic
Feature Space”. In Proc. of European Conference on Machine Learning and Knowledge
Discovery in Databases, ECML PKDD 2010, Barcelona, Spain, September 20- 24, 2010,
Springer 2010, ISBN 978-3-642-15882-7, Page: 337-352.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham,
“Classification and Novel Class Detection in Data Streams with Active Mining”. In Proc of
14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-24 June,
2010, Page 311-324, - Hyderabad, India.
Aug 10, 2011Masud et al. 109
University of Texas at Dallas
References (contd.)References (contd.) Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel
Class Detection in Concept-Drifting Data Streams under Time Constraints" , IEEE Transactions on Knowledge &
Data Engineering (TKDE), 2011, IEEE Computer Society, June 2011, Vol. 23, No. 6, Page 859-874.
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, “A Framework for Clustering Evolving Data streams”
Published in Proceedings VLDB ’03 proceedings of the 29th international conference on Very Large Data Bases-
Volume 29
H. Wang, W. Fan, P. S. Yu, and J. Han. “Mining concept-drifting data streams using ensemble classifiers”. In Proc.
ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235,
Washington, DC, USA, Aug, 2003. ACM.
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class
Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled,
Slovenia, 7-11 Sept, 2009.
Aug 10, 2011Masud et al. 110
University of Texas at Dallas
Questions
Masud et al. 111
Aug 10, 2011
top related