systematic data selection to mine concept drifting data streams wei fan ibm t.j.watson
TRANSCRIPT
![Page 1: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/1.jpg)
Systematic Data Selection to Mine Concept Drifting Data Streams
Wei FanIBM T.J.Watson
![Page 2: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/2.jpg)
About … …
Data streams: continuous stream of new data, generated either in real time or periodically. Credit card transactions Stock trades. Insurance claim data. Phone call records
Our notations.
![Page 3: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/3.jpg)
Data Streams
Old data New data
t1 t2 t3 t4 t5
![Page 4: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/4.jpg)
Data Stream Mining
Characteristics: may change over time.
Main goal of stream mining: make sure that the constructed model is
the most accurate and up-to-date.
![Page 5: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/5.jpg)
Data Sufficiency Definition:
A dataset is considered “sufficient” if adding more data items will not increase the final accuracy of a trained model significantly.
We normally do not know if a dataset is sufficient or not.
Sufficiency detection: Expensive “progressive sampling” experiment.
Keep on adding data and stop when accuracy doesn’t increase significantly.
Dependent on both dataset and algorithm Difficult to make a general claim
![Page 6: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/6.jpg)
Possible changes of data streams Possible “concept drift”.
For the same feature vector, different class labels are generated at some later time
Or stochastically, with different probabilities. Possible “data sufficiency”. Other possible changes not addressed in
our paper. Most important of all:
These are “possibilities”. No “Oracle” out there to tell us the truth! Dangerous to make assumptions.
![Page 7: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/7.jpg)
How many combinations?
Four combinations: Sufficient and no drift. Insufficient and no drift. Sufficient and drift. Insufficient and drift
Question: Does the “most accurate model” remain the same under all four situations?
![Page 8: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/8.jpg)
Case 1: Sufficient and no drift
Solution one: Throw away old models and data. Re-train a new model from new data. By definitions of data sufficiency.
Solution two: If old model is trained from “sufficient
data”, just use the old model
![Page 9: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/9.jpg)
Case 2: Sufficient and drift
Solution one: Train a new model from new data Same “sufficiency definition”.
![Page 10: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/10.jpg)
Case 3: Insufficient and no drift
Possibility I: if old model is trained from sufficient data, keep the old model.
Possibility II: otherwise, combine new data and old data, and train a new model.
![Page 11: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/11.jpg)
Case 4: Insufficient and drift
Obviously, new data is not enough by definition.
What are our options. Use old data? But how?
![Page 12: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/12.jpg)
A moving hyper plane
![Page 13: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/13.jpg)
A moving hyper plane
![Page 14: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/14.jpg)
See any problems?
Which old data items can we use?
![Page 15: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/15.jpg)
We need to be picky
![Page 16: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/16.jpg)
Inconsistent Examples
![Page 17: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/17.jpg)
Consistent examples
![Page 18: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/18.jpg)
See more problems?
We normally never know which of the four cases a real data stream belongs to.
It may change over time from case to case.
Normally, no truth is known apriori or even later.
![Page 19: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/19.jpg)
Solution Requirements:
The right solution should not be “one size fits all.”
Should not make any assumptions. Any assumptions can be wrong.
It should be adaptive. Let the data speak for itself.
We prefer model A over model B if the accuracy of A on the evolving data stream is likely to be more accurate than B.
No assumptions!
![Page 20: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/20.jpg)
An “Un-biased” Selection framework
Train FN from new data. Train FN+ from new data and selected
consistent old data. Assume FO is the previous most accurate
model. Update FO using the new data. Call it FO+.
Use cross-validation to choose among the four candidate models {FN, FN+, FO, and FO+}.
![Page 21: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/21.jpg)
Consistent old data
Theoretically, if we know the true models, we can use the true models to choose consistent data. But we don’t
Practically, we have to rely on “optimal models.”
Go back to the hyper plane example
![Page 22: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/22.jpg)
A moving hyper plane
![Page 23: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/23.jpg)
Their optimal models
![Page 24: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/24.jpg)
True model and optimal models
True model. Perfect model: never makes mistakes. Not always possible due to:
Stochastic nature of the problem Noise in training data Data is insufficient
Optimal model: defined over a given loss function.
![Page 25: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/25.jpg)
Optimal Model Loss function L(t,y) to evaluate
performance. t is true label and y is prediction
Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times: 0-1 loss: y* is the label that appears the most
often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the
“empirical risk”. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09,
predict fraud
![Page 26: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/26.jpg)
Random decision trees Train multiple trees. Details to follow. Each tree outputs posterior probability
when classifying an example x. The probability outputs of many trees are
averaged as the final probability estimation.
Loss function and probability are used to make the best prediction.
![Page 27: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/27.jpg)
Training At each node, an un-used feature is
chosen randomly A discrete feature is un-used if it has
never been chosen previously on a given decision path starting from the root to the current node.
A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen
![Page 28: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/28.jpg)
Example
Gender?
M F
Age>30
y n
P: 100N: 150
P: 1N: 9
… …
Age> 25
![Page 29: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/29.jpg)
Training: Continued We stop when one of the following
happens: A node becomes empty. Or the total height of the tree exceeds a
threshold, currently set as the total number of features.
Each node of the tree keeps the number of examples belonging to each class.
![Page 30: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/30.jpg)
Classification Each tree outputs membership probability
p(fraud|x) = n_fraud/(n_fraud + n_normal) If a leaf node is empty (very likely for when discrete
feature is tested at the end): Use the parent nodes’ probability estimate but do not
output 0 or NaN The membership probability from multiple random
trees are averaged to approximate as the final output
Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90
![Page 31: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/31.jpg)
N-fold Cross-validation with Random Decision Trees
Tree structure is independent from the data.
Compensation when computing probability
![Page 32: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/32.jpg)
Key advantage
n-fold cross validation comes easy. Same cost as testing the model once on
the training data. Training is efficient since we do not
compute information gain. It is actually also very accurate.
![Page 33: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/33.jpg)
Experiments
I have a demo available to show. Please contact me.
In the paper. I have the following experiments. Synthetic datasets. Credit card fraud datasets. Donation datasets.
![Page 34: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/34.jpg)
Compare This new selective framework proposed in this
paper. Our last year’s hard coded ensemble framework.
Use k number of weighted ensembles. K=1. Only train on new data. K=8.
Use new data and previous 7 periods of model. Classifier is weighted against new data.
Sufficient and insufficient. Always drift.
![Page 35: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/35.jpg)
Data insufficient: new method
![Page 36: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/36.jpg)
Last year’s method
![Page 37: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/37.jpg)
Avg Result
![Page 38: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/38.jpg)
Data sufficient: new method
![Page 39: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/39.jpg)
Data sufficient: last year’s method
![Page 40: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/40.jpg)
Avg Result
![Page 41: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/41.jpg)
Independent study and implementation of random decision tree
Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets
Edward Greengrass from DOD on their data sets. 100 to 300 features. Both categorical and continuous features. Some features have a lot of values. 2000 to 3000 examples. Both binary and multiple class problem (16 and
25)
![Page 42: Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson](https://reader035.vdocuments.us/reader035/viewer/2022062619/551515a8550346a87d8b4cc0/html5/thumbnails/42.jpg)
Related publications on random trees
“Is random model better? On its accuracy and efficiency” ICDM 2003
“On the optimality of probability estimation by random decision trees” AAAI 04.
“Mining concept-drifting data streams with ensemble classifiers” SIGKDD2003