2005/01/21 董原賓

29
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD internat ional conference on Knowledge discovery 2005/01/21 董董董

Upload: rich

Post on 06-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery. 2005/01/21 董原賓. Outline. Issues with data stream Will old data really help? Optimal models - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 2005/01/21                           董原賓

1

Systematic Data Selection to Mine Concept-Drifting Data Streams

Wei FanProceedings of the 2004 ACM SIGKDD international conf

erence on Knowledge discovery

2005/01/21 董原賓

Page 2: 2005/01/21                           董原賓

2

Outline

Issues with data stream Will old data really help? Optimal models Computing optimal models Cross Validation Decision Tree

Ensemble Experiment Conclusion

Page 3: 2005/01/21                           董原賓

3

Issues with data stream Concept drift

FOi(x) : the optimal model at time stamp i There is concept drift from time stamp i-1 to time stamp i,

if there are inconsistencies between FOi-1(x) and FOi(x) Data sufficient

A data set is considered sufficient if adding more data will not increase the generalization accuracy

Determining the sufficient amount can be formidably expensive

Even if the dataset is insufficient, still need to train a model that can best fit the changing data.

Page 4: 2005/01/21                           董原賓

4

Will old data really help?

Definition: Si: most recent data chunk SP: previous data chunks, SP= S1 U…U Si-1

y = f(x): the underlying true model that we aim to model

Page 5: 2005/01/21                           董原賓

5

Will old data really help? (Cont.)

Three types of SP: 1. FOi (x) ≠ FOi-1(x), obviously it will only can

cel out the changing concept 2. FOi (x) = FOi-1(x) ≠ y, both models agree t

hat their predictions are wrong, can’t determine if they will help or not

3. FOi (x) = FOi-1(x) = y, both models make the correct prediction, it is the only portion that may help

Page 6: 2005/01/21                           董原賓

6

Will old data really help? (Cont.)

Page 7: 2005/01/21                           董原賓

7

Optimal Models

The two main themes of our comparison is on

possible data insufficiency and concept drift

Four situations: 1. New data is sufficient by itself and

there is no concept drift, the optimal model should be the one trained from the new data itself.

Page 8: 2005/01/21                           董原賓

8

Optimal Models (cont.)

2. New data is sufficient by itself and there is concept drift, the optimal model should be the one trained from the new data itself.

3. New data is insufficient by itself and there is no concept drift. If the previous data is sufficient, the optimal model should be the existing model. Otherwise, train a new model from new data plus existing data

Page 9: 2005/01/21                           董原賓

9

Optimal Models (cont.)

4. New data is insufficient by itself and there is concept drift. Choose previous data chunks that have consistent concept with the new data chunk and combine them with the new data

We will usually never know if the data is indeed sufficient. Ideally, we should compare a few sensible choices if the training cost is affordable

Page 10: 2005/01/21                           董原賓

10

Computing optimal models

Definition : Di-1 : the dataset that trained the mos

t recent optimal model FOi-1(x) and is collectively iteratively throughout the streaming mining process

si-1 : examples selected from Di-1

Page 11: 2005/01/21                           董原賓

11

Computing optimal models (cont.)

Steps: 1. Train a model FNi(x) from the new data c

hunk Si only. 2. Select examples from Di-1 that both the tr

ained new model FNi(x) and the recent optimal model FOi-1(x) make the correct prediction.

si-1 = { ∀(x,y) ∈ Di-1, such that, (FNi(x) = y) ∧ (FOi-1(x) = y) }

Page 12: 2005/01/21                           董原賓

12

Computing optimal models (cont.)

3. Train a model FNi+(x) from the data d

ata plus the selected data in the last step (Si U si-1)

4. Update the most recent model FOi-1

(x) with Si and call this model FOi-1+(x). W

hen update the model, keep the structure of the model and update its internal statistics

Page 13: 2005/01/21                           董原賓

13

Computing optimal models (cont.)

5. Using “cross-validation“ to compare the accuracy of FNi(x), FOi-1(x), FNi

+

(x) and FOi-1+(x). Choose the one that is

the most accurate and name it FOi(x) 6. Di is the training set that computes

FOi(x) and is one of Si, Di-1, Si U si-1, and Si U Di-1

Page 14: 2005/01/21                           董原賓

14

Computing optimal models (cont.)

Address how the above framework findsthe optimal model under all four previously discussed situations : 1. New data is sufficient by itself and there

is no concept drift. Conceptually FNi(x) should be the optimal model. FOi-1(x), FNi

+

(x) and FOi-1+(x) could be its close match

Page 15: 2005/01/21                           董原賓

15

Computing optimal models (cont.)

2. New data is sufficient by itself and there is concept drift. FNi(x) should be the optimal model. FNi

+(x) could be very similar in performance to FNi(x)

3. New data is insufficient by itself and there is no concept drift. The optimal model should be either FOi-1(x) or FOi-1

+(x) 4. New data is insufficient by itself and there i

s concept drift. The optimal model should be either FNi(x) or FNi

+(x)

Page 16: 2005/01/21                           董原賓

16

Cross Validation Decision Tree Ensemble

Decision tree ensemble Step1: sequentially scans the complete

dataset once and finds out all features with information gain

Step2: When building trees, at each step choose a remaining feature randomly

Step3: The tree stop growing a branch if there are no more examples passing through that branch

Page 17: 2005/01/21                           董原賓

17

Cross Validation Decision Tree Ensemble (cont.)

Rule1: Features without information gain will never be used

Rule2: Each discrete feature can be used at most once in a particular decision path

Rule3: Each continuous feature can be chosen multiple times on the same decision path

Page 18: 2005/01/21                           董原賓

18

Cross Validation Decision Tree Ensemble (cont.)

Rule4: The splitting threshold is a random value within the max and min of that feature

Rule5: In the training data, each example x is assign an initial weight 1.0 When missing feature value is encountered, the current weight of x is distributed across its children nodes. If the prior distribution of known values are given, the weight is distributed in proportion to this distribution. Otherwise, it is equally divided among the children nodes

Page 19: 2005/01/21                           董原賓

19

Cross Validation Decision Tree Ensemble (cont.)

Cross validation : n: the size of the training set n-fold cross validation: leaves one exa

mple x out and use the remaining n – 1 examples to train a model and classify on the left-out example x

Page 20: 2005/01/21                           董原賓

20

Cross Validation Decision Tree Ensemble (cont.)

Compute the probability of x: Assume that we have two class labels, either fraud or non fraud, the probability if x being fraudulent is

Page 21: 2005/01/21                           董原賓

21

Experiment (streaming data) Synthetic data : Data with drifting concepts base

d on a moving hyperplane Roughly half of the examples are positive, and the other half negative

Credit card fraud data : Real life credit card transaction for cost sensitive mining. One year period and 5million transactions. Investigating a fraud transaction cost = $90

Donation dataset : The famous donation dataset that first appeared in KDDCUP’98 competition. 95412 known records. 96367 unknown records for test

Page 22: 2005/01/21                           董原賓

22

Experiment( Synthetic dataset)

Page 23: 2005/01/21                           董原賓

23

Experiment (credit card dataset)

Page 24: 2005/01/21                           董原賓

24

Experiment (Donation dataset)

Page 25: 2005/01/21                           董原賓

25

Experiment (Accuracy of Cross validation)

Page 26: 2005/01/21                           董原賓

26

Experiment (Optimal model counts)

Page 27: 2005/01/21                           董原賓

27

Experiment (Training time)

Page 28: 2005/01/21                           董原賓

28

Experiment (Data sufficiency test)

Page 29: 2005/01/21                           董原賓

29

Conclusion Using old data unselectively is like

gambling Proposed a cross-validation-based

framework to choose data and compare sensible choices

Proposed an implementation of this framework using cross-validation decision tree ensemble