a general framework for mining concept-drifting data streams with skewed distributions jing gao wei...

20
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡ †University of Illinois at Urbana-Champa ign ‡IBM T. J. Watson Research Center

Upload: danielle-wood

Post on 27-Mar-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

A General Framework for Mining Concept-Drifting Data Streams with

Skewed Distributions

Jing Gao† Wei Fan‡ Jiawei Han† Philip S. Yu‡

†University of Illinois at Urbana-Champaign‡IBM T. J. Watson Research Center

Page 2: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Introduction (1)

• Data Stream– Continuously arriving

data flow– Applications: network

traffic, credit card transaction flow, phone calling records, etc.

10

11

10

1

00

11

Page 3: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Introduction (2)• Stream Classification

– Construct a classification model based on past records

– Use the model to predict labels for new data– Help decision making

Fraud?

Fraud

Classification model

Labeling

Page 4: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Framework

……… ?………

Classification Model Predict

Page 5: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Concept Drifts

• Changes in P(x,y)– P(x,y)=P(y|x)P(x) x-feature vector, y-class label– No Change, Feature Change, Conditional Change, Dual C

hange– Expected error is not a good indicator of concept drifts– Training on the most recent data could help reduce expect

ed error

Time Stamp 1

Time Stamp 11

Time Stamp 21

Page 6: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Issues in Stream Classification(1)

• Generative Model– P(y|x) follows some

distribution

• Descriptive Model– Let data decides

• Stream Data– Distribution unknow

n and evolving

Page 7: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Issues in Stream Classification(2)

• Label Prediction– Classify x into one

class

• Probability Estimation– x is assigned to all

classes with different probabilities

• Stream Applications– Stochastic, prediction

confidence information is needed

Page 8: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Mining Skewed Data Stream• Skewed Distribution

– Credit card frauds, network intrusions

• Existing Stream Classification Algorithms– Evaluated on balanced

data

• Problems– Ignore minority examples– The cost of misclassifying

minority examples is usually huge

+

-

Classify every leaf node as negative

Page 9: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Stream Ensemble Approach (1)

……… ?………

Training set? Insufficient positive examples!

Step 1

Sampling

Page 10: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Stream Ensemble Approach (2)

Step 2

Ensemble

C1 C2 Ck……

k

i

iE xfk

xf1

)(1

)(

1 2 k……

Page 11: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Why this approach works?• Incorporation of old positive examples

– increase the training size, reduce variance– negative examples reflect current concepts, so

the increase in boundary bias is small• Ensemble

– reduce variance caused by single model– disjoint sets of negative examples—the

classifiers will make uncorrelated errors• Bagging & Boosting

– running cost is much higher– cannot generate reliable probability estimates for

skewed distributions

Page 12: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Analysis

)()|()( xxcPxf ccc 2222 /)( sb

)()|()( xxcPxf ccEC

• Error Reduction– Sampling

– Ensemble

• Efficiency Analysis– Single model– Ensemble– Ensemble is more efficient

k

ibb iE

k 1

22

2 1

))log()(( qpqp knnknndO

))log()(( qpqp nnnndkO

Page 13: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experiments

• Measures– Mean Squared Error

– ROC Curve – Recall-Precision Curve

• Baseline Methods– NS: No sampling +Single Model– SS: Sampling + Single Model– SE: Sampling + Ensemble

n

iii xPxf

nL

1

2))|()((1

Page 14: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experimental Results (1)

Mean Squared Error on Synthetic Data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Feature Condi ti onal Dual

SENSSS

Feature Change only P(x) changes

Conditional Change only P(y|x) changes

Dual Change both P(x) and P(y|x)

changes

Page 15: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experimental Results (2)

Mean Squared Error on Real Data

0

0.05

0.1

0.15

0.2

0.25

Thyroi d1 Thyroi d2 Opt Letter Covtype

SENSSS

Page 16: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experimental Results (3)

ROC Curve Recall-Precision Plot

Plots on Synthetic Data

Page 17: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experimental Results (4)

ROC Curve Recall-Precision Plot

Plots on Real Data

Page 18: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Experimental Results (5)

Training Time

Page 19: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Conclusions

• General issues in stream classification– concept drifts– descriptive model– probability estimation

• Mining skewed data streams– sampling and ensemble techniques– accurate and efficient

• Wide applications– graph data– airforce data

Page 20: A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois

Thanks!

• Any questions?