Download - Inductive Learning from Imbalanced Data Sets
1
Inductive Learning from Imbalanced Data Sets
Nathalie Japkowicz, Ph.D.
School of Information Technology and Engineering
University of Ottawa
2
Inductive Learning: Definition Given a sequence of input/output pairs of the form
<xi, yi>, where xi is a possible input, and yi is the output associated with xi:
Learn a function f such that: f(xi)=yi for all i’s, f makes a good guess for the outputs of inputs
that it has not previously seen.
[If f has only 2 possible outputs, f is called a concept and learning is called concept-learning.]
3
Inductive Learning: Example
Patient Attributes Class Temperature Cough Sore Throat Sinus Pain 1 37 yes no no no flu 2 39 no yes yes flu 3 38.4 no no no no flu 4 36.8 no yes no no flu 5 38.5 yes no yes flu 6 39.2 no no yes flu
Goal: Learn how to predict whether a new patient with a given set of symptoms does or does not have the flu.
4
Standard Assumption
The data sets are balanced: i.e., there are as many positive examples of the concept as there are negative ones.
Example: Our database of sick and healthy patients contains as many examples of sick patients as it does of healthy ones.
5
The Standard Assumption is not Always Correct There exist many domains that do not
have a balanced data set. Examples:
Helicopter Gearbox Fault Monitoring Discrimination between Earthquakes and
Nuclear Explosions Document Filtering Detection of Oil Spills Detection of Fraudulent Telephone Calls
6
But What is the Problem?
Standard learners are often biased towards the majority class.
That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration.
As a result examples from the overwhelming class are well-classified whereas examples from the minority class tend to be misclassified.
7
Significance of the problem for Machine Learners/Data Miners Two Workshops:
AAAI’2000 Workshop, organizers:R. Holte, N. Japkowicz, C. Ling, S. Matwin, 13 contributions
ICML’2003 Workshop, organizers: N. Chawla, N. Japkowicz, A. Kolcz, 16 contributions
Bibliography on Class Imbalance, maintained by N. Japkowicz, 37 entries
Special Issue ACM KDDSIGMOD Explorations Newsletter, editors: N.
Chawla, N. Japkowicz, A. Kolcz (call for papers just came out) Profile of people involved in this research:
Well-known researchers: e.g., F. Provost, C. Elkan, R. Holte, T. Fawcett, C. Ling, etc.
8
Several Common Approaches
At the data Level: Re-Sampling Oversampling (Random or Directed) Undersampling (Random or Directed) Active Sampling
At the Algorithmic Level: Adjusting the Costs Adjusting the decision threshold /
probabilistic estimate at the tree leaf
9
My Contributions Fundamental What domain chara-
cteristics aggravate the problem?
Class imbalances or small disjuncts?
Are all classifiers sensitive to class imbalances?
Which proposed solutions to the class imbalance problem are more appropriate?
New Approaches Specialized
Resampling: within-class versus between-class imbalances
One class versus two-class learning
Multiple Resampling
10
Part I: Fundamentals
I. What domain characteristics aggravate the problem?
II. Class Imbalances or Small Disjuncts?III. Are all classifiers sensitive to class
imbalances?IV. Which proposed solutions to the class
imbalance problem are more appropriate?
11
I. I What domain characteristics aggravate the Problem?
To answer this question, I generated artificial domains that vary along three different axes:• The degree of concept complexity• The size of the training set• The degree of imbalance between the two
classes.
0 1
+ - + - + - + -
12
I. I What domain characteristics aggravate the Problem? I created 125 domains, each representing a
different type of class imbalance, by varying the concept complexity (C), the size of the training set (S) and the degree of imbalance (I) at different rates (5 settings were used per domain characteristics).
I ran C5.0 [a decision tree learning algorithm] on these various imbalanced domains and plotted its error rate on each domain.
Each experiment was repeated 5 times and the results averaged.
13
I. I What domain characteristics aggravate the Problem?
0.0
10.0
20.0
30.0
40.0
50.0
60.0
C=1 C=2 C=3 C=4 C=5
C
I=1
I=2
I=3
I=4
I=5
S= 1
Errorrate
LargeImbal.
Fullbalance
14
0.010.020.030.040.050.060.0
C=1 C=2 C=3 C=4 C=5
I=1
I=2
I=3
I=4
I=5
I.I What domain characteristics aggravate the Problem?
Errorrate
S= 5
LargeImbal.
Fullbalance
15
I. I What domain characteristics aggravate the Problem? The problem is aggravated by two factors:
An increase in the degree of class imbalance An increase in problem complexity – class
imbalances do not hinder the classification of simple problems (e.g., linearly separable ones)
However, the problem is simultaneously mitigated by one factor: The size of the training set – large training sets
yield low sensitivity to class imbalances
16
I.II: Clas Imbalances or Small Disjuncts? Studying the training sets from the previous
experiments, it can be inferred that when i and c are large, and s, small, the domain contains many very small subclusters.
These were also the conditions under which C5.0 performed the worst.
To test whether it is these small subclusters that cause performance degradation, we disregarded the value of s and set the size of all subclusters to 50 examples.
17
05
101520253035404550
s=1 s=5 All Subs =50
I=1
I=2
I=3
I=4
I=5
PreviousExperiment This
Experiment
High Concept Complexity: c=5
I.II: Clas Imbalances or Small Disjuncts?
Error rate
LargeImbal.
Fullbalance
18
I.II: Class Imbalances or Small Disjuncts?
When all the subclusters are of size 50, even at the highest degree of concept complexity, no matter what the class imbalance is, the error is below 1% It is negligible.
This suggests that it is not the class imbalance per se that causes a performance decrease, but rather, that it is the small disjunct problem created by the class imbalance (in highly complex and small-sized domains) that cause that loss of performance.
19
I.III Are all classifiers sensitive to class imbalances?
Neural Net (MLPs)
A5
A2 A7
+ A5 - - +
+ -
T F
> 5 5< 5 T
F
FT
Decision Tree (C5.0)
++ +
++ -
--
- - -
Support Vector Machines (SVMs)
20
I.III Are all classifiers sensitive to class imbalances?
05
101520253035404550
C5.0 MLP SVM
I=1
I=2
I=3
I=4
I=5
Error rate
LargeImbal.
Fullbalance
S = 1; C = 3
21
I.III Are all classifiers sensitive to class imbalances?
Decision Tree (C5.0) C5.0 is the most sensitive to class imbalances. This is because C5.0 works globally, not paying attention to specific data points.
Multi-Layer perceptrons (MLPs) MLPs are less prone to the class imbalance problem than C5.0. This is because of their flexibility: their solution gets adjusted by each data point in a bottom-up manner as well as by the overall data set in a top-down manner.
Support Vector Machines (SVMs) SVMs are even less prone to the class imbalance problem than MLPs because they are only concerned with a few support vectors, the data points located close to the boundaries.
22
I.IV Which Solution is Best?
Random Oversampling Directed Oversampling Random Undersampling Directed Undersampling Adjusting the Costs
23
0
510152025303540
4550
Imb. R.Ov D.Ov R.Un D.Un Cost
I=1
I=2
I=3
I=4
I=5
I.IV Which Solution is Best?
LargeImbal.
FullBal.
Err. rate
S = 1; C = 3
24
I.IV Which Solution is Best? Three of the five methods considered present
an improvement over C5.0 at S=1 and C=3: Random oversampling, Directed oversampling and Cost-modifying.
Undersampling (random and directed) is not effective and can even hurt the performance.
Random oversampling helps quite dramatically at all complexity. Directed oversampling makes a bit of a difference by helping slightly more.
On the graph of the previous slide, Cost-adjusting is about as effective as Directed oversampling. Generally, however, it is found to be slightly more useful.
25
Part II: New Approaches
I. Specialized Resampling: within-class versus between-class imbalances
II. One class versus two-class learning
III. Multiple Resampling
26
II.I: Within-class versus Between-class Imbalances
Idea: Use unsupervised learning to identify
subclusters in each class separately. Re-sample the subclusters of each
class until no within-class imbalance and no between-class imbalance are present (although the subclusters of each class can have different sizes)
27
II.I: Within-class versus Between-class Imbalances
5
SymmetricCase Asymmetric
Case
28
II.I: Within-class vs Between- class Imbalances: Experiments Imbalances Random Oversampling
Between class imbalance eliminated Guided Oversampling I (# Clusters Known)
Use prior knowledge of classes to guide clustering Guided Oversampling II (# Clusters
Unknown) Let clustering algorithm determine the number of
clusters
29
II.I: Within-class vs Between- class Imbalances: Letters Subset of the Letters
dataset found at the UCI Repository
Positive class contains the vowels a and u
Negative class contains the consonants m, s, t and w.
All letters are distributed according to their frequency in English texts.
30
II.I: Within-class vs Between- class Imbalances: Letters
Method Precision Recall F-Measure
Imbalanced 0.905 0.818 0.859
Random Oversampling 0.905 0.818 0.859
Guided Oversampling I (# Clusters Unknown)
0.923 0.914 0.919
Guided Oversampling II (Using Known Clusters)
0.935 0.877 0.905
31
II.I: Within-class vs Between- class Imbalances: Text Classification
Reuters-21578 Dataset
Classifying a document according to its topic
Positive class is a particular topic
Negative class is every other topic
32
II.I: Within-class vs Between- class Imbalances: Text Classification
Method Precision Recall F-Measure
Imbalanced 0.617 0.394 0.455
Random Oversampling
0.580 0.545 0.560
Guided Oversampling I (# Clusters Unknown)
0.650 0.510 0.544
Guided OversamplingII (Using Known Clusters)
0.601 0.751 0.665
33
II.I: Within-class versus Between-class Imbalances
Results: On letter and text categorization tasks,
this strategy worked better than the random over-sampling strategy.
Noise in the small subclusters, however, caused problems since it got too magnified.
This promising strategy requires more study..
34
II.II One-Class versus Two-Class Learning
DMLP RMLP
A5
A2 A7
+ A5 - - +
+ -
T F
> 5 5 < 5 TF
FT
Decision Tree (C5.0)
35
II.II One-Class versus Two-Class Learning
0
5
10
15
20
25
30
35
Helicop. Promoters Sonars
RMLP
DMLP
C4.5
ErrorRate
36
II.II One-Class versus Two-Class Learning One-Class learning is more accurate
than two-class learning on two of our three domains considered and as accurate on the third.
It can thus be quite useful in class imbalanced situations.
Further comparisons with other proposed methods are required.
37
II.III Multiple Resampling
Idea: Although the results reported here suggest that
undersampling is not as useful as oversampling, other studies of ours and others (on different data sets) suggest that it can be It shouldn’t be abandoned
Further experiments of ours (not reported here) suggest that rather than oversampling or undersampling until a full balance is achieved may not always be optimal A different re-sampling rate should be used
38
II.III Multiple Resampling
Idea (Continued): It is not possible to know, a-priori, whether a
given domain favours oversampling or undersampling and what resampling rate is best.
Therefore, we decided to create a self-adaptive combination scheme that considers both strategies at various rates.
39
II.III Multiple Resampling
… …
Output
Oversampling Expert Undersampl. Expert
Oversampling Classifiers(sampled at different rates)
Undersampling Classifiers(sampled at diff.rates)
40
II.III Multiple Resampling The combination scheme was compared to C4.5-
Adaboost (with 20 classifiers) with respect to the FB-measures on a text classification task (Reuters-21578, Top 10 categories)
The FB-measure combines precision (the proportion of examples classified as positive that are truly positive) and recall (the proportion of truly positive examples that are classified as positive) in the following way: F1 precision = recall F2 2 * precision = recall F0.5 precision = 2 * recall
41
II.3 Testing the Combination Scheme Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
B=1 B=2 B=0.5
Adaboost
Mixture-of-Experts
F-Measure
In all cases, the mixture scheme is superior to Adaboost. However, though it helps both recall and precision, ithelps recall more.
42
Summary/Conclusions: Overall Goals of the research
This talk presented some of the work I conducted in the recent past years. In particular, I focused on the class imbalanced problem aiming at: Establishing some fundamental results
regarding the nature of the problem, the behaviour of different types of classifiers,and the relative performance of various previously proposed schemes for dealing with the problem.
Designing new methods for attacking the problem.
43
Summary/Conclusions: Results Fundamentals The sensitivity of Decision Trees and Neural Networks to
class imbalance increases with the domain complexity and the degree of imbalance. Training set size mitigates this pattern. SVMs are not sensitive to class imbalances [up to 1/16 imbalance].
Cost-Adjusting is slightly more effective than random or directed over- or under- sampling although all approaches are helpful, and directed oversampling is close to cost-adjusting.
The class imbalance problem may not be a problem in itself. Rather, the small disjunct problem it causes is responsible for the decay.
44
Summary/Conclusions: Results, New Approaches I presented three new methods very different
from each other and from previously proposed schemes. They all showed promise over previously proposed approaches.
Approach 1: Oversampling with respect to within-class and between-class imbalances
Approach 2: One-class Learning Approach 3: Adaptive combination scheme
which combines over- and under-sampling at 10 different rates each.
45
Summary/Conclusions: Future Work Expand on and study in more depth all the
new proposed approaches I have described. Adapt the idea of Boosting to the class
imbalanced problem (with the National Institute of Health (NIH) in Washington, D.C.and Master’s Student Benjamin Wang)
Design novel Oversampling schemes and feature selection schemes fo Text Classification (with Ph.D. Student: Taeho Jo)
46
Partial Bibliography
"A Multiple Resampling Method for Learning from Imbalances Data Sets" , Estabrooks, A., Jo, T. and Japkowicz, N., Computational Intelligence, Volume 20, Number 1, 2004. (in press)
"The Class Imbalance Problem: A Systematic Study" , Japkowicz N. and Stephen, S., Intelligent Data Analysis, Volume 6, Number 5, pp. 429-450, November 2002.
"Supervised versus Unsupervised Binary-Learning by Feedforward Neural Networks" , Japkowicz, N., Machine Learning Volume 42, Issue 1/2, pp. 97-122, January 2001.
"A Mixture-of-Experts Framework for Concept-Learning from Imbalanced Data Sets" , Estabrooks A, and Japkowicz, N., Proceedings of the 2001 Intelligent Data Analysis Conference .
"Concept-Learning in the Presence of Between-Class and Within-Class Imbalances" , Japkowicz N., Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, 2001.
"The Class Imbalance Problem: Significance and Strategies" , Japkowicz, N. in the Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000), Volume 1, pp. 111-117
47
A Summary of the Various Measures Used
Error Rate:( b + c ) / ( a + b + c + d) Accuracy: ( a + d ) / ( a + b + c + d) Precision: P = a / ( a + c ) Recall: R = a / ( a + b ) FB-Measure: ( B2 + 1 ) P R / (B2 P + R )
dbNeg
caPos
NegPos
Classi-Fied as
Actual Class