Download - Andrew Thesis
-
8/12/2019 Andrew Thesis
1/125
A COMBINATION SCHEME FOR INDUCTIVE
LEARNING FROM IMBALANCED DATA
SETS
by
Andrew Estabrooks
A Thesis Submitted to the
Faculty of Computer Science
in Partial Fulfillment of the Requirementsfor the degree of
ASTER !F C!P"TER SC#E$CE
a%or Sub%ect& Computer Science
APPR!'E(&
)))))))))))))))))))))))))))))))))))))))))
$athalie *apkowic+, Super-isor
)))))))))))))))))))))))))))))))))))))))))
.igang /ao
)))))))))))))))))))))))))))))))))))))))))
0ouise Spiteri
(A01!"S#E "$#'ERS#T2 3 (A0TEC1
-
8/12/2019 Andrew Thesis
2/125
1alifa4, $o-a Scotia5666
ii
-
8/12/2019 Andrew Thesis
3/125
(A0TEC1 0#7RAR2
"AUTHORITY TO DISTRIBUTE MANUSCRIPT THESIS"
T#T0E&
A Combination Scheme for 0earning From #mbalanced (ata Sets
The abo-e library may make a-ailable or authori+e another library to make
a-ailable indi-idual photo8microfilm copies of this thesis without restrictions9
Full $ame of Author& Andrew Estabrooks
Signature of Author& )))))))))))))))))))))))))))))))))
(ate& :85;85666
iii
-
8/12/2019 Andrew Thesis
4/125
TA70E !F C!$TE$TS
1. Introducton999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;; #nducti-e 0earning9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;
5 Class #mbalance999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995< oti-ation 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=
= Chapter !-er-iew99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999=
> 0earners99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?
>9; 7ayesian 0earning99999999999999999999999999999999999999999999999999999999999999999999999999999999?>95 $eural $etworks99999999999999999999999999999999999999999999999999999999999999999999999999999999999:
>9< $earest $eighbor9999999999999999999999999999999999999999999999999999999999999999999999999999999999@
>9= (ecision Trees99999999999999999999999999999999999999999999999999999999999999999999999999999999999999? (ecision Tree 0earning Algorithms and C>969999999999999999999999999999999999999999999999999999999999999999999
?9; (ecision Trees and the #(< algorithm 99999999999999999999999999999999999999999999999;6?95 #nformation /ain and the Entropy easure9999999999999999999999999999999999999999;;?9< !-erfitting and (ecision Trees99999999999999999999999999999999999999999999999999999999999;96 !ptions999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;>
: Performance easures9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;::9; Confusion atri499999999999999999999999999999999999999999999999999999999999999999999999999999999;:
:95 g3ean99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@
:9< R!C cur-es 999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@
@ A Re-iew of Current 0iterature999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;@9; isclassification Costs9999999999999999999999999999999999999999999999999999999999999999999999956
@95 Sampling Techniques9999999999999999999999999999999999999999999999999999999999999999999999999955
@959; 1eterogeneous "ncertainty Sampling99999999999999999999999999999999999999999999999999999999999955@9595 !ne sided #ntelligent Selection99999999999999999999999999999999999999999999999999999999999999999999995=
@959< $ai-e Sampling Techniques999999999999999999999999999999999999999999999999999999999999999999999999995>
@9< Classifiers Bhich Co-er !ne Class9999999999999999999999999999999999999999999999999999
-
8/12/2019 Andrew Thesis
5/125
;695 Architecture99999999999999999999999999999999999999999999999999999999999999999999999999999999999999?:
;6959; Classifier 0e-el99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999?;69595 E4pert 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:6
;6959< Beighting Scheme999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:;
;6959= !utput 0e-el999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999:5;; Testing the Combination scheme on the Artificial (omain9999999999999999999999999999999999999999999:5
;5 Te4t Classification999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999::
;59; Te4t Classification as an #nducti-e Process99999999999999999999999999999999999999:;< Reuters35;>:@999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@6
;9; Precision and Recall9999999999999999999999999999999999999999999999999999999999999999999999999@@
;>95 F3 measure9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999@;>9< 7reake-en Point99999999999999999999999999999999999999999999999999999999999999999999999999999999@
;>9= A-eraging Techniques9999999999999999999999999999999999999999999999999999999999999999999999;
;? Statistics used in this study999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999
, Mot()ton
Currently the ma%ority of research in the machine learning community has based the
performance of learning algorithms on how well they function on data sets that are
reasonably balanced9 This has lead to the design of many algorithms that do not adapt well
to imbalanced data sets9 Bhen faced with an imbalanced data set, researchers ha-e
generally de-ised methods to deal with the data imbalance that are specific to the
application at hand9 Recently howe-er there has been a thrust towards generali+ing
techniques that deal with data imbalances9
The focus of this thesis is directed towards inducti-e learning on imbalanced data sets9 The
goal of the work presented is to introduce a combination scheme that uses two of the
pre-iously mentioned balancing techniques, downsi+ing and o-er3sampling, in an attempt
to impro-e learning on imbalanced data sets9 ore specifically, # will present a system that
combines classifiers in a hierarchical structure according to their sampling technique9 This
combination scheme will be designed using an artificial domain and tested on the real
world application of te4t classification9 #t will be shown that the combination scheme is an
effecti-e method of increasing a standard classifiers performance on imbalanced data sets9
- C)/t$r O($r($
The remainder of this thesis is broken down into four chapters9 Chapter 5 gi-es backgroundinformation and a re-iew of the current literature pertaining to data set imbalance9 Chapter
< is di-ided into se-eral sections9 The first section describes an artificial domain and a set
of e4periments, which lead to the moti-ation behind a general scheme to handle
imbalanced data sets9 The second section describes the architecture behind a system
=
-
8/12/2019 Andrew Thesis
16/125
designed to lend itself to domains that ha-e imbalanced data9 The third section tests the
de-eloped system on the artificial domain and presents the results9 Chapter = presents the
real world application of te4t classification and is di-ided into two parts9 The first part
gi-es needed background information and introduces the data set that the system will be
tested on9 The second part presents the results of testing the system on the te4t
classification task and discusses it effecti-eness9 The thesis concludes with Chapter >,
which contains a summary and suggested directions for further research9
>
-
8/12/2019 Andrew Thesis
17/125
C h a p t e r T w o
5 7AC/R!"$(
# will begin this chapter by gi-ing a brief o-er-iew of some of the more common learning
algorithms and e4plaining the underlying concepts behind the decision tree learning
algorithm C>96, which will be used for the purposes of this study9 There will then be a
discussion of -arious performance measures that are commonly used in machine learning9
Following that, # will gi-e an o-er-iew of the current literature pertaining to data
imbalance9
0 L$)rn$r'
There are a large number of learning algorithms, which can be di-ided into a broad range
of categories9 This section gi-es a brief o-er-iew of the more common learning algorithms9
0.1 B)$')n L$)rnn%
#nducti-e learning centers on finding the best hypothesis h, in a hypothesis space 1, gi-en aset of training data (9 Bhat is meant by the best hypothesis is that it is the most probable
hypothesis gi-en a data set ( and any initial knowledge about the prior probabilities of
-arious hypothesis in 19 achine learning problems can therefore be -iewed as attempting
to determine the probabilities of -arious hypothesis and choosing the hypothesis which has
the highest probability gi-en (9
ore formally, we define the posterior probability PGh(H, to be the probability of an
hypothesis h after seeing a data set (9 7ayes theorem GEq9 ;H pro-ides a means to calculate
posterior probabilities and is the basis of 7ayesian learning9
?
-
8/12/2019 Andrew Thesis
18/125
( ) ( ) ( )
( )DPhPhDP
DhPS
S = GEq9 ;H
A simple method of learning based on 7ayes theorem is called the nai-e 7ayes classifier9
$ai-e 7ayes classifiers operate on data sets where each e4ample 4 consists of attribute
-alues a;, a5 999 aiQ and the target function fG4H can take on any -alue from a pre3defined
finite set 'G-;, -5 999 -%H9 Classifying unseen e4amples in-ol-es calculating the most
probable target -alue vma4and is defined as&
"sing 7ayes theorem GEq9 ;H vmaxcan be rewritten as&
"nder the assumption that attribute -alues are conditionally independent gi-en the target
-alue9 The formula used by the nai-e 7ayes classifier is&
where - is the target output of the classifier and PGai-%H and PG-iH can be calculated based on
their frequency in the training data9
0.* N$ur)# N$tor!'
$eural $etworks are considered -ery robust learners that perform well on a wide range of
applications such as, optical character recognition 0e Cun et al9, ;@ and autonomous
na-igation Pomerleau, ;
-
8/12/2019 Andrew Thesis
19/125
a4ons9 The basic unit of an artificial neural network is the perceptron, which takes as input
a number of -alues and calculates the linear combination of these -alues9 The combined
-alue of the input is then transformed by a threshold unit such as the sigmoid function 59
Each input to a perceptron is associated with a weight that determines the contribution of
the input9 0earning for a neural network essentially in-ol-es determining -alues for the
weights9 A pictorial representation of a perceptron is gi-en in Figure 59;9;9
w1
w2
wn
x1
x2
xn
Threshold unit
w0
Figure& 59;959 A perceptron9
0., N$)r$'t N$%+or
$earest $eighbor learning algorithms are instance3based learning methods, which store
e4amples and classify newly encountered e4amples by looking at the stored instances
considered similar9 #n its simplest form all instances correspond to points in an n
dimensional space9 An unseen e4ample is classified by choosing the ma%ority class of the
closest e4amples9 An ad-antage nearest neighbor algorithms ha-e is that they can
appro4imate -ery comple4 target functions, by making simple local appro4imations based
on data, which is close to the e4ample to be classified9 An e4cellent e4ample of an
application, which uses a nearest neighbor algorithm, is that of te4t retrie-al in which
documents are represented as -ectors and a cosine similarity metric is used to measure the
distance of queries to documents9
5 The sigmoid function is defined as oGyH ; 8 G; I e3yH and is referred to as a squashing function because it maps a -ery
wide range of -alues onto the inter-al G6, ;H9
@
-
8/12/2019 Andrew Thesis
20/125
0.- D$c'on Tr$$'
(ecision trees classify e4amples according to the -alues of their attributes9 They are
constructed by recursi-ely partitioning training e4amples based each time on the remaining
attribute that has the highest information gain9 Attributes become nodes in the constructed
tree and their possible -alues determine the paths of the tree9 The process of partitioning the
data continues until the data is di-ided into subsets that contain a single class, or until some
stopping condition is met Gthis corresponds to a leaf in the treeH9 Typically, decision trees
are pruned after construction by merging children of nodes and gi-ing the parent node the
ma%ority class9 Section 595 describes in detail how decision trees, in particular C>96, operate
and are constructed9
2 D$c'on Tr$$ L$)rnn% A#%ort&' )nd C0.3
C>96 is a decision tree learning algorithm that is a later -ersion of the widely used C=9>
algorithm .uinlan, ; and C>969 The following
section consists of two parts9 The first part is a brief summary of itchells description of
the #(< algorithm and the e4tensions leading to typical decision tree learners9 A brief
operational o-er-iew of C>96 is then gi-en as it relates to this work9
7efore # begin the discussion of decision tree algorithms, it should be noted that a decision
tree is not the only learning algorithm that could ha-e been used in this study9 As described
in Chapter ;, there are many different learning algorithms9 For the purposes of this study a
decision tree algorithm was chosen for three reasons9 The first is the understandability of
the classifier created by the learner9 7y looking at the comple4ity of a decision tree in terms
of the number and si+e of e4tracted rules, we can describe the beha-ior of the learner9
Choosing a learner such as $ai-e 7ayes, which classifies e4amples based on probabilities,
would make an analysis of this type nearly impossible9 The second reason a decision tree
learner was chosen was because of its computational speed9 Although, not as cheap to
operate as $ai-e 7ayes, decision tree learners ha-e significantly shorter training times than
do neural networks9 Finally, a decision tree was chosen because it operates well on tasks
-
8/12/2019 Andrew Thesis
21/125
that classify e4amples into a discrete number of classes9 This lends itself well to the real
world application of te4t classification9 Te4t classification is the domain that the
combination scheme designed in Chapter < will be tested on9
2.1 D$c'on Tr$$' )nd t$ ID, )#%ort&
(ecision trees classify e4amples by sorting them based on attribute -alues9 Each node in a
decision tree represents an attribute in an e4ample to be classified, and each branch in a
decision tree represents a -alue that the node can take9 E4amples are classified starting at
the root node and sorting them based on their attribute -alues9 Figure 5959; is an e4ample of
a decision tree that could be used to classify whether it is a good day for a dri-e or not9
Road Conditions
Clear Snow Covered Icy
Forecast
Temperature Accumulation
RainClear
HeavyFreein! "i!ht#arm
Snow
$%S
&'
&' &'
&'
$%S $%S
Figure 5959
-
8/12/2019 Andrew Thesis
22/125
would sort to the nodes& Road Conditions, Forecast, and finally Temperature, which would
classify the instance as being positi-e GyesH, that is, it is a good day to dri-e9 Con-ersely an
instance containing the attribute Road Conditions assigned Snow Co-ered would be
classified as not a good day to dri-e no matter what the Forecast, Temperature, or
Accumulation are9
(ecision tress are constructed using a top down greedy search algorithm which recursi-ely
subdi-ides the training data based on the attribute that best classifies the training e4amples9
The basic algorithm #(< begins by di-iding the data according to the -alue of the attribute
that is most useful in classifying the data9 The attribute that best di-ides the training data
would be the root node of the tree9 The algorithm is then repeated on each partition of the
di-ided data, creating sub trees until the training data is di-ided into subsets of the same
class9 At each le-el in the partitioning process a statistical property known as information
gainis used to determine which attribute best di-ides the training e4amples9
2.* In4or&)ton G)n )nd t$ Entro/ M$)'ur$
#nformation gain is used to determine how well an attribute separates the training data
according to the target concept9 #t is based on a measure commonly used in information
theory known as entropy9 (efined o-er a collection of training data, S, with a 7ooleantarget concept, the entropy of Sis defined as&
wherep(+)is the proportion of positi-e e4amples in S andp()the proportion of negati-e
e4amples9 The function of the entropy measure is easily described with an e4ample9
Assume that there is a set of data Scontaining ten e4amples9 Se-en of the e4amples ha-e a
positi-e class and three of the e4amples ha-e a negati-e class :I, 5;96
-
8/12/2019 Andrew Thesis
23/125
$ote that if the number of positi-e and negati-e e4amples in the set were e-en Gp(+)p()
69>H, then the entropy function would equal ;9 #f all the e4amples in the set were of the
same class, then the entropy of the set would be 69 #f the set being measured contains an
unequal number of positi-e and negati-e e4amples then the entropy measure will be
between 6 and ;9
Entropy can be interpreted as the minimum number of bits needed to encode the
classification of an arbitrary member of S9 Consider two people passing messages back and
forth that are either positi-e or negati-e9 #f the recei-er of the message knows that the
message being sent is always going to be positi-e, then no message needs to be sent9
Therefore, there needs to be no encoding and no bits are sent9 #f on the other hand, half the
messages are negati-e, then one bit needs to be used to indicate that the message being sent
is either positi-e or negati-e9 For cases where there are more e4amples of one class than the
other, on a-erage, less than one bit needs to be sent by assigning shorter codes to more
likely collections of e4amples and longer codes to less likely collections of e4amples9 #n a
case wherep(+) 69 shorter codes could be assigned to collections of positi-e messages
being sent, with longer codes being assigned to collections of negati-e messages being
sent9
#nformation gain is the e4pected reduction in entropy when partitioning the e4amples of a
set S"according to an attribute#9 #t is defined as&
where Va$ue%(#)is the set of all possible -alues for an attribute#and Svis the subset of
e4amples in Swhich ha-e the -alue vfor attribute#9 !n a 7oolean data set ha-ing only
positi-e and negati-e e4amples, Va$ue%(#)would be defined o-er I,39 The first term in
the equation is the entropy of the original data set9 The second term describes the entropy
of the data set after it is partitioned using the attribute #9 #t is nothing more than a sum of
( ) ( )( )
( )v#Va$ue%v
vS!ntropy
S
SS!ntropy#S&ain
=,
;5
-
8/12/2019 Andrew Thesis
24/125
the entropies of each subset Svweighted by the number of e4amples that belong to the
subset9 The following is an e4ample of how &ain(S" #)would be calculated on a fictitious
data set9 /i-en a data set Swith ten e4amples G: positi-e and < negati-eH, each containing
an attribute Temperature, &ain(S"#) where #Temperature and Va$ue%GTemperatureH
KBarm, Free+ingL would be calculated as follows&
S :I,
-
8/12/2019 Andrew Thesis
25/125
There are two common approaches that decision tree induction algorithms can use to a-oid
o-erfitting training data9 They are&
Stop the training algorithm before it reaches a point in which it perfectly fits the
training data, and,
Prune the induced decision tree9
The most commonly used is the latter approach itchell, ;:9 (ecision tree learners
normally employ post3pruning techniques that e-aluate the performance of decision trees as
they are pruned using a -alidation set of e4amples that are not used during training9 The
goal of pruning is to impro-e a learners accuracy on the -alidation set of data9
#n its simplest form post3pruning operates by considering each node in the decision tree as
a candidate for pruning9 Any node can be remo-ed and assigned the most common class of
the training e4amples that are sorted to the node in question9 A node is pruned if remo-ing
it does not make the decision tree perform any worse on the -alidation set than before the
node was remo-ed9 7y using a -alidation set of e4amples it is hoped that the regularities in
the data used for training do not occur in the -alidation set9 #n this way pruning nodes
created on regularities occurring in the training data will not hurt the performance of the
decision tree o-er the -alidation set9
Pruning techniques do not always use additional data such as the following pruning
technique used by C=9>9
C=9> begins pruning by taking a decision tree to be and con-erting it into a set of rulesN one
for each path from the root node to a leaf9 Each rule is then generali+ed by remo-ing any of
its conditions that will impro-e the estimated accuracy of the rule9 The rules are then sorted
by this estimated accuracy and are considered in the sorted sequence when classifying
newly encountered e4amples9 The estimated accuracy of each rule is calculated on the
training data used to create the classifier Gi9e9, it is a measure of how well the rule classifies
the training e4amplesH9 The estimate is a pessimistic one and is calculated by taking the
;=
-
8/12/2019 Andrew Thesis
26/125
accuracy of the rule o-er the training e4amples it co-ers and then calculating the standard
de-iation assuming a binomial distribution9 For a gi-en confidence le-el, the lower3bound
estimate is taken as a measure of the rules performance9 A more detailed discussion of
C=9>s pruning technique can be found in .uinlan, ;96 offers adapti-e boosting Schapire and Freund, ;:9 The general idea behind
adapti-e boosting is to generate se-eral classifiers on the training data9 Bhen an unseen
e4ample is encountered to be classified, the predicted class of the e4ample is a weighted
count of -otes from indi-idually trained classifiers9 C>96 creates a number of classifiers by
first constructing a single classifier9 A second classifier is then constructed by re3training
on the e4amples used to create the first classifier, but paying more attention to the cases of
the training set in which the first classifier, classified incorrectly9 As a result the secondclassifier is generally different than the first9 The basic algorithm behind .uinlans
implementation of adapti-e boosting is described as follows9
Choose e4amples from the training set of $ e4amples each being assigned a
probability of ;8$ of being chosen to train a classifier9
Classify the chosen e4amples with the trained classifier9
Replace the e4amples by multiplying the probability of the misclassified e4amples
by a weight 79
Repeat the pre-ious three steps O times with the generated probabilities9
Combine the O classifiers gi-ing a weight logG7OH to each trained classifier9
;>
-
8/12/2019 Andrew Thesis
27/125
Adapti-e boosting can be in-oked by C>96 and the number of classifiers generated
specified9
Prunn% O/ton'C>96 constructs decision trees in two phases9 First it constructs a classifier that fits the
training data, and then it prunes the classifier to a-oid o-er3fitting the data9 Two options
can be used to affect the way in which the tree is pruned9
The first option specifies the degree in which the tree can initially fit the training data9 #t
specifies the minimum number of training e4amples that must follow at least two of the
branches at any node in the decision tree9 This is a method of a-oiding o-er3fitting data by
stopping the training algorithm before it o-er3fits the data9
A second pruning option that C>96 has affects the se-erity in which the algorithm will post3
prune constructed decision trees and rule sets9 Pruning is performed by remo-ing parts of
the constructed decision trees or rule sets that ha-e a high predicted error rate on new
e4amples9
Ru#$ S$t'
C>96 can also con-ert decision trees into rule sets9 For the purposes of this study rule sets
were generated using C>969 This is due to the fact that rule sets are easier to understand
than decision trees and can easily be described in terms of comple4ity9 That is, rules sets
can be looked at in terms of the a-erage si+e of the rules and the number of rules in the set9
The pre-ious description of C>96s operation is by no means complete9 #t is merely an
attempt to pro-ide the reader with enough information to understand the options that were
primarily used in this study9 C>96 has many other options that can be used to affect its
operation9 They include options to in-oke k3fold cross -alidation, enable differential
misclassification costs, and speed up training times by randomly sampling from large data
sets9
;?
-
8/12/2019 Andrew Thesis
28/125
5 P$r4or&)nc$ M$)'ur$'
E-aluating a classifierMs performance is a -ery important aspect of machine learning9
Bithout an e-aluation method it is impossible to compare learners, or e-en know whether
or not a hypothesis should be used9 For e4ample, learning to classify mushrooms as being
poisonous or not, one would want to be able to -ery precisely measure the accuracy of a
learned hypothesis in this domain9 The following section introduces the 'onfu%ion matrix
that identifies the type of errors a classifier makes, as well as two more sophisticated
e-aluation methods9 They are thegmean, which combines the performance of a classifier
o-er two classes, and OC 'urve%, which pro-ide a -isual representation of a classifiers
performance9
5.1 Con4u'on M)tr6
A classifiers performance is commonly broken down into what is known as a 'onfu%ion
matrix9 A confusion matri4 basically shows the type of classification errors a classifier
makes9 Figure 59
-
8/12/2019 Andrew Thesis
29/125
A classifierMs performance can also be separately calculated for its performance o-er the
positi-e e4amples Gdenoted as aIH and o-er the negati-e e4amples Gdenoted as a3H9 Each arecalculated as&
5.* %7M$)n
ubat, 1olte, and atwin ;@ use the geometric mean of the accuracies measured
separately on each class&
The basic idea behind this measure is to ma4imi+e the accuracy on both classes9 #n this
study the geometric mean will be used as a check to see how balanced the combination
scheme is9 For e4ample, if we consider an imbalanced data set that has 5=6 positi-e
e4amples and ?666 negati-e e4amples and stubbornly classify each e4ample as negati-e,
we could see, as in many imbalanced domains, a -ery high accuracy Gacc ?UH9 "sing
the geometric mean, howe-er, would quickly show that this line of thinking is flawed9 #t
would be calculated as sqrtG6 V ;H 69
5., ROC cur($'
OC 'urve% GRecei-ing !perator CharacteristicH pro-ide a -isual representation of the
trade off between true positi-es and false positi-es9 They are plots of the percentage ofcorrectly classified positi-e e4amples aI with respect to the percentage of incorrectly
classified negati-e e4amples a39
dc
ca
ba
aa
+=
+=
+
+
= aag
;@
-
8/12/2019 Andrew Thesis
30/125
ROC curves
0
20
(0)0
*0
100
0 20 (0 )0 *0 100
False Positive (%)
TruePositive(%)
Series1Series2
Figure 59& A fictitious e4ample of two R!C cur-es9
Point G6, 6H along a cur-e would represent a classifier that by default classifies all e4amples
as being negati-e, whereas a point G6, ;66H represents a classifier that correctly classifies all
e4amples9
any learning algorithms allow induced classifiers to mo-e along the cur-e by -arying
their learning parameters9 For e4ample, decision tree learning algorithms pro-ide options
allowing induced classifiers to mo-e along the cur-e by way of pruning parameters
Gpruning options for C>96 are discussed in Section 5959=H9 Swets ;@@ proposes that
classifiers performances can be compared by calculating the area under the cur-es
generated by the algorithms on identical data sets9 #n Figure 59
-
8/12/2019 Andrew Thesis
31/125
sampling techniques, discusses data set balancing techniques that sample training
e4amples, both in nai-e and intelligent fashions9 The third category, classifiers that co-er
one class, describes learning algorithms that create rules to co-er only one class9 The last
category, recognition based learning, discusses a learning method that ignores or makes
little use of one class all together9
8.1 M'c#)''4c)ton Co't'
Typically a classifiers performance is e-aluated using the proportion of e4amples that are
incorrectly classified9 Pa++ani, er+, urphy, Ali, 1ume, and 7runk ;= look at errors
made by a classifier in terms of their cost9 For e4ample, take an application such as the
detection of poisonous mushrooms9 The cost of misclassifying a poisonous mushroom as
being safe to eat may ha-e serious consequences and therefore should be assigned a high
costN con-ersely, misclassifying a mushroom that is safe to eat may ha-e no serious
consequences and should be assigned a low cost9 Pa++ani et al9 ;= use algorithms that
attempt to sol-e the problem of imbalanced data sets by way of introducing a cost matri49
The algorithm that is of interest here is called eu'e Co%t Orering GRC!H, which
attempts to order a decision list Gset of rulesH so as to minimi+e the cost of making incorrect
classifications9
RC! is a post3processing algorithm that can complement any rule learner such as C=9>9 #t
essentially orders a set of rules to minimi+e misclassification costs9 The algorithm works as
follows&
The algorithm takes as input a set of rules Grule listH, a cost matri4, and a set of e4amples
Ge4ample listH and returns an ordered set of rules Gdecision listH9 An e4ample of a cost
matri4 Gfor the mushroom e4ampleH is depicted in Figure 59=9;9
1ypothesis
Safe Poisonous Actual Class
6 ; Safe
;6 6 PoisonousFigure 59=9?& A cost matri4 for a poisonous mushroom
application9
56
-
8/12/2019 Andrew Thesis
32/125
$ote that the costs in the matri4 are the costs associated with the prediction in light of the
actual class9
The algorithm begins by initiali+ing a decision list to a default class which yields the leaste4pected cost if all e4amples were tagged as being that class9 #t then attempts to iterati-ely
replace the default class with a new rule 8 default class pair, by choosing a rule from the
rule list that co-ers as many e4amples as possible and a default class which minimi+es the
cost of the e4amples not co-ered by the rule chosen9 $ote that when an e4ample in the
e4ample list is co-ered by a chosen rule it is remo-ed9 The process continues until no new
rule 8 default class pair can be found to replace the default class in the decision list Gi9e9, the
default class minimi+es cost o-er the remaining e4amplesH9
An algorithm such as the one described abo-e can be used to tackle imbalanced data sets
by way of assigning high misclassification costs to the underrepresented class9 (ecision
lists can then be biased, or ordered to classify e4amples as the underrepresented class, as
they would ha-e the least e4pected cost if classified incorrectly9
#ncorporating costs into decision tree algorithms can be done by replacing the information
gain metric used with a new measure that bases partitions not on information gain, but on
the cost of misclassification9 This was studied by Pa++ani et al9 ;= by modifying #(< to
use a metric that chooses partitions that minimi+e misclassification cost9 The results of their
e4perimentation indicate that their greedy test selection method, attempting to minimi+e
cost, did not perform as well as using an information gain heuristic9 They attribute this to
the fact that their selection technique attempts to solely fit training data and not minimi+e
the comple4ity of the learned concept9
A more -iable alternati-e to incorporating misclassification costs into the creation of a
decision trees, is to modify pruning techniques9 Typically, decision trees are pruned by
merging lea-es of the tree to classify e4amples as the ma%ority class9 #n effect, this is
calculating the probability that an e4ample belongs to a gi-en class by looking at training
e4amples that ha-e filtered down to the lea-es being merged9 7y assigning the ma%ority
5;
-
8/12/2019 Andrew Thesis
33/125
class to the node of the merged lea-es, decision trees are assigning the class with the lowest
e4pected error9 /i-en a cost matri4, pruning can be modified to assign the class that has the
lowest e4pect cost instead of the lowest e4pected error9 Pa++ani et al9 ;= state that cost
pruning techniques ha-e an ad-antage o-er replacing the information gain heuristic with a
minimal cost heuristic, in that a change in the cost matri4 does not affect the learned
concept description9 This allows different cost matrices to be used for different e4amples9
8.* S)&/#n% T$cn9u$'
*,- .eterogeneou% /n'ertainty Samp$ing
0ewis and Catlett ;= describe a heterogeneou%0 approach to selecting training
e4amples from a large data set by using uncertainty sampling9 The algorithm they use
operates under an information filtering paradigmN uncertainty sampling is used to select
training e4amples to be presented to an e4pert9 #t can be simply described as a process
where a cheap classifier chooses a subset of training e4amples for which it is unsure of the
class from a large pool and presents them to an e4pert to be classified9 The classified
e4amples are then used to help the cheap classifier choose more e4amples for which it is
uncertain9 The e4amples that the classifier is unsure of are used to create a more e4pensi-e
classifier9
The uncertainty sampling algorithm used is an iterati-e process by which an ine4pensi-e
probabilistic classifier is initially trained on three randomly chosen positi-e e4amples from
the training data9 The classifier is based on an estimate of the probability that an instance
belongs to a class C&
< Their method is considered heterogeneous because a classifier of one type chooses e4amples to present to a classifier of
another type9
( )
( )
( )( )( )
++
+
=
=
=
)
i i
i
)
i i
i
CwP
CwPba
CwP
CwPba
1CP
;
;
S
Sloge4p;
S
Sloge4p
S
55
-
8/12/2019 Andrew Thesis
34/125
where C indicates class membership and wiis the ith attribute of d attributes in e4ample wN
a and b are calculated using logistic regression9 This model is described in detail in 0ewis
and 1ayes, ;=9 All we are concerned with here is that the classifier returns a number P
between 6 and ; indicating its confidence in whether or not an unseen e4ample belongs to a
class9 The threshold chosen to indicate a positi-e instance is 69>9 #f the classifier returns a P
higher than 69> for an unknown e4ample, it is considered to belong to the class C9 The
classifiers confidence in its prediction is proportional to the distance its prediction is away
from the threshold9 For e4ample, the classifier is less confident in a P of 69? belonging to C
than it is a P of 69 belong to C9
At each iteration of the sampling loop, the probabilistic classifier chooses four e4amples
from the training setN the two which are closest and below the threshold and the two which
are closest and abo-e the threshold9 The e4amples that are closest to the threshold are those
that it is least sure of the class9 The classifier is then retrained at each iteration of the
uncertainty sampling and reapplied to the training data to select four more instances that it
is unsure of9 $ote that after the four e4amples are chosen at each loop, their class is known
for retraining purposes Gthis is analogous to ha-ing an e4pert label e4amplesH9
The training set presented to the e4pert classifier can essentially be described as a pool ofe4amples that the probabilistic classifier is unsure of9 The pool of e4amples, chosen using a
threshold, will be biased towards ha-ing too many positi-e e4amples if the training data set
is imbalanced9 This is because the e4amples are chosen from a window that is centered
o-er the borderline where the positi-e and negati-e e4amples meet9 To correct for this, the
classifier chosen to train on the pool of e4amples, C=9>, was modified to include a loss ratio
parameter, which allows pruning to be based on e4pected loss instead of e4pected error
Gthis is analogous to cost pruning, Section 59=9;H9 The default rule for the classifier was also
modified to be chosen based on e4pected loss instead of e4pected error9
0ewis and Catlett ;= show by testing their sampling technique on a te4t classification
task that uncertainty sampling reduces the number of training e4amples required by an
e4pensi-e learner such as C=9> by a factor of ;69 They did this by comparing results of
5
-
8/12/2019 Andrew Thesis
35/125
induced decision trees on uncertainty samples from a large pool of training e4amples with
pools of e4amples that were randomly selected, but ten times larger9
*,, One %ie 2nte$$igent Se$e'tionubat and atwin ;: propose an intelligent one sided sampling technique that reduces
the number of negati-e e4amples in an imbalanced data set9 The underlying concept in their
algorithm is that positi-e e4amples are considered rare and must all be kept9 This is in
contrast to 0ewis and Catletts technique in that uncertainty sampling does not guarantee
that a large number of positi-e e4amples will be kept9 ubat and atwin ;: balance
data sets by remo-ing negati-e e4amples9 They categori+e negati-e e4amples as belonging
to one of four groups9 They are&
Those that suffer from class label noiseN
7orderline e4amples Gthey are e4amples which are close to the boundaries of
positi-e e4amplesHN
Redundant e4amples Gtheir part can be taken o-er by other e4amplesHN and
Safe e4amples that are considered suitable for learning9
#n their selection technique all negati-e e4amples, e4cept those which are safe, areconsidered to be harmful to learning and thus ha-e the potential of being remo-ed from the
training set9 Redundant e4amples do not directly harm correct classification, but increase
classification costs9 7orderline negati-e e4amples can cause learning algorithms to o-erfit
positi-e e4amples9
ubat and atwinMs ;: selection technique begins by first remo-ing redundant
e4amples from the training set9 To do this a subset C of the training e4amples, S, is created
by taking e-ery positi-e e4ample from S and randomly choosing one negati-e e4ample9
The remaining e4amples in S are then classified using the ;3$earest $eighbor G;3$$H rule
with C9 Any misclassified e4ample is added to C9 $ote that this technique does not make
the smallest C possible, it %ust shrinks S9 After redundant e4amples are remo-ed, e4amples
considered borderline or class noisy are remo-ed9
5=
-
8/12/2019 Andrew Thesis
36/125
7orderline, or class noisy e4amples are detected using the concept of Tomek 0inks
Tomek, ;:? that are defined by the distance between different class labeled e4amples9
Take for instance, two e4amples 4 and y with different classes9 The pair G4, yH is considered
to be a Tomek link if there e4ists no e4ample +, such that G4, +H G4, yH or Gy, +H Gy,
4H, where Ga, bH is defined as the distance between e4ample a and e4ample b9 E4amples
are considered borderline or class noisy if they participate in a Tomek link9
ubat and atwins selection technique was shown to be successful in impro-ing the
performance using the g3mean on two of three benchmark domains& -ehicles G-eh;H, glass
Gg:H, and -owels G-woH9 The domain in which no impro-ement was seen, g:, was
e4amined and it was found that in that particular domain the original data set did not
produce disproportionate -alues for gI and g39
*,0 Naive Samp$ing Te'hni3ue%
The pre-iously described selection algorithms balance data sets by significantly reducing
the number of training e4amples9 7oth are intelligent methods that filter out e4amples
using uncertainty sampling, or by remo-ing e4amples that are considered harmful to
learning9 0ing and 0i ;@ approach the problem of data imbalance using methods that
nai-ely downsi+e or o-er3sample data sets classifying e4amples with a confidence
measurement9 The domain of interest is data mining for direct marketing9 (ata sets in this
field are typically two class problems and are se-erely imbalanced, only containing a few
e4amples of people who ha-e bought the product and many e4amples of people who ha-e
not9 The three data sets studied by 0ing and 0i ;@ are a bank data set from a loan
product promotion G7ankH, a RRSP campaign from a life insurance company G0ife
#nsuranceH, and a bonus point program where customers accumulate points to redeem for
merchandise G7onusH9 As will be e4plained later, all three of the data sets are imbalanced9
(irect marketing is used by the consumer industry to target customers who are likely to
buy products9 Typically, if mass marketing is used to promote products Ge9g9, including
flyers in a newspaper with a large distributionH the response rate Gthe percent of people who
buy a product after being e4posed to the promotionH is -ery low and the cost of mass
5>
-
8/12/2019 Andrew Thesis
37/125
marketing -ery high9 For the three data sets studied by 0ing and 0i the response rates were
;95U of 6,66 responding in the 7ank data set, :U of @6,666 responding in the 0ife
#nsurance data set, and ;95U of ;6=,666 for the 7onus Program9
(ata mining can be -iewed as a two class domain9 /i-en a set of customers and their
characteristics, determine a set of rules that can accurately predict a customer as being a
buyer or a non3 buyer, ad-ertising only to buyers9 0ing and 0i ;@ howe-er, state that a
binary classification is not -ery useful for direct marketing9 For e4ample, a company may
ha-e a database of customers to which it wants to ad-ertise the sale of a new product to the
-
8/12/2019 Andrew Thesis
38/125
The e-aluation method used by 0ing and 0i ;@ is known as the lift inde49 This inde4
has been widely used in database marketing9 The moti-ation behind using the lift inde4 is
that it reflects the re3distribution of testing e4amples after a learner has ranked them9 For
e4ample, in this domain the learning algorithms rank e4amples in order of the most likely
to respond to the least likely to respond9 0ing and 0i ;@ di-ide the ranked list into ;6
deciles9 Bhen e-aluating the ranked list, regularities should be found in the distribution of
the responders Gi9e9, there should be a high percentage of the responders in the first few
decilesH9 Table 59=9; is a reproduction of the e4ample that 0ing and 0i ;@ present to
demonstrate this9
0ift Table
;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U ;6U
=;6 ;6 ;
-
8/12/2019 Andrew Thesis
39/125
"sing their lift inde4 as the sole measure of performance, 0ing and 0i ;@ report results
for o-er3sampling and downsi+ing on the three data sets of interest G7ank, 0ife #nsurance,
and 7onusH9
0ing and 0i ;@ report results that show the best lift inde4 is obtained when the ratio of
positi-e and negati-e e4amples in the training data is equal9 "sing 7oosted3$aW-e 7ayes
with a downsi+ed data set resulted in a lift inde4 of :69>U for 7ank, :>95U for 0ife
#nsurance, and @;99=U for 0ife #nsurance, and @69=U for the 7onus program when the data sets were
imbalanced at a ratio of ; positi-e e4ample to e-ery @ negati-e e4amples9 1owe-er, using
7oosted37ayes with o-er3sampling did not show any significant impro-ement o-er the
imbalanced data set9 0ing and 0i ;@ state that one method to o-ercome this limitation
may be to retain all the negati-e e4amples in the data set and re3sample the positi-e
e4amples=9
Bhen tested using their boosted -ersion of C=9>, o-er3sampling saw a performance gain as
the positi-e e4amples were re3sampled at higher rates9 Bith a positi-e sampling rate of
564, 7ank saw an increase of 59U Gfrom ?>9?U to ?@9>UH, 0ife #nsurance an increase of
59U Gfrom :=9
-
8/12/2019 Andrew Thesis
40/125
which techniques are appropriate in dealing with class imbalancesX To in-estigate these
questions *apkowic+ created a number of artificial domains which were made to -ary in
concept comple4ity, si+e of the training data and ratio of the under3represented class to the
o-er3represented class9
The target concept to be learned in her study was a one dimensional set of continuous
alternating equal si+ed inter-als in the range 6, ;, each associated with a class -alue of 6
or ;9 For e4ample, a linear domain generated using her model would be the inter-als 6,
69>H and G69>, ;9 #f the first inter-al was gi-en the class ;, the second inter-al would ha-e
class 69 E4amples for the domain would be generated by randomly sampling points from
each inter-al Ge9g9, a point 4 sampled in 6, 69> would be a G4, IH e4ample, and likewise a
point y sampled in G69>, ; would be an Gy, 3H e4ampleH9
*apkowic+ 5666 -aried the comple4ity of the domains by -arying the number of inter-als
in the target concept9 (ata set si+es and balances were easily -aried by uniformly sampling
different numbers of points from each inter-al9
The two balancing techniques that *apkowic+ 5666 used in her study that are of interest
here are o-er3sampling and downsi+ing9 The o-er3sampling technique used was one in
which the small class was randomly re3sampled and added to the training set until the
number of e4amples of each class was equal9 The downsi+ing technique used was one in
which random e4amples were remo-ed from the larger class until the si+e of the classes
was equal9 The domains and balancing techniques described abo-e were implemented
using -arious discrimination based neural networks G(0PH9
*apkowic+ found that both re3sampling and downsi+ing helped impro-e (0P, especially
as the target concept became -ery comple49 (ownsi+ing, howe-er, outperformed o-er3
sampling as the si+e of the training set increased9
5
-
8/12/2019 Andrew Thesis
41/125
8., C#)''4$r' :c Co($r On$ C#)''
*0- 5/T!
Riddle, Segal, and Et+ioni ;= propose an induction technique called 7R"TE9 The goal
of 7R"TE is not classification, but the detection of rules that predict a class9 The domain
of interest which leads to the creation of 7R"TE is the detection of manufactured airplane
parts that are likely to fail9 Any rule that detects anomalies, e-en if they are rare, is
considered important9 Rules which predict that a part will not fail, on the other hand are not
considered -aluable, no matter how large the co-erage may be9
7R"TE operates on the premise that standard decision trees test functions such as #(
-
8/12/2019 Andrew Thesis
42/125
#t can be seen that T5 will be chosen o-er T; using #(
-
8/12/2019 Andrew Thesis
43/125
CART and 59U for C=>9 !ne drawback is that the computational comple4ity of 7R"TES
depth bound search is much higher than that of typical decision tree algorithms9 They do
report, howe-er, that it only took CP" minutes of computation on a SPARC3;69
*0, 7O24
F!#0 .uinlan, ;6 is an algorithm designed to learn a set of first order rules to predict a
target predicate to be true9 #t differs from learners such as C>96 in that it learns relations
among attributes that are described with -ariables9 For e4ample, using a set of training
e4amples where each e4ample is a description of people and their relations&
$ame; *ack, /irlfriend; *ill,
$ame5 *ill, 7oyfriend5 *ack, Couple;5 True Q
C>96 may learn the rule&
#F G$ame; *ackH Y G7oyfriend5 *ackH T1E$ Couple;5 True9
This rule of course is correct, but will ha-e a -ery limited use9 F!#0 on the other hand can
learn the rule&
#F 7oyfriendG4, yH T1E$ CoupleG4, yH True
where 4 and y are -ariables which can be bound to any person described in the data set9 A
positi-e binding is one in which a predicate binds to a positi-e assertion in the training
data9 A negati-e binding is one in which there is no assertion found in the training data9 For
e4ample, the predicate 7oyfriendG4, yH has four possible bindings in the e4ample abo-e9
The only positi-e assertion found in the data is for the binding 7oyfriendG*ill, *ackH Gread
> The accuracy being referred to here is not how well a rule set performs o-er the testing data9 Bhat is being referred to
is the percentage of testing e4amples which are co-ered by a rule and correctly classified9 The e4ample Riddle et al9
;= gi-e is that if a rule matches ;6 e4amples in the testing data, and = of them are positi-e, then the predicti-e
accuracy of the rule is =6U9 The figures gi-en are a-erages o-er the entire rule set created by each algorithm9 Riddle et
al9 ;= use this measure of performance in their domain because their primary interest is in finding a few accurate
rules that can be interpreted by factory workers in order to impro-e the production process9 #n fact, they state that they
would be happy with a poor tree with one really good branch from which an accurate rule could be e4tracted9
-
8/12/2019 Andrew Thesis
44/125
the boyfriend of *ill is *ackH9 The other three possible bindings Ge9g9, 7oyfriendG*ack, *illHH
are negati-e bindings, because there are no positi-e assertions in the training data9
The following is a brief description of the F!#0 algorithm adapted from itchell, ;:9
F!#0 takes as input a target predicate Ge9g9, CoupleG4, yHH, a list of predicates that will be
used to describe the target predicate and a set of e4amples9 At a high le-el, the algorithm
operates by learning a set of rules that co-ers the positi-e e4amples in the training set9 The
rules are learned using an iterati-e process that remo-es positi-e training e4amples from
the training set when they are co-ered by a rule9 The process of learning rules continues
until there are enough rules to co-er all the positi-e training e4amples9 This way, F!#0 can
be -iewed as a specific to general search through a hypothesis space, which begins with an
empty set of rules that co-ers no positi-e e4amples and ends with a set of rules general
enough to co-er all the positi-e e4amples in the training data Gthe default rule in a learned
set is negati-eH9
Creating a rule to co-er positi-e e4amples is a process by which a general to specific search
is performed starting with an empty condition that co-ers all e4amples9 The rule is then
made specific enough to co-er only positi-e e4amples by adding literals to the rule Ga
literal is defined as a predicate or its negati-eH9 For e4ample, a rule predicting the predicate
FemaleG4H may be made more specific by adding the literals long)hairG4H and ZbeardG4H9
The function used to e-aluate which literal, 0, to add to a rule, R, at each step is&
where p6and n6are the number of positi-e GpH and negati-e GnH bindings of the rule R, p ;
and n;are the number of positi-e and negati-e binding of the rule which will be created by
adding 0 to R and t is the number of positi-e bindings of the rule R which are still co-ered
by R when 0 is added Gi9e9, t p6 3 p;H9
( )
+
+=
66
65
;;
;5 loglog,)
np
p
np
pt(4&ain7oi$
-
8/12/2019 Andrew Thesis
45/125
The function 7oi$6&aindetermines the utility of adding 0 to R9 #t prefers adding literals
with more positi-e bindings than negati-e bindings9 As can be seen in the equation, the
measure is based on the proportion of positi-e bindings before and after the literal in
question is added9
*00 S.2N8
ubat, 1olte, and atwin ;@ discuss the design of the S1R#$ algorithm that follows
the same principles as 7R"TE9 S1R#$ operates by finding rules that co-er positi-e
e4amples9 #n doing this, it learns from both positi-e and negati-e e4amples using the g3
mean to take into account rule accuracy o-er negati-e e4amples9 There are three principles
behind the design of S1R#$9 They are&
(o not subdi-ide the positi-e e4amples when learningN
Create a classifier that is low in comple4ityN and
Focus on regions in space where positi-e e4amples occur9
A S1R#$ classifier is made up of a network of tests9 Each test is of the form& 4 imin aiN
ma4 ai where i inde4es the attributes9 0et hirepresent the output of the ith test9 #f the test
suggests a positi-e test, the output is ;, else it is 3;9 E4amples are classified as being
positi-e if ihiwiQ where wiis a weight assigned to the test h i9
S1R#$ creates the tests and weights in the following way9 #t begins by taking the inter-al
for each attribute that co-ers all the positi-e e4amples9 The inter-al is then reduced in si+e
by remo-ing either the left or right point based on whiche-er produces the best g3mean9
This process is repeated iterati-ely and the inter-al found to ha-e the best g3mean is
considered the test for the attribute9 Any test that has a g3mean less than 69>6 is discarded9
The weight assigned to each test is wi log Ggi8;3giH where giis the g3mean associated with
the ith attribute test9
-
8/12/2019 Andrew Thesis
46/125
The results reported by ubat et al9 ;@ demonstrate that the S1R#$ algorithm
performs better than ;3$earest $eighbor with one sided selection?9 Pitting S1R#$ against
C=9> with one sided selection the results became less clear9 "sing one sided selection
resulted in a performance gain o-er the positi-e e4amples but a significant loss o-er the
negati-e e4amples9 This loss of performance o-er the negati-e e4amples results in the g3
mean being lowered by about ;6U9
Accuracies Achie-ed by C=9>, ;3$$ and Shrink
Classifier aI a3 g3mean
C=9> @;9; @?9? @;9:
;3$$ ?:95 @ ?69 :69Table 59=95& This table is adapted from ubat et al9, ;@9 #tgi-es the accuracies achie-ed by C=9> ;3$$ and S1R#$9
8.- R$co%nton B)'$d L$)rnn%
(iscrimination based learning techniques, such as C>969 create rules which describe both
the positi-e GconceptualH class, as well as the negati-e Gcounter conceptualH class9
Algorithms such as, 7R"TE, and F!#0 differ from algorithms such as C>96, in that they
create rules that only co-er positi-e e4amples9 1owe-er, they are still discrimination based
techniques because they create positi-e rules using negati-e e4amples in their search
through the hypothesis space9 For e4ample, F!#0 creates rules to co-er the positi-e class
by adding literals until they do not co-er any of the negati-e class e4amples9 !ther learning
methods, such as back propagation applied to a feed forward neural network and 3nearest
neighbor, do not e4plicitly create rules, but they are discrimination based techniques that
learn from both positi-e and negati-e e4amples9
*apkowic+, yers, and /luck ;> describe 1#PP!, a system that learns to recogni+e atarget concept in the absence of counter e4amples9 ore specifically, it is a neural network
Gcalled an autoencoderH that is trained to take positi-e e4amples as input, map them to a
small hidden layer, and then attempt to reconstruct the e4amples at the output layer9
? !ne sided selection is discussed in Section 59>95959 #t is essentially a method by which negati-e e4amples considered
harmful to learning are remo-ed from the data set9
-
8/12/2019 Andrew Thesis
47/125
7ecause the network has a narrow hidden layer it is forced to compress redundancies found
in the input e4amples9
An ad-antage of recognition based learners is that they can operate in en-ironments inwhich negati-e e4amples are -ery hard or e4pensi-e to obtain9 An e4ample *apkowic+ et
al9 ;> gi-e is the application of machine fault diagnosis where a system is designed to
detect the likely failure of hardware Ge9g9, helicopter gear bo4esH9 #n domains such as this,
statistics on functioning hardware are plentiful, while statistics of failed hardware may be
nearly impossible to acquire9 !btaining positi-e e4amples in-ol-es monitoring functioning
hardware, while obtaining negati-e e4amples in-ol-es monitoring hardware that fails9
Acquiring enough e4amples of failed hardware for training a discrimination based learner,
can be -ery costly if the de-ice has to be broken a number of different ways to reflect all
the conditions in which it may fail9
#n learning a target concept, recognition based classifiers such as that described by
*apkowic+ et al9 ;> do not try to partition a hypothesis space with boundaries that
separate positi-e and negati-e e4amples, but they attempt to make boundaries which
surround the target concept9 The following is an o-er-iew of how 1#PP!, a one hidden
layer autoencoder, is used for recognition based learning9
A one hidden layer autoencoder consists of three layers, the input layer, the hidden layer
and the output layer9 Training an autoencoder takes place in two stages9 #n the first stage
the system is trained on positi-e instances using back3propagation :to be able to compress
the training e4amples at the hidden layer and reconstruct them at the output layer9 The
second stage of training in-ol-es determining a threshold that can be used to determine the
reconstruction error between positi-e and negati-e e4amples9
The second stage of training is a semi3automated process that can be one of two cases9 The
first noiseless case is one in which a lower bound is calculated on the reconstruction error
of either the negati-e or positi-e instances9 The second noisy case is one that uses both
: $ote that back propagation is not the only training function that can be used9 E-ans and *apkowic+ 5666 report results
using an auto3encoder trained with the !ne Step Secant function9
-
8/12/2019 Andrew Thesis
48/125
positi-e and negati-e training e4amples to calculate the threshold ignoring the e4amples
considered to be noisy or e4ceptional9
After training and threshold determination, unseen e4amples can be gi-en to theautoencoder that can compress and then reconstruct them at the output layer, measuring the
accuracy at which the e4ample was reconstructed9 For a two class domain this is -ery
powerful9 Training an autoencoder to be able to sufficiently reconstruct the positi-e class,
means that unseen e4amples that can be reconstructed at the output layer contain features
that were in the e4amples used to train the system9 "nseen e4amples that can be
generali+ed with a low reconstruction error can therefore be deemed to be of the same
conceptual class as the e4amples used for training9 Any e4ample which cannot be
reconstructed with a low reconstruction error is deemed to be unrecogni+ed by the system
and can be classified as the counter conceptual class9
*apkowic+ et al9 ;> compared 1#PP! to two other standard classifiers that are designed
to operate with both positi-e and negati-e e4amples9 They are C=9> and applying back
propagation to a feed forward neural network GFF ClassificationH9 The data sets studied
were&
The C1=? 1elicopter /earbo4 data set olesar and $Ra(, ;=9 This domain
consists of discriminating between faulty and non3faulty helicopter gearbo4es
during operation9 The faulty gearbo4es are the positi-e class9
The Sonar Target Recognition data set9 This data was obtained from the "9C9 #r-ine
Repository of achine 0earning9 This domain consists of taking sonar signals as
input and determining which signals constitute rocks and which are mines Gmine
signals were considered the positi-e class in the studyH9
The Promoter data set9 This data consists of input segments of ($A strings9 The
problem consists of recogni+ing which strings represent promoters that are the
positi-e class9
-
8/12/2019 Andrew Thesis
49/125
Testing 1#PP! showed that it performed much better than C=9> and FF Classifier on the
1elicopters and Sonar Targets domains9 #t performed equally with FF classifier on the
promoters domain but much better than C=9> on the same data9
(ata Set Results
(ata Set 1#PP! C=9> FF Classifier
1elicopters 69 ;>9?5>;9 ;69;9:
Promoters 5669: ;9= 56;9=
SonarTargets
5659: 5;9@
-
8/12/2019 Andrew Thesis
50/125
-
8/12/2019 Andrew Thesis
51/125
C h a p t e r T h r e e
< ART#F#C#A0 (!A#$
Chapter < is di-ided into three sections9 Section 969 The
purpose of the e4periments is to in-estigate the nature of imbalanced data sets and pro-ide
a moti-ation behind the design of a system intended to impro-e a standard classifiers
performance on imbalanced data sets9 Section
-
8/12/2019 Andrew Thesis
52/125
where k is the number of dis%uncts, n is the number of con%unctions in each dis%unct, and 4 n
is defined o-er the alphabet 4;, 45,[, 4%9 Z4;, Z45, [,Z4%9 An e4ample of a k3($F
e4pression, being 5, gi-en as GE4p9 ;H9
4;Y4Z4 GE4p9 ;H
$ote that if 4k is a member of a dis%unct Z4kcannot be9 Also note, GE4p9 ;H would be
referred to as an e4pression of , the following four e4amples would ha-e
classes indicated by I839
4; 45 4< 4= 4> Class
;H ; 6 ; ; 6 I
5H 6 ; 6 ; ; I
-
8/12/2019 Andrew Thesis
53/125
Figure
-
8/12/2019 Andrew Thesis
54/125
The other similarity between te4t classification and k3($F e4pressions is the ability to
affect the comple4ity of the target e4pression in a k3($F e4pression9 7y -arying the
number of dis%uncts in an e4pression we can -ary the difficulty of the target concept to be
learned9@This ability to control concept comple4ity can map itself onto te4t classification
tasks where not all classification tasks are equal in difficulty9 This may not be ob-ious at
first9 Consider a te4t classification task where one needs to classify documents as being
about a particular consumer product9 The comple4ity of the rule set needed to distinguish
documents of this type, may be as simple as a single rule indicating the name of the product
and the name of the company that produces it9 This task would probably map itself to a
-ery simple k3($F e4pression with perhaps only one dis%unct9 $ow consider training
another classifier intended to be used to classify documents as being computer softwarerelated or not9 The number of rules needed to describe this category is probably much
greater9 For e4ample, the terms JcomputerJ and JsoftwareJ in a document may be good
indicators that a document is computer software related, but so might be the term
JwindowsJ, if it appears in a document not containing the term JcleanerJ9 #n fact, the terms
JoperatingJ and JsystemJ or JwordJ and JprocessorJ appearing together in a document are
also good indicators that it is software related9 The comple4ity of a rule set needed to be
constructed by a learner to recogni+e computer software related documents is, therefore,
greater and would probably map onto a k3($F e4pression with more dis%uncts than that of
the first consumer product e4ample9
The biggest difference between the two domains is that the artificial domain was created
without introducing any noise9 $o negati-e e4amples were created and labeled as being
positi-e9 0ikewise, there were no positi-e e4amples labeled as negati-e9 For te4t domains
in general there is often label noise in which documents are gi-en labels that do not
accurately indicate their content9
@ As the number of dis%uncts GkH in an e4pression increases, more partitions in the hypothesis space are need to be
reali+ed by a learner to separate the positi-e e4amples from the negati-e e4amples9
=
-
8/12/2019 Andrew Thesis
55/125
;.* E6)&/#$ Cr$)ton
For the described tests, training e4amples were always created independently of the testing
e4amples9 The training and testing e4amples were created in the following manner&
A Random k3($F e4pression is created on a gi-en alphabet si+e Gin this study the
alphabet si+e is >6H9
An arbitrary set of e4amples was generated as a random sequence of attributes
equal to the si+e of the alphabet the k3($F e4pression was created o-er9 All the
attributes were gi-en an equal probability of being either 6 or ;9
Each e4ample was then classified as being either a member of the e4pression or not
and tagged appropriately9 Figure 666 negati-e e4amples and ;566 positi-e e4amples was used9 This
represented a class imbalance of >&; in fa-or of the negati-e class9 As the tests, howe-er,
lead to the creation of a combination scheme, the data sets tested were further imbalanced
to a 5>&; G?666 negati-e & 5=6 positi-eH ratio in fa-or of the negati-e class9 This greater
imbalance more closely resembled the real world domain of te4t classification on which the
system was ultimately tested9 #n each case the e4act ratio of positi-e and negati-e e4amples
in both the training and testing set will be indicated9
==
-
8/12/2019 Andrew Thesis
56/125
;., D$'cr/ton o4 T$'t' )nd R$'u#t'
The description of each test will consist of se-eral sections9 The first section will state the
moti-ation behind performing the test and gi-e the particulars of its design9 The results of
the e4periment will then be gi-en followed by a discussion9
90- Te%t : - Varying the Target Con'ept% Comp$exity
'arying the number of dis%uncts in an e4pression -aries the comple4ity of the target
concept9 As the number of dis%uncts increases, the following two things occur in a data set
where the positi-e e4amples are e-enly distributed o-er the target e4pression and their
number is held constant&
The target concept becomes more comple4, and
The number of positi-e e4amples becomes sparser relati-e to the target concept9
A -isual representation of the preceding statements is gi-en in Figure
-
8/12/2019 Andrew Thesis
57/125
The moti-ation behind this e4periment comes from Schaffer ;96 learns target concepts of increasing comple4ity on balanced and imbalanced data sets9
S$tu/
#n order to in-estigate the performance of induced decision trees on balanced and
imbalanced data sets, eight sets of training and testing data of increasing target concept
comple4ities were created9 The target concepts in the data sets were made to -ary in
concept comple4ity by increasing the number of dis%uncts in the e4pression to be learned,
while keeping the number of con%unctions in each dis%unct constant9 The following
algorithm was used to produce the results gi-en below9
Repeat 4 times
o Create a training set TGc, ?666I, ?6663Ho Create a test set EGc, ;566I, ;5663H
o Train Con T
o Test Con Eand record its performance P;&;
o Randomly remo-e =@66 positi-e e4amples from T
o Train Con T
o Test Con Eand record its performance P;&>
o Randomly remo-e ?6 positi-e e4amples from T
o Train Con To Test Con Eand record its performance P;&5>
$ote that throughout Chapter < the testing sets used to measure the performance of the induced classifiers are balanced9
That is, there is an equal number of both positi-e and negati-e e4amples used for testing9 The test sets are artificially
balanced in order to increase the cost of misclassifying positi-e e4amples9 "sing a balanced testing set to measure a
classifiers performance gi-es each class equal weight9
=?
-
8/12/2019 Andrew Thesis
58/125
A-erage each Ps o-er each 49
For this test e4pressions of comple4ity ' =45, =469 The results for each e4pression were a-eraged o-er 4 ;6 runs9
R$'u#t'
The results of the e4periment are shown in Figures
-
8/12/2019 Andrew Thesis
59/125
Error Over All Examples
0
041
042
04/
04(
(x2
(x/
(x(
(x-
(x)
(x.
(x*
(x10
Degree of Com plexity
Erro
r 151
15-
152-
Figure
-
8/12/2019 Andrew Thesis
60/125
D'cu''on
As pre-iously stated, the purpose of this e4periment was to test the classifiers performance
on both balanced and imbalanced data sets while -arying the comple4ity of the target
e4pression9 #t can be seen in Figure
-
8/12/2019 Andrew Thesis
61/125
Table
-
8/12/2019 Andrew Thesis
62/125
#n terms of the o-erall si+e of the data set, downsi+ing significantly reduces the number of
o-erall e4amples made a-ailable for training9 7y lea-ing negati-e e4amples out of the data
set, information about the negati-e Gor counter conceptualH class is being remo-ed9
!-er3sampling has the opposite effect in terms of the si+e of the data set9 Adding e4amples
by re3sampling the positi-e Gor conceptualH class, howe-er, does not add any additional
information to the data set9 #t %ust balances the data set by increasing the number of positi-e
e4amples in the data set9
S$tu/
This test was designed to determine if randomly remo-ing e4amples of the o-er
represented negati-e class, or uniformly o-er3sampling e4amples of the under represented
class to balance the data set, would impro-e the performance of the induced classifier o-er
the test data9 To do this, data sets imbalanced at a ratio of ;I&5>3 were created, -arying the
comple4ity of target e4pression in terms of the number of dis%uncts9 The idea behind the
testing procedure was to start with an imbalanced data set and measure the performance of
an induced classifier as either negati-e e4amples are remo-ed, or positi-e e4amples are re3
sampled and added to the training data9 The procedure gi-en below was followed to
produce the presented results9
Repeat 4 times
o Create a training set TG', 5=6I, ?6663H
o Create a test set EG', ;566I, ;5663H
o Train Con T
o Test Con Eand record its performance Poriginalo Repeat for n ; to ;6
Create TdG5=6I, G?666 3 nV>:?H3H by randomly remo-ing >:?Vne4amples from T
Train Con Td Test Con Eand record its performance Pdownsi+e
o Repeat for n ; to ;6
Create ToGG5=6 I nV>:?HI, ?6663H by uniformly o-er3sampling the
positi-e e4amples from T9
Train Con Td
Test Con Eand record its performance Po-ersample
>;
-
8/12/2019 Andrew Thesis
63/125
-
8/12/2019 Andrew Thesis
64/125
For downsi+ing the numbers represent the rate at which negati-e e4amples were remo-ed
from the training data9 The point 6 represents no negati-e e4amples being remo-ed, while
;66 represents all the negati-e e4amples being remo-ed9 The point 6 represents the
training data being balanced G5=6I, 5=63H9 Essentially, what is being said is that the
negati-e e4amples were remo-ed at >:? increments9
For o-er3sampling, the labels on the 43a4is are simply the rate at which the positi-e
e4amples were re3sampled, ;66 being the point at which the training data set is balanced
G?666I, ?6663H9 The positi-e e4amples were therefore re3sampled at >:? increments9
#t can be seen from Figure :?H or =6:?H positi-e e4amples9 That is, the lowest error rate
achie-ed for o-er3sampling is around the ?6 or :6 mark in Figure
-
8/12/2019 Andrew Thesis
65/125
x$ Accuracy Over All Examples
0
041
042
04/
04(
0 20 (0 )0 *0 100
"ampli#g Rate
Error 6ow nsiin!
'verSamplin!
Figure & This graph demonstrates that the optimal le-el at
which a data set should be balanced does not always occur at thesame point9 To see this, compare this graph with Figure
-
8/12/2019 Andrew Thesis
66/125
x% Accuracy Over Negative Examples
0
04002
0400(
0400)
0400*
0401
0 20 (0 )0 *0 100
rror
6ownsiin!
'verSamplin!
Figure
-
8/12/2019 Andrew Thesis
67/125
The results in Figure
-
8/12/2019 Andrew Thesis
68/125
Figure :
-
8/12/2019 Andrew Thesis
69/125
There are competing factors when each balancing technique is used9 Achie-ing a
higher aI comes at the e4pense of a3 Gthis is a common point in the literature for
domains such as te4t classificationH9
900 Te%t :0 # u$e Count for 5a$an'e Data Set%
"ltimately, the goal of the e4periments described in this section is to pro-ide moti-ation
behind the design of a system that combines multiple classifiers that use different sampling
techniques9 The ad-antage of combining classifiers that use different sampling techniques
only comes if there is a -ariance in their predictions9 Combining classifiers that always
make the same predictions is of no -alue if one hopes that their combination will increase
predicti-e accuracy9 #deally, one would like to combine classifiers that agree on correctpredictions, but disagree on incorrect predictions9
ethods that combine classifiers such as Adapti-e37oosting attempt to -ary learners
predictions by -arying the training e4amples in which successi-e classifiers are presented
to learn on9 As we saw in Section 5959=, Adapti-e37oosting increases the sampling
probability of e4amples that are incorrectly classified by already constructed classifiers9 7y
placing this higher weight on incorrectly classified e4amples, the induction process at each
iteration is biased towards creating a classifier that performs well on pre-iously
misclassified e4amples9 This is done in an attempt to create a number of classifiers that can
be combined to increase predicti-e accuracy9 #n doing this, Adapti-e37oosting ideally
di-ersifies the large rule sets of the classifiers9
S$tu/
Rules can be described in terms of their comple4ity9 0arger rules sets are considered more
comple4 than smaller rule sets9 This e4periment was designed to get a feel for the
comple4ities of the rule sets produced by C>96, when induced on imbalanced data sets that
ha-e been balanced by either o-er3sampling or downsi+ing9 7y looking at the comple4ity
of the rule sets created, we can get a feel for the differences between the rule sets created
>@
-
8/12/2019 Andrew Thesis
70/125
using each sampling technique9 The following algorithm was used to produce the results
gi-en below9
Repeat 4 timeso Create a training set TGc, 5=6I, ?6663H
o Create ToG?666I,?6663H by uniformly re3sampling the positi-e e4amples
from Tand adding the negati-e e4amples from T9o Train Con Too Record rule counts RoI and Ro3 for positi-e and negati-e rule sets
o Create TdG5=6I, 5=63H by randomly remo-ing >?:6 negati-e e4amples from
T9o Train Con Td
o Record rule counts RdI and Rd3 for positi-e and negati-e rule sets
A-erage rule counts o-er 49
For this test e4pressions of si+es c =45, =46 were tested and a-eraged o-er 4
-
8/12/2019 Andrew Thesis
71/125
Positi-e Rule Counts
Don S96=4? =9: ; =96 ?96
=4: =9@ ;>9< =9< :95
=4@ =9 ;>9= @9<
-
8/12/2019 Andrew Thesis
72/125
7efore # begin the discussion of these results it should be noted that these numbers must
only be used to indicate general trends towards rule set comple4ity9 Bhen being a-eraged
for e4pressions of comple4ities =4? and greater the numbers -aried considerably9 The
discussion will be in four parts9 #t will begin by attempting to e4plain the factors in-ol-ed
in creating rule sets o-er imbalanced data sets and then lead into an attempt to e4plain the
characteristics of rules sets created by downsi+ed data sets, followed by o-er3sampled rule
sets9 # will then conclude with a general discussion about some of the characteristics of the
artificial domain and how they create the results that ha-e been presented9 Throughout this
section one should remember that the positi-e rule set contains the target concept, that is,
the underrepresented class9
.ow oe% a $a'= of po%itive training examp$e% hurt $earning>
ubat et al9 ;@ gi-e an intuiti-e e4planation of why a lack of positi-e e4amples hurts
learning9 0ooking at the decision surface of a two dimensional plane, they e4plain the
beha-ior of the ;3$earest $eighbor G;3$$H rule9 #t is a simple e4planation that is
generali+ed as& J[as the number of negati-e e4amples in a noisy domain grows Gthe
number of positi-es being constantH, so does the likelihood that the nearest neighbor of any
e4ample will be negati-e9J Therefore, as more negati-e e4amples are introduced to the data
set, the more likely a positi-e e4ample is to be classified as negati-e using the ;3$$ rule9
!f course, as the number of negati-e e4amples approaches infinity, the accuracy of a
learner that classifies all e4amples as negati-e approaches ;66U o-er negati-e data and 6U
o-er the positi-e data9 This is unacceptable if one e4pects to be able to recogni+e positi-e
e4amples9
They then e4tend the argument to decision trees, drawing a connection to the common
problem of o-erfitting9 Each leaf of a decision tree represents a decision as being positi-e
or negati-e9 #n a noisy training set that is unbalanced in terms of the number of negati-e
e4amples, it is stated that an induced decision tree will be large enough to create regions
arbitrarily small enough to partition the positi-e regions9 That is, the decision tree will ha-e
rules comple4 enough to co-er -ery small regions of the decision surface9 This is a result of
?;
-
8/12/2019 Andrew Thesis
73/125
a classifier being induced to partition positi-e regions of the decision surface small enough
to contain on$y positi-e e4amples9 #f there are many negati-e e4amples nearby, the
partitions will be made -ery small to e4clude them from the positi-e regions9 #n this way,
the tree o-erfits the data with a similar effect as the ;3$$ rule9
any approaches ha-e been de-eloped to a-oid o-er fitting data, the most successful being
post pruning9 ubat et al9 ;@, howe-er, state that this does not address the main
problem9 #f a region in an imbalanced data set by definition contains many more negati-e
e4amples than positi-e e4amples, post pruning is -ery likely to result in all of the pruned
branches being classified as negati-e9
C?@ an u$e Set%
C>96 attempts to partition data sets into regions that contain only positi-e e4amples and
regions that contain only negati-e e4amples9 #t does this by attempting to find features in
the data that are good to partition the training data around Gi9e9, ha-e a high information
gainH9 !ne can look at the partitions it creates by analy+ing the rules that are generated
which create the boundaries9 Each rule generated creates a partition in the data9 Rules can
appear to o-erlap, but when -iewed as partitions in an entire set of rules, the partitions
created in the data by the rule sets do not o-erlap9 'iewed as an entire set of rules, thepartitions in the data can be -iewed has ha-ing highly irregular shapes9 This is due to the
fact that C>96 assigns a confidence le-el to each rule9 #f a region of space is o-erlapped by
multiple rules, the confidence le-el for each rule class that co-ers the space is summed9 The
class with the highest summed confidence le-el is determined to be the correct class9 The
confidence le-el gi-en to each rule can be -iewed as being the number of e4amples the rule
co-ers correctly o-er the training data9 Therefore, rule sets that contain higher numbers of
rules are generally less confident in their estimated accuracy because each rule co-ers
fewer e4amples9 Figure
-
8/12/2019 Andrew Thesis
74/125
Rule 1 Rule 2
Figure 96 adds rules to createcomple4 decision surfaces9 #t is done by summing the confidencele-el of rules that co-er o-erlapping regions9 A region co-ered by
more than one rule is assigned the class with the highest summed
confidence le-el of all the rules that co-er it9 1ere we assumeRule ; has a higher confidence le-el than Rule 59
Down%i
-
8/12/2019 Andrew Thesis
75/125
!-er3sampling has different effects than downsi+ing9 !ne ob-ious difference is the
comple4ity of the rule sets indicating negati-e partitions9 Rule sets that classify negati-e
e4amples when o-er3sampling is used are much larger than those created using
downsi+ing9 This is because there is still the large number of negati-e e4amples in the data
set, resulting in a large number of rules created to classify them9
The rule sets created for the negati-e e4amples are gi-en much less confidence than those
created when downsi+ing is used9 This effect occurs due to the fact that the learning
algorithm attempts to partition the data using features contained in the negati-e e4amples9
7ecause there is no target concept contained in the negati-e e4amples;5Gi9e9, no features to
indicate an e4ample to be negati-eH, the learning algorithm is faced with the dubious task,
in this domain, of attempting to find features that do not e4ist e4cept by mere chance9
!-er sampling the positi-e class can be -iewed as adding weight to the e4amples that are
re3sampled9 "sing an information gain heuristic when searching through the hypothesis
space, features which partition more e4amples correctly are fa-ored o-er those that do not9
The effect of multiplying the number of e4amples a feature will classify correctly when
found gi-es the feature weight9 !-er sampling the positi-e e4amples in the training data
therefore has the effect of gi-ing weight to features contained in the target concept, but italso adds weight to random features which occur in the data that is being o-er3sampled9
The effect of o-er3sampling therefore has two competing factors9 The factors are&
!ne that adds weight to features containing the target concept9
!ne that adds weight to features notcontaining the target concept
The effect of features not rele-ant to the target concept being gi-en a disproportionate
weight can be seen for e4pressions of comple4ity =4@ and =4;69 This can be seen in lower
right hand corner of Table
-
8/12/2019 Andrew Thesis
76/125
sparse compared to the number of positi-e e4amples9 Bhen the positi-e data is o-er3
sampled, irrele-ant features are gi-en enough weight relati-e to the features containing the
target conceptN as a result the learning algorithm se-erely o-erfits the training data by
creating garbage rules that partition the data on features not containing the target concept,
but that appear in the positi-e e4amples9
;.- C)r)ct$r'tc' o4 t$ Do&)n )nd o t$ A44$ct t$ R$'u#t'
The characteristics of the artificial domain greatly affect the way in which rule sets are
created9 The ma%or determining factor in the creation of the rule sets is the fact that the
target concept is hidden in the underrepresented class and that the negati-e e4amples in the
domain ha-e no rele-ant features9 That is, the underrepresented class contains the target
concept and the o-er represented class contains e-erything else9 #n fact, if o-er3sampling is
used to balance the data sets, e4pressions of comple4ity =45 to =4? could still, on a-erage,
attain ;66U accuracy on the testing set, if only the positi-e rule sets were used to classify
e4amples with a default negati-e rule9 #n this respect, the artificial domain can be -iewed as
lending itself to bein