7. techniques for improvement of classification bagging and boosting

Techniques for Improvement of Classification - Bagging & Boosting

One of the most important issues for classification processes is their accuracy. Sometimes it happens that a poor classifier is obtained and there is a great need for improvement. One of frequent approaches is to make use of the particularities of the application domain or available data. The most common situations in which poor classification occur are when the classifier needs to categorize data among a large number of classes (e.g., over 200,000 categories) or when we have several poor classifiers.

The typical solution for obtaining an improvement in classification accuracy is to combine the outputs of poorly available classifiers into a single. The combination provides better results ia the situation when the misclassified instances are at least somehow independent. The most: common techniques that rely on this approach are known in the literature as- bagging and boosting. Other similar techniques from the same area are voting and stacking. Ultimately, a hybrid automatic or manual solution may sometimes need to be used in order to obtainenoughincrease in classification, accuracy.

The common way to accomplishthe above presented approach in the case of classifications to take a vote (perhaps a weighted vote) for each classification. Both bagging andboosting both adopt this approach, but they derive the individual classifiers in differentways. In bagging, the involved classifiers receive equal weight, whereas in boosting,weighting is used to give more influence to the more successful classifiers. This situation occurs in day to day life when we have several people from which we receive an advice but the final decision is taken by using different weights on each advice depending upon how much trust we give for each people's opinion.

Bagging

To introduce bagging, let us suppose that we have at our disposal several input datasets of similar size. The datasets may be of small size and were obtained from the same application domain. The first idea may be to put together all available data and to build a classifier. This approach may lead to a classifier that has not enough accuracy as required. Thus, a second approach is to build more classifiers, even one for each dataset. This approach may lead to classifiers that perform a little worse that the classifier obtained in the first approach. An interesting observation may be to see how these classifiers work on a test dataset. At first glance, our intuition may make us think that all the classifiers (we suppose that, we have built one classifier for each input dataset) perform pretty much identical oa a provided test dataset. If we are in this situation it means that alt our classifiers are equivalent. Still, if this is not the situation (i.e., the classifiers have different

1

opinions about items from the test dataset) it means that bagging procedure may lead to better results.Another important aspect regards the size of the test dataset. If the size of the test dataset is large and the available classifiers give different classifications it means that bagging is clearly the right way. Intuitively, giving different answers is more likely to appear for small, test datasets and even in this situation bagging is still recommended.

Returning to the previously presented analogy with advice offered by several people, we can think of these people as distinct classifiers. In this setup when neither of them has enough accuracy a combination among them is advisable. The trivial and first option is to apply a voting strategy. Each item is the test dataset will be placed in the class which receives more votes from the existing classifiers. Basically, in this setup each classifier offers its opinion regarding the class of the item from the test and the bagging algorithm chooses the class with greatest number of votes. The prediction by voting proves in general to be quite reliable especially when classifiers have quite contradictory opinions regarding the class to which the test items belong. It is also true that the finally obtained classifier will lose or gain accuracy if new input datasets are used for adding more classifiers in the bagging scheme. Still, bagging does not always work out in producing an increase in accuracy. Sometimes it happens that the combined decisions are worse, but it is the task of the data analyst and of domain knowledge person to properly debug the machine learning system and to alter its design in such a way that better results are obtained.

From bias-variance point of view bagging does not alter the bias but decreases the variance. Since the bias measures the persistent error of the learning algorithm it can not be decreased by using a larger number of classifiers. On the other hand, the variance tends to decrease its value with the increase in. number of classifiers because .any of the input datasets is finite and therefore not fully representative for the actual input dataset.

The pseudocode of bagging algorithm is presented below:

//IDS is the set of input datasets//C is a set of classifiersprocedure generateClassifiersForBagging(lnputDatasets IDS) return C //ids is

an input dataset from the set IDS For each of ids in IDS {c =- buildClassifier(ids)://c is the obtained classifierC.add(c); //add the obtained classifier c to the set of classifiers C

return C;

//I is the instance to be classified

2

//C is the set of classifiers//da is the obtained classprocedureclassifylnstance (Instance I, Classifiers C) return da

For each classifier c from C {{Predict class of instance using classifier c;

Return class that has been predicted most often.

Boosting

In the previous section we have seen that instability provided by different da tasets may be used oy bagging procedure in order to produce better classifiers. In conclusion, the above procedure is advisable only when the classifiers are different one from another and. each one of them treats i n " - correct fashion a reasonable high percentage of the items. In an ideal situation each classifier should have high accuracy on a specific part of the data domain' in which all other classifiers perform quite poorly. Referring to the advice seekers scenario this situation is similar with the one in which somebody is asking advice from persons that are experts in different areas of the same domain but their areas complement to each other.

Boosting works in the same scenario as bagging, which implies combining classifiers that complement one another, using voting for reaching a final decision and combining the same type of classifier (e.g.. decision iree^ The main difference from bagging is the iterative approach taken on available classifiers. In bagging each classifier is built completely independent of all other classifiers, possibly m a parallel fashion. On the other hand, boosting builds each new classifier by being influenced of the performance of fee previously built classifier. This approach makes the newly obtained classifier to gather the expertise from previously used classifiers only in the "areas" in which former classifiers performed best. Finally, boosting weights a classifier's contribution according to its performance instead of assigning same weight to all classifiers in a voting strategy. Of course, these are just the main guidelines that describe the boosting algorithm. Here is the general boosting algorithm."

//t is the maximum number of.iterations prccedurebuildClassifierForBoosting(inlt) Mssign equal weight to each training instance for each of? iterations {

# build classifierusing weighted dataset# compute error s of classifier on weighed dataset if(e = 0 or e>0.5)

U stop building classifiers f&r each instance in dataset

3

if (instance classified correctly by model)# Multiply weight of instance by e/(J ~ e)

# normalize weight of all instance

/ //endforProcedureclassifylnstance# assign -weight of zero to ail classes far each of the t (or less) classifiers

# add -iog(e /{I- e)) to weight of class predicted by modeI return class with largest 'weight

The boosting algorithm, summarized presented above begins by assigning equaiweigbt to all instances in the input dataset. This value is usually equal to 1/m where m is the size of the input dataset. A classifier is built for this data and the weights of the instances are recomputed: according with classifier's output. In this way, the weight of the correctly classified instances is decreased and the weight of the incorrectly classified instances is increased. In this way the instances with greater weight become harder to be classified and the ones with lower weight are easier to be classified. Once this difference among instances is transformed into smaller and larger weights the new iteration will build a classifier that takes into consideration the reweighted data. The consequence is that it will focus on classifying fee hard instances in a correct fashion. Again, the weights are recomputed and as a consequence some items will become harder to be classified and other ones will become easier to be classified. Intuitively, the algorithm tries to keep track a measure regarding how hard (or easy) it is to classify each instance and smoothly generates a set of classifiers that complement one another.

One issue regards the way in which weights ate altered. In general, the added or deleted value depends upon the overall classifier's error. The generalaccepted update rule for the correctly classified instances is

where is the classifier's error on weighted data and has a value between 0 and I. At this moment the weights for the misclassified instances remain unchanged. Finally, all the weights are normalized such that weights of correctly classified instances is decreased and the weights of incorrectly classified instances is increased. If the error on the weighted input dataset exceeds or equals 0.5 the boosting procedure stops building classifiersdiscarding the classifier. The boosting procedure also stops if the error is 0 due to the fact that classifiers that perform perfectly on the input dataset must be deleted. This happens because

4

which is a positive number between 0 and infinity but is not defined for e equals to 0,Finally, to make a prediction, theweights of all classifiers that vote tor a particular class are summed, and the classwith the greatest total is chosen.The normaiizationprocedure is classical. Suppose you have a range or scale from A to B and you want to convert it to a scale of C to D, where A maps to C and B maps to D. Then the following (linear) equation can be applied to any number i on the A-B scale:

Around these core ideas there were built many variants of boosting among which one of the most popular is AdaBoosi, short for Adaptive Boosting.which is a positive number between 0 and infinity but is not defined for e equals to 0,Finally, to make a prediction, theweights of all classifiers that vote tor a particular class are summed, and the classwith the greatest total is chosen.The normaiizationprocedure is classical. Suppose you have a range or scale from A to B and you want to convert it to a scale of C to D, where A maps to Cand B maps to D. Then the following (linear) equation can be applied to any number i on the A-B scale:

D - CAround these core ideas there were built many variants of boosting among which one of the most popular is AdaBoosi, short for Adaptive Boosting. Here is the detailed AdaBoosi algorithm.

procedureAdaBoost ({(x,, yi), (xm,ym)J where x€X y € {-I, +1}) # Initialize the weight of each tuple in D to Urn D = 1 /m;for t--=1,2,....T

itfmd the classifier h,: X {-1, +1} that minimizes the error with respect to the distribution Dt ■'

ht = argminh.eH ejf where e} = Dt(i)I\yj *hj(xi)]and I is the indicator function.

if(e t>05)STOP

else {

# Choose at &R, typically at =-ln~ , where et is the

weighted error rate of classifier ht

5

# Update:n x(e'atifht:(xi)= yf)DM(l) €at if ,, y( } -

m^J^^whereZt is a normalization factor Zt(chosen so that D;+i will be a distribution);

Wutput the final classifier:H(x) = si,gn(EL tatht(x))

Detailed example of running AdaBoostSuppose we have a set D of 10 students and we want to classify them

according to the level of their knowledge; good students and weak students. For this there are used 3 classifiers denoted A, B and C.

Since si-lfi (we have 10 s&deots) we first initialize the weight (error) of each tuple ft byl/lfl;The initial weight (error) distribution is:

Student Weight (error)

SI 0.1 |1 S2 0.1 ir S3 1 0.1r S4 0.1 i

S5 olS6 0JS7 0.1S8 0.1

r S9 0.1"slo 0.1

The initial classification realized by all the classifiers is represented below. A value of "1" means that the student is correctly classified while a value of"-!" means that the student is niisclassified.

Student A(h,) B(h 2) C(h,)SI 1 1 1S2 I 1 -IS3 1 1 1S4 1 -i 1S5 -1 i 1S6 -1 i 1S7 .-1 1 1S8 1 i „lS9 1 A I

6

810 -1 1 -1

Iteration 1

Firstly, classifier hi (i.e., A) is taken into consideration. Tb.e above data shows that it manages to identify correctly only 6 out of the 10 students, therefore its error is 0.1 x 4 = OA;

The second classifier h2 (i.e., B), manages to identify correctly 8 out of the 10 students, therefore its error is 0.1 x2-(12;

The last classifier h3 ,(i.e., C), manages to identify correctly 7 out of the 10 students, therefore its error is 0,1 x 3 = ©3;

The Mowing table presents the error for each initial classifier.

Classifier Error e,

A 0.4

B 0.2

C 0.3

At this moment all the errors respect the condition that c(<0.5 so the algorithm can continue.

The general task is to find the classifier h t that minimizes the error with respect, to the distribution D. AdaBoost chooses the second classifier, hjbecause it has the smallest error.

Nest, the parameter &zis computed as:

1, I l-e2 1, 1-0.2

a. = ~hi-----•'-=> a,-—In—-2- = -to----------- - 0.69312 e, 2 e2 2 0.2

Then we recalculate (update) the classification error for each student from distribution D2(i):

o ( A = ' / * < < * , ) = . ^ ( O x f f l c p H M W )

Z. [ Z,In the formula above we choose the first formula if the student is correctly

classified {represented by a value of 1), otherwise the second formula if the student is misciassified (represented by a value of-1).

For the correctly classified studentsin initial classification table the value 1 corresponds to a new value of the weight (error) of 0.05 computed by:

7

Dl®~M.Xe*mt=QA Xe'°-mi = 0.05 Zt

For the incorrectly classified studentsin initial classification table the value Gcorresponds to a sew value of the weight (error) of 0.02 computed by:

D2(-1)= =0.1 xe0fi931 =0.199

After applying the same formula we derive the new classification error and accordingly its normalized values.

Student Weight NormalizedSI 0.05 0.05S2 0.05 0.05S3 0.05 0.05S4 0.199 0.2S5 0.05 0.0556 0.05 0.05S7 0.05 0.05S8 0.05 0.05S9 0.199 0.2

S100.05 0.05

The normalized the weights are computed by using the previously presented method so that D becomes a distribution. The result is displayed in the third column from the table above. We observe that the error of the misciassified students increases while the error of the correctly classified students is decreasing.

Having the normalized weight computed we are ready to update the error of each classifier by adding the already normalized weight (error), talcing into account the only the incorrectly classified students from the initial student classification;

For example, the first classifier, A misciassified 4 students (S5, S6, S7 and S10), so the error will be et = 8.05 * 4 = 8.2;

The second classifier, B misciassified 2 students (54 and S9) so the error will bee* = 0 . 2 * 2 = 0.4:

Finally, the last classifier, C misciassified 3 students (S3, S8, S10), therefore its error will be et = 0.05 * 3 = 0. IS;

Classifier Initial ermrErmr Error e, after step 1A 0.4 0.2B 0.2 0.4C 0.3 0.15

The recomputed error for each classifier is:

8

7. techniques for improvement of classification bagging and boosting

Documents