[ieee 2013 ieee applied imagery pattern recognition workshop: sensing for control and augmentation...

Combining the Advice of Experts with RandomizedBoosting for Robust Pattern Recognition

Jing PengComputer Science Department

Montclair State UniversityMontclair, NJ 07043

Email: [email protected]

Guna SeetharamanInformation Directorate

AFRL/RITBRome, NY 13441

Email: [email protected]

Abstract—We have developed an algorithm, called ShareBoost,for combining mulitple classifiers from multiple informationsources. The algorithm offer a number of advantages, such asincreased confidence in decision-making, resulting from combinedcomplementary data, good performance against noise, and theability to exploit interplay between sensor subspaces. We have alsodeveloped a randomized version of ShareBoost, called rShare-Boost, by casting ShareBoost within an adversarial multi-armedbandit framework. This in turn allows us to show rShareBoostis efficient and convergent. Both algorithms have shown promisein a number of applications.

The hallmark of these algorithms is a set of strategies formining and exploiting the most informative sensor sources for agiven situation. These strategies are computations performed bythe algorithms. In this paper, we propose to consider strategiesas advice given to an algorithm by “experts” or “Oracle.” Inthe context of pattern recognition, there can be several patternrecognition strategies. Each strategy makes different assumptionsregarding the fidelity of each sensor source and uses differentdata to arrive at its estimates. Each strategy may place differenttrust in a sensor at different times, and each may be betterin different situations. In this paper, we introduce a novelalgorithm for combining the advice of the experts to achieverobust pattern recognition performance. We show that with highprobability the algorithm seeks out the advice of the expertsfrom decision relevant information sources for making optimalprediction. Finally, we provide experimental results using face andinfrared image data that corroborate our theoretical analysis.

I. INTRODUCTION

Classifiers must deal with various adversities such asnoise in sensors and intra-class variations [1]. Thus, it isuseful to develop classifiers taking input from various sources(views) for pattern recognition. It requires an effective wayof combining the various sources of information. Resultingclassifiers can offer a number of advantages, such as in-creased confidence in decision-making, robust performanceagainst noise, and improved performance in adverse externalconditions. For example, smoke or fog can cause bad visiblecontrast and various weather conditions can cause low thermalcontrast (for infrared imaging); however, combining visibleand infrared band sensors gives rise to significantly betteroverall performance. Data fusion finds its applications in manydomains such as defense, robotics, medicine, sciences andSpace [2], [3].

We have developed an algorithm, called ShareBoost,for combining mulitple classifiers from multiple information

sources. The algorithm offer a number of advantages, such asincreased confidence in decision-making, resulting from com-bined complementary data, good performance against noise,and the ability to exploit interplay between sensor subspaces.We have also developed a randomized version of ShareBoost,called rShareBoost, by casting ShareBoost within an adversar-ial multi-armed bandit framework. This in turn allows us toshow rShareBoost is efficient and convergent. Both algorithmshave shown promise in a number of applications.

The hallmark of these algorithms is a set of strategies formining and exploiting the most informative sensor sources fora given situation. These strategies are computations performedby the algorithms. In this paper, we propose to considerstrategies as advice given to an algorithm by “experts” or“Oracle.” In the context of pattern recognition, there can beseveral pattern recognition strategies. Each strategy makesdifferent assumptions regarding the fidelity of each sensorsource and uses different data to arrive at its estimates. Eachstrategy may place different trust in a sensor at different times,and each may be better in different situations. In this paper,we introduce a novel algorithm for combining the advice ofthe experts to achieve robust pattern recognition performance.We show that with high probability the algorithm seeks outthe advice of the experts from decision relevant informationsources for making optimal predictions. Finally, we provideexperimental results using face and infrared image data thatcorroborate our theoretical analysis.

The rest of the paper is organized as follows. SectionII discusses related work. Section III describes the proposedShareBoost algorithm in detail. Section IV introduces therandomized version of ShareBoost that addresses potentialproblems facing ShareBoost. Section V introduces the pro-posed algorithm for combining the advice of experts, calledeShareBoost. Section VI presents experimental evaluation,followed by discussions in Section VII. Finally, Section VIIIsummarizes our contributions and points out future researchdirections.

II. RELATED WORK

In multi-view learning, a co-training procedure for classi-fication problems was developed [4]. The idea is that betterclassifiers can be learned at the individual view level, ratherthan constructed directly on all the available views. Co-traininghas been extensively investigated in the context of semi-supervised learning [5], [6], [7]. In this work, we are mainly

interested in creating classifiers that fuse information frommultiple views for better generalization.

Comprehensive surveys of various classifier fusion studiesand approaches can be found in [8], [9]. More recently,Lanckriet et al. [3] introduce a kernel-based data fusion (multi-view learning) approach to protein function prediction in yeast.The method combines multiple kernel representations in anoptimal fashion by formulating the problem as a convexoptimization problem that can be solved using semi-definiteprogramming.

In [10] stacked generalization from multiple views wasproposed. It is a general technique for construction of multi-level learning systems. In the context of multi-view learning,it yields unbiased, full-size training sets for the trainablecombiner. In some cases stacked generalization is equivalent tocross-validation, in other cases it is equivalent to forming a lin-ear combination of the classification results of the constituentclassifiers. In [11], a local learning technique was proposedthat combines multi-view information for better classification.

Boosting has been investigated in multi-view learningrecently [12]. In particular, there is a close relationship betweenour technique and that proposed in [12]. If we have a singleview and base classifiers are allowed to include features aswell, then both techniques reduce to AdaBoost. When noiseexists, however, the two techniques diverge. The technique in[12] behaves exactly like AdaBoost. Noise forces the boostingalgorithm to focus on noisy examples, thereby distorting theoptimal decision boundary. On the other hand, our approachrestricts noise to individual views, which has a similar effectto that of placing less mass of sampling probability on thesenoisy examples. This is the key difference between the twotechniques.

Considerable research in the pattern recognition field isfocused on fusion rules that aggregate the outputs of the firstlevel experts and make a final decision. Various techniques forfusion of expert observations such as linear weighted voting,the Naive Bayes classifiers, the kernel function approach, po-tential functions, decision trees or multilayer perceptrons havebeen proposed in recent years [13], [14], [9]. Other approachesare based on bagging, boosting, and arching classifiers [15],[16], [17]. Comprehensive surveys of various classifier fusionstudies and approaches can be found in [9], [8], [18], [19].

In [19] various classifier fusion strategies such as min-imum, maximum, average, majority vote and oracle are dis-cussed and empirical results are compared. Kuncheva et al.[20] discuss the effect of dependence between individualclassifiers in classifier fusion. They study the limits on majorityvote accuracy when combining dependent classifiers. A Qstatistic based measure has been proposed to quantify thedependence between the classifiers. It is shown that dependentclassifiers can offer a dramatic improvement over individualclassifier’s accuracy. A synthetic data experiment demonstratesthe intuitive result that, in general, negative dependence ispreferable.

Sonnenburg et al. considered in [21] a similar framework asthat of Lankriet [22] for multiple kernel learning (MKL). Un-like Lankriet, Sonnenburg et al. followed a different direction,by reformulating the MKL problem as a semi-infinite linearprogram, which could be efficiently solved using an LP solver

and an SVMs implementation. Furthermore, they generalizedthe MKL formulation and their algorithm to a larger class ofproblems, including regression and one-class classification.

A large number of AdaBoost like algorithms has been de-veloped, including LPBoost [23], MadaBoost [24], AdaBoostwith soft margins [25], SmoothBoost [26], and others [27].All these AdaBoost variants registered superior performanceon noisy data over AdaBoost. The hallmark of these variantsis that they gain an overall large margin at the expense ofmargin errors. They do so by placing less mass of re-samplingprobability on difficult examples. By restricting noisy, thus“difficult,” examples to individual views, the mass of samplingprobability on these examples will be restricted as well in ourtechnique. This is possible because probability mass will bedetermined by those views having less noise.

While these AdaBoost variants were developed to addressthe sensitivity of AdaBoost to noise, our technique is devel-oped to combine multiple sources of information for achievingbetter prediction, whose robustness against noise is a resultof the proposed shared sampling strategy. That is, while themotivations might have been different, they share the samebenefit.

Multi-armed bandits have been studied in a number ofapplications [28], [29]. We state that multi-armed banditsdescribed in [30] is stochastic by nature. There are manyapplications where the stochastic setting can be applied tononstationary environments, such as performance tuning prob-lems [31] and the SAT problem [32]. Algorithms such as UCB[28] and UCBV [33] work well for making AdaBoost moreefficient. Given that AdaBoost is adversarial by nature, it isdifficult to use stochastic bandits to derive strong performanceguarantees on AdaBoost. Many arguments made in [30] remainheuristic to an extent. However, this has been addressed in [34].

III. SHAREBOOST ALGORITHM

In this section, we first describe the shared sampling(ShareBoost) algorithm. Given a set of training examples rep-resented by M views, the ShareBoost algorithm builds weakclassifiers independently from each view (feature source).However, all data types share the same sampling distributioncomputed from the view having the smallest error rate. Thekey steps of the algorithm are shown in Algorithm III, whereI(·) is the indicator function.

ShareBoost ({(xji , yi)}ni=1)

1) Initialization: w1(i) =1n , i = 1, · · · , n.

2) For t = 1 to T

a) Compute base classifier hjt using distribution

wt

b) Calculate: εjt =∑

iwt(i)I(hjt (x

ji ) �= yi) and

α∗t = 1

2 ln(1−ε∗tε∗t

), where ε∗t = minj{εjt} withcorresponding h∗

t

c) Update wt+1(i) = wt(i)Z∗

t×

exp(−yih∗t (x

∗i )α

∗t ), where Z∗

t is anormalization factor.

3) Output: H(x) = sign(∑T

t=1 α∗th

∗t (x))

Input to the ShareBoost algorithm is the jth view of n train-ing examples. The algorithm produces as output a classifier thatcombines data from all the views. In the initialization step, allthe views for a given training example are initialized with thesame weight. The final decision function H(x) is computedas a weighted sum of base classifiers h∗

t (x∗), selected at each

iteration from the views that had the smallest training error orlargest α value. In this sense, ShareBoost possesses the abilityto decide at each iteration which view to influence its finaldecision. This ability goes beyond simple subspace selection. Itempowers ShareBoost not only to exploit the interplay betweensubspaces, but also to be more robust against noise.

The final decision function F (x) is computed at step 4 ofAlgorithm 1 as a weighted sum of weak hypothesis h∗

k(x∗),

selected at each iteration from the views that had the smallesttraining error ε∗k. Given a testing example x, we use the generalnotation x∗ at step 4 to indicate the view “*” of x alongwhich the weak hypothesis h∗

k was selected. Suppose data arecaptured by Radar, IR and Visible sensors. The dimensions ofRadar, IR and Visible views are not necessarily the same. Onthe other hand, the numbers of dimensions of the testing vectorx is the sum of the dimensions of the Radar, IR and Visiblerepresentations of x. Hence the weak hypothesis correspondingto each view is trained on the data with fewer dimensionsthan testing vector x. At each iteration k, the winning weakhypothesis h∗

k employed at step 4 of Algorithm 1 providesthe classification result on the corresponding view x∗ of thetesting vector x.

Notice that our algorithm works in multimodal fusionwhere data types might not be compatible or of fixed sizevectors. We only require that the number of training examplesfrom each modality be the same. We also note that sincewe are mainly interested in asymptotic margins, we are lessconcerned with distorted class probabilities associated withboosting predictions [35], especially in the two class case.

IV. RANDOMIZED SHAREBOOST

The ShareBoost algorithm described above is greedy in thatresampling weights for all views are determined solely by thewinning view. That is, it employs a winner take-all strategy.One of the benefits associated with this algorithm is that noisewill be restricted to individual views. In other words, noisewill be compartmentalized, which has a similar effect to that ofplacing less mass of sampling probability on noisy examples.This, however, needs not to be the case in approaches such asthose described in [12].

A potential drawback of this “impatience” or greedy strat-egy is that examples may not have enouogh opportunities to“express” themselves, resulting in a sub-optimal solution. Wetherefore appeal to a randomized version of the ShareBoostalgorithm, where a winning view is determined probabilisti-cally. This “patient” strategy of search-then-converge allowsrandomized ShareBoost to sufficiently explore the solutionspace, resulting in more robust performance, as we shall seelater in the experimental section.

From the viewpoint of convergence analysis, the random-ized version of the ShareBoost algorithm can be cast withina multi-armed bandit framework [29]. This in turn allows usto show that with high probability the algorithm chooses a

set of best (large edges to be detailed later) views for makingpredictions.

We first specify a reward function for each informationsource. We define the training error

Err =1

n|{i : H(xi) �= yi}|. (1)

If we write

EH(H,W1) =

n∑i=1

w1(i)exp(−H(xi)yi), (2)

then EH(H,W1) upper bounds Err [36]. Furthermore, letEh(h,Wt) =

∑ni=1 wt(i)exp(−h(xi)yi). It can be shown that

EH(H,W1) =T∏

t=1

Eh(ht,Wt). (3)

It is shown that at each boosting round, the base learner triesto find a weak classifier ht that minimizes Eh(h,Wt) =∑n

i=1 wt(i)exp(−h(xi)yi) (Algorithm 1). Thus, minimizingEh(h,Wt) at each boosting round minimizes the training errorEq. (1) in an iterative greedy fashion.

Now let

βt =∑i

wi(t)yiht(xi) = Ei∼W (t)[yiht(xi)], (4)

be the edge [37] of the base hypothesis ht chosen by thebase learner at time step t. Here the edge helps define rewardfunctions in the proposed algorithm.

One can show that [36]

Eh(h,W ) =√

1− β2t . (5)

This implies that the training error of the final classifier isat most

∏Tt=1

√1− β2

t . This upper bound suggests severalpossible reward functions. For example, we can define thereward function as

rt(j) = 1−√1− β2

t (Vj), (6)

where βt(Vj) is the edge (4) of the classifier chosen by thebase learner from source Vj at the tth boosting round. Sinceβ2t (Vj) ∈ [0, 1], this reward is between 0 and 1.

It is important to notice that a reward function in logarithmis proposed in [34] that restricts values the edge (4) can take. Incontrast, our reward function does not have such a restriction.

The algorithm, called Randomized ShareBoost or rShare-Boost, is shown in Algorithm 2. Here, wt denotes the distribu-tion for sampling examples that is shared by all sensors, whiledt represents the weight for determining the distribution forsampling views.

rShareBoost seems to have departed significantly from theShareBoost sampling algorithm. In ShareBoost, boosting isexecuted in parallel by all the information sources. In contrast,rShareBoost performs boosting along the view chosen by thebandit algorithm only. From the viewpoint of computation,rShareBoost is much more efficient, i.e., it is a fraction (1/M )of the time required for ShareBoost. In addition, ShareBoostis greedy, while rShareBoost is not. That is, the probability

rShareBoost (σ > 0, γ ∈ (0, 1], {(xi, yi)}ni=1)1) w1(i) =

1n , i = 1, · · · , n. d1(j) = exp(σγ3

√T/M),

j = 1, · · · ,M .2) For t = 1 to T

a) pt(j) = (1 − γ) dt(j)∑M

k=1dt(k)

+ γM

b) Let j be the view chosen using ptc) Obtain base classifier hj

t using distributionwt.

d) Calculate: εjt =∑

iwt(i)I(hjt (x

ji ) �= yi).

and rt(j) ∈ [0, 1] (6).e) For k = 1, · · · ,M set

i) rt(k) = rt(k)/pt(k) (k = j), and 0(k �= j)

ii) dt+1(k) = dt(k)eγ

3M (rt(k)+σ

pt(k)√

MT)

f) Let α∗t = 1


), where ε∗t = εjt , h∗t =

hjt

g) Update wt+1(i) = wt(i)Zt

×exp(−yih

∗t (x

∗i )α

∗t ), where Zt is a

normalization factor.3) Output: H(x) = sign(

∑Tt=1 α

∗th

∗t (x))

distribution for sampling training examples for all views isdetermined solely by the winning source. In rShareBoost,however, the information source selected may not be thewinning view. This provides rShareBoost with an opportunityto examine potential information sources that may prove to beuseful.

V. COMBINING THE ADVICE OF EXPERTS

The hallmark of the ShareBoost and rShareBoost algo-rithms is a set of strategies for mining and exploiting themost informative sensor sources for a given situation. Thesestrategies are computations performed by the algorithms. Wepropose to consider strategies as advice given to an algorithmby “experts” or “Oracle.” In the context of pattern recognition,there can be several pattern recognition strategies. Each strat-egy makes different assumptions regarding the fidelity of eachsensor source and uses different data to arrive at its estimates.Each strategy may place different trust in a sensor at differenttimes, and each may be better in different situations. In thispaper, we introduce an algorithm for combining the adviceof the experts in the framework of adversarial multi-armedbandits to achieve better pattern recognition performance [28],[29].

A. Adversarial Multi-Armed Bandit Approach

In the multi-armed bandit problem [38], a gambler choosesone of M slot machines to play. Formally, a player algorithmpulls one out of M arms at each time t. Pulling an armjt at time t results in a reward rjt(t) ∈ [0, 1], from astationary distribution. The goal of the player algorithm isto maximize the expected sum of the rewards over the pulls.More precisely, let GA(T ) =

∑Tt=1 rjt(t) be the total reward

that algorithm A receives over T pulls. Then the performanceof algorithm A can be evaluated in terms of regret withrespect to the average return of the optimal strategy (pulling

consistently the best arm) Reg = GO − GA(T ), whereGO =

∑Tt=1 maxi∈{1,···,M} R(it). Here R(i) represents the

expected return of the ith arm.

Notice that in this setup, no statistical assumptions aremade about the generation of rewards. Only the reward rjtof the chosen arm jt is revealed to the player algorithm. Sincethe rewards are not drawn from a stationary distribution, anykind of regret can only be defined with respect to a particularsequence of actions. One such regret is the worst case regretG(j1,···,jT )−GA(T ), where G(j1,···,jT ) =

∑Tj=1 rjt . Thus, the

worst case regret measures how much the player algorithmlost (or gained) by following algorithm A instead of choosingactions (ii, · · · , iT ).

A special case of this is the regret of strategy A for thebest single action

RegA(T ) = Gmax(T )−GA(T ) (7)

where Gmax(T ) = maxi∑T

t=1 ri(t). That is, strategy A iscompared to the best fixed arm, retrospectively. Notice thatwhen player algorithm A that achieves limT→∞

RegA(T )T ≤ 0

is called a no-regret algorithm.

B. Exp4: Combining Advice of Experts

The adversarial multi-armed bandit problem can be treatedwithin the class of Exponentially Weighted Average Forecasteralgorithms [39]. Typically these algorithms maintain a proba-bility distribution over the arms and draws a random arm fromthis distribution at each step. The probability for pulling an armincreases exponentially with the average of past rewards thearm receives. In particular, we chose the Exp4 algorithm forcombining the advice of experts [29], because the particularform of the probability bound on the weak regret (7).

Exp4 (σ > 0,γ ∈ (0, 1])Gi(0) = 0 for i = 1, · · · ,MFor t = 1, 2, · · · , T

1) Let pi(t) = exp(σGi(t−1))/∑M

j=1 exp(σGj(t−1))

2) Obtain advice πj(t) ∈ [0, 1]M , and set p(t) =∑Nj=1 qj(t)π

j(t)3) Choose it to be j randomly according to pj(t) =

(1 − γ)pj(t) + γ/M4) Receive reward rit(t) ∈ [0, 1]

5) Let xj(t) =

{rit(t)/pit(t) if j = it;0 otherwise.

6) Let yj(t) = πj(t) · x(t)7) Gi(t) = Gi(t− 1) + y(t)

In the Exp4 algorithm, the probability distribution forchoosing arms is a mixture (weighted by γ) of the uniformdistribution and a distribution that allocates a probability massexponential in the estimated cumulative reward to each arm.This mixture ensures that the algorithm tries out all M arms.When arm it is selected (line 2(d)(1)), the estimated reward xitfor the arm is set to rit(i)/pit(t). This choice compensates thereward of arms that are unlikely to be chosen. For the purposeof our analysis, we state the following theorem (Theorem 8.1in [29]).

Theorem 1: For σ > 0 and γ ∈ (0, ], and for any family ofexperts which inlucdes the uniform expert, the expected gainof the Exp4 algorithm is at least

E[GExp4] ≥ EGmax−(γ+MΦM/γ(σ)

σ)EGmax−1− γ

σlnN.

It can be seen that σ and γ are “smoothing” parameters:the larger they are, the more uniform the probability distri-bution for choosing arms pt. In addition, Exp4 is a no-regretalgorithm with probability 1 [29].

C. eShareBoost: Combining rShareBoost and Exp4

We now describe our algorithm for combining the adviceof experts. It is a combination of rShareBoost and Exp4.

eShareBoost (σ > 0, γ ∈ (0, 1], {(xi, yi)}ni=1)Gi(0) = 0 for i = 1, · · · ,Mw1(i) =

1n , i = 1, · · · , n

For t = 1 to T

1) Let qk(t) = exp(σGk(t − 1))/∑M

j=1 exp(σGj(t −1)), k = 1, · · · , N

2) Obtain advice πk(t) ∈ [0, 1]M , and set p(t) =∑Nk=1 qk(t)π

k(t)3) Choose view it to be j randomly according to pj(t) =

(1− γ)pj(t) + γ/M4) Obtain base classifier hj

t using distribution wt.5) Calculate: εjt =

∑ni wt(i)I(h

jt (x

ji ) �= yi). and

rit(t) ∈ [0, 1] (6).

6) Let xj(t) =

{rit(t)/pit(t) if j = it;0 otherwise.

7) Let yj(t) = πj(t) · x(t)8) Gi(t) = Gi(t− 1) + y(t)

9) Let α∗t = 1


), where ε∗t = εjt , h∗t = hj

t

10) Update wt+1(i) =wt(i)Zt

×exp(−yih∗t (x

∗i )α

∗t ), where

Zt is a normalization factor.Output: H(x) = sign(

∑Tt=1 α

∗th

∗t (x))

The combined algorithm, called Randomized ShareBoostor rShareBoost, is shown in Algorithm 4. Here, wt denotesthe distribution for sampling examples that is shared by allsensors.

VI. EXPERIMENTS

A. Competing Methods

We have carried out empirical study evaluating the per-formance of the proposed algorithm. As comparison, thefollowing methods are evaluated.

1) InVar–eShareBoost combined with the inverse vari-ance expert, where the expert selects a view (rep-resentation) according to the probability that is in-versely proportional to variance. Specifically, let fcbe the consensus amoung the views. For example, fcmay be the average prediction

fc(x) =1

M

M∑i=1

fj(xj)

where fj represents the prediction of the jth view.The variance for the jth view can be estimated

varj =1

n

n∑i=1

(fj(xji )− fc(xi))

2.

Thus, the InVar expert selects the jth view withprobability

πj = (1/varj)/

M∑i=1

(1/vari).

2) UniformE–eShareBoost combined with the uniformexpert, where the expert selects each view with equalprobability.

The number of base classifiers for both algorithms is 50.

B. Data Sets

The following data sets are used to evaluate the per-formance of the proposed techniques against the competingmethods.

• The Gender data consist of the FERET facial imagesof 101 male and female subjects. Each subject hasthree poses or views: (1) frontal pose, 2) half left, and3) half right profiles. Sample images are shown in Fig.1. each view has 101 dimensions after applying PCA.The task is to determine the gender of a given subject.

Fig. 1. Sample FERET images

• The IR data set consists of a sequence of 289 framesof two moving objects: T72 tank and an armouredpersonnel carrier (APC). The task is to classify target(T72) against clutter (APC). Figure 2 shows sampleframes of the IR sequence, where the APC is on theleft, and the T72 is on the right. One of the challengesthe IR data present is the changing number of pixelson the moving objects. Specifically, the number ofpixels on the objects varies from 20 to over 300.To address this problem, object chips are extractedand resized to have 300 pixels. For each object chip,three representations (views) are constructed. (1) pixelintensity, (2) histograms, and (3) Canny edges. Foreach representation, PCA is applied to compute the

principal component space that captures 95% variancein the data, resulting in a 83 dimensional space forthe intensity view, a 41 dimensional space for thehistogram view, and a 100 dimensional space for theedge view.

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

20 40 60 80 100 120

20

40

60

80

100

120

Fig. 2. Sample IR data.

For all the experiments reported here, we randomly splitthe data into 60% as training and remaining 40% as testing.This process is repeated 20 times and the average results over20 runs are reported. In the experiments, σ and γ were set to0.15 and 0.3, respectively, as suggested in [34].

C. Experimental Results

Table I shows the average accuracy over 20 runs bythe competing methods. The two methods registered similarperformance.

TABLE I. AVERAGE ACCURACY–NOISE FREE

Gender IRInVar 0.92 0.94Uniform 0.91 0.94

Robustness against noise is a key feature for any data fusionalgorithm that must operate across the full range of conditionsand scenarios the algorithm is anticipated to encounter. Oneway to generate noise is by flipping class labels. Flipping labelsto generate noise produces similar effect to that produced bypoor representations. Label noise poses challenges due to thefact that label noise will most likely cause any two classes tooverlap (e.g., non zero Bayesian error)

Table II shows the average accuracy over 20 runs by thecompeting methods, where we added noise (70%) to the class

TABLE II. AVERAGE ACCURACY–ONE VIEW NOISE (70%)


label of the training data a randomly chosen view by “flipping”the label from one class to another. Note that label noise seemsto create harder problems. It most likely creates problems withoverlapping classes, while feature noise (white noise) may not.

The results show that InVar outperformed Uniform on bothdata sets Similar performace (Table III) was observed whentwo randomly chosen views were added noise 70% and 30%,respectively.

TABLE III. AVERAGE ACCURACY–TWO VIEW NOISE (70% AND 30%)


VII. DISCUSSION

Compared to the uniform expert, the inverse variance ex-pert registered better performance across problems, especiallyagainst noise. This strategy is consistent with strategies studiedin Bayesian Co-training [40], where views whose predictiondeviates the most from the consensus should contribute less tothe overall prediction.

There are two procedural parameters input to eShareBoost(Algorithm V-C): σ and γ. They were set to 0.15 and 0.3, re-spectively, throughout the experiment without expensive cross-validation to determine their values. This shows the advantageof the proposed technique.

Note that in the experiments, decision trees are taken as thebase learner. Compared to other base learners such as NaiveBayes, decision trees or stumps are known to better exploitfeatures, resulting in a significantly better baseline, hence abetter classifier.

VIII. SUMMARY

We have developed the eShareBoost algorithm for com-bining the advice of experts for pattern recognition in theframework of adversarial multi-armed bandits. We have alsointroduced a variance inverse expert for selecting views. Theexperimental results show that the variance inverse expert ismost robust against noise in the problems we have experi-mented with. Our future work includes investigating differentexpert strategies for robust pattern recognition.

REFERENCES

[1] A. Ross and A. K. Jain, “Multimodal biometrics: an overview,” Proceed-ings of 12th European Signal Processing Conference, pp. 1221–1224,2004.

[2] J. Gao, W. Fan, Y. Sun, and J. Han, “Heterogeneous source consen-sus learningvia decision propagation and negotiation,” in 15th ACMSIGKDD Conference on Knowledge Discovery and Data Mining, 2009.

[3] G. R. G. Lanckriet, M. H. Deng, N. Cristianini, M. I. Jordan, andW. S. Noble, “Kernel-based data fusion and its application to proteinfunction prediction in yeast,” Proceedings of the Pacific Symposium onBiocomputing, vol. 9, pp. 300–311, 2004.

[4] A. Blum and T. Mitchell, “Combining labeled and unlabeled data withco-training,” in In Proceedings of the Eleventh Annual Conference inComputational Learning Theory.

[5] W. Wang and Z. hua Zhou, “On multi-view active learning and thecombination with semi-supervised learning,” in Proceedings of the 25thinternational conference on Machine learning, 2008.

[6] ——, “A new analysis of co-training,” in Proceedings of the 25thinternational conference on Machine learning, 2010.

[7] M. Culp, G. Michailidis, and K. Johnson, “Tri-training: Exploitingunlabeled data using three classifiers,” IEEE Transactions on Knowledgeand Data Engineering, vol. 17, no. 11, pp. 1529–1541, 2005.

[8] J. Kittler, “Combining classifiers: A theoretical framework,” PatternAnalysis and Applications, vol. 1, pp. 18–27, 1998.

[9] L. I. Kuncheva, J. C. Bezdek, and R. P. W. Duin, “Decision templatesfor multiple classifier fusion: An experimental comparison,” PatternRecognition, vol. 34, pp. 299–314, 2001.

[10] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, pp.241–259, 1992.

[11] D. Zhang, F. Wang, C. Zhang, and T. Li, “Multi-view local learning,”Proceedings of the 23rd National Conference on Artificial Intelligence(AAAI), pp. 752–757, 2008.

[12] P. Viola and M. Jones, “Fast and robust classification using asymmetricadaboost and a detector cascade,” Advances in Neural InformationProcessing Systems, vol. 14, 2002.

[13] S. Hashem, “Optimal linear combination of neural networks,” NeuralNetworks, vol. 19, pp. 599–614, 1997.

[14] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classi-fiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 20, pp. 226–239, 1998.

[15] Y. Freund and R. E. Schapire, “A decision-theoretic generalization ofon-line learning and an application to boosting,” Journal of Computerand Systems Science, vol. 55, pp. 119–139, 1997.

[16] L. Breiman, “Bagging predictors,” Machine Learning Journal, vol. 24,pp. 123–140, 1996.

[17] ——, “Arching classifiers,” Annals of Statistics, vol. 26, pp. 801–849,1998.

[18] J. Kittler, “A framework for classifier fusion: Is still needed?” LectureNotes in Computer Science, vol. 1876, pp. 45–56, 2000.

[19] L. I. Kuncheva, “A theoretical study on six classifier fusion strategies,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 24, no. 2, pp. 281–286, 1997.

[20] L. I. Kuncheva, C. J. Whitaker, C. A. Shipp, and R. P. W. Duin, “Isindependence good for combining classifiers?” Proceedings of the 15thInternational Conference on Pattern Recognition, vol. 2, pp. 168–171,2000.

[21] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf, “Largescale multiple kernel learning,” Journal of Machine Learning Research,vol. 7, pp. 1531–1565, 2006.

[22] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I.Jordan, “Learning the kernel matrix with semi-definite programming,”Journal of Machine learning Research, vol. 5, pp. 27–72, 2004.

[23] A. Demiriz, K. Bennett, and J. Shawe-Taylor, “Linear programmingboosting via column generation,” Machine Learning, vol. 46, no. 1-3,pp. 225–254, 2002.

[24] C. Domingo and O. Watanabe, “Madaboost: A modification of ad-aboost,” in Proc. COLT, 2000, pp. 180–189.

[25] G. Ratsch, T. Onoda, and K.-R. Muller, “Soft margins for adaboost,”Machine Learning, vol. 42, no. 3, pp. 287–320, 2001.

[26] R. A. Servedio, “Smooth boosting and learning with malicious noise,”Journal of Machine Learning Research, vol. 4, pp. 1557–1595, 2003.

[27] R. Jin, Y. Liu, L. Si, J. Carbonell, and A. G. Hauptmann, “A newboosting algorithm using input-dependent regularizer,” Proceedings ofthe Twentieth International Conference on Machine Learning, 2003.

[28] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256,2002.

[29] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, “The non-stochastic multi-armed bandit problem,” SIAM Journal on Computing,vol. 32, no. 1, pp. 48–77, 2002.

[30] R. Busa-Fekete and B. Kegl, “Accelerating adaboost using ucb,” inKDDCup (JMLR W&CP), 2009, pp. 111–122.

[31] F. D. Mesmay, A. Rimmel, Y. Voronenko, and M. Puschel, “Bandit-based optimization on graphs with application to library performancetuning,” in Proceedings of International Conference on Machine Learn-ing, 2009, pp. 729–736.

[32] J. Maturana, A. Fialho, F. Saubion, M. Schoenauer, and M. Sebag,“Extreme compass and dynamic multi-armed bandits for adaptiveoperator selection,” in In Proceedings of IEEE ICEC, 2009, pp. 365–372.

[33] J.-Y. Audibert, R. Munos, and C. Szepesvari, “Exploration-exploitationtradeoff using variance estimates in multi-armed bandits,” Theor. Com-put. Sci., vol. 410, no. 19, pp. 1876–1902, 2009.

[34] R. Busa-Fekete and B. Kegl, “Fast boosting using adversarial bandits,”in Proceedings of International Conference on Machine Learning, 2010.

[35] T. Fawcett and A. Niculescu-mizil, “Technical note: Pav and the rocconvex hull,” Machine Learning, vol. 68, no. 1, pp. 97–106, 2007.

[36] R. E. Schapire and Y. Singer, “Improved boosting algorithms usingconfidence rated predictions,” Machine Learning, vol. 3, no. 37, pp.297–336, 1999.

[37] C. Rudin, R. Schapire, and I. Daubechies, “Precise statements ofconvergence for adaboost and arc-gv,” Contemporary Mathematics, vol.443, 2007.

[38] H. Robbins, “Some aspects of the sequential design of experiments,”Bulletin American Mathematical Society, vol. 55, pp. 527–535, 1952.

[39] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games.Cambridge University Press, 2006.

[40] S. Yu, B. Krishnapuram, R. Rosales, and R. Rao, “Bayesian co-training,” Journal of Machine Learning Research, vol. 12, pp. 2649–2680, 2011.

[ieee 2013 ieee applied imagery pattern recognition workshop: sensing for control and augmentation...

Documents