smooth boosting by using an information-based criterion kohei hatano kyushu university, japan

Smooth Boosting By Using An Information-

Based Criterion

Kohei HatanoKyushu University, JAPAN

Organization of this talk

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Boosting• Methodology to combine prediction rules into a

more accurate one .

E.g. learning rule to classify web pages on “Drew Barrymore”

accuracy 80%

Barrymore?y n

NoYes

Set of pred. rules = words

Labeled training data (web pages)

y nBarrymore?

NOＹＥＳ

y n Drew?

NOＹＥＳ

y nCharlie’s engels?

NOＹＥＳ

＋＋

combination of prediction rules (say, majority vote)

accuracy 51%!

John Drew Barrymore

(her father)

“The Barrymore family” of Hollywood

Jaid Barrymore(her mother)

John Barrymore(her grandpa)

Lionel Barrymore(her granduncle) Diana Barrymore

(her aunt)

Boosting by filtering [Schapire 90], [Freund 95],

Advantage 2: smaller space complexity (for sample)

accept

reject

Boosting scheme that uses random sampling from data

(Huge) data boosting algorithmsample randomly

Advantage 1: can determine sample size adaptively

batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)

Some known results Boosting algorithms by filtering

– Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund

95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03].

– Criterion for choosing prediction rules ：　 accuracy

Are there any better criteria?

A candidate: information-based criterion– Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple

version of Real AdaBoost)

– Criterion for choosing prediction rules ：　 mutual information– sometimes faster than those using accuracy-based

criterion Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03],

[Hatano&Watanabe 04]

– However, no boosting algorithm by filtering known

Our work

Boosting by filtering Information-based criterion

efficient boosting by filteringusing an information-based criterion

lower space complexity faster convergence

our work

Illustration of general boostingTrain. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)

Distribution D１

0.2 0.2 0.2 0.2 0.2

1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1.

2. Assign a coefficient to h1

based on its quality.h1

+1 -1

0.25

3. Update the distribution.

Pred. of h １ +1 +1 -1 +1 -1

lower higher: correct: wrong

Pred. of h2 -1 -1 -1 -1 +1

Illustration of general boosting(2)Train. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)

Distribution D2

0.16 0.16 0.21 0.21 0.26

h2

+1 -1

0.28

lowerhigher: correct: wrong

1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2.

2. Assign a coefficient to h1

based on its weighted error.

3. Update the distribution.

Repeat these procedure for T times

Illustration of general boosting(3)

h2

+1 -1

0.28

h3

+1 -1

0.05

h1

+1 -1

0.25

+ +

Final pred. rule = weighted majority vote of chosen pred. rules.

0.020.050.280.25)( xH

instance x

predict +1, if H(x) >0predict -1, otherwise

Example: AdaBoost [Freund&Schapire 97]

(h)edge

1

tD

)()(maxarg

m

iitii

Wht xDxhyh

;))(exp(

))(exp()(

;

11

11

1

m

iiti

itiit

tttt

xHy

xHyxD

hHH

;)()( where

,ln11

21

i ititit

t

xDxhyr

t

-yiHt(xi)

wrongcorrect

Difficult examples (possibly noisy) may have too much

weights

Criterion for choosing pred. rules

Coefficient

Update

(edge)

Smooth boosting• Keeping the distribution “smooth”

• makes boosting algorithms– noise-tolerant

• (statistical query model) MadaBoost [Domingo&Watanabe00]

• (malicious noise model ) SmoothBoost [Servedio01]

• (agnostic boosting model) AdaFlat [Gavinsky 03] ,

– sampling from Dt can be simulated efficiently

via sampling from D1 (e.g., by rejection sampling).

applicable in the boosting by filtering framework

D1 (original distribution, e.g. uniform)Dt (distribution costructed by the booster)

supxDt(x)/D1(x)is poly-bounded

poly D1

Example: MadaBoost [Domingo & Watanabe 00]

;))((

))(()(

;

11

11

1

m

iiti

itiit

tttt

xHy

xHyxD

hHH

;)()( where

,ln11

21

i ititit

t

xDxhyr

t

-yiHt(xi)

Dt is 1/bounded( : error of Ht)


Coefficient

Update

(edge)

(h)edge

1

tD

)()(maxarg

m

iitii

Wht xDxhyh

l(-yiHt(xi))

Examples of other smooth boosters

LogitBoost [Freidman, et al 00]AdaFlat [Gavinsky 03]

logistic function stepwise linear function

Our new booster

;)(maxarg gain pseudo

hh tWh

t

;))((

))(()(

);())(()()(

11

11

1

m

iiti

itiit

ttttt

xHy

xHyxD

xhxhxHxH

;)(

)()(]1[ where ;

0 z if ,2/]1[

0 z if ,2/]1[)(

1)(:

1)(:

xihtiit

xihtiititi

tt

tt xD

xDxhyz

-yiHt(xi)


Coefficient

Update

Still, Dt is 1/bounded( : error of Ht)

(pseudo gain)

l(-yiHt(xi))

Pseudo gain

];1[)1(]1[ tt pp

;)(

)()(

]1[ };1)({Pr)( where

1)(:

1)(:

1)(:

i

i

t

i

xhiit

itxhi

ii

ttDxhi

itt xD

xDxhy

xhxDp

;)()( edge1

m

iiii xDxhy

22 ]1[)1(]1[)( ttttt pph

Relation to edge

;]1[)1(]1[ 22 tt pp

Property: 2 (by convexity of of the square function)

Interpretation of pseudo gain)(-1 min)( max tt t

ht

hhh

tt

but, ・・・ the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index

minh(conditional entropy of labels given ht)

maxh(mutual information between h and labels)

Information-based criteria

)1(4)(EGini ppp

;)1(2)(EKM ppp

)1log()1(log)(EShannon ppppp

Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy.

[Kearns & Mansour 98]

Good news: Gini index can be estimated via sampling efficiently!

Our booster chooses a pred. rule maximizing the mutual informationdefined by Gini Index (GiniBoost)

Convergence of train. error (GiniBoost)

Thm. Suppose that (train. error of Ht)> for t=1,…,T. Then

.)(4

1train.err1

T

tttT hH

Coro.Further, if t (ht)¸ , train.err(HT) · in T= O(1/) steps.

Comparison on convergence speed

booster #of iterations to get a final rule with error

comments

MadaBoost[Domingo&Watanabe 00] O(1/ 2)

○boost by filtering○adaptive (don’ need to know )×needs technical assumptions

SmoothBoost[Servedio 01]

O(1/ 2)○boost by filtering× not adaptive

AdaFlat[Gavinsky 03]

O(1/22 ) ○boost by filtering

○adaptive

GiniBoost(our result)

O(1/) 1/2)

○boost by filtering○adaptive

AdaBoost[Schapire& Freund 97]

O(log(1/) /2)○adaptive×boost by filtering

: minimum pseudo gain : minimum edge

Boosting- by- filtering version of

GiniBoost (outline)

• Multiplicative bounds for pseudo gain (and more practical bounds using the

central limit approximation).• Adaptive pred. rule selector.• Boosting alg. in the PAC learning

sense.

Experiments • Topic classification of Reuters news (Reuters-21578)• Binary classification for each 5 topics (Results are

averaged).• 10,000 examples.• 30,000 words used as base pred. rules.• Run algorithms until they sample 1,000,000 examples in

total.• 10-fold CV.

Test error over Reuters

Note ：　 GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost

Execution time

test error(%)

time (sec.)

AdaBoost 　(w/o sampling ， run in 100 step)

5.6 1349

MadaBoost 6.7 493

GiniBoost 5.8 408

GiniBoost2 5.5 359

faster by about 4 times!(Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )

Summary/Open problemSummaryGiniBoost:• uses pseudo gain (Gini index) to choose base

prediction rules• shows faster convergence in the filtering scheme.

Open problem•Theoretical analysis on noise-tolerance

Comparison on sample size

# of sampling # of accepted examples

time （ sec. ）

AdaBoost 　(w/o sampling, run in 100 steps)

N/A N/A 1349

MadaBoost 1,032,219 157,320 493

GiniBoost1 1,039,943 156,856 408

GiniBoost2 1,027,874 140,916 359

Observation ：　 smaller accepted examples→ faster selection of pred. rules

smooth boosting by using an information-based criterion kohei hatano kyushu university, japan

Documents