smooth boosting by using an information-based criterion kohei hatano kyushu university, japan

29
Smooth Boosting By Using An Information- Based Criterion Kohei Hatano Kyushu University, JAPAN

Upload: candace-boot

Post on 01-Apr-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Smooth Boosting By Using An Information-

Based Criterion

Kohei HatanoKyushu University, JAPAN

Page 2: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Organization of this talk

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Page 3: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Boosting• Methodology to combine prediction rules into a

more accurate one .

E.g. learning rule to classify web pages on “Drew Barrymore”

accuracy 80%

Barrymore?y n

NoYes

Set of pred. rules = words

Labeled training data (web pages)

y nBarrymore?

NOYES

y n Drew?

NOYES

y nCharlie’s engels?

NOYES

+ +

combination of prediction rules (say, majority vote)

accuracy 51%!

John Drew Barrymore

(her father)

“The Barrymore family” of Hollywood

Jaid Barrymore(her mother)

John Barrymore(her grandpa)

Lionel Barrymore(her granduncle) Diana Barrymore

(her aunt)

Page 4: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Boosting by filtering [Schapire 90], [Freund 95],

Advantage 2: smaller space complexity (for sample)

accept

reject

Boosting scheme that uses random sampling from data

(Huge) data boosting algorithmsample randomly

Advantage 1: can determine sample size adaptively

batch learning: O(1/) boosting by filtering : polylog(1/) (: desired error)

Page 5: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Some known results Boosting algorithms by filtering

– Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund

95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03].

– Criterion for choosing prediction rules :  accuracy

Are there any better criteria?

A candidate: information-based criterion– Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple

version of Real AdaBoost)

– Criterion for choosing prediction rules :  mutual information– sometimes faster than those using accuracy-based

criterion Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03],

[Hatano&Watanabe 04]

– However, no boosting algorithm by filtering known

Page 6: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Our work

Boosting by filtering Information-based criterion

efficient boosting by filteringusing an information-based criterion

lower space complexity faster convergence

our work

Page 7: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Page 8: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Illustration of general boostingTrain. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)

Distribution D1

0.2 0.2 0.2 0.2 0.2

1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1.

2. Assign a coefficient to h1

based on its quality.h1

+1 -1

0.25

3. Update the distribution.

Pred. of h 1 +1 +1 -1 +1 -1

lower higher: correct: wrong

Page 9: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Pred. of h2 -1 -1 -1 -1 +1

Illustration of general boosting(2)Train. data (x1,+1) (x2,+1) (x3,-1) (x4,-1) (x5,+1)

Distribution D2

0.16 0.16 0.21 0.21 0.26

h2

+1 -1

0.28

lowerhigher: correct: wrong

1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2.

2. Assign a coefficient to h1

based on its weighted error.

3. Update the distribution.

Repeat these procedure for T times

Page 10: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Illustration of general boosting(3)

h2

+1 -1

0.28

h3

+1 -1

0.05

h1

+1 -1

0.25

+ +

Final pred. rule = weighted majority vote of chosen pred. rules.

0.020.050.280.25)( xH

instance x

predict +1, if H(x) >0predict -1, otherwise

Page 11: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Example: AdaBoost [Freund&Schapire 97]

(h)edge

1

tD

)()(maxarg

m

iitii

Wht xDxhyh

;))(exp(

))(exp()(

;

11

11

1

m

iiti

itiit

tttt

xHy

xHyxD

hHH

;)()( where

,ln11

21

i ititit

t

xDxhyr

t

-yiHt(xi)

wrongcorrect

Difficult examples (possibly noisy) may have too much

weights

Criterion for choosing pred. rules

Coefficient

Update

(edge)

Page 12: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Smooth boosting• Keeping the distribution “smooth”

• makes boosting algorithms– noise-tolerant

• (statistical query model) MadaBoost [Domingo&Watanabe00]

• (malicious noise model ) SmoothBoost [Servedio01]

• (agnostic boosting model) AdaFlat [Gavinsky 03] ,

– sampling from Dt can be simulated efficiently

via sampling from D1 (e.g., by rejection sampling).

applicable in the boosting by filtering framework

D1 (original distribution, e.g. uniform)Dt (distribution costructed by the booster)

supxDt(x)/D1(x)is poly-bounded

poly D1

Page 13: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Example: MadaBoost [Domingo & Watanabe 00]

;))((

))(()(

;

11

11

1

m

iiti

itiit

tttt

xHy

xHyxD

hHH

;)()( where

,ln11

21

i ititit

t

xDxhyr

t

-yiHt(xi)

Dt is 1/bounded( : error of Ht)

Criterion for choosing pred. rules

Coefficient

Update

(edge)

(h)edge

1

tD

)()(maxarg

m

iitii

Wht xDxhyh

l(-yiHt(xi))

Page 14: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Examples of other smooth boosters

LogitBoost [Freidman, et al 00]AdaFlat [Gavinsky 03]

logistic function stepwise linear function

Page 15: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Page 16: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Our new booster

;)(maxarg gain pseudo

hh tWh

t

;))((

))(()(

);())(()()(

11

11

1

m

iiti

itiit

ttttt

xHy

xHyxD

xhxhxHxH

;)(

)()(]1[ where ;

0 z if ,2/]1[

0 z if ,2/]1[)(

1)(:

1)(:

xihtiit

xihtiititi

tt

tt xD

xDxhyz

-yiHt(xi)

Criterion for choosing pred. rules

Coefficient

Update

Still, Dt is 1/bounded( : error of Ht)

(pseudo gain)

l(-yiHt(xi))

Page 17: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Pseudo gain

];1[)1(]1[ tt pp

;)(

)()(

]1[ };1)({Pr)( where

1)(:

1)(:

1)(:

i

i

t

i

xhiit

itxhi

ii

ttDxhi

itt xD

xDxhy

xhxDp

;)()( edge1

m

iiii xDxhy

22 ]1[)1(]1[)( ttttt pph

Relation to edge

;]1[)1(]1[ 22 tt pp

Property: 2 (by convexity of of the square function)

Page 18: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Interpretation of pseudo gain)(-1 min)( max tt t

ht

hhh

tt

but, ・・・ the entropy function is NOT defined with Shannon’s entropy, but defined with Gini index

minh(conditional entropy of labels given ht)

maxh(mutual information between h and labels)

Page 19: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Information-based criteria

)1(4)(EGini ppp

;)1(2)(EKM ppp

)1log()1(log)(EShannon ppppp

Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the mutual information defined with KM entropy.

[Kearns & Mansour 98]

Good news: Gini index can be estimated via sampling efficiently!

Our booster chooses a pred. rule maximizing the mutual informationdefined by Gini Index (GiniBoost)

Page 20: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Convergence of train. error (GiniBoost)

Thm. Suppose that (train. error of Ht)> for t=1,…,T. Then

.)(4

1train.err1

T

tttT hH

Coro.Further, if t (ht)¸ , train.err(HT) · in T= O(1/) steps.

Page 21: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Comparison on convergence speed

booster #of iterations to get a final rule with error

comments

MadaBoost[Domingo&Watanabe 00] O(1/ 2)

○boost by filtering○adaptive (don’ need to know )×needs technical assumptions

SmoothBoost[Servedio 01]

O(1/ 2)○boost by filtering× not adaptive

AdaFlat[Gavinsky 03]

O(1/22 ) ○boost by filtering

○adaptive

GiniBoost(our result)

O(1/) 1/2)

○boost by filtering○adaptive

AdaBoost[Schapire& Freund 97]

O(log(1/) /2)○adaptive×boost by filtering

: minimum pseudo gain : minimum edge

Page 22: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Boosting- by- filtering version of

GiniBoost (outline)

• Multiplicative bounds for pseudo gain (and more practical bounds using the

central limit approximation).• Adaptive pred. rule selector.• Boosting alg. in the PAC learning

sense.

Page 23: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Page 24: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Experiments • Topic classification of Reuters news (Reuters-21578)• Binary classification for each 5 topics (Results are

averaged).• 10,000 examples.• 30,000 words used as base pred. rules.• Run algorithms until they sample 1,000,000 examples in

total.• 10-fold CV.

Page 25: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Test error over Reuters

Note :  GiniBoost2 doubles coefficients t[+1], t[-1] used in GiniBoost

Page 26: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Execution time

test error(%)

time (sec.)

AdaBoost  (w/o sampling , run in 100 step)

5.6 1349

MadaBoost 6.7 493

GiniBoost 5.8 408

GiniBoost2 5.5 359

faster by about 4 times!(Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )

Page 27: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

1. Introduction2. Preliminaries3. Our booster4. Experiments5. Summary

Page 28: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Summary/Open problemSummaryGiniBoost:• uses pseudo gain (Gini index) to choose base

prediction rules• shows faster convergence in the filtering scheme.

Open problem•Theoretical analysis on noise-tolerance

Page 29: Smooth Boosting By Using An Information-Based Criterion Kohei Hatano Kyushu University, JAPAN

Comparison on sample size

# of sampling # of accepted examples

time ( sec. )

AdaBoost  (w/o sampling, run in 100 steps)

N/A N/A 1349

MadaBoost 1,032,219 157,320 493

GiniBoost1 1,039,943 156,856 408

GiniBoost2 1,027,874 140,916 359

Observation :  smaller accepted examples→ faster selection of pred. rules