filterboost: regression and classification on large...

32
FilterBoost: Regression and Classification on Large Datasets NIPS’07 paper by Joseph K. Bradley and Robert E. Schapire * some slides reused from the NIPS ’07 presentation 1

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Regression and Classification on Large Datasets

NIPS’07 paper by Joseph K. Bradley and Robert E. Schapire

* some slides reused from the NIPS ’07 presentation

1

Page 2: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Typical Boosting Framework

• Batch Framework

2

Dataset

Sampled i.i.d. from target

distribution D

2

Page 3: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster

Sampled i.i.d. from target

distribution D

2

Page 4: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster Final Hypothesis

Sampled i.i.d. from target

distribution D

2

Page 5: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Typical Boosting Framework

• Batch Framework

2

Dataset

Booster Final Hypothesis

The booster must always have

access to the entire dataset.

Sampled i.i.d. from target

distribution D

2

Page 6: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Motivation for a New Framework

• In the typical framework, the boosting algorithm must have access to the entire data set.

• This limits the application to scenarios with very large data sets.

• Computationally infeasible because in each round, the distribution information on each point is updated.

• New ideas:

• use a data stream instead of the entire (fixed) data set.

• train on new subsets of data in each round.

3

3

Page 7: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Filtering Framework

• Boost for 1000 rounds: only store ~1/1000 of data at a time.

4

Data Oracle

Booster

Stores tiny fraction of

data!

Final Hypothesis~D

4

Page 8: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Paper’s Results

• A new boosting-by-filtering algorithm.

• Provable guarantees.

• Applicable to both classification and conditional probability estimation.

• Good empirical performance.

5

5

Page 9: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

AdaBoost (batch boosting)

• Given: Fixed data set S.

• In each round t,

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Higher weight to misclassified examples.

Dt forces the algorithm to correctly classify “harder” examples in later rounds.

6

Page 10: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

forces the algorithm to correctly classify “harder” examples in later rounds.

AdaBoost (batch boosting)

• Given: Fixed data set S.

• In each round t,

• choose distribution Dt over S.

• choose hypothesis ht.

• estimate error of ht with Dt.

• give a weight of

• Output: hypothesis:

• In filtering, no “fixed” data set S. What about ?

7

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Higher weight to misclassified examples.

Dt

Dt

7

Page 11: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

• Key idea: Simulate using the filter (accept only “hard examples”).

accept

Filtering Idea

8

Data Oracle

DFilter

reject

Use in the Boosting algorithm.

Dt

8

Page 12: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Main Algorithm

9

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

9

Page 13: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

• Simulate using rejection sampling.

• Higher weight to misclassified examples.

• If , then the label and hypothesis agree, low probability of being accepted.

• Otherwise, if , then misclassified example, high probability of being accepted.

• So, try

• Difficulty: too much weight on too few examples.

Filtering: Details

10

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

Dt

yH(x) = 1

yH(x) = !1

D(x, y) = exp(!yH(x)).

10

Page 14: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Filtering: Details

• Truncated exponential weights work for filtering.[M-AdaBoost, Domingo & Watanabe, 00]

11

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.8

-0.4

0.4

0.8

1.2

1.6

2

2.4

2.8

11

Page 15: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Filtering: Details

• FilterBoost: based on AdaBoost for Logistic Regression. Minimize logistic loss leads to logistic weights.

12

reject

acceptData Oracle

DFilter Use in the

Boosting algorithm.

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.5

0.5

1

1.5

2

2.5

Pr [accept(x, y)] =1

1 + eyH(x).

Filte

rBoo

st

Pr [accept(x, y)] = min{1, e!yH(x)}.

M-Ada

Boos

t

12

Page 16: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Main Algorithm

13

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

13

Page 17: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Main Algorithm

14

• Given: Oracle to distribution D.

• In each round t,

• use Filtert to obtain (sample really)

• choose hypothesis ht that does well on

• estimate error of ht by using the oracle again.

• give a weight of

• Output: hypothesis:

6

!t

ht !t =12

log1! "t

"t

h(x) :=T!

i=1

!tht(x)

Dt St

St

Q1. How much time does Filtert take?

Q2. How many boosting rounds are needed?

Q3. How can we estimate hypothesis errors?

A1. If filter takes too long, then hypothesis is accurate enough.

A2. If weak hypothesis’ error bounded away from 1/2, we

make good progress.

A3. Adaptive Sampling [Watanabe, 00]

14

Page 18: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Analysis

• Interpreted as an additive logistic regression model. Suppose

• Which implies .

• In the case of FilterBoost, .

• Expected Negative log-likelihood of an example is:

• FilterBoost minimizes this function. Like AdaBoost, gradient descent is used to determine weak learner and step size.

15

logPr[y = 1|x]

Pr[y = !1|x]=

!

t

ft(x) = F (x)

Pr[y = 1|x] =1

1 + e!F (x)

ft(x) = !tht(x)

!(F ) = E!! ln

11 + e!yF (x)

"

!(F + "tht).

step-size

weak learner

15

Page 19: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Details

• Second order expansion of :

• For positive , this expression is minimized when we maximize:

• This is maximized for

• Once is fixed, is determined by to minimize the upper-bound

16

!(F + "h)!!h=0

!(F + "h) = !(F ) + "h!!(F ) +"2h2

2!!!(F )

= E!ln(1 + e"yF (x))! y"h

1 + eyF (x)+

12

y2"2h2eyF (x)

(1 + eyF (x))2

"both +1

!

E!

yh(x)1 + eyF (x)

"! Eq[yh(x)]

f(x) = sign(Eq[y|x]).

h(x) !!(F + "h) ! E[e!y(F (x)+!h(x))]

! =12

log!

1/2 + "

1/2! "

".

16

Page 20: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Details (1)

17

17

Page 21: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Details (2)

18

18

Page 22: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Details (3)

19

19

Page 23: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Theory

• Theorem:Assume that the weak hypotheses have edge at least . Let be the target error rate. FilterBoost produces a final hypothesis with error less than in T rounds, where

• The real bound is given by:

• Proof elements:

• Step 1:

• Step 2:

• Assume that for all Contradiction.

20

!!

!

T = !O"

1!"2

#

T >2 ln(2)

!(1! 2!

1/4! "2)

errt ! 2pt.

!t ! !t+1 " pt

!1! 2

"1/4! "2

t

#.

t ! {1, . . . , T}, errt " !.

hypothesis error in round t

probability of accepting an example in round t

20

Page 24: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Theory

• Step 1:

• recall that Call it .

21

Pr [accept(x, y)] =1

1 + eyH(x). qt(x, y)

-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8

-0.5

0.5

1

1.5

2

2.5

yH(x)

21

Page 25: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Theory

• Step II:

• recall that .

• expanding the expectation,

22

!(F ) = E!! ln

11 + e!yF (x)

"

!t =!

(x,y)

D(x, y) ln(1! qt(x, y))

!t ! !t+1 =!

(x,y)

D(x, y) ln"

1! qt(x, y)1! qt+1(x, y)

#

22

Page 26: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Theory

• Let

• Recall that

23

&

23

Page 27: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost vs. Rest

• Comparison (as reported in the paper):

24

Need edges decreasing?

Need bound on min edge?

Inf. weak learner space

# rounds

M-AdaBoost[Domingo &

Watanabe,00]Y N Y 1/ε

AdaFlat[Gavinsky,02]

N N Y 1/ε2

GiniBoost[Hatano,06]

N N N 1/ε

FilterBoost N N Y 1/ε

24

Page 28: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

FilterBoost: Details

• In the previous analysis, overlooked the probability of failure introduced by the three steps:

• training the weak learner

• deciding when to stop boosting

• estimating the edges

25

25

Page 29: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Experiments

• Tested FilterBoost against M-AdaBoost, AdaBoost, Logistic AdaBoost.

• Synthetic and real data sets.

• Tested: classification and conditional probability estimation.

26

26

Page 30: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Experiments: Classification

• Noisy Majority Vote (synthetic).

• Decision stumps are the weak learners.

• 500,000 examples.

27

FilterBoost achieves high accuracy fast.

27

Page 31: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Experiments: CPE

• Conditional Probability Estimation

28

28

Page 32: FilterBoost: Regression and Classification on Large Datasetsmohri/sem/schedule/2007/ashish-filterboost.pdf · Motivation for a New Framework • In the typical framework, the boosting

Conclusion

• FilterBoost good for boosting over large data sets.

• Fewer assumptions, better guarantees.

• Validated empirically, in classification and conditional probability estimation.

29

29