filterboost: regression and classification on large...
TRANSCRIPT
FilterBoost: Regression and Classification on Large Datasets
NIPS’07 paper by Joseph K. Bradley and Robert E. Schapire
* some slides reused from the NIPS ’07 presentation
1
Typical Boosting Framework
• Batch Framework
2
Dataset
Sampled i.i.d. from target
distribution D
2
Typical Boosting Framework
• Batch Framework
2
Dataset
Booster
Sampled i.i.d. from target
distribution D
2
Typical Boosting Framework
• Batch Framework
2
Dataset
Booster Final Hypothesis
Sampled i.i.d. from target
distribution D
2
Typical Boosting Framework
• Batch Framework
2
Dataset
Booster Final Hypothesis
The booster must always have
access to the entire dataset.
Sampled i.i.d. from target
distribution D
2
Motivation for a New Framework
• In the typical framework, the boosting algorithm must have access to the entire data set.
• This limits the application to scenarios with very large data sets.
• Computationally infeasible because in each round, the distribution information on each point is updated.
• New ideas:
• use a data stream instead of the entire (fixed) data set.
• train on new subsets of data in each round.
3
3
Filtering Framework
• Boost for 1000 rounds: only store ~1/1000 of data at a time.
4
Data Oracle
Booster
Stores tiny fraction of
data!
Final Hypothesis~D
4
Paper’s Results
• A new boosting-by-filtering algorithm.
• Provable guarantees.
• Applicable to both classification and conditional probability estimation.
• Good empirical performance.
5
5
AdaBoost (batch boosting)
• Given: Fixed data set S.
• In each round t,
• choose distribution Dt over S.
• choose hypothesis ht.
• estimate error of ht with Dt.
• give a weight of
• Output: hypothesis:
6
!t
ht !t =12
log1! "t
"t
h(x) :=T!
i=1
!tht(x)
Higher weight to misclassified examples.
Dt forces the algorithm to correctly classify “harder” examples in later rounds.
6
forces the algorithm to correctly classify “harder” examples in later rounds.
AdaBoost (batch boosting)
• Given: Fixed data set S.
• In each round t,
• choose distribution Dt over S.
• choose hypothesis ht.
• estimate error of ht with Dt.
• give a weight of
• Output: hypothesis:
• In filtering, no “fixed” data set S. What about ?
7
!t
ht !t =12
log1! "t
"t
h(x) :=T!
i=1
!tht(x)
Higher weight to misclassified examples.
Dt
Dt
7
• Key idea: Simulate using the filter (accept only “hard examples”).
accept
Filtering Idea
8
Data Oracle
DFilter
reject
Use in the Boosting algorithm.
Dt
8
FilterBoost: Main Algorithm
9
• Given: Oracle to distribution D.
• In each round t,
• use Filtert to obtain (sample really)
• choose hypothesis ht that does well on
• estimate error of ht by using the oracle again.
• give a weight of
• Output: hypothesis:
6
!t
ht !t =12
log1! "t
"t
h(x) :=T!
i=1
!tht(x)
Dt St
St
9
• Simulate using rejection sampling.
• Higher weight to misclassified examples.
• If , then the label and hypothesis agree, low probability of being accepted.
• Otherwise, if , then misclassified example, high probability of being accepted.
• So, try
• Difficulty: too much weight on too few examples.
Filtering: Details
10
reject
acceptData Oracle
DFilter Use in the
Boosting algorithm.
Dt
yH(x) = 1
yH(x) = !1
D(x, y) = exp(!yH(x)).
10
Filtering: Details
• Truncated exponential weights work for filtering.[M-AdaBoost, Domingo & Watanabe, 00]
11
reject
acceptData Oracle
DFilter Use in the
Boosting algorithm.
-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8
-0.8
-0.4
0.4
0.8
1.2
1.6
2
2.4
2.8
11
Filtering: Details
• FilterBoost: based on AdaBoost for Logistic Regression. Minimize logistic loss leads to logistic weights.
12
reject
acceptData Oracle
DFilter Use in the
Boosting algorithm.
-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8
-0.5
0.5
1
1.5
2
2.5
Pr [accept(x, y)] =1
1 + eyH(x).
Filte
rBoo
st
Pr [accept(x, y)] = min{1, e!yH(x)}.
M-Ada
Boos
t
12
FilterBoost: Main Algorithm
13
• Given: Oracle to distribution D.
• In each round t,
• use Filtert to obtain (sample really)
• choose hypothesis ht that does well on
• estimate error of ht by using the oracle again.
• give a weight of
• Output: hypothesis:
6
!t
ht !t =12
log1! "t
"t
h(x) :=T!
i=1
!tht(x)
Dt St
St
13
FilterBoost: Main Algorithm
14
• Given: Oracle to distribution D.
• In each round t,
• use Filtert to obtain (sample really)
• choose hypothesis ht that does well on
• estimate error of ht by using the oracle again.
• give a weight of
• Output: hypothesis:
6
!t
ht !t =12
log1! "t
"t
h(x) :=T!
i=1
!tht(x)
Dt St
St
Q1. How much time does Filtert take?
Q2. How many boosting rounds are needed?
Q3. How can we estimate hypothesis errors?
A1. If filter takes too long, then hypothesis is accurate enough.
A2. If weak hypothesis’ error bounded away from 1/2, we
make good progress.
A3. Adaptive Sampling [Watanabe, 00]
14
FilterBoost: Analysis
• Interpreted as an additive logistic regression model. Suppose
• Which implies .
• In the case of FilterBoost, .
• Expected Negative log-likelihood of an example is:
• FilterBoost minimizes this function. Like AdaBoost, gradient descent is used to determine weak learner and step size.
15
logPr[y = 1|x]
Pr[y = !1|x]=
!
t
ft(x) = F (x)
Pr[y = 1|x] =1
1 + e!F (x)
ft(x) = !tht(x)
!(F ) = E!! ln
11 + e!yF (x)
"
!(F + "tht).
step-size
weak learner
15
FilterBoost: Details
• Second order expansion of :
• For positive , this expression is minimized when we maximize:
• This is maximized for
• Once is fixed, is determined by to minimize the upper-bound
16
!(F + "h)!!h=0
!(F + "h) = !(F ) + "h!!(F ) +"2h2
2!!!(F )
= E!ln(1 + e"yF (x))! y"h
1 + eyF (x)+
12
y2"2h2eyF (x)
(1 + eyF (x))2
"both +1
!
E!
yh(x)1 + eyF (x)
"! Eq[yh(x)]
f(x) = sign(Eq[y|x]).
h(x) !!(F + "h) ! E[e!y(F (x)+!h(x))]
! =12
log!
1/2 + "
1/2! "
".
16
FilterBoost: Details (1)
17
17
FilterBoost: Details (2)
18
18
FilterBoost: Details (3)
19
19
FilterBoost: Theory
• Theorem:Assume that the weak hypotheses have edge at least . Let be the target error rate. FilterBoost produces a final hypothesis with error less than in T rounds, where
• The real bound is given by:
• Proof elements:
• Step 1:
• Step 2:
• Assume that for all Contradiction.
20
!!
!
T = !O"
1!"2
#
T >2 ln(2)
!(1! 2!
1/4! "2)
errt ! 2pt.
!t ! !t+1 " pt
!1! 2
"1/4! "2
t
#.
t ! {1, . . . , T}, errt " !.
hypothesis error in round t
probability of accepting an example in round t
20
FilterBoost: Theory
• Step 1:
• recall that Call it .
21
Pr [accept(x, y)] =1
1 + eyH(x). qt(x, y)
-4.8 -4 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2 4 4.8
-0.5
0.5
1
1.5
2
2.5
yH(x)
21
FilterBoost: Theory
• Step II:
• recall that .
• expanding the expectation,
22
!(F ) = E!! ln
11 + e!yF (x)
"
!t =!
(x,y)
D(x, y) ln(1! qt(x, y))
!t ! !t+1 =!
(x,y)
D(x, y) ln"
1! qt(x, y)1! qt+1(x, y)
#
22
FilterBoost: Theory
• Let
• Recall that
23
&
23
FilterBoost vs. Rest
• Comparison (as reported in the paper):
24
Need edges decreasing?
Need bound on min edge?
Inf. weak learner space
# rounds
M-AdaBoost[Domingo &
Watanabe,00]Y N Y 1/ε
AdaFlat[Gavinsky,02]
N N Y 1/ε2
GiniBoost[Hatano,06]
N N N 1/ε
FilterBoost N N Y 1/ε
24
FilterBoost: Details
• In the previous analysis, overlooked the probability of failure introduced by the three steps:
• training the weak learner
• deciding when to stop boosting
• estimating the edges
25
25
Experiments
• Tested FilterBoost against M-AdaBoost, AdaBoost, Logistic AdaBoost.
• Synthetic and real data sets.
• Tested: classification and conditional probability estimation.
26
26
Experiments: Classification
• Noisy Majority Vote (synthetic).
• Decision stumps are the weak learners.
• 500,000 examples.
27
FilterBoost achieves high accuracy fast.
27
Experiments: CPE
• Conditional Probability Estimation
28
28
Conclusion
• FilterBoost good for boosting over large data sets.
• Fewer assumptions, better guarantees.
• Validated empirically, in classification and conditional probability estimation.
29
29