online learning of maximum margin classifiers kohei hatano kyusyu university (joint work with k....
TRANSCRIPT
Online Learning of Maximum Margin Classifiers
Kohei HATANOKyusyu University
(Joint work with K. Ishibashi and M. Takeda)
p-Norm with Bias
COLT 2008
Plan of this talk
1. Introduction 2. Preliminalies
― ROMMA
3. Our result– Our new algorithm – Our implicit reduction
4. Experiments
PUMMA
Maximum Margin Classification
margin
• SVMs [Boser et al. 92]– 2-norm margin
• Boosting [Freund&Schapire 97] – ∞-norm margin (approximtely)
• Why maximum (or large) margin?– Good generalization [Schapire et al. 98]
[Shawe-Taylor et al. 98]– Formulated as convex
optimization problems(QP, LP)
Scaling up Max. Margin Classification
1. Decomposition Methods (for SVMs)– Break original QP into smaller QPs – SMO [Platt 99],SVMlight [Joachims 99], LIBSVM [Chang & Lin 01]– state-of-the-art implementations
2. Online Learning (our approach)
Online Learning
Advantages of online Learning • Simple & easy to implement• Uses less memory• Adaptive for changing concepts
Online Learning AlgorithmFor t=1 to T1. Receive an instance xt in Rn
2. Guess a label ŷt=sign(wt ∙ xt+bt)
3. Receive the label yt in {-1,1}
4. Update (wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)
end
xt
+1?
(w t,b t)(w t+1,b t+1
)
-1
Online Learning Algorithms for maximum margin classification
• Max Margin Perceptron [Kowalzyk 00]
• ROMMA [Li & Long 02]
• ALMA [Gentile 01]
• LASVM [Bordes et al. 05]
• MICRA [Tsampouka&Shawe-Taylor 07]
• Pegasos [Shalev-Shwalz et al. 07]
• Etc.
Most of online algs cannot learn hyperplane with bias!
bias
0
hyperplane with bias
hyperplane w/o bias
Typical Reduction to deal with bias [Cf. Cristianini& Shawe-Taylor 00]
Adding an extra dimension corresponding bias.
),( bu
jxγ
nj Rx
Original space Augmented space1),(~ n
jj R Rxx
),( bu )/,(~ Rbuu
jjR xmax:instance
hyperplane 1u
↔
↔
Rbxuy
γ jj
j u)(
min
margin (over normalized Instances)
↔
b
jjR x~max:~
R
R~~)~~(y
minγ~ jj
j uxu
u~
j~x
γ~
R~
jj~~b xuxu NOTE: is equivalent with (u,b) )u~ (
This reduction weaken the guarantee of margin:
2~ γγγ
→it might cause significant difference in genealization!
Our New Online Learning Algorithm
PUMMA(P-norm Utilizing Maximum Margin Algorithm)
• PUMMA can learn maximum margin classifiers with bias directly (without using the typical reduction!).
• Margin is defined as p-norm (p≥2)– For p=2, similar to Perceptron.– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].
Fast when the target is sparse.
• Extended to linearly inseparable case (omitted).– Soft margin with2-norm slack variables.
Problem of finding the p-norm maximum margin hyperplane [Cf. Mangasarian 99]
)1( 1y :to sub.
21
minarg
j
2
T,...j)b(
.)b,(
j
qb,
**
xw
www
Given: (linearly separable) S=((x1,y1),…,(xT,yT)),
Goal: Find an approximate solution of (w*,b*)
We want an online alg. solving the problem with small # of updates.
q-norm (dual norm)1/p+1/q=1
E.g.p=2, q=2p=∞, q=1
ROMMA(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]
2
2t
t
2
21
,1)(y :to sub. 21
.minarg
t
t
t
www
xw
www
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1. Predict ŷt=sign(wt∙xt), and receive yt
2. If yt(wt ·xt )<1-δ (margin is “insufficient”), 3. update:
4. Otherwise, wt+1=wt
Constraint over the last example which causes an update
Constraint over the last hyperplane
2 constraints only!
NOTE: bias is fixed with 0
ROMMA [Li&Long,’02]
0
weght space
12
4
3
w1
w2
wSVM
w3
1,...,4)(j
,1y :to sub.21
min
bias) (without SVM
2
2
)(
.
jj xw
ww
2
21-t
t
2
2
,1)(y :to sub.21
.min
ROMMA
www
xw
ww
t
feasible region of SVM
Solution of ROMMA
.
)(
)(2 ,
)(
1
where , Otherwise, (ii)
.1
where ,
, If (i)
22
2
2
2
2
2
2
222
2
2
2
2
2
1
2
2
1
2
21
tttt
tttt
tttt
ttt
tttt
tttt
ttt
βα
βyα
αyα
xwxw
xwxw
xwxw
xww
wxw
xxw
www
Solution of ROMMA is an additive update:
PUMMA
21-t
2
,11
)(
,1
,1 :to sub. 21
.min),(
q
negt
post
qbtt
b
b
b
wwfw
xw
xw
www
bias is optimized
q-norm(1/p+1/q=1)
xpost, xneg
t
: last positive and negative examples which incur updates
2
1)()(
q
q
q
qii wwsignf
ww
link function [Grove et al. 97]
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1.Predict ŷt=sign(wt∙xt), and receive yt
2.If yt(wt ·xt +bt)>1-δ, update:
3.Otherwise, wt+1=wt
○≧1
2tt www
ROMMA
○≧1
2)(qtt wwfw
PUMMA
●
≧1
Solution of PUMMA
Observation:For p=2, the solution is the same as that of ROMMA for zt = xt
pos – xtneg.
.)(
,)(βα)(βα.)β,α(
,βyα
.α,yα
,
negtt
postt
ptptt
tttt
negt
postt
tttt
ttt
2b
cases, either Inmethod. Newton the by solved is which
221
argmin
where Otherwise, (ii)
and 2
where
If (i)
111t
22
1
2
2
1
2
21
xwxw
wfwfz
wzw
xxzz
zw
www
Solution of PUMMA is found numerically:
xpost, xneg
t
: last positive and negative examples which incur updates
Our (implicit) reduction which preserves the margin
),...,1( 1
),...,1( 1
:to sub.21
.minarg),( 2
2 ,
**
Njb
Pib
b
negj
posi
b
xw
xw
www
),...,1,,...1(
2)(
:to sub.21
.minarg~ 2
2
NjPi
xx negj
posi
w
www
ww ~ Thm. * = -
hyperplane with bias hyperplane without biasover pairs of positive and negative instances
Main Result
Thm• Suppose that given S=((x1,y1),…,(xT,yT)),
there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for t=1,…,T.
• (# of updates of PUMMAp(δ)) ≤(p-1)uq2R2/ δ2
• After (p-1)u q2R2/ δ2 updates,
PUMMAp(δ) outputs a hypothesis with p-norm margin ≥ (1-δ)γ (γ: margin of (u,b) ).
.max where,...1 ptTt
R x
similar to those of previous algorithms
• example (x,y)- x: n(=100)-dimensional {-1,+1}-valued vector- y=f(x),where
• generate 1000 examples randomly• 3 datasets (b=1 (small), 9(medium), 15(large))• Compare ROMMA(p=2), ALMA(p=2ln n).
Experiment over artificial data
),(sign)( 1621 bxxxf x
Results over Artificial Data
NOTE1: margin is defined over the original space (w/o reduction) NOTE2: We omit the results for b=9 for clarity .
0 0.5 1 1.5 2 2.5x 104
- 0.1
- 0.08
- 0.06
- 0.04
- 0.02
0
0.0225
# of updates
mar
gin
p=2
PUMMA(15)PUMMA(1)ROMMA(15)ALMA(1)
103 104 105 106- 0.2
- 0.15
- 0.1
- 0.05
0
0.0461p=2 ln (N)
# of updates
mar
gin
ALMA(15)ALMA(9)ALMA(1)PUMMA(15)PUMMA(9)PUMMA(1)
PUMMA
ROMMAPUMMA ALMA
# of updates # of updates
mar
gin
# of updates
mar
gin
p=2 p=2ln n
Computation Time
• time
For p=2 , PUMMA is faster than ROMMA.For p=2ln n , PUMMA is faster than ALMA even though PUMMA uses Newton method.
p=2 p=2ln n
large← bias → small
large← bias → small
PUMMA
ROMMA
PUMMA
ALMA
Sec. Se
c.
Results over UCI Adult data
• result
adult
# of data 32561
algorithm sec.maginrate
SVMlight 5893 100ROMMA
(99%) 71296 99.03PUMMA
(99%) 44480 99.14
•Fix p=2.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Results over MNIST data
MNIST
# of data
algorithm sec. margin rate(%)
SVMlight 401.36 100ROMMA
(99%) 1715.57 93.5PUMMA
(99%) 1971.30 99.2
•Fix p=2.•Use polynomial kernels.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Summary
• PUMMA can learn p-norm maximum margin classifiers with bias directly.– # of updates is similar to those of previous algs.– achieves (1-δ) times the maximum p-norm margin.
• PUMMA outperforms other online algs when the underlying hyperplane has large bias.
Future work
• Maximizing ∞-norm margin directly.• Tighter bounds of # of updates:– In our experiments, PUMMA is faster especially
when bias is large (like WINNOW). – Our current bound does not reflect this fact.