online learning of maximum margin classifiers kohei hatano kyusyu university (joint work with k....

Online Learning of Maximum Margin Classifiers

Kohei HATANOKyusyu University

(Joint work with K. Ishibashi and M. Takeda)

p-Norm with Bias

COLT 2008

Plan of this talk

1. Introduction 2. Preliminalies

― ROMMA

3. Our result– Our new algorithm – Our implicit reduction

4. Experiments

PUMMA

Maximum Margin Classification

margin

• SVMs [Boser et al. 92]– 2-norm margin

• Boosting [Freund&Schapire 97] – ∞-norm margin (approximtely)

• Why maximum (or large) margin?– Good generalization [Schapire et al. 98]

[Shawe-Taylor et al. 98]– Formulated as convex

optimization problems(QP, LP)

Scaling up Max. Margin Classification

1. Decomposition Methods (for SVMs)– Break original QP into smaller QPs – SMO [Platt 99],SVMlight [Joachims 99], LIBSVM [Chang & Lin 01]– state-of-the-art implementations

2. Online Learning (our approach)

Online Learning

Advantages of online Learning • Simple & easy to implement• Uses less memory• Adaptive for changing concepts

Online Learning AlgorithmFor t=1 to T1. Receive an instance xt in Rn

2. Guess a label ŷt=sign(wt ∙ xt+bt)

3. Receive the label yt in {-1,1}

4. Update (wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)

end

xt

+1?

(w t,b t)(w t+1,b t+1

)

-1

Online Learning Algorithms for maximum margin classification

• Max Margin Perceptron [Kowalzyk 00]

• ROMMA [Li & Long 02]

• ALMA [Gentile 01]

• LASVM [Bordes et al. 05]

• MICRA [Tsampouka&Shawe-Taylor 07]

• Pegasos [Shalev-Shwalz et al. 07]

• Etc.

Most of online algs cannot learn hyperplane with bias!

bias

0

hyperplane with bias

hyperplane w/o bias

Typical Reduction to deal with bias [Cf. Cristianini& Shawe-Taylor 00]

Adding an extra dimension corresponding bias.

),( bu

jxγ

nj Rx

Original space Augmented space1),(~ n

jj R Rxx

),( bu )/,(~ Rbuu

jjR xmax:instance

hyperplane 1u

↔

↔

Rbxuy

γ jj

j u)(

min

margin (over normalized Instances)

↔

b

jjR x~max:~

R

R~~)~~(y

minγ~ jj

j uxu

u~

j~x

γ~

R~

jj~~b xuxu NOTE: is equivalent with (u,b) )u~ (

This reduction weaken the guarantee of margin:

2~ γγγ

→it might cause significant difference in genealization!

Our New Online Learning Algorithm

PUMMA(P-norm Utilizing Maximum Margin Algorithm)

• PUMMA can learn maximum margin classifiers with bias directly (without using the typical reduction!).

• Margin is defined as p-norm (p≥2)– For p=2, similar to Perceptron.– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].

Fast when the target is sparse.

• Extended to linearly inseparable case (omitted).– Soft margin with2-norm slack variables.

Problem of finding the p-norm maximum margin hyperplane [Cf. Mangasarian 99]

)1( 1y :to sub.

21

minarg

j

2

T,...j)b(

.)b,(

j

qb,

**

xw

www

Given: (linearly separable) S=((x1,y1),…,(xT,yT)),

Goal: Find an approximate solution of (w*,b*)　　　

　　　We want an online alg. solving the problem with small # of updates.

q-norm (dual norm)1/p+1/q=1

E.g.p=2, q=2p=∞, q=1

ROMMA(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]

2

2t

t

2

21

,1)(y :to sub. 21

.minarg

t

t

t

www

xw

www

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1. Predict ŷt=sign(wt∙xt), and receive yt

2. If yt(wt ·xt )<1-δ (margin is “insufficient”), 3. update:

4. Otherwise, wt+1=wt

Constraint over the last example which causes an update

Constraint over the last hyperplane

2 constraints only!

NOTE: bias is fixed with 0

ROMMA [Li&Long,’02]

0

weght space

12

4

3

w1

w2

wSVM

w3

1,...,4)(j

,1y :to sub.21

min

bias) (without SVM

2

2

)(

.

jj xw

ww

2

21-t

t

2

2

,1)(y :to sub.21

.min

ROMMA

www

xw

ww

t

feasible region of SVM

Solution of ROMMA

.

)(

)(2 ,

)(

1

where , Otherwise, (ii)

.1

where ,

, If (i)

22

2

2

2

2

2

2

222

2

2

2

2

2

1

2

2

1

2

21

tttt

tttt

tttt

ttt

tttt

tttt

ttt

βα

βyα

αyα

xwxw

xwxw

xwxw

xww

wxw

xxw

www

Solution of ROMMA is an additive update:

PUMMA

21-t

2

,11

)(

,1

,1 :to sub. 21

.min),(

q

negt

post

qbtt

b

b

b

wwfw

xw

xw

www

bias is optimized

q-norm(1/p+1/q=1)

xpost, xneg

t

: last positive and negative examples which incur updates

2

1)()(

q

q

q

qii wwsignf

ww

link function [Grove et al. 97]

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1.Predict ŷt=sign(wt∙xt), and receive yt

2.If yt(wt ·xt +bt)>1-δ, update:

3.Otherwise, wt+1=wt

○≧1

2tt www

ROMMA

○≧1

2)(qtt wwfw

PUMMA

●

≧1

Solution of PUMMA

Observation:For p=2, the solution is the same as that of ROMMA for zt = xt

pos – xtneg.

.)(

,)(βα)(βα.)β,α(

,βyα

.α,yα

,

negtt

postt

ptptt

tttt

negt

postt

tttt

ttt

2b

cases, either Inmethod. Newton the by solved is which

221

argmin

where Otherwise, (ii)

and 2

where

If (i)

111t

22

1

2

2

1

2

21

xwxw

wfwfz

wzw

xxzz

zw

www

Solution of PUMMA is found numerically:

xpost, xneg

t

: last positive and negative examples which incur updates

Our (implicit) reduction which preserves the margin

),...,1( 1

),...,1( 1

:to sub.21

.minarg),( 2

2 ,

**

Njb

Pib

b

negj

posi

b

xw

xw

www

),...,1,,...1(

2)(

:to sub.21

.minarg~ 2

2

NjPi

xx negj

posi

w

www

ww ~ Thm. * = -

hyperplane with bias hyperplane without biasover pairs of positive and negative instances

Main Result

Thm• Suppose that given S=((x1,y1),…,(xT,yT)),

there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for t=1,…,T.

• (# of updates of PUMMAp(δ)) ≤(p-1)uq2R2/ δ2

• After (p-1)u q2R2/ δ2 updates,

PUMMAp(δ) outputs a hypothesis with p-norm margin ≥ (1-δ)γ (γ: margin of (u,b) ).

.max where,...1 ptTt

R x

similar to those of previous algorithms

• example (x,y)- x: n(=100)-dimensional {-1,+1}-valued vector- y=f(x),where

• generate 1000 examples randomly• 3 datasets (b=1 (small), 9(medium), 15(large))• Compare ROMMA(p=2), ALMA(p=2ln n).

Experiment over artificial data

),(sign)( 1621 bxxxf x

Results over Artificial Data

NOTE1: margin is defined over the original space (w/o reduction) NOTE2: We omit the results for b=9 for clarity .

0 0.5 1 1.5 2 2.5x　104

- 0.1

- 0.08

- 0.06

- 0.04

- 0.02

0

0.0225

#　of　updates

mar

gin

p=2

　

　

PUMMA(15)PUMMA(1)ROMMA(15)ALMA(1)

103 104 105 106- 0.2

- 0.15

- 0.1

- 0.05

0

0.0461p=2 ln (N)

#　of　updates

mar

gin

　

　

ALMA(15)ALMA(9)ALMA(1)PUMMA(15)PUMMA(9)PUMMA(1)

PUMMA

ROMMAPUMMA ALMA

# of updates # of updates

mar

gin

# of updates

mar

gin

p=2 p=2ln n

Computation Time

• time

For p=2 ， PUMMA is faster than ROMMA.For p=2ln n ， PUMMA is faster than ALMA even though PUMMA uses Newton method.

p=2 p=2ln n

large← 　　　　 bias 　　　　→ small

large← 　　　　 bias 　　　　→ small

PUMMA

ROMMA

PUMMA

ALMA

Sec. Se

c.

Results over UCI Adult data

• result

adult

# of data 32561

algorithm sec.maginrate

SVMlight 5893 100ROMMA

(99%) 71296 99.03PUMMA

(99%) 44480 99.14

•Fix p=2.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Results over MNIST data

MNIST

# of data

algorithm sec. margin rate(%)

SVMlight 401.36 100ROMMA

(99%) 1715.57 93.5PUMMA

(99%) 1971.30 99.2

•Fix p=2.•Use polynomial kernels.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Summary

• PUMMA can learn p-norm maximum margin classifiers with bias directly.– # of updates is similar to those of previous algs.– achieves (1-δ) times the maximum p-norm margin.

• PUMMA outperforms other online algs when the underlying hyperplane has large bias.

Future work

• Maximizing ∞-norm margin directly.• Tighter bounds of # of updates:– In our experiments, PUMMA is faster especially

when bias is large (like WINNOW). – Our current bound does not reflect this fact.

online learning of maximum margin classifiers kohei hatano kyusyu university (joint work with k....

Documents

pnorm p2for p

large margin

maximum margin algorithmpumma

maximum margin algorithmlilong

instance xt

marginfor p

online learning algorithms

signwt xt btreceive