online learning of maximum margin classifiers kohei hatano kyusyu university (joint work with k....

23
Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p- Norm with Bias COLT 2008

Upload: calvin-bridges

Post on 19-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Online Learning of Maximum Margin Classifiers

Kohei HATANOKyusyu University

(Joint work with K. Ishibashi and M. Takeda)

p-Norm with Bias

COLT 2008

Page 2: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Plan of this talk

1. Introduction 2. Preliminalies

― ROMMA

3. Our result– Our new algorithm – Our implicit reduction

4. Experiments

PUMMA

Page 3: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Maximum Margin Classification

margin

• SVMs [Boser et al. 92]– 2-norm margin

• Boosting [Freund&Schapire 97] – ∞-norm margin (approximtely)

• Why maximum (or large) margin?– Good generalization [Schapire et al. 98]

[Shawe-Taylor et al. 98]– Formulated as convex

optimization problems(QP, LP)

Page 4: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Scaling up Max. Margin Classification

1. Decomposition Methods (for SVMs)– Break original QP into smaller QPs – SMO [Platt 99],SVMlight [Joachims 99], LIBSVM [Chang & Lin 01]– state-of-the-art implementations

2. Online Learning (our approach)

Page 5: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Online Learning

Advantages of online Learning • Simple & easy to implement• Uses less memory• Adaptive for changing concepts

Online Learning AlgorithmFor t=1 to T1. Receive an instance xt in Rn

2. Guess a label ŷt=sign(wt ∙ xt+bt)

3. Receive the label yt in {-1,1}

4. Update (wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)

end

xt

+1?

(w t,b t)(w t+1,b t+1

)

-1

Page 6: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Online Learning Algorithms for maximum margin classification

• Max Margin Perceptron [Kowalzyk 00]

• ROMMA [Li & Long 02]

• ALMA [Gentile 01]

• LASVM [Bordes et al. 05]

• MICRA [Tsampouka&Shawe-Taylor 07]

• Pegasos [Shalev-Shwalz et al. 07]

• Etc.

Most of online algs cannot learn hyperplane with bias!

bias

0

hyperplane with bias

hyperplane w/o bias

Page 7: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Typical Reduction to deal with bias [Cf. Cristianini& Shawe-Taylor 00]

Adding an extra dimension corresponding bias.

),( bu

jxγ

nj Rx

Original space Augmented space1),(~ n

jj R Rxx

),( bu )/,(~ Rbuu

jjR xmax:instance

hyperplane 1u

Rbxuy

γ jj

j u)(

min

margin (over normalized Instances)

b

jjR x~max:~

R

R~~)~~(y

minγ~ jj

j uxu

u~

j~x

γ~

R~

jj~~b xuxu NOTE: is equivalent with (u,b) )u~ (

This reduction weaken the guarantee of margin:

2~ γγγ

→it might cause significant difference in genealization!

Page 8: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Our New Online Learning Algorithm

PUMMA(P-norm Utilizing Maximum Margin Algorithm)

• PUMMA can learn maximum margin classifiers with bias directly (without using the typical reduction!).

• Margin is defined as p-norm (p≥2)– For p=2, similar to Perceptron.– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].

Fast when the target is sparse.

• Extended to linearly inseparable case (omitted).– Soft margin with2-norm slack variables.

Page 9: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Problem of finding the p-norm maximum margin hyperplane [Cf. Mangasarian 99]

)1( 1y :to sub.

21

minarg

j

2

T,...j)b(

.)b,(

j

qb,

**

xw

www

Given: (linearly separable) S=((x1,y1),…,(xT,yT)),

Goal: Find an approximate solution of (w*,b*)    

    We want an online alg. solving the problem with small # of updates.

q-norm (dual norm)1/p+1/q=1

E.g.p=2, q=2p=∞, q=1

Page 10: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

ROMMA(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]

2

2t

t

2

21

,1)(y :to sub. 21

.minarg

t

t

t

www

xw

www

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1. Predict ŷt=sign(wt∙xt), and receive yt

2. If yt(wt ·xt )<1-δ (margin is “insufficient”), 3. update:

4. Otherwise, wt+1=wt

Constraint over the last example which causes an update

Constraint over the last hyperplane

2 constraints only!

NOTE: bias is fixed with 0

Page 11: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

ROMMA [Li&Long,’02]

0

weght space

12

4

3

w1

w2

wSVM

w3

1,...,4)(j

,1y :to sub.21

min

bias) (without SVM

2

2

)(

.

jj xw

ww

2

21-t

t

2

2

,1)(y :to sub.21

.min

ROMMA

www

xw

ww

t

feasible region of SVM

Page 12: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Solution of ROMMA

.

)(

)(2 ,

)(

1

where , Otherwise, (ii)

.1

where ,

, If (i)

22

2

2

2

2

2

2

222

2

2

2

2

2

1

2

2

1

2

21

tttt

tttt

tttt

ttt

tttt

tttt

ttt

βα

βyα

αyα

xwxw

xwxw

xwxw

xww

wxw

xxw

www

Solution of ROMMA is an additive update:

Page 13: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

PUMMA

21-t

2

,11

)(

,1

,1 :to sub. 21

.min),(

q

negt

post

qbtt

b

b

b

wwfw

xw

xw

www

bias is optimized

q-norm(1/p+1/q=1)

xpost, xneg

t

: last positive and negative examples which incur updates

2

1)()(

q

q

q

qii wwsignf

ww

link function [Grove et al. 97]

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1.Predict ŷt=sign(wt∙xt), and receive yt

2.If yt(wt ·xt +bt)>1-δ, update:

3.Otherwise, wt+1=wt

○≧1

2tt www

ROMMA

○≧1

2)(qtt wwfw

PUMMA

≧1

Page 14: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Solution of PUMMA

Observation:For p=2, the solution is the same as that of ROMMA for zt = xt

pos – xtneg.

.)(

,)(βα)(βα.)β,α(

,βyα

.α,yα

,

negtt

postt

ptptt

tttt

negt

postt

tttt

ttt

2b

cases, either Inmethod. Newton the by solved is which

221

argmin

where Otherwise, (ii)

and 2

where

If (i)

111t

22

1

2

2

1

2

21

xwxw

wfwfz

wzw

xxzz

zw

www

Solution of PUMMA is found numerically:

xpost, xneg

t

: last positive and negative examples which incur updates

Page 15: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Our (implicit) reduction which preserves the margin

),...,1( 1

),...,1( 1

:to sub.21

.minarg),( 2

2 ,

**

Njb

Pib

b

negj

posi

b

xw

xw

www

),...,1,,...1(

2)(

:to sub.21

.minarg~ 2

2

NjPi

xx negj

posi

w

www

ww ~ Thm. * = -

hyperplane with bias hyperplane without biasover pairs of positive and negative instances

Page 16: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Main Result

Thm• Suppose that given S=((x1,y1),…,(xT,yT)),

there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for t=1,…,T.

• (# of updates of PUMMAp(δ)) ≤(p-1)uq2R2/ δ2

• After (p-1)u q2R2/ δ2 updates,

PUMMAp(δ) outputs a hypothesis with p-norm margin ≥ (1-δ)γ (γ: margin of (u,b) ).

.max where,...1 ptTt

R x

similar to those of previous algorithms

Page 17: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

• example (x,y)- x: n(=100)-dimensional {-1,+1}-valued vector- y=f(x),where

• generate 1000 examples randomly• 3 datasets (b=1 (small), 9(medium), 15(large))• Compare ROMMA(p=2), ALMA(p=2ln n).

Experiment over artificial data

),(sign)( 1621 bxxxf x

Page 18: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Results over Artificial Data

NOTE1: margin is defined over the original space (w/o reduction) NOTE2: We omit the results for b=9 for clarity .

0 0.5 1 1.5 2 2.5x 104

- 0.1

- 0.08

- 0.06

- 0.04

- 0.02

0

0.0225

# of updates

mar

gin

p=2

 

 

PUMMA(15)PUMMA(1)ROMMA(15)ALMA(1)

103 104 105 106- 0.2

- 0.15

- 0.1

- 0.05

0

0.0461p=2 ln (N)

# of updates

mar

gin

 

 

ALMA(15)ALMA(9)ALMA(1)PUMMA(15)PUMMA(9)PUMMA(1)

PUMMA

ROMMAPUMMA ALMA

# of updates # of updates

mar

gin

# of updates

mar

gin

p=2 p=2ln n

Page 19: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Computation Time

• time

For p=2 , PUMMA is faster than ROMMA.For p=2ln n , PUMMA is faster than ALMA even though PUMMA uses Newton method.

p=2 p=2ln n

large←         bias        → small

large←         bias        → small

PUMMA

ROMMA

PUMMA

ALMA

Sec. Se

c.

Page 20: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Results over UCI Adult data

• result

adult

# of data 32561

algorithm sec.maginrate

SVMlight 5893 100ROMMA

(99%) 71296 99.03PUMMA

(99%) 44480 99.14

•Fix p=2.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Page 21: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Results over MNIST data

MNIST

# of data

algorithm sec. margin rate(%)

SVMlight 401.36 100ROMMA

(99%) 1715.57 93.5PUMMA

(99%) 1971.30 99.2

•Fix p=2.•Use polynomial kernels.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Page 22: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Summary

• PUMMA can learn p-norm maximum margin classifiers with bias directly.– # of updates is similar to those of previous algs.– achieves (1-δ) times the maximum p-norm margin.

• PUMMA outperforms other online algs when the underlying hyperplane has large bias.

Page 23: Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Future work

• Maximizing ∞-norm margin directly.• Tighter bounds of # of updates:– In our experiments, PUMMA is faster especially

when bias is large (like WINNOW). – Our current bound does not reflect this fact.