introduction to machine learning lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · mehryar...

Introduction to Machine LearningLecture 6

Mehryar MohriCourant Institute and Google Research

[email protected]

mailto:[email protected]

mailto:[email protected]

Perceptron and Winnow

pageMehryar Mohri - Introduction to Machine Learning

This Lecture

On-Line linear classification: two algorithms.

• Perceptron algorithm.

• Winnow algorithm.

3


Linear Classification

Definition: a linear classifier is an algorithm that returns a hypothesis of the form

4

with w ∈ RN , b ∈ R.

x → sgn(w · x + b),

w · x + b = 0

+-


Margin Definitions

Definition: the (geometric) margin of a point with label y for a linear classifier is its algebraic distance to the hyperplane ,

Definition: the margin of a linear classifier for a sample is the minimum margin of the points in that sample:

5

h : x → w · x + b

x

w·x+b=0

h

S =(x1, . . . , xm)

ρ(x) =y(w · x + b)

w .

ρ = min1≤i≤m

yi(w · xi + b)w .


Perceptron Algorithm

6

(Rosenblatt, 1958)

Perceptron(w0)1 w1 ← w0 typically w0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn(wt · xt)5 Receive(yt)6 if (yt = yt) then7 wt+1 ← wt + ytxt more generally ηytxt, η>08 else wt+1 ← wt

9 return wT+1


Perceptron - Notes

Update: if , then

Different modes of applications:

• repeated passes over sample of size drawn according to some distribution .

• infinite sample drawn according to .

• no distributional assumption.

7

yt(wt · xt) < 0

yt(wt+1 · xt) = yt(wt · xt) + ηxt2

≥0

.

change in the desired direction.

m

D

D


Separating Hyperplane

8

Margin and errors

!

w·x=0

!

w·x=0

−yi(w · xi)w


Perceptron Stochastic Gradient Descent

Objective function: convex but not differentiable.

Stochastic gradient: for each , the update is

Here:

9

with

xt

f(w,x) = max0,−y(w · x)

.

F (w) =1T

T

t=1

max0,−yt(w · xt)

= E

x∼ bD[f(w,x)]

where is a learning rate parameter.η>0

=

wt+1 ←

wt + ηytxt if yt(wt · xt) < 0wt otherwise.

wt+1 ←

wt − η∇wf(wt,xt) if differentiablewt otherwise,


Perceptron Algorithm - Bound

Theorem: Assume that for all and that for some and , for all ,

Proof: Let be the set of s at which there is an update and let be the total number of updates.

10

(Novikoff, 1962)

xt≤R t∈ [1, T ]ρ>0 v∈RN

ρ ≤ yt(v · xt)v .

t∈ [1, T ]

Then, the number of mistakes made by the perceptron algorithm is bounded by .R2/ρ2

I tM


• Summing up the assumption inequalities gives:

11

Mρ ≤v ·

t∈I ytxt

v

=v ·

t∈I(wt+1 −wt)v (definition of updates)

=v · wT+1

v≤ wT+1 (Cauchy-Schwarz ineq.)= wtm + ytmxtm (tm largest t in I)

=wtm2 + xtm2 + 2ytmwtm · xtm

≤0

1/2

≤wtm2 + R2

1/2

≤MR2

1/2=√

MR. (applying the same to previous ts in I)


• Notes:

• bound independent of dimension and tight.

• convergence can be slow for small margin, it can be in .

• among the many variants: voted perceptron algorithm. Predict according to

• are the support vectors for the perceptron algorithm.

• non-separable case: does not converge.12

Ω(2N )

where is the number of iterations survives.ct wt

xt : t∈I

sgn(

t∈I

ctwt) · x,


Leave-One-Out Error

Definition: let be the hypothesis output by learning algorithm after receiving sample of size . Then, the leave-one-out error of over is:

Property: unbiased estimate of expected error of hypothesis trained on sample of size ,

13

hS

L S

m L S

Rloo(L) =1m

m

i=1

1hS−xi(xi) =f(xi).

m−1

ES∼Dm

[ Rloo(L)]=1m

m

i=1

ES[1hS−xi(xi) =f(xi)]=E

S[1hS−x(x) =f(x)]

= ES∼Dm−1

[ Ex∼D

[1hS(x) =f(x)]] = ES∼Dm−1

[R(hS)].


Perceptron - Leave-One-Out Analysis

Theorem: Assume that the data is separable. Let be the hypothesis returned by the Perceptron algorithm after training on sample (repeated passes) and let be the number of updates made and let be the error of . Then,

Proof: Let be a point in sample . Then, If misclassifies , there must have been an update at during training to obtain . Thus,

14

M(S)

hS

ES∼Dm

[R(hS)] ≤ ES∼Dm+1

min(M(S), R2

m+1/ρ2m+1)

m + 1

.

S∼Dm+1

hS−x

hS

x

R(hS) hS

x Sx

Rloo(perceptron) ≤ M(S)m+1 .


Dual Perceptron Algorithm

15

Dual-Perceptron(α0)1 α1 ← α0 typically α0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn

Ts=1 αsys(xs · xt)

5 Receive(yt)6 if (yt = yt) then7 αt+1 ← αt + 18 else αt+1 ← αt

9 return α


Kernel Perceptron Algorithm

16

(Aizerman et al., 1964)

K PDS kernel.

Kernel-Perceptron(α0)1 α1 ← α0 typically α0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn(

Ts=1 αsysK(xs, xt))

5 Receive(yt)6 if (yt = yt) then7 αt+1 ← αt + 18 else αt+1 ← αt

9 return α


(1, 1,!

"

2,+"

2,!

"

2, 1)

XOR Problem

Use second-degree polynomial kernel with :

x1

x2(1, 1)

(-1, -1)

(-1, 1)

(1, -1)

!

2 x1x2

!

2 x1

Linearly non-separable Linearly separable by

(1, 1,!

"

2,!

"

2,+"

2, 1)

(1, 1,+!

2,"

!

2,"

!

2, 1) (1, 1,+!

2,+!

2,+!

2, 1)

17

c = 1

x1x2 = 0.


D=T

t=1 d2tand let . Then, the number of

perceptron updates after processing is bounded by

Theorem: Let be any vector with and let . Define the deviation of by:

Non-Separable Case(Freund and Schapire, 1998)

v v=1ρ>0 xt

dt = max0, ρ− yt(v · xt),

R + D

ρ

2

.

18

x1, . . . ,xT


• Proof: Reduce problem to separable case in higher dimension.

• Mapping (similar to trivial mapping):

19

th component(N +t)

xt =

xt,1...

xt,N

→ xt =

xt,1...

xt,N

0...0∆0...0

v→ v =

v1/Z...

vN/Zy1d1/(∆Z)

...yT dT /(∆Z)

v=1⇒ Z =

1+

D2

∆2.


• Now,

• Since , the bound of the separable case applies:

• With this bound is minimized and equal to: .

• Predictions made by the perceptron in the higher-dimension coincide with those of the perceptron in the original space.

20

||xt||2≤R2+∆2

∆ =√

RD,(R+D)2

ρ2

(R2+∆2)(1+D2/∆2)ρ2 .

yt(v · xt) = yt

v · xt

Z+ ∆

ytdt

Z∆

=ytv · xt

Z+

dt

Z

≥ ytv · xt

Z+

ρ− yt(v · xt)Z

=ρ

Z.


Winnow Algorithm

21

Winnow(η)1 w1 ← 1/N2 for t← 1 to T do3 Receive(xt)4 yt ← sgn(wt · xt) yt ∈ −1, +15 Receive(yt)6 if (yt = yt) then7 Zt ←

Ni=1 wt,i exp(ηytxt,i)

8 for i← 1 to N do9 wt+1,i ← wt,i exp(ηytxt,i)

Zt

10 else wt+1 ← wt

11 return wT+1

(Littlestone, 1988)


Winnow - Notes

Winnow weighted majority:

• for , coincides with the majority vote.

• multiplying by or the weight of correct or incorrect experts, is equivalent to multiplying by the weight of incorrect ones.

Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family).

Motivation: large number of irrelevant features.22

=

yt,i =xt,i∈−1, +1 sgn(wt · xt)

eη e−η

β =e−2η


Winnow Algorithm - Bound

Theorem: Assume that for all and that for some and , for all ,

Proof: Let be the set of s at which there is an update and let be the total number of updates.

23

t∈ [1, T ]v∈RN t∈ [1, T ]

Then, the number of mistakes made by the Winnow algorithm is bounded by .

I tM

v≥0xt∞≤R∞

ρ∞>0

ρ∞ ≤ yt(v · xt)v1

.

2 (R2∞/ρ2

∞) log N



Potential:

Upper bound: for each in ,

24

(relative entropy)Φt =N

i=1

vi

v logvi/v

wt,i.

t IΦt+1 − Φt =

Ni=1

viv1 log

wt,i

wt+1,i

=N

i=1viv1 log

Ztexp(ηytxt,i)

= log Zt − ηN

i=1viv1 ytxt,i

≤ log N

i=1 wt,i exp(ηytxt,i)− ηρ∞

= log Ewt

exp(ηytxt)

− ηρ∞

(Hoeffding) ≤ logexp(η2

(2R∞)2/8)

+ ηytwt · xt − ηρ∞

≤ η2R2∞/2− ηρ∞.



Upper bound: summing up the inequalities yields

Lower bound: note that

Comparison: For we obtain

25

and for all , (property or relative entropy).t Φt≥0

ΦT+1 − Φ1 ≤ M(η2R2∞/2− ηρ∞).

Thus, ΦT+1 − Φ1 ≥ 0− log N = − log N.

− log N ≤M(η2R2∞/2− ηρ∞). η= ρ∞

R2∞

M ≤ 2 logN R2∞

ρ2∞

.

Φ1 =N

i=1

viv1

log vi/v11/N = log N +

N

i=1

viv1

log viv1

≤ log N


Notes

Comparison with perceptron bound:

• dual norms: norms for and .

• similar bounds with different norms.

• each advantageous in different cases:

• Winnow bound favorable when a sparse set of experts can predict well. For example, if and , vs .

• Perceptron favorable in opposite situation.

26

xt v

xt∈±1Nv=e1

log N N

introduction to machine learning lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · mehryar...

Documents