introduction to machine learning lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · mehryar...

26
Introduction to Machine Learning Lecture 6 Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: others

Post on 31-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

Introduction to Machine LearningLecture 6

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

Perceptron and Winnow

Page 3: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

This Lecture

On-Line linear classification: two algorithms.

• Perceptron algorithm.

• Winnow algorithm.

3

Page 4: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Linear Classification

Definition: a linear classifier is an algorithm that returns a hypothesis of the form

4

with w ∈ RN , b ∈ R.

x → sgn(w · x + b),

w · x + b = 0

+-

Page 5: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Margin Definitions

Definition: the (geometric) margin of a point with label y for a linear classifier is its algebraic distance to the hyperplane ,

Definition: the margin of a linear classifier for a sample is the minimum margin of the points in that sample:

5

h : x → w · x + b

x

w·x+b=0

h

S =(x1, . . . , xm)

ρ(x) =y(w · x + b)

w .

ρ = min1≤i≤m

yi(w · xi + b)w .

Page 6: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Perceptron Algorithm

6

(Rosenblatt, 1958)

Perceptron(w0)1 w1 ← w0 typically w0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn(wt · xt)5 Receive(yt)6 if (yt = yt) then7 wt+1 ← wt + ytxt more generally ηytxt, η>08 else wt+1 ← wt

9 return wT+1

Page 7: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Perceptron - Notes

Update: if , then

Different modes of applications:

• repeated passes over sample of size drawn according to some distribution .

• infinite sample drawn according to .

• no distributional assumption.

7

yt(wt · xt) < 0

yt(wt+1 · xt) = yt(wt · xt) + ηxt2

≥0

.

change in the desired direction.

m

D

D

Page 8: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Separating Hyperplane

8

Margin and errors

!

w·x=0

!

w·x=0

−yi(w · xi)w

Page 9: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Perceptron Stochastic Gradient Descent

Objective function: convex but not differentiable.

Stochastic gradient: for each , the update is

Here:

9

with

xt

f(w,x) = max0,−y(w · x)

.

F (w) =1T

T

t=1

max0,−yt(w · xt)

= E

x∼ bD[f(w,x)]

where is a learning rate parameter.η>0

=

wt+1 ←

wt + ηytxt if yt(wt · xt) < 0wt otherwise.

wt+1 ←

wt − η∇wf(wt,xt) if differentiablewt otherwise,

Page 10: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Perceptron Algorithm - Bound

Theorem: Assume that for all and that for some and , for all ,

Proof: Let be the set of s at which there is an update and let be the total number of updates.

10

(Novikoff, 1962)

xt≤R t∈ [1, T ]ρ>0 v∈RN

ρ ≤ yt(v · xt)v .

t∈ [1, T ]

Then, the number of mistakes made by the perceptron algorithm is bounded by .R2/ρ2

I tM

Page 11: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

• Summing up the assumption inequalities gives:

11

Mρ ≤v ·

t∈I ytxt

v

=v ·

t∈I(wt+1 −wt)v (definition of updates)

=v · wT+1

v≤ wT+1 (Cauchy-Schwarz ineq.)= wtm + ytmxtm (tm largest t in I)

=wtm2 + xtm2 + 2ytmwtm · xtm

≤0

1/2

≤wtm2 + R2

1/2

≤MR2

1/2=√

MR. (applying the same to previous ts in I)

Page 12: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

• Notes:

• bound independent of dimension and tight.

• convergence can be slow for small margin, it can be in .

• among the many variants: voted perceptron algorithm. Predict according to

• are the support vectors for the perceptron algorithm.

• non-separable case: does not converge.12

Ω(2N )

where is the number of iterations survives.ct wt

xt : t∈I

sgn(

t∈I

ctwt) · x,

Page 13: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Leave-One-Out Error

Definition: let be the hypothesis output by learning algorithm after receiving sample of size . Then, the leave-one-out error of over is:

Property: unbiased estimate of expected error of hypothesis trained on sample of size ,

13

hS

L S

m L S

Rloo(L) =1m

m

i=1

1hS−xi(xi) =f(xi).

m−1

ES∼Dm

[ Rloo(L)]=1m

m

i=1

ES[1hS−xi(xi) =f(xi)]=E

S[1hS−x(x) =f(x)]

= ES∼Dm−1

[ Ex∼D

[1hS(x) =f(x)]] = ES∼Dm−1

[R(hS)].

Page 14: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Perceptron - Leave-One-Out Analysis

Theorem: Assume that the data is separable. Let be the hypothesis returned by the Perceptron algorithm after training on sample (repeated passes) and let be the number of updates made and let be the error of . Then,

Proof: Let be a point in sample . Then, If misclassifies , there must have been an update at during training to obtain . Thus,

14

M(S)

hS

ES∼Dm

[R(hS)] ≤ ES∼Dm+1

min(M(S), R2

m+1/ρ2m+1)

m + 1

.

S∼Dm+1

hS−x

hS

x

R(hS) hS

x Sx

Rloo(perceptron) ≤ M(S)m+1 .

Page 15: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Dual Perceptron Algorithm

15

Dual-Perceptron(α0)1 α1 ← α0 typically α0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn

Ts=1 αsys(xs · xt)

5 Receive(yt)6 if (yt = yt) then7 αt+1 ← αt + 18 else αt+1 ← αt

9 return α

Page 16: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Kernel Perceptron Algorithm

16

(Aizerman et al., 1964)

K PDS kernel.

Kernel-Perceptron(α0)1 α1 ← α0 typically α0 = 02 for t← 1 to T do3 Receive(xt)4 yt ← sgn(

Ts=1 αsysK(xs, xt))

5 Receive(yt)6 if (yt = yt) then7 αt+1 ← αt + 18 else αt+1 ← αt

9 return α

Page 17: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

(1, 1,!

"

2,+"

2,!

"

2, 1)

XOR Problem

Use second-degree polynomial kernel with :

x1

x2(1, 1)

(-1, -1)

(-1, 1)

(1, -1)

!

2 x1x2

!

2 x1

Linearly non-separable Linearly separable by

(1, 1,!

"

2,!

"

2,+"

2, 1)

(1, 1,+!

2,"

!

2,"

!

2, 1) (1, 1,+!

2,+!

2,+!

2, 1)

17

c = 1

x1x2 = 0.

Page 18: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

D=T

t=1 d2tand let . Then, the number of

perceptron updates after processing is bounded by

Theorem: Let be any vector with and let . Define the deviation of by:

Non-Separable Case(Freund and Schapire, 1998)

v v=1ρ>0 xt

dt = max0, ρ− yt(v · xt),

R + D

ρ

2

.

18

x1, . . . ,xT

Page 19: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

• Proof: Reduce problem to separable case in higher dimension.

• Mapping (similar to trivial mapping):

19

th component(N +t)

xt =

xt,1...

xt,N

→ xt =

xt,1...

xt,N

0...0∆0...0

v→ v =

v1/Z...

vN/Zy1d1/(∆Z)

...yT dT /(∆Z)

v=1⇒ Z =

1+

D2

∆2.

Page 20: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

• Now,

• Since , the bound of the separable case applies:

• With this bound is minimized and equal to: .

• Predictions made by the perceptron in the higher-dimension coincide with those of the perceptron in the original space.

20

||xt||2≤R2+∆2

∆ =√

RD,(R+D)2

ρ2

(R2+∆2)(1+D2/∆2)ρ2 .

yt(v · xt) = yt

v · xt

Z+ ∆

ytdt

Z∆

=ytv · xt

Z+

dt

Z

≥ ytv · xt

Z+

ρ− yt(v · xt)Z

Z.

Page 21: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Winnow Algorithm

21

Winnow(η)1 w1 ← 1/N2 for t← 1 to T do3 Receive(xt)4 yt ← sgn(wt · xt) yt ∈ −1, +15 Receive(yt)6 if (yt = yt) then7 Zt ←

Ni=1 wt,i exp(ηytxt,i)

8 for i← 1 to N do9 wt+1,i ← wt,i exp(ηytxt,i)

Zt

10 else wt+1 ← wt

11 return wT+1

(Littlestone, 1988)

Page 22: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Winnow - Notes

Winnow weighted majority:

• for , coincides with the majority vote.

• multiplying by or the weight of correct or incorrect experts, is equivalent to multiplying by the weight of incorrect ones.

Relationships with other algorithms: e.g., boosting and Perceptron (Winnow and Perceptron can be viewed as special instances of a general family).

Motivation: large number of irrelevant features.22

=

yt,i =xt,i∈−1, +1 sgn(wt · xt)

eη e−η

β =e−2η

Page 23: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Winnow Algorithm - Bound

Theorem: Assume that for all and that for some and , for all ,

Proof: Let be the set of s at which there is an update and let be the total number of updates.

23

t∈ [1, T ]v∈RN t∈ [1, T ]

Then, the number of mistakes made by the Winnow algorithm is bounded by .

I tM

v≥0xt∞≤R∞

ρ∞>0

ρ∞ ≤ yt(v · xt)v1

.

2 (R2∞/ρ2

∞) log N

Page 24: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Winnow Algorithm - Bound

Potential:

Upper bound: for each in ,

24

(relative entropy)Φt =N

i=1

vi

v logvi/v

wt,i.

t IΦt+1 − Φt =

Ni=1

viv1 log

wt,i

wt+1,i

=N

i=1viv1 log

Ztexp(ηytxt,i)

= log Zt − ηN

i=1viv1 ytxt,i

≤ log N

i=1 wt,i exp(ηytxt,i)− ηρ∞

= log Ewt

exp(ηytxt)

− ηρ∞

(Hoeffding) ≤ logexp(η2

(2R∞)2/8)

+ ηytwt · xt − ηρ∞

≤ η2R2∞/2− ηρ∞.

Page 25: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Winnow Algorithm - Bound

Upper bound: summing up the inequalities yields

Lower bound: note that

Comparison: For we obtain

25

and for all , (property or relative entropy).t Φt≥0

ΦT+1 − Φ1 ≤ M(η2R2∞/2− ηρ∞).

Thus, ΦT+1 − Φ1 ≥ 0− log N = − log N.

− log N ≤M(η2R2∞/2− ηρ∞). η= ρ∞

R2∞

M ≤ 2 logN R2∞

ρ2∞

.

Φ1 =N

i=1

viv1

log vi/v11/N = log N +

N

i=1

viv1

log viv1

≤ log N

Page 26: Introduction to Machine Learning Lecture 6mohri/mlu/mlu_lecture_6.pdf · 2011-10-04 · Mehryar Mohri - Introduction to Machine Learning page Perceptron Algorithm - Bound Theorem:

pageMehryar Mohri - Introduction to Machine Learning

Notes

Comparison with perceptron bound:

• dual norms: norms for and .

• similar bounds with different norms.

• each advantageous in different cases:

• Winnow bound favorable when a sparse set of experts can predict well. For example, if and , vs .

• Perceptron favorable in opposite situation.

26

xt v

xt∈±1Nv=e1

log N N