perceptron - svcl.ucsd.edu · the perceptron this was the main insight of rosenblatt, which lead to...

The Perceptron

Nuno Vasconcelos ECE Department, UCSDp ,

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

)(xfyx

l d i l ifi h Y h th t h( ) f( ) ∀

)(xfy =x(.)f

2

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

Linear discriminantthe classifier implements the linear decision rule

[ ]⎧ >if 0)(1 xgwith

has the properties

bxwxg T +=)([ ]⎩⎨⎧

=<−>

= g(x)sgn if if

0)(10)(1

)(*xgxg

xh

has the properties• it divides X into two “half-spaces”

• boundary is the plane with:

w

y p• normal w• distance to the origin b/||w||

( )/|| || i th di t f i twb

x

wxg )(

• g(x)/||w|| is the distance from point xto the boundary

• g(x) = 0 for points on the plane

3

• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”

Linear discriminantthe classifier implements the linear decision rule

[ ]⎧ >if 0)(1 xgwith

given a linearly separable training set

bxwxg T +=)([ ]⎩⎨⎧

=<−>

= g(x)sgn if if

0)(10)(1

)(*xgxg

xh

given a linearly separable training set

D = {(x1,y1), ..., (xn,yn)}f f

w y=1

no errors if and only if, ∀ i• yi = 1 and g(xi) > 0 or

yi = -1 and g(xi) < 0 wb

x

wxg )(

yi g( i)• i.e. yi.g(xi) > 0

this allows a very concise expressionf h i i f “ i i ” “ i i l i k”

y=-1

4

for the situation of “no training error” or “zero empirical risk”

Learning as optimizationnecessary and sufficient condition for zero empirical risk

( ) ibT ∀0

this is interesting because it allows the formulation of the l i bl f f ti ti i ti

( ) ibxwy iT

i ∀>+ ,0

learning problem as one of function optimization• starting from a random guess for the parameters w and b• we maximize the reward functionwe maximize the reward function

( )bxwy iT

n

ii +∑

1• or, equivalently, minimize the cost function

i =1

( ) ( )bxwybwJ Tn

+∑5

( ) ( )bxwybwJ iT

ii +−= ∑

=1,

The gradientwe have seen thatthe gradient of a function f(w)the gradient of a function f(w)at z is

Tff ⎞

⎜⎛ ∂∂

Theorem: the gradient points in

n

zwfz

wfzf ⎟

⎠

⎞⎜⎜⎝

⎛∂

∂∂∂

=∇−

)(,),()(10

L ∇f

Theorem: the gradient points in the direction of maximum growthgradient is

f(x,y)

• the direction of greatest increaseof f(x) at z,

• normal to the iso-contours of f( )(*)

)( yxf∇

6

• normal to the iso-contours of f(.) ),( 00 yxf∇

),( 11 yxf∇

Critical point conditionslet f(x) be continuously differentiablex* is a local minimum of f(x) if and only ifx is a local minimum of f(x) if and only if• f has zero gradient at x*

0*)( =∇ xf• and the Hessian of f at x* is positive definite

0*)( =∇ xf

nt ddxfd ℜ∈∀≥∇ 0*)(2

• where

ddxfd ℜ∈∀≥∇ ,0)(

⎥⎤

⎢⎡ ∂∂ )()(

22

xfxf

⎥⎥⎥⎥

⎢⎢⎢⎢

∂∂

∂∂∂=∇

−

)()(

)(22

1020

2

ff

xxx

xx

xfn

M

L

7

⎥⎥

⎦⎢⎢

⎣ ∂∂

∂∂∂

−−

)()( 2101

xx

fxxx

f

nn

L

Gradient descentthis suggest a simple minimization technique• pick initial estimate x(0) f(x)pick initial estimate x( )

• follow the negative gradient

( ))()()1( nnn xfxx ∇=+ η

f(x)

this is gradient descent

( ))()()( xfxx ∇−= η( ))(nxf∇−η

)(nx

η is the learning rate and needs to be carefully chosen• if η too large, descent may diverge f(x)

many extensions are possiblemain point:

8

• once framed as optimization, we can (in general) solve it

( ))(nxf∇−η)(nx

The perceptronthis was the main insight of Rosenblatt, which lead to the Perceptronpthe basic idea is to do gradient descent on our cost

( )n

( ) ( )bxwybwJ iT

n

ii +−= ∑

=1,

we know that:• if the training set is linearly separable there is at least a pair (w,b)

s ch that J( b) < 0such that J(w,b) < 0• any minimum that is equal to or better than this will do

Q: can we find one such minimum?

9

Q: can we find one such minimum?

Perceptron learningthe gradient is straightforward to compute

∂∂ ff

and gradient descent is trivial

∑∑ −=∂∂

−=∂∂

ii

iii y

bfxy

wf

and gradient descent is trivialthere is, however, one problem:• J(w,b) is not bounded below( , )• if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0• the minimum is always at −∞ which is quite bad, numerically

this is really just the normalization problem that we already talked about

10

Rosenblatt’s idearestrict attention to the points incorrectly classifiedat each iteration define set of errorsat each iteration define set of errors

{ }0)(| <+= bxwyxE iT

ii

and make the cost

( ) ( )∑ T bxwybwJ

note that

( ) ( )∑∈

+−=Exi

iT

ipi

bxwybwJ|

,

note that• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• if we get to zero, we know we have the best possible solution (E

11

g , p (empty)

Perceptron learningis trivial, just do gradient descent on Jp(w,b)

∑+ += nn xyww η)()1(

∑

∑+

∈

+=

+=

inn

Exii

i

ybb

xyww

η

η

)()1(

this turns out not to be very effective if the D is largel th ti t i i t t t k ll t t th d

∑∈Ex

ii

• loop over the entire training set to take a small step at the end

one alternative that frequently is better is “stochastic gradient descent”gradient descent• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow

th di ti ft i ti D

12

the same direction after processing entire D • very popular in learning, where D is usually large

Perceptron learningthe algorithm is as follows:set k = 0, wk = 0, bk = 0set R = maxi ||xi||

do {

for i = 1:n {if yi(wTxi + bk) < 0 then {

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

we will talk about

R shortly!– k = k+1}

}

R shortly!

13

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

– k = k+1}

}

oo

o

ooo

o

bk

14

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1


do { x2 y=1for i = 1:n {


xx

x

x

x

xx

x

xo

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

i

– k = k+1}

}

oo

o

ooo

o

bk

15



do { x2 y=1for i = 1:n {


xx

x

x

x

xx

x

xo


– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

i

wk+ ηyixi– k = k+1}

}

oo

o

ooo

o

bk

wk+ ηyixi

16



do { x2 y=1for i = 1:n {


xx

x

x

x

xx

x

xo


– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

ok+1

bk

17


bk


do { x2 y=1for i = 1:n {


xx

x

x

x

xx

x

xo


– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

o

bk

k+1

bk +ηyiR2

18


bk


do { x2

y=1for i = 1:n {


xx

x

x

x

xx

x

xo

y=1


– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

owk+1

b

19

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1bk+1

Perceptron learningOK, makes intuitive sensehow do we know it will not get stuck on a local minimum?gthis was Rosenblatt’s seminal contributionTheorem: Let D = {(x1,y1), ..., (xn,yn)} and{( 1,y1), , ( n,yn)}

If there is (w*,b*) such that ||w*|| = 1 and

||||max iixR = (*)

If there is (w ,b ) such that ||w || 1 and

then the Perceptron will find an error free hyper-plane in( ) ibxwy i

Ti ∀>+ ,** γ (**)

then the Perceptron will find an error free hyper-plane in at most

iterations 2

2⎟⎠

⎞⎜⎜⎝

⎛γR

20

⎠⎝ γ

Proofnot that harddenote iteration by t assume point processed at iterationdenote iteration by t, assume point processed at iteration t-1 is (xi,yi)for simplicity, use homogeneous coordinates. Definingp y, g g

⎥⎦

⎤⎢⎣

⎡=

Rbw

at

tt /⎥

⎦

⎤⎢⎣

⎡=

Rx

z ii

allows the compact notation⎦⎣ Rbt /⎦⎣

( ) TT zaybxwy =+

since only misclassified points are processed, we have

( ) itititi zaybxwy 111 −−− =+

T

21

01 <− iTti zay (***)

Proofhow does at evolve?

xyww ⎤⎡⎤⎡⎤⎡

d ti th ti l l ti b * ( * b*/R)T

iiti

ii

t

t

t

tt zya

Ryxy

Rbw

Rbw

a ηη +=⎥⎦

⎤⎢⎣

⎡+⎥

⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡= −

−

−1

1

1

//

denoting the optimal solution by a* = (w*, b*/R)T

( )***

*** 1

bxwyaa

azyaaaaT

Tiit

Tt

++=

+= −

η

η

and, from (**), ( )***1 bxwyaa iit ++= − η

ηγ+> ** aaaaT

solving the recursionηγ+> − ** 1aaaa tt

tT 2***

22

ηγηγηγ taaaaaa ttTt >>+>+> −− ...2*** 21

Proofthis means convergence to a* if we can bound the magnitude of at . What is this magnitude? Sinceg t g

we haveiitt zyaa η+= −1

and,

221

21

2 2 iiTtitt

Ttt zzayaaaa ηη ++== −−

2222

( )22221

2221

2

Rxa

zaa

it

itt

η

η

++=

+<

−

−

from (***)

from def of z

solving the recursion

2221 2 Rat η+< − from (**)

2

23

222 2 Rtat η<

Proofcombining the two

RT

and

Rtaaaaat tTt ηηγ 2**.* <<<

⎞⎛ 222 b

⎞⎜⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛+=<

22

2

22

2

22

2

2

*2

**2*2

bR

RbwRaRt

γγfrom def of a*

since b*/||w*|| = b* is the distance to the

⎟⎠

⎞⎜⎜⎝

⎛+= 22

*12RbR

γ from ||w*|| = 1

wsince b /||w || = b is the distance to the origin, we have b* < R = maxi ||xi|| and

wb

x

wxg )(

22 ⎞

⎜⎛ R

24

w2⎟⎠

⎞⎜⎜⎝

⎛<

γRt

Notethis is not the “standard proof” (e.g. Duda, Hart, Stork)standard proof:standard proof:• regular algorithm (no R in update equations)• tighter bound t < (R/γ)2g ( γ)

this appears better, but requires choosing η = R2/γwhich requires knowledge of γ, that we don’t have until q g γ,we find a*i.e. the proof is non-constructive, cannot design algorithm th tthat waythe algorithm above just works!h I lik thi f b tt d it l b d

25

hence, I like this proof better despite looser bound.

Perceptron learningTheorem: Let D = {(x1,y1), ..., (xn,yn)} and

||||max ixR =

If there is (w*,b*) such that ||w*|| = 1 and

|||| ii

( ) ibxwy T ∀>+ ** γthen the Perceptron will find an error free hyper-plane in at most

( ) ibxwy ii ∀>+ ,** γ

22 ⎞

⎜⎛ R

this result was the start of learning theory

iterations 2⎟⎠

⎞⎜⎜⎝

⎛γR

this result was the start of learning theoryfor the first time there was a proof that a learning machine could actually learn something!

26

machine could actually learn something!

The marginnote that

( ) ibT ∀**w

y=1

will hold if and only if

( ) ibxwy iT

i ∀≥+ ,** γ

bxww

bxwi

T

i

iT

i+=

+= minminγ y=-1

which is how we defined the margin

w y 1

22 ⎞⎛ Rthis says that the bound on time to convergence

is inversely proportional to the margineven in this early result the margin appears as a

2⎟⎠

⎞⎜⎜⎝

⎛γR

27

even in this early result, the margin appears as a measure of the difficulty of the learning problem

The role of Rscaling the space should notmake a difference as to whether w

y=1

the problem is solvableR accounts for this

γ

if the xi are re-scaled both R andγ are re-scaled and the bound

y=-1R

2⎞⎛

remains the same

y 1

22

⎟⎟⎠

⎞⎜⎜⎝

⎛γR

remains the sameonce again, just a question of normalization• illustrates the fact that the normalization ||w||=1 is usually not

28

illustrates the fact that the normalization ||w||=1 is usually not sufficient

Some historyRosenblatt’s result generated a lot of excitement about learning in the 50sglater, Minsky and Papert identified serious problems with the Perceptron• there are very simply logic problems that it cannot solve• more on this on the homework

thi kill d ff th th i til ld lt bthis killed off the enthusiasm until an old result by Kolmogorov saved the dayTheorem: any continuous function g(x) defined on [0,1]nTheorem: any continuous function g(x) defined on [0,1]can be represented in the form

( )∑ ∑+ ⎞

⎜⎛

=12n d

iijj xΨΓg(x)

29

( )∑ ∑= = ⎠

⎜⎝1 1j i

iijj xΨΓg(x)

Some historynoting that the Perceptron can be written as

⎤⎡ n

thi l k lik h i t P t l

[ ] ⎥⎦

⎤⎢⎣

⎡+=+= ∑

=

n

iii

T wxwbxw h(x)1

0sgnsgn

this looks like having two Perceptron layerslayer 1: J hyper-planes wj

n ⎤⎡

layer 2: hyperplane v

Jj, wxw (x)hn

ijijij ,...,1sgn

10 =⎥⎦

⎤⎢⎣

⎡+= ∑

=

⎤⎡ ⎞⎛

⎥⎦

⎤⎢⎣

⎡+= ∑

=

J J

J

jjjj v(x)hvxu

10sgn)(

30

⎥⎥⎦

⎤

⎢⎢⎣

⎡+⎟⎟

⎠

⎞⎜⎜⎝

⎛+= ∑ ∑

= =

J

jj

J

jjijij vwxwv

10

10sgnsgn

Some historywhich can be written as

[ ] ∑ ∑⎞

⎜⎛J J

with and resembles

[ ])(sgn xg u(x) = ∑ ∑= =

+⎟⎠

⎞⎜⎜⎝

⎛+=

jj

jjijij vwxwvxg

10

10sgn)(

+ ⎞⎛12n d

it suggested the idea that

( )∑ ∑+

= =

⎟⎠

⎞⎜⎝

⎛=

12

1 1

n

j

d

iiijj xΨΓg(x)

gg• while one Perceptron is not good enough• maybe a multi-layered Perceptron (MLP) will work

a lot of work on MLPs ensued under the name of neural networks

t ll it h th t t f ti b

31

eventually, it was shown that most functions can be approximated by MLPs

Graphical representationthe Perceptron is usually represented as

input units: coordinates of xweights: coordinates of wweights: coordinates of whomogeneous coordinates: x = (x,1)T

⎞⎛

bias term

32

( )xwwxwxh T

iii sgnsgn)( 0 =⎟

⎠

⎞⎜⎝

⎛+= ∑

Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0 f(x)no derivative at 0• non-smooth approximations

it can be approximated in various ways

f(x)

pp yfor example by the hyperbolic tangent

xx eefσσ −−)h()(

f’(x)

σ controls the approximation error, but

xx eeeexxf σσσ −+

== )tanh()(

σ controls the approximation error, but• has derivative everywhere• smooth

f’’(x)

33

neural networks are implemented with these functions

Neural networkthe MLP as function approximation

34

perceptron - svcl.ucsd.edu · the perceptron this was the main insight of rosenblatt, which lead to...

Documents