perceptron - svcl.ucsd.edu · the perceptron this was the main insight of rosenblatt, which lead to...
TRANSCRIPT
The Perceptron
Nuno Vasconcelos ECE Department, UCSDp ,
Classificationa classification problem has two types of variables
• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world
e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)
• y ∈ Y = {disease, no disease}
X, Y related by a (unknown) function
)(xfyx
l d i l ifi h Y h th t h( ) f( ) ∀
)(xfy =x(.)f
2
goal: design a classifier h: X → Y such that h(x) = f(x) ∀x
Linear discriminantthe classifier implements the linear decision rule
[ ]⎧ >if 0)(1 xgwith
has the properties
bxwxg T +=)([ ]⎩⎨⎧
=<−>
= g(x)sgn if if
0)(10)(1
)(*xgxg
xh
has the properties• it divides X into two “half-spaces”
• boundary is the plane with:
w
y p• normal w• distance to the origin b/||w||
( )/|| || i th di t f i twb
x
wxg )(
• g(x)/||w|| is the distance from point xto the boundary
• g(x) = 0 for points on the plane
3
• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”
Linear discriminantthe classifier implements the linear decision rule
[ ]⎧ >if 0)(1 xgwith
given a linearly separable training set
bxwxg T +=)([ ]⎩⎨⎧
=<−>
= g(x)sgn if if
0)(10)(1
)(*xgxg
xh
given a linearly separable training set
D = {(x1,y1), ..., (xn,yn)}f f
w y=1
no errors if and only if, ∀ i• yi = 1 and g(xi) > 0 or
yi = -1 and g(xi) < 0 wb
x
wxg )(
yi g( i)• i.e. yi.g(xi) > 0
this allows a very concise expressionf h i i f “ i i ” “ i i l i k”
y=-1
4
for the situation of “no training error” or “zero empirical risk”
Learning as optimizationnecessary and sufficient condition for zero empirical risk
( ) ibT ∀0
this is interesting because it allows the formulation of the l i bl f f ti ti i ti
( ) ibxwy iT
i ∀>+ ,0
learning problem as one of function optimization• starting from a random guess for the parameters w and b• we maximize the reward functionwe maximize the reward function
( )bxwy iT
n
ii +∑
1• or, equivalently, minimize the cost function
i =1
( ) ( )bxwybwJ Tn
+∑5
( ) ( )bxwybwJ iT
ii +−= ∑
=1,
The gradientwe have seen thatthe gradient of a function f(w)the gradient of a function f(w)at z is
Tff ⎞
⎜⎛ ∂∂
Theorem: the gradient points in
n
zwfz
wfzf ⎟
⎠
⎞⎜⎜⎝
⎛∂
∂∂∂
=∇−
)(,),()(10
L ∇f
Theorem: the gradient points in the direction of maximum growthgradient is
f(x,y)
• the direction of greatest increaseof f(x) at z,
• normal to the iso-contours of f( )(*)
)( yxf∇
6
• normal to the iso-contours of f(.) ),( 00 yxf∇
),( 11 yxf∇
Critical point conditionslet f(x) be continuously differentiablex* is a local minimum of f(x) if and only ifx is a local minimum of f(x) if and only if• f has zero gradient at x*
0*)( =∇ xf• and the Hessian of f at x* is positive definite
0*)( =∇ xf
nt ddxfd ℜ∈∀≥∇ 0*)(2
• where
ddxfd ℜ∈∀≥∇ ,0)(
⎥⎤
⎢⎡ ∂∂ )()(
22
xfxf
⎥⎥⎥⎥
⎢⎢⎢⎢
∂∂
∂∂∂=∇
−
)()(
)(22
1020
2
ff
xxx
xx
xfn
M
L
7
⎥⎥
⎦⎢⎢
⎣ ∂∂
∂∂∂
−−
)()( 2101
xx
fxxx
f
nn
L
Gradient descentthis suggest a simple minimization technique• pick initial estimate x(0) f(x)pick initial estimate x( )
• follow the negative gradient
( ))()()1( nnn xfxx ∇=+ η
f(x)
this is gradient descent
( ))()()( xfxx ∇−= η( ))(nxf∇−η
)(nx
η is the learning rate and needs to be carefully chosen• if η too large, descent may diverge f(x)
many extensions are possiblemain point:
8
• once framed as optimization, we can (in general) solve it
( ))(nxf∇−η)(nx
The perceptronthis was the main insight of Rosenblatt, which lead to the Perceptronpthe basic idea is to do gradient descent on our cost
( )n
( ) ( )bxwybwJ iT
n
ii +−= ∑
=1,
we know that:• if the training set is linearly separable there is at least a pair (w,b)
s ch that J( b) < 0such that J(w,b) < 0• any minimum that is equal to or better than this will do
Q: can we find one such minimum?
9
Q: can we find one such minimum?
Perceptron learningthe gradient is straightforward to compute
∂∂ ff
and gradient descent is trivial
∑∑ −=∂∂
−=∂∂
ii
iii y
bfxy
wf
and gradient descent is trivialthere is, however, one problem:• J(w,b) is not bounded below( , )• if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0• the minimum is always at −∞ which is quite bad, numerically
this is really just the normalization problem that we already talked about
10
Rosenblatt’s idearestrict attention to the points incorrectly classifiedat each iteration define set of errorsat each iteration define set of errors
{ }0)(| <+= bxwyxE iT
ii
and make the cost
( ) ( )∑ T bxwybwJ
note that
( ) ( )∑∈
+−=Exi
iT
ipi
bxwybwJ|
,
note that• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• if we get to zero, we know we have the best possible solution (E
11
g , p (empty)
Perceptron learningis trivial, just do gradient descent on Jp(w,b)
∑+ += nn xyww η)()1(
∑
∑+
∈
+=
+=
inn
Exii
i
ybb
xyww
η
η
)()1(
this turns out not to be very effective if the D is largel th ti t i i t t t k ll t t th d
∑∈Ex
ii
• loop over the entire training set to take a small step at the end
one alternative that frequently is better is “stochastic gradient descent”gradient descent• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow
th di ti ft i ti D
12
the same direction after processing entire D • very popular in learning, where D is usually large
Perceptron learningthe algorithm is as follows:set k = 0, wk = 0, bk = 0set R = maxi ||xi||
do {
for i = 1:n {if yi(wTxi + bk) < 0 then {
– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
we will talk about
R shortly!– k = k+1}
}
R shortly!
13
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2 y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
wk
– k = k+1}
}
oo
o
ooo
o
bk
14
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2 y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
xi– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
wk
i
– k = k+1}
}
oo
o
ooo
o
bk
15
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2 y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
xi– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
wk
i
wk+ ηyixi– k = k+1}
}
oo
o
ooo
o
bk
wk+ ηyixi
16
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2 y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
xi– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
i
wk+1– k = k+1}
}
oo
o
ooo
ok+1
bk
17
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1
bk
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2 y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
xi– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
i
wk+1– k = k+1}
}
oo
o
ooo
o
bk
k+1
bk +ηyiR2
18
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1
bk
Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||
do { x2
y=1for i = 1:n {
if yi(wTxi + bk) < 0 then {x
xx
x
x
x
xx
x
xo
y=1
xi– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
k = k+1
xx
x
o
o
ooo
oy=-1
i
wk+1– k = k+1}
}
oo
o
ooo
owk+1
b
19
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1bk+1
Perceptron learningOK, makes intuitive sensehow do we know it will not get stuck on a local minimum?gthis was Rosenblatt’s seminal contributionTheorem: Let D = {(x1,y1), ..., (xn,yn)} and{( 1,y1), , ( n,yn)}
If there is (w*,b*) such that ||w*|| = 1 and
||||max iixR = (*)
If there is (w ,b ) such that ||w || 1 and
then the Perceptron will find an error free hyper-plane in( ) ibxwy i
Ti ∀>+ ,** γ (**)
then the Perceptron will find an error free hyper-plane in at most
iterations 2
2⎟⎠
⎞⎜⎜⎝
⎛γR
20
⎠⎝ γ
Proofnot that harddenote iteration by t assume point processed at iterationdenote iteration by t, assume point processed at iteration t-1 is (xi,yi)for simplicity, use homogeneous coordinates. Definingp y, g g
⎥⎦
⎤⎢⎣
⎡=
Rbw
at
tt /⎥
⎦
⎤⎢⎣
⎡=
Rx
z ii
allows the compact notation⎦⎣ Rbt /⎦⎣
( ) TT zaybxwy =+
since only misclassified points are processed, we have
( ) itititi zaybxwy 111 −−− =+
T
21
01 <− iTti zay (***)
Proofhow does at evolve?
xyww ⎤⎡⎤⎡⎤⎡
d ti th ti l l ti b * ( * b*/R)T
iiti
ii
t
t
t
tt zya
Ryxy
Rbw
Rbw
a ηη +=⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡= −
−
−1
1
1
//
denoting the optimal solution by a* = (w*, b*/R)T
( )***
*** 1
bxwyaa
azyaaaaT
Tiit
Tt
++=
+= −
η
η
and, from (**), ( )***1 bxwyaa iit ++= − η
ηγ+> ** aaaaT
solving the recursionηγ+> − ** 1aaaa tt
tT 2***
22
ηγηγηγ taaaaaa ttTt >>+>+> −− ...2*** 21
Proofthis means convergence to a* if we can bound the magnitude of at . What is this magnitude? Sinceg t g
we haveiitt zyaa η+= −1
and,
221
21
2 2 iiTtitt
Ttt zzayaaaa ηη ++== −−
2222
( )22221
2221
2
Rxa
zaa
it
itt
η
η
++=
+<
−
−
from (***)
from def of z
solving the recursion
2221 2 Rat η+< − from (**)
2
23
222 2 Rtat η<
Proofcombining the two
RT
and
Rtaaaaat tTt ηηγ 2**.* <<<
⎞⎛ 222 b
⎞⎜⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛+=<
22
2
22
2
22
2
2
*2
**2*2
bR
RbwRaRt
γγfrom def of a*
since b*/||w*|| = b* is the distance to the
⎟⎠
⎞⎜⎜⎝
⎛+= 22
*12RbR
γ from ||w*|| = 1
wsince b /||w || = b is the distance to the origin, we have b* < R = maxi ||xi|| and
wb
x
wxg )(
22 ⎞
⎜⎛ R
24
w2⎟⎠
⎞⎜⎜⎝
⎛<
γRt
Notethis is not the “standard proof” (e.g. Duda, Hart, Stork)standard proof:standard proof:• regular algorithm (no R in update equations)• tighter bound t < (R/γ)2g ( γ)
this appears better, but requires choosing η = R2/γwhich requires knowledge of γ, that we don’t have until q g γ,we find a*i.e. the proof is non-constructive, cannot design algorithm th tthat waythe algorithm above just works!h I lik thi f b tt d it l b d
25
hence, I like this proof better despite looser bound.
Perceptron learningTheorem: Let D = {(x1,y1), ..., (xn,yn)} and
||||max ixR =
If there is (w*,b*) such that ||w*|| = 1 and
|||| ii
( ) ibxwy T ∀>+ ** γthen the Perceptron will find an error free hyper-plane in at most
( ) ibxwy ii ∀>+ ,** γ
22 ⎞
⎜⎛ R
this result was the start of learning theory
iterations 2⎟⎠
⎞⎜⎜⎝
⎛γR
this result was the start of learning theoryfor the first time there was a proof that a learning machine could actually learn something!
26
machine could actually learn something!
The marginnote that
( ) ibT ∀**w
y=1
will hold if and only if
( ) ibxwy iT
i ∀≥+ ,** γ
bxww
bxwi
T
i
iT
i+=
+= minminγ y=-1
which is how we defined the margin
w y 1
22 ⎞⎛ Rthis says that the bound on time to convergence
is inversely proportional to the margineven in this early result the margin appears as a
2⎟⎠
⎞⎜⎜⎝
⎛γR
27
even in this early result, the margin appears as a measure of the difficulty of the learning problem
The role of Rscaling the space should notmake a difference as to whether w
y=1
the problem is solvableR accounts for this
γ
if the xi are re-scaled both R andγ are re-scaled and the bound
y=-1R
2⎞⎛
remains the same
y 1
22
⎟⎟⎠
⎞⎜⎜⎝
⎛γR
remains the sameonce again, just a question of normalization• illustrates the fact that the normalization ||w||=1 is usually not
28
illustrates the fact that the normalization ||w||=1 is usually not sufficient
Some historyRosenblatt’s result generated a lot of excitement about learning in the 50sglater, Minsky and Papert identified serious problems with the Perceptron• there are very simply logic problems that it cannot solve• more on this on the homework
thi kill d ff th th i til ld lt bthis killed off the enthusiasm until an old result by Kolmogorov saved the dayTheorem: any continuous function g(x) defined on [0,1]nTheorem: any continuous function g(x) defined on [0,1]can be represented in the form
( )∑ ∑+ ⎞
⎜⎛
=12n d
iijj xΨΓg(x)
29
( )∑ ∑= = ⎠
⎜⎝1 1j i
iijj xΨΓg(x)
Some historynoting that the Perceptron can be written as
⎤⎡ n
thi l k lik h i t P t l
[ ] ⎥⎦
⎤⎢⎣
⎡+=+= ∑
=
n
iii
T wxwbxw h(x)1
0sgnsgn
this looks like having two Perceptron layerslayer 1: J hyper-planes wj
n ⎤⎡
layer 2: hyperplane v
Jj, wxw (x)hn
ijijij ,...,1sgn
10 =⎥⎦
⎤⎢⎣
⎡+= ∑
=
⎤⎡ ⎞⎛
⎥⎦
⎤⎢⎣
⎡+= ∑
=
J J
J
jjjj v(x)hvxu
10sgn)(
30
⎥⎥⎦
⎤
⎢⎢⎣
⎡+⎟⎟
⎠
⎞⎜⎜⎝
⎛+= ∑ ∑
= =
J
jj
J
jjijij vwxwv
10
10sgnsgn
Some historywhich can be written as
[ ] ∑ ∑⎞
⎜⎛J J
with and resembles
[ ])(sgn xg u(x) = ∑ ∑= =
+⎟⎠
⎞⎜⎜⎝
⎛+=
jj
jjijij vwxwvxg
10
10sgn)(
+ ⎞⎛12n d
it suggested the idea that
( )∑ ∑+
= =
⎟⎠
⎞⎜⎝
⎛=
12
1 1
n
j
d
iiijj xΨΓg(x)
gg• while one Perceptron is not good enough• maybe a multi-layered Perceptron (MLP) will work
a lot of work on MLPs ensued under the name of neural networks
t ll it h th t t f ti b
31
eventually, it was shown that most functions can be approximated by MLPs
Graphical representationthe Perceptron is usually represented as
input units: coordinates of xweights: coordinates of wweights: coordinates of whomogeneous coordinates: x = (x,1)T
⎞⎛
bias term
32
( )xwwxwxh T
iii sgnsgn)( 0 =⎟
⎠
⎞⎜⎝
⎛+= ∑
Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0 f(x)no derivative at 0• non-smooth approximations
it can be approximated in various ways
f(x)
pp yfor example by the hyperbolic tangent
xx eefσσ −−)h()(
f’(x)
σ controls the approximation error, but
xx eeeexxf σσσ −+
== )tanh()(
σ controls the approximation error, but• has derivative everywhere• smooth
f’’(x)
33
neural networks are implemented with these functions
Neural networkthe MLP as function approximation
34
35