perceptron - svclsvcl.ucsd.edu/courses/ece271b-f09/handouts/perceptron.pdfthe perceptron this was...

35
The Perceptron Nuno Vasconcelos ECE Department, UCSD

Upload: others

Post on 06-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • The Perceptron

    Nuno Vasconcelos ECE Department, UCSDp ,

  • Classificationa classification problem has two types of variables

    • X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

    e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)• y ∈ Y = {disease, no disease}

    X, Y related by a (unknown) function

    )(xfyx

    l d i l ifi h Y h th t h( ) f( ) ∀

    )(xfy =x(.)f

    2

    goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

  • Linear discriminantthe classifier implements the linear decision rule

    [ ]⎧ >if 0)(1 xg with

    has the properties

    bxwxg T +=)([ ]⎩⎨⎧

    =

    = g(x)sgn if if

    0)(10)(1

    )(*xgxg

    xh

    has the properties• it divides X into two “half-spaces”

    • boundary is the plane with:

    w

    y p• normal w• distance to the origin b/||w||

    ( )/|| || i th di t f i twb

    x

    wxg )(

    • g(x)/||w|| is the distance from point xto the boundary

    • g(x) = 0 for points on the plane

    3

    • g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”

  • Linear discriminantthe classifier implements the linear decision rule

    [ ]⎧ >if 0)(1 xg with

    given a linearly separable training set

    bxwxg T +=)([ ]⎩⎨⎧

    =

    = g(x)sgn if if

    0)(10)(1

    )(*xgxg

    xh

    given a linearly separable training set

    D = {(x1,y1), ..., (xn,yn)}f f

    w y=1

    no errors if and only if, ∀ i• yi = 1 and g(xi) > 0 or

    yi = -1 and g(xi) < 0 wb

    x

    wxg )(

    yi g( i)• i.e. yi.g(xi) > 0

    this allows a very concise expressionf h i i f “ i i ” “ i i l i k”

    y=-1

    4

    for the situation of “no training error” or “zero empirical risk”

  • Learning as optimizationnecessary and sufficient condition for zero empirical risk

    ( ) ibT ∀0this is interesting because it allows the formulation of the l i bl f f ti ti i ti

    ( ) ibxwy iTi ∀>+ ,0

    learning problem as one of function optimization• starting from a random guess for the parameters w and b• we maximize the reward functionwe maximize the reward function

    ( )bxwy iTn

    ii +∑

    1• or, equivalently, minimize the cost function

    i =1

    ( ) ( )bxwybwJ Tn

    +∑5

    ( ) ( )bxwybwJ iTi

    i +−= ∑=1

    ,

  • The gradientwe have seen thatthe gradient of a function f(w)the gradient of a function f(w)at z is

    Tff ⎞⎜⎛ ∂∂

    Theorem: the gradient points in

    n

    zwfz

    wfzf ⎟

    ⎞⎜⎜⎝

    ⎛∂

    ∂∂∂

    =∇−

    )(,),()(10

    L ∇f

    Theorem: the gradient points in the direction of maximum growthgradient is

    f(x,y)

    • the direction of greatest increaseof f(x) at z,

    • normal to the iso-contours of f( )(*)

    )( yxf∇

    6

    • normal to the iso-contours of f(.) ),( 00 yxf∇),( 11 yxf∇

  • Critical point conditionslet f(x) be continuously differentiablex* is a local minimum of f(x) if and only ifx is a local minimum of f(x) if and only if• f has zero gradient at x*

    0*)( =∇ xf• and the Hessian of f at x* is positive definite

    0*)( =∇ xf

    nt ddxfd ℜ∈∀≥∇ 0*)(2

    • where

    ddxfd ℜ∈∀≥∇ ,0)(

    ⎥⎤

    ⎢⎡ ∂∂ )()(

    22

    xfxf

    ⎥⎥⎥⎥

    ⎢⎢⎢⎢

    ∂∂

    ∂∂∂=∇

    )()(

    )(22

    1020

    2

    ff

    xxx

    xx

    xfn

    M

    L

    7

    ⎥⎥

    ⎦⎢⎢

    ⎣ ∂∂

    ∂∂∂

    −−

    )()( 2101

    xx

    fxxx

    f

    nn

    L

  • Gradient descentthis suggest a simple minimization technique• pick initial estimate x(0) f(x)pick initial estimate x( )

    • follow the negative gradient

    ( ))()()1( nnn xfxx ∇=+ η

    f(x)

    this is gradient descent

    ( ))()()( xfxx ∇−= η( ))(nxf∇−η

    )(nx

    η is the learning rate and needs to be carefully chosen• if η too large, descent may diverge f(x)

    many extensions are possiblemain point:

    8

    • once framed as optimization, we can (in general) solve it

    ( ))(nxf∇−η)(nx

  • The perceptronthis was the main insight of Rosenblatt, which lead to the Perceptronpthe basic idea is to do gradient descent on our cost

    ( )n

    ( ) ( )bxwybwJ iTn

    ii +−= ∑

    =1,

    we know that:• if the training set is linearly separable there is at least a pair (w,b)

    s ch that J( b) < 0such that J(w,b) < 0• any minimum that is equal to or better than this will do

    Q: can we find one such minimum?

    9

    Q: can we find one such minimum?

  • Perceptron learningthe gradient is straightforward to compute

    ∂∂ ff

    and gradient descent is trivial

    ∑∑ −=∂∂

    −=∂∂

    ii

    iii yb

    fxywf

    and gradient descent is trivialthere is, however, one problem:• J(w,b) is not bounded below( , )• if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0• the minimum is always at −∞ which is quite bad, numerically

    this is really just the normalization problem that we already talked about

    10

  • Rosenblatt’s idearestrict attention to the points incorrectly classifiedat each iteration define set of errorsat each iteration define set of errors

    { }0)(|

  • Perceptron learningis trivial, just do gradient descent on Jp(w,b)

    ∑+ += nn xyww η)()1(

    ∑+

    +=

    +=

    inn

    Exii

    i

    ybb

    xyww

    η

    η

    )()1(

    this turns out not to be very effective if the D is largel th ti t i i t t t k ll t t th d

    ∑∈Ex

    ii

    • loop over the entire training set to take a small step at the end

    one alternative that frequently is better is “stochastic gradient descent”gradient descent• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow

    th di ti ft i ti D

    12

    the same direction after processing entire D • very popular in learning, where D is usually large

  • Perceptron learningthe algorithm is as follows:set k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do {

    for i = 1:n {if yi(wTxi + bk) < 0 then {

    – wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    we will talk about

    R shortly!– k = k+1}

    }

    R shortly!

    13

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2 y=1for i = 1:n {

    if yi(wTxi + bk) < 0 then {x

    xx

    x

    x

    x

    xx

    x

    xo

    – wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    wk

    – k = k+1}

    }

    oo

    o

    ooo

    o

    bk

    14

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2 y=1for i = 1:n {

    if yi(wTxi + bk) < 0 then {x

    xx

    x

    x

    x

    xx

    x

    xo

    xi– wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    wk

    i

    – k = k+1}

    }

    oo

    o

    ooo

    o

    bk

    15

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2 y=1for i = 1:n {

    if yi(wTxi + bk) < 0 then {x

    xx

    x

    x

    x

    xx

    x

    xo

    xi– wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    wk

    i

    wk+ ηyixi– k = k+1}

    }

    oo

    o

    ooo

    o

    bk

    wk+ ηyixi

    16

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2 y=1for i = 1:n {

    if yi(wTxi + bk) < 0 then {x

    xx

    x

    x

    x

    xx

    x

    xo

    xi– wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    i

    wk+1– k = k+1}

    }

    oo

    o

    ooo

    ok+1

    bk

    17

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

    bk

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2 y=1for i = 1:n {

    if yi(wTxi + bk) < 0 then {x

    xx

    x

    x

    x

    xx

    x

    xo

    xi– wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    i

    wk+1– k = k+1}

    }

    oo

    o

    ooo

    o

    bk

    k+1

    bk +ηyiR2

    18

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

    bk

  • Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

    do { x2y=1

    for i = 1:n {if yi(wTxi + bk) < 0 then {

    xx

    x

    x

    x

    x

    xx

    x

    xo

    y=1

    xi– wk+1 = wk + η yi xi– bk+1 = bk + η yi R2

    k = k+1

    xx

    x

    o

    o

    ooo

    oy=-1

    i

    wk+1– k = k+1}

    }

    oo

    o

    ooo

    owk+1

    b

    19

    } until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1bk+1

  • Perceptron learningOK, makes intuitive sensehow do we know it will not get stuck on a local minimum?gthis was Rosenblatt’s seminal contributionTheorem: Let D = {(x1,y1), ..., (xn,yn)} and{( 1,y1), , ( n,yn)}

    If there is (w*,b*) such that ||w*|| = 1 and

    ||||max ii xR =(*)

    If there is (w ,b ) such that ||w || 1 and

    then the Perceptron will find an error free hyper-plane in( ) ibxwy iTi ∀>+ ,** γ (**)

    then the Perceptron will find an error free hyper-plane in at most

    iterations 2

    2⎟⎠

    ⎞⎜⎜⎝

    ⎛γR

    20

    ⎠⎝ γ

  • Proofnot that harddenote iteration by t assume point processed at iterationdenote iteration by t, assume point processed at iteration t-1 is (xi,yi)for simplicity, use homogeneous coordinates. Definingp y, g g

    ⎥⎦

    ⎤⎢⎣

    ⎡=

    Rbw

    at

    tt /⎥⎦

    ⎤⎢⎣

    ⎡=

    Rx

    z ii

    allows the compact notation⎦⎣ Rbt /⎦⎣

    ( ) TT zaybxwy =+since only misclassified points are processed, we have

    ( ) itititi zaybxwy 111 −−− =+

    T

    21

    01

  • Proofhow does at evolve?

    xyww ⎤⎡⎤⎡⎤⎡

    d ti th ti l l ti b * ( * b*/R)T

    iiti

    ii

    t

    t

    t

    tt zyaRy

    xyRb

    wRb

    wa ηη +=⎥

    ⎤⎢⎣

    ⎡+⎥

    ⎤⎢⎣

    ⎡=⎥

    ⎤⎢⎣

    ⎡= −

    −1

    1

    1

    //

    denoting the optimal solution by a* = (w*, b*/R)T

    ( )****** 1

    bxwyaa

    azyaaaaT

    Tiit

    Tt

    ++=

    += −η

    η

    and, from (**), ( )***1 bxwyaa iit ++= − η

    ηγ+> ** aaaaT

    solving the recursionηγ+> − ** 1aaaa tt

    tT 2***

    22

    ηγηγηγ taaaaaa ttTt >>+>+> −− ...2*** 21

  • Proofthis means convergence to a* if we can bound the magnitude of at . What is this magnitude? Sinceg t g

    we haveiitt zyaa η+= −1

    and,

    221

    21

    2 2 iiTtitt

    Ttt zzayaaaa ηη ++== −−

    2222

    ( )22221222

    12

    Rxa

    zaa

    it

    itt

    η

    η

    ++=

    +<

    from (***)

    from def of z

    solving the recursion

    2221 2 Rat η+< − from (**)

    2

    23

    222 2 Rtat η

  • Proofcombining the two

    RT

    and

    Rtaaaaat tTt ηηγ 2**.*

  • Notethis is not the “standard proof” (e.g. Duda, Hart, Stork)standard proof:standard proof:• regular algorithm (no R in update equations)• tighter bound t < (R/γ)2g ( γ)

    this appears better, but requires choosing η = R2/γwhich requires knowledge of γ, that we don’t have until q g γ,we find a*i.e. the proof is non-constructive, cannot design algorithm th tthat waythe algorithm above just works!h I lik thi f b tt d it l b d

    25

    hence, I like this proof better despite looser bound.

  • Perceptron learningTheorem: Let D = {(x1,y1), ..., (xn,yn)} and

    ||||max ixR =

    If there is (w*,b*) such that ||w*|| = 1 and

    |||| ii

    ( ) ibxwy T ∀>+ ** γthen the Perceptron will find an error free hyper-plane in at most

    ( ) ibxwy ii ∀>+ ,** γ2

    2 ⎞⎜⎛ R

    this result was the start of learning theory

    iterations 2 ⎟⎠

    ⎞⎜⎜⎝

    ⎛γR

    this result was the start of learning theoryfor the first time there was a proof that a learning machine could actually learn something!

    26

    machine could actually learn something!

  • The marginnote that

    ( ) ibT ∀**w

    y=1

    will hold if and only if

    ( ) ibxwy iTi ∀≥+ ,** γ

    bxww

    bxwi

    T

    i

    iT

    i+=

    += minminγ y=-1

    which is how we defined the margin

    w y 1

    22 ⎞⎛ Rthis says that the bound on time to convergence

    is inversely proportional to the margineven in this early result the margin appears as a

    2 ⎟⎠

    ⎞⎜⎜⎝

    ⎛γR

    27

    even in this early result, the margin appears as a measure of the difficulty of the learning problem

  • The role of Rscaling the space should notmake a difference as to whether w

    y=1

    the problem is solvableR accounts for this

    γ

    if the xi are re-scaled both R andγ are re-scaled and the bound

    y=-1R

    2⎞⎛

    remains the same

    y 1

    22

    ⎟⎟⎠

    ⎞⎜⎜⎝

    ⎛γR

    remains the sameonce again, just a question of normalization• illustrates the fact that the normalization ||w||=1 is usually not

    28

    illustrates the fact that the normalization ||w||=1 is usually not sufficient

  • Some historyRosenblatt’s result generated a lot of excitement about learning in the 50sglater, Minsky and Papert identified serious problems with the Perceptron• there are very simply logic problems that it cannot solve• more on this on the homework

    thi kill d ff th th i til ld lt bthis killed off the enthusiasm until an old result by Kolmogorov saved the dayTheorem: any continuous function g(x) defined on [0,1]nTheorem: any continuous function g(x) defined on [0,1]can be represented in the form

    ( )∑ ∑+ ⎞

    ⎜⎛

    =12n d

    iijj xΨΓg(x)

    29

    ( )∑ ∑= = ⎠

    ⎜⎝1 1j i

    iijj xΨΓg(x)

  • Some historynoting that the Perceptron can be written as

    ⎤⎡ n

    thi l k lik h i t P t l

    [ ] ⎥⎦

    ⎤⎢⎣

    ⎡+=+= ∑

    =

    n

    iii

    T wxwbxw h(x)1

    0sgnsgn

    this looks like having two Perceptron layerslayer 1: J hyper-planes wj

    n ⎤⎡

    layer 2: hyperplane v

    Jj, wxw (x)hn

    ijijij ,...,1sgn

    10 =⎥⎦

    ⎤⎢⎣

    ⎡+= ∑

    =

    ⎤⎡ ⎞⎛

    ⎥⎦

    ⎤⎢⎣

    ⎡+= ∑

    =

    J J

    J

    jjjj v(x)hvxu

    10sgn)(

    30

    ⎥⎥⎦

    ⎢⎢⎣

    ⎡+⎟⎟

    ⎞⎜⎜⎝

    ⎛+= ∑ ∑

    = =

    J

    jj

    J

    jjijij vwxwv

    10

    10sgnsgn

  • Some historywhich can be written as

    [ ] ∑ ∑⎞

    ⎜⎛J J

    with and resembles

    [ ])(sgn xg u(x) = ∑ ∑= =

    +⎟⎠

    ⎞⎜⎜⎝

    ⎛+=

    jj

    jjijij vwxwvxg

    10

    10sgn)(

    + ⎞⎛12n d

    it suggested the idea that

    ( )∑ ∑+

    = =

    ⎟⎠

    ⎞⎜⎝

    ⎛=

    12

    1 1

    n

    j

    d

    iiijj xΨΓg(x)

    gg• while one Perceptron is not good enough• maybe a multi-layered Perceptron (MLP) will work

    a lot of work on MLPs ensued under the name of neural networks

    t ll it h th t t f ti b

    31

    eventually, it was shown that most functions can be approximated by MLPs

  • Graphical representationthe Perceptron is usually represented as

    input units: coordinates of xweights: coordinates of wweights: coordinates of whomogeneous coordinates: x = (x,1)T

    ⎞⎛

    bias term

    32

    ( )xwwxwxh Ti

    ii sgnsgn)( 0 =⎟⎠

    ⎞⎜⎝

    ⎛+= ∑

  • Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0 f(x)no derivative at 0• non-smooth approximations

    it can be approximated in various ways

    f(x)

    pp yfor example by the hyperbolic tangent

    xx eefσσ −−)h()(

    f’(x)

    σ controls the approximation error, but

    xx eeeexxf σσσ −+

    == )tanh()(

    σ controls the approximation error, but• has derivative everywhere• smooth

    f’’(x)

    33

    neural networks are implemented with these functions

  • Neural networkthe MLP as function approximation

    34

  • 35