perceptron - svcl.ucsd.edu · the perceptron this was the main insight of rosenblatt, which lead to...

35
The Perceptron Nuno Vasconcelos ECE Department, UCSD

Upload: lykiet

Post on 24-Jul-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

The Perceptron

Nuno Vasconcelos ECE Department, UCSDp ,

Page 2: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the worldX - vector of observations (features) in the world• Y - state (class) of the world

e.g. g• x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

)(xfyx

l d i l ifi h Y h th t h( ) f( ) ∀

)(xfy =x(.)f

2

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

Page 3: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Linear discriminantthe classifier implements the linear decision rule

[ ]⎧ >if 0)(1 xgwith

has the properties

bxwxg T +=)([ ]⎩⎨⎧

=<−>

= g(x)sgn if if

0)(10)(1

)(*xgxg

xh

has the properties• it divides X into two “half-spaces”

• boundary is the plane with:

w

y p• normal w• distance to the origin b/||w||

( )/|| || i th di t f i twb

x

wxg )(

• g(x)/||w|| is the distance from point xto the boundary

• g(x) = 0 for points on the plane

3

• g(x) > 0 on the side w points to (“positive side”)• g(x) < 0 on the “negative side”

Page 4: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Linear discriminantthe classifier implements the linear decision rule

[ ]⎧ >if 0)(1 xgwith

given a linearly separable training set

bxwxg T +=)([ ]⎩⎨⎧

=<−>

= g(x)sgn if if

0)(10)(1

)(*xgxg

xh

given a linearly separable training set

D = {(x1,y1), ..., (xn,yn)}f f

w y=1

no errors if and only if, ∀ i• yi = 1 and g(xi) > 0 or

yi = -1 and g(xi) < 0 wb

x

wxg )(

yi g( i)• i.e. yi.g(xi) > 0

this allows a very concise expressionf h i i f “ i i ” “ i i l i k”

y=-1

4

for the situation of “no training error” or “zero empirical risk”

Page 5: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Learning as optimizationnecessary and sufficient condition for zero empirical risk

( ) ibT ∀0

this is interesting because it allows the formulation of the l i bl f f ti ti i ti

( ) ibxwy iT

i ∀>+ ,0

learning problem as one of function optimization• starting from a random guess for the parameters w and b• we maximize the reward functionwe maximize the reward function

( )bxwy iT

n

ii +∑

1• or, equivalently, minimize the cost function

i =1

( ) ( )bxwybwJ Tn

+∑5

( ) ( )bxwybwJ iT

ii +−= ∑

=1,

Page 6: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

The gradientwe have seen thatthe gradient of a function f(w)the gradient of a function f(w)at z is

Tff ⎞

⎜⎛ ∂∂

Theorem: the gradient points in

n

zwfz

wfzf ⎟

⎞⎜⎜⎝

⎛∂

∂∂∂

=∇−

)(,),()(10

L ∇f

Theorem: the gradient points in the direction of maximum growthgradient is

f(x,y)

• the direction of greatest increaseof f(x) at z,

• normal to the iso-contours of f( )(*)

)( yxf∇

6

• normal to the iso-contours of f(.) ),( 00 yxf∇

),( 11 yxf∇

Page 7: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Critical point conditionslet f(x) be continuously differentiablex* is a local minimum of f(x) if and only ifx is a local minimum of f(x) if and only if• f has zero gradient at x*

0*)( =∇ xf• and the Hessian of f at x* is positive definite

0*)( =∇ xf

nt ddxfd ℜ∈∀≥∇ 0*)(2

• where

ddxfd ℜ∈∀≥∇ ,0)(

⎥⎤

⎢⎡ ∂∂ )()(

22

xfxf

⎥⎥⎥⎥

⎢⎢⎢⎢

∂∂

∂∂∂=∇

)()(

)(22

1020

2

ff

xxx

xx

xfn

M

L

7

⎥⎥

⎦⎢⎢

⎣ ∂∂

∂∂∂

−−

)()( 2101

xx

fxxx

f

nn

L

Page 8: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Gradient descentthis suggest a simple minimization technique• pick initial estimate x(0) f(x)pick initial estimate x( )

• follow the negative gradient

( ))()()1( nnn xfxx ∇=+ η

f(x)

this is gradient descent

( ))()()( xfxx ∇−= η( ))(nxf∇−η

)(nx

η is the learning rate and needs to be carefully chosen• if η too large, descent may diverge f(x)

many extensions are possiblemain point:

8

• once framed as optimization, we can (in general) solve it

( ))(nxf∇−η)(nx

Page 9: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

The perceptronthis was the main insight of Rosenblatt, which lead to the Perceptronpthe basic idea is to do gradient descent on our cost

( )n

( ) ( )bxwybwJ iT

n

ii +−= ∑

=1,

we know that:• if the training set is linearly separable there is at least a pair (w,b)

s ch that J( b) < 0such that J(w,b) < 0• any minimum that is equal to or better than this will do

Q: can we find one such minimum?

9

Q: can we find one such minimum?

Page 10: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningthe gradient is straightforward to compute

∂∂ ff

and gradient descent is trivial

∑∑ −=∂∂

−=∂∂

ii

iii y

bfxy

wf

and gradient descent is trivialthere is, however, one problem:• J(w,b) is not bounded below( , )• if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0• the minimum is always at −∞ which is quite bad, numerically

this is really just the normalization problem that we already talked about

10

Page 11: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Rosenblatt’s idearestrict attention to the points incorrectly classifiedat each iteration define set of errorsat each iteration define set of errors

{ }0)(| <+= bxwyxE iT

ii

and make the cost

( ) ( )∑ T bxwybwJ

note that

( ) ( )∑∈

+−=Exi

iT

ipi

bxwybwJ|

,

note that• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• if we get to zero, we know we have the best possible solution (E

11

g , p (empty)

Page 12: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningis trivial, just do gradient descent on Jp(w,b)

∑+ += nn xyww η)()1(

∑+

+=

+=

inn

Exii

i

ybb

xyww

η

η

)()1(

this turns out not to be very effective if the D is largel th ti t i i t t t k ll t t th d

∑∈Ex

ii

• loop over the entire training set to take a small step at the end

one alternative that frequently is better is “stochastic gradient descent”gradient descent• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow

th di ti ft i ti D

12

the same direction after processing entire D • very popular in learning, where D is usually large

Page 13: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningthe algorithm is as follows:set k = 0, wk = 0, bk = 0set R = maxi ||xi||

do {

for i = 1:n {if yi(wTxi + bk) < 0 then {

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

we will talk about

R shortly!– k = k+1}

}

R shortly!

13

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)

Page 14: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

– k = k+1}

}

oo

o

ooo

o

bk

14

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

Page 15: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

i

– k = k+1}

}

oo

o

ooo

o

bk

15

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

Page 16: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

wk

i

wk+ ηyixi– k = k+1}

}

oo

o

ooo

o

bk

wk+ ηyixi

16

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

Page 17: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

ok+1

bk

17

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

bk

Page 18: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2 y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

o

bk

k+1

bk +ηyiR2

18

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1

bk

Page 19: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningdoes this make sense? consider the example belowset k = 0, wk = 0, bk = 0set R = maxi ||xi||

do { x2

y=1for i = 1:n {

if yi(wTxi + bk) < 0 then {x

xx

x

x

x

xx

x

xo

y=1

xi– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

k = k+1

xx

x

o

o

ooo

oy=-1

i

wk+1– k = k+1}

}

oo

o

ooo

owk+1

b

19

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)x1bk+1

Page 20: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningOK, makes intuitive sensehow do we know it will not get stuck on a local minimum?gthis was Rosenblatt’s seminal contributionTheorem: Let D = {(x1,y1), ..., (xn,yn)} and{( 1,y1), , ( n,yn)}

If there is (w*,b*) such that ||w*|| = 1 and

||||max iixR = (*)

If there is (w ,b ) such that ||w || 1 and

then the Perceptron will find an error free hyper-plane in( ) ibxwy i

Ti ∀>+ ,** γ (**)

then the Perceptron will find an error free hyper-plane in at most

iterations 2

2⎟⎠

⎞⎜⎜⎝

⎛γR

20

⎠⎝ γ

Page 21: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Proofnot that harddenote iteration by t assume point processed at iterationdenote iteration by t, assume point processed at iteration t-1 is (xi,yi)for simplicity, use homogeneous coordinates. Definingp y, g g

⎥⎦

⎤⎢⎣

⎡=

Rbw

at

tt /⎥

⎤⎢⎣

⎡=

Rx

z ii

allows the compact notation⎦⎣ Rbt /⎦⎣

( ) TT zaybxwy =+

since only misclassified points are processed, we have

( ) itititi zaybxwy 111 −−− =+

T

21

01 <− iTti zay (***)

Page 22: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Proofhow does at evolve?

xyww ⎤⎡⎤⎡⎤⎡

d ti th ti l l ti b * ( * b*/R)T

iiti

ii

t

t

t

tt zya

Ryxy

Rbw

Rbw

a ηη +=⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡= −

−1

1

1

//

denoting the optimal solution by a* = (w*, b*/R)T

( )***

*** 1

bxwyaa

azyaaaaT

Tiit

Tt

++=

+= −

η

η

and, from (**), ( )***1 bxwyaa iit ++= − η

ηγ+> ** aaaaT

solving the recursionηγ+> − ** 1aaaa tt

tT 2***

22

ηγηγηγ taaaaaa ttTt >>+>+> −− ...2*** 21

Page 23: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Proofthis means convergence to a* if we can bound the magnitude of at . What is this magnitude? Sinceg t g

we haveiitt zyaa η+= −1

and,

221

21

2 2 iiTtitt

Ttt zzayaaaa ηη ++== −−

2222

( )22221

2221

2

Rxa

zaa

it

itt

η

η

++=

+<

from (***)

from def of z

solving the recursion

2221 2 Rat η+< − from (**)

2

23

222 2 Rtat η<

Page 24: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Proofcombining the two

RT

and

Rtaaaaat tTt ηηγ 2**.* <<<

⎞⎛ 222 b

⎞⎜⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛+=<

22

2

22

2

22

2

2

*2

**2*2

bR

RbwRaRt

γγfrom def of a*

since b*/||w*|| = b* is the distance to the

⎟⎠

⎞⎜⎜⎝

⎛+= 22

*12RbR

γ from ||w*|| = 1

wsince b /||w || = b is the distance to the origin, we have b* < R = maxi ||xi|| and

wb

x

wxg )(

22 ⎞

⎜⎛ R

24

w2⎟⎠

⎞⎜⎜⎝

⎛<

γRt

Page 25: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Notethis is not the “standard proof” (e.g. Duda, Hart, Stork)standard proof:standard proof:• regular algorithm (no R in update equations)• tighter bound t < (R/γ)2g ( γ)

this appears better, but requires choosing η = R2/γwhich requires knowledge of γ, that we don’t have until q g γ,we find a*i.e. the proof is non-constructive, cannot design algorithm th tthat waythe algorithm above just works!h I lik thi f b tt d it l b d

25

hence, I like this proof better despite looser bound.

Page 26: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Perceptron learningTheorem: Let D = {(x1,y1), ..., (xn,yn)} and

||||max ixR =

If there is (w*,b*) such that ||w*|| = 1 and

|||| ii

( ) ibxwy T ∀>+ ** γthen the Perceptron will find an error free hyper-plane in at most

( ) ibxwy ii ∀>+ ,** γ

22 ⎞

⎜⎛ R

this result was the start of learning theory

iterations 2⎟⎠

⎞⎜⎜⎝

⎛γR

this result was the start of learning theoryfor the first time there was a proof that a learning machine could actually learn something!

26

machine could actually learn something!

Page 27: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

The marginnote that

( ) ibT ∀**w

y=1

will hold if and only if

( ) ibxwy iT

i ∀≥+ ,** γ

bxww

bxwi

T

i

iT

i+=

+= minminγ y=-1

which is how we defined the margin

w y 1

22 ⎞⎛ Rthis says that the bound on time to convergence

is inversely proportional to the margineven in this early result the margin appears as a

2⎟⎠

⎞⎜⎜⎝

⎛γR

27

even in this early result, the margin appears as a measure of the difficulty of the learning problem

Page 28: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

The role of Rscaling the space should notmake a difference as to whether w

y=1

the problem is solvableR accounts for this

γ

if the xi are re-scaled both R andγ are re-scaled and the bound

y=-1R

2⎞⎛

remains the same

y 1

22

⎟⎟⎠

⎞⎜⎜⎝

⎛γR

remains the sameonce again, just a question of normalization• illustrates the fact that the normalization ||w||=1 is usually not

28

illustrates the fact that the normalization ||w||=1 is usually not sufficient

Page 29: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Some historyRosenblatt’s result generated a lot of excitement about learning in the 50sglater, Minsky and Papert identified serious problems with the Perceptron• there are very simply logic problems that it cannot solve• more on this on the homework

thi kill d ff th th i til ld lt bthis killed off the enthusiasm until an old result by Kolmogorov saved the dayTheorem: any continuous function g(x) defined on [0,1]nTheorem: any continuous function g(x) defined on [0,1]can be represented in the form

( )∑ ∑+ ⎞

⎜⎛

=12n d

iijj xΨΓg(x)

29

( )∑ ∑= = ⎠

⎜⎝1 1j i

iijj xΨΓg(x)

Page 30: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Some historynoting that the Perceptron can be written as

⎤⎡ n

thi l k lik h i t P t l

[ ] ⎥⎦

⎤⎢⎣

⎡+=+= ∑

=

n

iii

T wxwbxw h(x)1

0sgnsgn

this looks like having two Perceptron layerslayer 1: J hyper-planes wj

n ⎤⎡

layer 2: hyperplane v

Jj, wxw (x)hn

ijijij ,...,1sgn

10 =⎥⎦

⎤⎢⎣

⎡+= ∑

=

⎤⎡ ⎞⎛

⎥⎦

⎤⎢⎣

⎡+= ∑

=

J J

J

jjjj v(x)hvxu

10sgn)(

30

⎥⎥⎦

⎢⎢⎣

⎡+⎟⎟

⎞⎜⎜⎝

⎛+= ∑ ∑

= =

J

jj

J

jjijij vwxwv

10

10sgnsgn

Page 31: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Some historywhich can be written as

[ ] ∑ ∑⎞

⎜⎛J J

with and resembles

[ ])(sgn xg u(x) = ∑ ∑= =

+⎟⎠

⎞⎜⎜⎝

⎛+=

jj

jjijij vwxwvxg

10

10sgn)(

+ ⎞⎛12n d

it suggested the idea that

( )∑ ∑+

= =

⎟⎠

⎞⎜⎝

⎛=

12

1 1

n

j

d

iiijj xΨΓg(x)

gg• while one Perceptron is not good enough• maybe a multi-layered Perceptron (MLP) will work

a lot of work on MLPs ensued under the name of neural networks

t ll it h th t t f ti b

31

eventually, it was shown that most functions can be approximated by MLPs

Page 32: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Graphical representationthe Perceptron is usually represented as

input units: coordinates of xweights: coordinates of wweights: coordinates of whomogeneous coordinates: x = (x,1)T

⎞⎛

bias term

32

( )xwwxwxh T

iii sgnsgn)( 0 =⎟

⎞⎜⎝

⎛+= ∑

Page 33: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0 f(x)no derivative at 0• non-smooth approximations

it can be approximated in various ways

f(x)

pp yfor example by the hyperbolic tangent

xx eefσσ −−)h()(

f’(x)

σ controls the approximation error, but

xx eeeexxf σσσ −+

== )tanh()(

σ controls the approximation error, but• has derivative everywhere• smooth

f’’(x)

33

neural networks are implemented with these functions

Page 34: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

Neural networkthe MLP as function approximation

34

Page 35: perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

35