neural networks the adaline - institute for systems and …alex/aauto0910/lecture5adaline.pdf ·...

Last Lecture Summary

� Introduction to Neural Networks

� Biological Neurons

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Artificial Neurons

� McCulloch and Pitts TLU

� Rosenblatt’s Perceptron

MACHINE LEARNING 09/ 10

Neural NetworksThe ADALINE

Perceptron Limitations

� Perceptron’s learning rule is not guaranteed to converge if data is not linearly separable.

� Widrow-Hoff (1960)� Minimize the error at the output of the linear unit (e) rather than at the output of

the threshold unit (e’).


ADALINE – Adaptive Linear Element

� Separating hyperplane is equivalent to the perceptron

0...110

=+++ NN wxwxw


ADALINE – Adaptive Linear Element

� Learning rule is different from the perceptron.


� Given the training set:

� minimize cost function: ( ) ( ) ( )∑∑==

−==P

p

ppP

p

pds

Pe

PwE

1

2

1

2 11r

( ){ } PpdxTpp

,...,1,, ==

ADALINE - Simplification

� Let us consider that, for every pattern: 10

=px


� Thus we can write:

∑=

=N

l

p

ll

pxws

0

ADALINE – Analytic Solution

� Optimize the cost function:

� Given that:

( ) ( )

[ ]

P

p

pe

PwE

1

1

2

= ∑=

r

r pppdse −=( )1

( )3

[ ]TNwwww ...10

=r

Nkw

E

k

,...,0,0 =∀=∂

∂∑

=

=N

l

p

ll

pxws

0

( )2 ( )4


� Compute the gradient of cost function:

∑∂

∂=

∂

∂ P pp e

eE

21

( ) ( )∑=

=P

p

pe

PwE

1

21r


∑= ∂

=∂ p kk w

ePw 1

2

( )ppP

p k

p

k

dsw

ePw

E−

∂

∂=

∂

∂∑

=1

21

∑= ∂

∂=

∂

∂ P

p k

pp

k w

se

Pw

E

1

21


∑=N

pp pp

xs

=∂

∑= ∂

∂=

∂

∂ P

p k

pp

k w

se

Pw

E

1

21


∑= ∂

∂=

∂

∂ P

p k

pp

k w

se

Pw

E

1

21

∑=

=N

l

p

ll

pxws

0

p

k

k

xw

s=

∂

∂

⇔

⇔

∑=

=∂

∂ P

p

p

k

p

k

xePw

E

1

21


� Very important !

� The partial derivative of the error function with respect a weight is proportional to the sum for all patterns of the input on that weight multiplied by the error.


∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2


Given thatppp

dse −=

∑=

=∂

∂ P

p

p

k

p

k

xePw

E

1

21


=∂ pk Pw 1

( )∑=

−=∂

∂ P

p

p

k

pp

k

xdsPw

E

1

21

∑∑==

−=∂

∂ P

p

p

k

pP

p

p

k

p

k

xdP

xsPw

E

11

21

21


∑=

=N

l

p

ll

pxws

0

∑∑∂ PPE 11


∑∑==

−=∂

∂ P

p

p

k

pP

p

p

k

p

k

xdP

xsPw

E

11

21

21

∑∑∑== =

−=∂

∂ P

p

p

k

pP

p

N

l

p

k

p

ll

k

xdP

xxwPw

E

11 1

21

21


Nkw

E

k

,...,0,0 =∀=∂

∂

∑∑∑∂ PP NE 11


∑∑∑== =

−=∂

∂ P

p

p

k

pP

p

N

l

p

k

p

ll

k

xdP

xxwPw

E

11 1

21

21

∑∑∑== =

=P

p

p

k

pP

p

N

l

p

k

p

ll xdxxw11 1


NkxdxxwP

p

p

k

pP

p

N

l

p

k

p

ll ,...,0,11 1

=∀=∑∑∑== =


� It is a linear system of N+1 equations with N+1 unknowns. How to solve it ?

ADALINE – Matrix Notation

[ ]TNwwww ...10

=r

[ ] 1,...010

== pTp

N

pppxxxxx

r


[ ]010 N

pTN

l

p

ll

pxwxwsrr

==∑=0

ppTpdxwe −=

rr


( ) ( ) ( )∑∑==

−==P

p

ppTP

p

pdxw

Pe

PwE

1

2

1

2 11 rrr

( ) ( )( )∑=

−−=P

p

TppTppTdxwdxw

PwE

1

1 rrrrr


∑=pP 1

( ) ( )( )pTpP

p

ppTdwxdxw

PwE −−= ∑

=

rrrrr

1

1

( ) ( ) ( )( )∑=

+−−=P

p

pTppppTTppTdwxddxwwxxw

PwE

1

21 rrrrrrrrr


( ) ( ) ( )( )∑=

+−−=P

p

pTppppTTppTdwxddxwwxxw

PwE

1

21 rrrrrrrrr

=


( ) ( ) ( )∑ ∑∑= ==

+−=P

p

P

p

pppTP

p

TppTd

Pdxw

Pwxxw

PwE

1 1

2

1

112

1 rrrrrrr

( ) ( ) ( )∑∑∑===

+

−

=

P

p

pP

p

ppTP

p

TppTd

Pdx

Pwwxx

PwwE

1

2

11

112

1 rrrrrrr


� Let us introduce the average operator < >:

( )∑=

⋅=⋅P

pP 1

1


� The cost function is written as:

( ) ( ) ( )2

2pppTTppT

ddxwwxxwwE +−=rrrrrr


� Defining: ( ) ( )TppP

p

Tpp

xx xxxxP

Rrrrr

== ∑=1

1

ppP

p

ppdxdx

Pp

rrr== ∑

=1

1


� the cost function is:

( ) 22 d

T

xx

TpwwRwwE σ+−=rrrrr

( ) ( )2

1

22 1 pP

p

p

d ddP

== ∑=

σ

ADALINE – Quadratic Cost

� Rxx is a covariance matrix –positive semi-definite

( ) 22 d

T

xx



positive semi-definite

� The error function surface is a parabola.

ADALINE – Gradient Vector

( ) 22 d

T

xx



( ) pwRwE xx

rrr22 −=∇

ADALINE – Closed Form Solution

( ) pwRwE xx

rrr=⇔=∇ *0*


� If Rxx is positive definite, the minimum is unique.

pRw xx

rr 1*

−=

ADALINE – Closed Form Solution

� Closed form solution requires the inversion of the covariance matrix, which can be problematic in high dimensions.


� Gradient methods are simpler and have proved convergence properties for quadratic functions.

( ) ( )

k

t

k

t

kw

Eww

∂

∂−=+ η1

ADALINE – Gradient Based Solution

� Remember ~10 slides back:

( ) ( )∑=

=P

p

pe

PwE

1

21r


∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

ADALINE – Gradient Based Solution

∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

P1


=∂ pk Pw 1

( ) ( )

k

t

k

t

kw

Eww

∂

∂−=+ η1

( ) ( ) ∑=

+ −=P

p

pp

k

t

k

t

k exP

ww1

12

1η

ADALINE – Batch Algorithm

� Initialize weigths at arbitrary values

� Define a learning rate η.

� Repeat:

� For each pattern in the training set


� Apply xp to the adaline input

� Observe the output sp and compute the error ep = sp –dp

� For each weight k, accumulate the product xkpep.

� After processing all patterns, update each weight k by:

( ) ( ) ∑=

+ −=P

p

pp

k

t

k

t

k exP

ww1

12

1η


ADALINE’s batch algorithm properties:

� Guarateed to converge to weight set with minimum squared error:


� given sufficiently small learning rate η.

� Even when training data contains noise.

� Even when training data is not separable.


� ADALINE’s batch algorithm requires the availability of all the training data from the beginning.

� The weights are updated only after presenting the whole training data.


training data.

� But humans learn continuously!

� In some applications we may want to update the weights immediately after each training pattern is available.

ADALINE – Incremental Algorithm

� Incremental Algorithm – approximate the complete gradient by its estimate for each pattern.

∂ PE 2


∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

pp

k

k

exw

E=

∂

∂ ˆ

Complete (exact) gradient

Stochastic (approximate) gradient

ADALINE – Incremental Algorithm

� Incremental mode gradient descent

� Batch mode gradient descent:

( ) ( ) pp

k

t

k

t

k exww η21 −=+ ( ) ( ) ∑

=

+ −=P

pp

k

t

k

t

k exP

ww1

21

ηkkk ∑

=pP 1

Incremental Gradient Descent can approximate Batch gradient descent arbitrarily closely if η is made small enough.

ADALINE – Incremental Algorithms

� Incremental gradient descent is also known as stochastic gradient descent.

� It is also called LMS algorithm or the Delta Rule.

� It is based on an approximation of the gradient so it never


� It is based on an approximation of the gradient so it never goes exactly to the minimum of the cost function.

� After reaching a vicinity of the minimum, it oscilates around it.

� The amplitude of the oscilations can be reduced by reducing η.

ADALINE - Comparison

� The plots show the value of one weight along time

Batch Incremental


Epochs Patterns

1 Epoch = P patterns (the full training set)

ADALINE vs Perceptron

� Both ADALINE Delta Rule and Perceptron weight update rule are instances of Error Correction Learning.

ADALINE Delta Rule Perceptron update rule

( ) ( ) ( )ppp

k

t

k

t

k dsxww −−=+ η21 ( ) ( ) ( )ppp

k

t

k

t

k dyxww −−=+ η1


� The ADALINE allows abritrary real values in the output values whereas the perceptron assumes binary outputs.

� The ADALINE always converges (given small enough η) to the minimum squared error, while the perceptron only converges when data is separable.

( )kkk dsxww −−= η2 ( )kkk dyxww −−= η

ADALINE – Statistical Interpretation

� The analytical solution for the weights was obtained “averaging” quantities obtained from the training set.

� It is possible to make a statistical interpretation of the process:


the process:

� Inputs: Observarions of N Random Variables

� X = [1, X1, ..., Xk, ..., XN]

� Desired Output: Observations of 1 Random Variable

� D

� Output: Observations of 1 Random Variable

� Y = wTX


� The error function can be interpreted as an approximation to the statistical expectation E[.]:

Solution

( ) ( ) ( )[ ]2

1

21DYEdy

PwE

P

p

pp −≈−= ∑=

r


Solution

� Matrix Rxx and vector p can be interpreted as approximations to the statistical auto-covariance and cross-covariance between random variables:

pRw xx

rr 1*

−=

( ) ][1

1

TP

p

Tpp

xx XXExxP

R ≈= ∑=

rr[ ]DXEdx

Pp

P

p

pp ≈= ∑=1

1 rr


� The LMS algorithm is based on an instantaneous estimate of the gradient.

� This estimate can be modeled by:


where eg(n) is a random noise vector.

� LMS = stochastic gradient descent

( )nengng g+= )()(ˆ


� Under reasonable conditions, stochastic gradient methods may converge to the exact solution.

� Convergence Conditions [Monro and Ljung]:

� eg(n) is zero mean.

Pattern sequence is random.


� Pattern sequence is random.

� η(n) tends slowly toward zero.

( ) ∞<∑∞

=0

2

n

nη ( ) ∞=∑∞

=0n

nη


� Typical learning rate schedules

( )n

cn =η

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


( )

τ

ηη

nn

+

=

1

0

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

neural networks the adaline - institute for systems and …alex/aauto0910/lecture5adaline.pdf ·...

Documents