neural networks the adaline - institute for systems and …alex/aauto0910/lecture5adaline.pdf ·...
TRANSCRIPT
Last Lecture Summary
� Introduction to Neural Networks
� Biological Neurons
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Artificial Neurons
� McCulloch and Pitts TLU
� Rosenblatt’s Perceptron
MACHINE LEARNING 09/ 10
Neural NetworksThe ADALINE
Perceptron Limitations
� Perceptron’s learning rule is not guaranteed to converge if data is not linearly separable.
� Widrow-Hoff (1960)� Minimize the error at the output of the linear unit (e) rather than at the output of
the threshold unit (e’).
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
ADALINE – Adaptive Linear Element
� Separating hyperplane is equivalent to the perceptron
0...110
=+++ NN wxwxw
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
ADALINE – Adaptive Linear Element
� Learning rule is different from the perceptron.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Given the training set:
� minimize cost function: ( ) ( ) ( )∑∑==
−==P
p
ppP
p
pds
Pe
PwE
1
2
1
2 11r
( ){ } PpdxTpp
,...,1,, ==
ADALINE - Simplification
� Let us consider that, for every pattern: 10
=px
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Thus we can write:
∑=
=N
l
p
ll
pxws
0
ADALINE – Analytic Solution
� Optimize the cost function:
� Given that:
( ) ( )
[ ]
P
p
pe
PwE
1
1
2
= ∑=
r
r pppdse −=( )1
( )3
[ ]TNwwww ...10
=r
Nkw
E
k
,...,0,0 =∀=∂
∂∑
=
=N
l
p
ll
pxws
0
( )2 ( )4
ADALINE – Analytic Solution
� Compute the gradient of cost function:
∑∂
∂=
∂
∂ P pp e
eE
21
( ) ( )∑=
=P
p
pe
PwE
1
21r
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑= ∂
=∂ p kk w
ePw 1
2
( )ppP
p k
p
k
dsw
ePw
E−
∂
∂=
∂
∂∑
=1
21
∑= ∂
∂=
∂
∂ P
p k
pp
k w
se
Pw
E
1
21
ADALINE – Analytic Solution
∑=N
pp pp
xs
=∂
∑= ∂
∂=
∂
∂ P
p k
pp
k w
se
Pw
E
1
21
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑= ∂
∂=
∂
∂ P
p k
pp
k w
se
Pw
E
1
21
∑=
=N
l
p
ll
pxws
0
p
k
k
xw
s=
∂
∂
⇔
⇔
∑=
=∂
∂ P
p
p
k
p
k
xePw
E
1
21
ADALINE – Analytic Solution
� Very important !
� The partial derivative of the error function with respect a weight is proportional to the sum for all patterns of the input on that weight multiplied by the error.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑=
=∂
∂ P
p
pp
k
k
exPw
E
1
2
ADALINE – Analytic Solution
Given thatppp
dse −=
∑=
=∂
∂ P
p
p
k
p
k
xePw
E
1
21
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
=∂ pk Pw 1
( )∑=
−=∂
∂ P
p
p
k
pp
k
xdsPw
E
1
21
∑∑==
−=∂
∂ P
p
p
k
pP
p
p
k
p
k
xdP
xsPw
E
11
21
21
ADALINE – Analytic Solution
∑=
=N
l
p
ll
pxws
0
∑∑∂ PPE 11
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑∑==
−=∂
∂ P
p
p
k
pP
p
p
k
p
k
xdP
xsPw
E
11
21
21
∑∑∑== =
−=∂
∂ P
p
p
k
pP
p
N
l
p
k
p
ll
k
xdP
xxwPw
E
11 1
21
21
ADALINE – Analytic Solution
Nkw
E
k
,...,0,0 =∀=∂
∂
∑∑∑∂ PP NE 11
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑∑∑== =
−=∂
∂ P
p
p
k
pP
p
N
l
p
k
p
ll
k
xdP
xxwPw
E
11 1
21
21
∑∑∑== =
=P
p
p
k
pP
p
N
l
p
k
p
ll xdxxw11 1
ADALINE – Analytic Solution
NkxdxxwP
p
p
k
pP
p
N
l
p
k
p
ll ,...,0,11 1
=∀=∑∑∑== =
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� It is a linear system of N+1 equations with N+1 unknowns. How to solve it ?
ADALINE – Matrix Notation
[ ]TNwwww ...10
=r
[ ] 1,...010
== pTp
N
pppxxxxx
r
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
[ ]010 N
pTN
l
p
ll
pxwxwsrr
==∑=0
ppTpdxwe −=
rr
ADALINE – Matrix Notation
( ) ( ) ( )∑∑==
−==P
p
ppTP
p
pdxw
Pe
PwE
1
2
1
2 11 rrr
( ) ( )( )∑=
−−=P
p
TppTppTdxwdxw
PwE
1
1 rrrrr
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑=pP 1
( ) ( )( )pTpP
p
ppTdwxdxw
PwE −−= ∑
=
rrrrr
1
1
( ) ( ) ( )( )∑=
+−−=P
p
pTppppTTppTdwxddxwwxxw
PwE
1
21 rrrrrrrrr
ADALINE – Matrix Notation
( ) ( ) ( )( )∑=
+−−=P
p
pTppppTTppTdwxddxwwxxw
PwE
1
21 rrrrrrrrr
=
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
( ) ( ) ( )∑ ∑∑= ==
+−=P
p
P
p
pppTP
p
TppTd
Pdxw
Pwxxw
PwE
1 1
2
1
112
1 rrrrrrr
( ) ( ) ( )∑∑∑===
+
−
=
P
p
pP
p
ppTP
p
TppTd
Pdx
Pwwxx
PwwE
1
2
11
112
1 rrrrrrr
ADALINE – Matrix Notation
� Let us introduce the average operator < >:
( )∑=
⋅=⋅P
pP 1
1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� The cost function is written as:
( ) ( ) ( )2
2pppTTppT
ddxwwxxwwE +−=rrrrrr
ADALINE – Matrix Notation
� Defining: ( ) ( )TppP
p
Tpp
xx xxxxP
Rrrrr
== ∑=1
1
ppP
p
ppdxdx
Pp
rrr== ∑
=1
1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� the cost function is:
( ) 22 d
T
xx
TpwwRwwE σ+−=rrrrr
( ) ( )2
1
22 1 pP
p
p
d ddP
== ∑=
σ
ADALINE – Quadratic Cost
� Rxx is a covariance matrix –positive semi-definite
( ) 22 d
T
xx
TpwwRwwE σ+−=rrrrr
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
positive semi-definite
� The error function surface is a parabola.
ADALINE – Gradient Vector
( ) 22 d
T
xx
TpwwRwwE σ+−=rrrrr
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
( ) pwRwE xx
rrr22 −=∇
ADALINE – Closed Form Solution
( ) pwRwE xx
rrr=⇔=∇ *0*
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� If Rxx is positive definite, the minimum is unique.
pRw xx
rr 1*
−=
ADALINE – Closed Form Solution
� Closed form solution requires the inversion of the covariance matrix, which can be problematic in high dimensions.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Gradient methods are simpler and have proved convergence properties for quadratic functions.
( ) ( )
k
t
k
t
kw
Eww
∂
∂−=+ η1
ADALINE – Gradient Based Solution
� Remember ~10 slides back:
( ) ( )∑=
=P
p
pe
PwE
1
21r
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑=
=∂
∂ P
p
pp
k
k
exPw
E
1
2
ADALINE – Gradient Based Solution
∑=
=∂
∂ P
p
pp
k
k
exPw
E
1
2
P1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
=∂ pk Pw 1
( ) ( )
k
t
k
t
kw
Eww
∂
∂−=+ η1
( ) ( ) ∑=
+ −=P
p
pp
k
t
k
t
k exP
ww1
12
1η
ADALINE – Batch Algorithm
� Initialize weigths at arbitrary values
� Define a learning rate η.
� Repeat:
� For each pattern in the training set
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Apply xp to the adaline input
� Observe the output sp and compute the error ep = sp –dp
� For each weight k, accumulate the product xkpep.
� After processing all patterns, update each weight k by:
( ) ( ) ∑=
+ −=P
p
pp
k
t
k
t
k exP
ww1
12
1η
ADALINE – Batch Algorithm
ADALINE’s batch algorithm properties:
� Guarateed to converge to weight set with minimum squared error:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� given sufficiently small learning rate η.
� Even when training data contains noise.
� Even when training data is not separable.
ADALINE – Batch Algorithm
� ADALINE’s batch algorithm requires the availability of all the training data from the beginning.
� The weights are updated only after presenting the whole training data.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
training data.
� But humans learn continuously!
� In some applications we may want to update the weights immediately after each training pattern is available.
ADALINE – Incremental Algorithm
� Incremental Algorithm – approximate the complete gradient by its estimate for each pattern.
∂ PE 2
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
∑=
=∂
∂ P
p
pp
k
k
exPw
E
1
2
pp
k
k
exw
E=
∂
∂ ˆ
Complete (exact) gradient
Stochastic (approximate) gradient
ADALINE – Incremental Algorithm
� Incremental mode gradient descent
� Batch mode gradient descent:
( ) ( ) pp
k
t
k
t
k exww η21 −=+ ( ) ( ) ∑
=
+ −=P
pp
k
t
k
t
k exP
ww1
21
ηkkk ∑
=pP 1
Incremental Gradient Descent can approximate Batch gradient descent arbitrarily closely if η is made small enough.
ADALINE – Incremental Algorithms
� Incremental gradient descent is also known as stochastic gradient descent.
� It is also called LMS algorithm or the Delta Rule.
� It is based on an approximation of the gradient so it never
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� It is based on an approximation of the gradient so it never goes exactly to the minimum of the cost function.
� After reaching a vicinity of the minimum, it oscilates around it.
� The amplitude of the oscilations can be reduced by reducing η.
ADALINE - Comparison
� The plots show the value of one weight along time
Batch Incremental
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Epochs Patterns
1 Epoch = P patterns (the full training set)
ADALINE vs Perceptron
� Both ADALINE Delta Rule and Perceptron weight update rule are instances of Error Correction Learning.
ADALINE Delta Rule Perceptron update rule
( ) ( ) ( )ppp
k
t
k
t
k dsxww −−=+ η21 ( ) ( ) ( )ppp
k
t
k
t
k dyxww −−=+ η1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� The ADALINE allows abritrary real values in the output values whereas the perceptron assumes binary outputs.
� The ADALINE always converges (given small enough η) to the minimum squared error, while the perceptron only converges when data is separable.
( )kkk dsxww −−= η2 ( )kkk dyxww −−= η
ADALINE – Statistical Interpretation
� The analytical solution for the weights was obtained “averaging” quantities obtained from the training set.
� It is possible to make a statistical interpretation of the process:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
the process:
� Inputs: Observarions of N Random Variables
� X = [1, X1, ..., Xk, ..., XN]
� Desired Output: Observations of 1 Random Variable
� D
� Output: Observations of 1 Random Variable
� Y = wTX
ADALINE – Statistical Interpretation
� The error function can be interpreted as an approximation to the statistical expectation E[.]:
Solution
( ) ( ) ( )[ ]2
1
21DYEdy
PwE
P
p
pp −≈−= ∑=
r
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Solution
� Matrix Rxx and vector p can be interpreted as approximations to the statistical auto-covariance and cross-covariance between random variables:
pRw xx
rr 1*
−=
( ) ][1
1
TP
p
Tpp
xx XXExxP
R ≈= ∑=
rr[ ]DXEdx
Pp
P
p
pp ≈= ∑=1
1 rr
ADALINE – Statistical Interpretation
� The LMS algorithm is based on an instantaneous estimate of the gradient.
� This estimate can be modeled by:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
where eg(n) is a random noise vector.
� LMS = stochastic gradient descent
( )nengng g+= )()(ˆ
ADALINE – Statistical Interpretation
� Under reasonable conditions, stochastic gradient methods may converge to the exact solution.
� Convergence Conditions [Monro and Ljung]:
� eg(n) is zero mean.
Pattern sequence is random.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Pattern sequence is random.
� η(n) tends slowly toward zero.
( ) ∞<∑∞
=0
2
n
nη ( ) ∞=∑∞
=0n
nη
ADALINE – Statistical Interpretation
� Typical learning rate schedules
( )n
cn =η
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
( )
τ
ηη
nn
+
=
1
0
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1