vi. backpropagation neural networks...
TRANSCRIPT
07/10/06 EC4460.SuFy06/MPF 1
VI. Backpropagation Neural Networks (BPNN)
• Review of Adaline• Newton’s method• Backpropagation algorithm
– definition
– derivative computation
– weight/bias computation– function approximation example
– network generalization issues– potential problems with the BPNN
– momentum filter
– iteration schemes review• Generalization
References: [Hagan], [Mathworks]NN FAQ at ftp://ftp.sas.com/pub/neural/FAQ.htmlhttp://www-stat.stanford.edu/%7Etibs/stat315a.htmlPattern Classification, Duda & Hart, Wiley, 2001
• Implementation issues
– regularization– early stopping
07/10/06 EC4460.SuFy06/MPF 2
+a = Wp b
• Recall the Adaline (LMS) network:
• Restriction to the Adaline (LMS) ?
• Problem solved with Adaline/LMS:
Given a set {pi , ti } define the weights and bias which minimize the Mean Square error ( )2E t a⎡ ⎤−⎣ ⎦
T
x z1
x z
pwb
a
⎡ ⎤⎡ ⎤= = ⎢ ⎥⎢ ⎥
⎣ ⎦ ⎣ ⎦
=
������
a = purelin (Wp+b)
Linear Neuron
p a
1
n
����W
����
b
R x 1S x R
S x 1
S x 1
S x 1
Input
R S
Linear activation function
( )22( ) [ ] 2T T TF x E e E t x z x Rx x h c⎡ ⎤= = − = − +⎢ ⎥⎣ ⎦
07/10/06 EC4460.SuFy06/MPF 3
1x x (x )x ( ),
k k k
k k
Fez e t a
αα
+ = − ∇= − − = −
• Practical application:– solving F(x) requires computing R and h, and R-1
– alternative: solve problem iteratively usingsteepest descent only
LMS iteration:pick x(0)a=xk
Tzke=t-axk+1=xk+2αezkk=k+1
• Extensions ==> Multilayer perceptron
07/10/06 EC4460.SuFy06/MPF 4
• Why use multi-layer structures ?
Class 1
Class 2
subclasses
K = 9 subclasses, M = 2 classes
07/10/06 EC4460.SuFy06/MPF 5
• Example: pattern classification: the XOR gate
− Can it be solved with a single layer perceptron?
1 21 2
3 43 4
0 0, 0 ; , 1 ;
0 1
1 1, 1 ; , 0
0 1
p t p t
p t p t
⎧ ⎫ ⎧ ⎫⎛ ⎞ ⎛ ⎞= = = =⎨ ⎬ ⎨ ⎬⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠⎩ ⎭ ⎩ ⎭⎧ ⎫ ⎧ ⎫⎛ ⎞ ⎛ ⎞
= = = =⎨ ⎬ ⎨ ⎬⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠⎩ ⎭ ⎩ ⎭
×
×
07/10/06 EC4460.SuFy06/MPF 6
NN block diagram:
07/10/06 EC4460.SuFy06/MPF 7
Note: Final network space partitioning varies as a function of the number of neurons in the hidden layer
×
×
×
×
07/10/06 EC4460.SuFy06/MPF 8
Example:1
−0.50.5
2−2
−1
b1
b3
b2
y1
y2
y3
• Assume b1 = 0.5, b2 = 2, b3 = 1
• Plot the decision boundaries obtained assuming HL is used as activation functions
• Derive the weight matrix and bias vector used forthis network
• Design the NN second layer (following given in-classguidelines, i.e., identify weight matrix and bias
07/10/06 EC4460.SuFy06/MPF 9
07/10/06 EC4460.SuFy06/MPF 10
07/10/06 EC4460.SuFy06/MPF 11
• Example: multilayer perceptron (classification)
Assume dark = 1
07/10/06 EC4460.SuFy06/MPF 12
07/10/06 EC4460.SuFy06/MPF 13
• Example: function approximation
( )
( )
1
2
1 1 11,1 2,1 2
12
2 2 21,1 2,1
11
10 10, 10
10
1 1 0
nf ne
f n n
b
b
b
ω ω
ω ω
−=+
=
= = = −
=
= = =
p
a12n1
2
Input
w11,1
a11n1
1
w21,1
b12
b11
b2
a2n2
1
1
1
����Σ
����Σ ��
��Σw1
2,1 w21,2
����
����
Log-Sigmoid Layer
����
Linear Layer
a1 = logsig (W1p + b1) a2 = purelin (W2a1 + b2)
-2 -1 0 1 2-1
0
1
2
3
Example Function Approximation Network
Nominal Response of Networkof Figure Above
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
Effect of Parameter Changeson Network Response
1– b2 1≤ ≤
p
a(2)
(1)20 20b≤ ≤
(2)1,21 1w− ≤ ≤
(1)1,21 1w− ≤ ≤
07/10/06 EC4460.SuFy06/MPF 14
• Backpropagation algorithm:
• Goal: given a set of {pi, ti}; find the weightsand bias which minimize the mean squareerror (performance surface)
( ) 2[( ) ]F x E t a= −
First Layer
��������
f 1
��������
f 2
��������
f 3
p a1 a2
��
W1
��b1
��
W2
��b21 1
n1 n2
a3
n3
1
����
W3
����b3
S2 x S1
S2 x 1
S2 x 1
S2 x 1S3 x S2
S3 x 1
S3 x 1
S3 x 1R x 1S1 x R
S1 x 1
S1 x 1
S1 x 1
Input
R S1 S2 S3
Second Layer Third Layer
a1 = f 1 (W1p + b1) a2 = f 2 (W2a1 + b2) a3 = f 3 (W3a2 + b3)
a3 = f 3 (W3 f 2 (W2f 1 (W1p + b1) + b2) + b3)
Discard the expected operation
( ) ( ) ( )Tk k k k kF x t a t a= − −
07/10/06 EC4460.SuFy06/MPF 15
• For a 1-layer only and purelin activation function
1
1
2
2
Tk k k k
k k k
w w e p
b b e
α
α+
+
= +
= +
07/10/06 EC4460.SuFy06/MPF 16
• How to compute the derivatives? → use SD
Recall: ( )( ) ( )
( ) ( )
, ,,
1x
1
m mi j i j m
i j
m mi i m
i
Fw k w kw
FFb k b kb
α
α
∂⎧ + = −⎪ ∂⎪⎨
∂⎪ + = −⎪ ∂⎩
• Note: F(x) may not be expressed directlyin terms of , etc……
( ) ( ) ( )df n w df n dn wdw dn dw
⎡ ⎤⎣ ⎦ = ×
1 2, ,,i j i jw w
We need to use the chain rule
• Example:
( )( )( ) ( )
3
3 5 1
2
2
5 1
n
w
f n e
f n w e
n w
+
=
=
= +
07/10/06 EC4460.SuFy06/MPF 17
, ,
mi
m m mi j i i j
mi
m m mi i i
nF Fw n w
nF Fb n b
∂∂ ∂= ×
∂ ∂ ∂
∂∂ ∂= ×
∂ ∂ ∂
1,
1
,
1
m m m mi j i j j i
mmijm
i j
mimi
n w a b
n aw
nb
−
−
= ∑ +
∂=
∂
∂=
∂
th,
th
th
th
: weight
associated to neuron input
: associated with neuron
mi j
mi
w m layer
ij
n i
th,
th th
th
: layer, associated to
neuron and input: associated with neuron
mi j
mi
w m
i jn i
1, , 1 , ,
, 1 ,
m m mi j k i j k jm
i
m mi k i k m
i
Fw w an
Fb bn
α
α
−+
+
∂= −
∂∂
= −∂
sim: sensitivity of F(.) to
changes in ith neuron element at layer m
07/10/06 EC4460.SuFy06/MPF 18
Expressing Weight/Bias in a Matrix Form
Matrix Form1
, ,
1
( 1) ( )
( 1) ( )
m m mi j i j jm
i
m mi i m
im m
k k
Fw k w k an
Fb k b kn
W W
α
α
α
−
+
∂+ = −
∂
∂+ = −
∂
= −
11 12 1
21 22 2
R
R
w w wW
w w w⎡ ⎤
= ⎢ ⎥⎣ ⎦
Associated with 1 neuron (1)
(2)
w1
( )1
1 11 2
11 12 11 12 1 1
1 121 2111 2
2 2
1 1 11 2
2
Tm
m
m mm m m m m m
m mm mk k
m m
mm m
m a
s
F Fa aw w w w n n
F Fw w a an n
Fn
a aFn
α
α−
− −
− −+
− −
∂ ∂⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ∂ ∂⎢ ⎥= −⎢ ⎥ ⎢ ⎥ ∂ ∂⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥∂ ∂⎣ ⎦
∂⎡ ⎤⎢ ⎥∂⎢ ⎥ ⎡ ⎤= − ⎣ ⎦∂⎢ ⎥⎢ ⎥∂⎣ ⎦
w21
w12
w22
111, 1 11, 1
1
112, 1 12, 2
1
121, 1 21, 1
2
input 1, neuron 1, 1
input 2, neuron 1, 1, 2
input 1, neuron 2, = 2, 1
m m mk k m
m mk k m
m m mk k m
Fw w a j inFw w a i jnFw w a i jn
α
α
α
−+
−+
−+
∂= − = =
∂∂
= − = =∂∂
− = =∂
07/10/06 EC4460.SuFy06/MPF 19
( )1
1 11 2
11 12 11 12 1 1
121 2111
2
1 1 11 2
2
Tm
m
m mm m m m m m
m mmk k
m
mm m
m a
s
F Fa aw w w w n n
Fw w an
Fn
a aFn
α
α−
− −
−+
− −
∂ ∂⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ∂ ∂⎢ ⎥= −⎢ ⎥ ⎢ ⎥ ∂⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥∂⎣ ⎦
∂⎡ ⎤⎢ ⎥∂⎢ ⎥ ⎡ ⎤= − ⎣ ⎦∂⎢ ⎥⎢ ⎥∂⎣ ⎦
( )11
1
tm mm mk k
m m mk k
W W s a
b b s
α
α
−+
+
= −
= −
07/10/06 EC4460.SuFy06/MPF 20
mi
Fn
∂∂
Need to use the chain rule
Will involve terms of the form:1
1
mj
m mj i
nFn n
+
+
∂∂∂ ∂
Define the matrix:
1 11 1
11 2
1 12 2
1 2
,
,
m m
m mm
m m m
m m
n nn nn
n n nn n
+ +
+
+ +
⎡ ⎤∂ ∂⎢ ⎥∂ ∂∂ ⎢ ⎥=⎢ ⎥∂ ∂ ∂⎢ ⎥
∂ ∂⎢ ⎥⎣ ⎦
[ ]
( )
( )( ) ( )( ) ( )
( )
1 11,
1,
1 ( ),
1 ( ) 1 ( )111 1 12 2
1 ( ) 1 ( )21 1 22 2
111 12
21 22 2
nn
0
0
m m mmi ii
m mj j
mj
m mj j
mjm m m m
i j j jmj
m m mi j j
m m m m m mm
m m m m m m m
m m
m
w a bnn n
aa n
aw a f n
n
w f n
w f n w f n
w f n w f n
f nw ww w f n
+ ++
+
+
+ ++
+ +
⎡ ⎤∂ +∂ ⎣ ⎦=∂ ∂
∂∂= ⋅
∂ ∂
∂= ⋅ =
∂
=
⎡ ⎤∂ ⎢ ⎥⇒ =⎢ ⎥∂ ⎣ ⎦
⎡ ⎤⎢ ⎥⎣ ⎦
∑
( )( )1 ( ) n
m
mm mW F+
⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦
=
07/10/06 EC4460.SuFy06/MPF 21
1m
2
n
mm
m
FnFsFn
∂⎡ ⎤⎢ ⎥∂∂ ⎢ ⎥= =
∂⎢ ⎥∂⎢ ⎥∂⎣ ⎦
n1 : first neuron
n2 : second neuron
Next, apply the chain rule for vectors:
1 11 2
1 11 1 1 2 1
(1)m m
m m m m m
n nF F Fn n n n n
+ +
+ +
∂ ∂∂ ∂ ∂= ⋅ + ⋅
∂ ∂ ∂ ∂ ∂
Sensitivity of F to change in the 1st element of the net input at layer m.
1 11 2
1 12 1 2 2 2
(2)m m
m m m m m
n nF F Fn n n n n
+ +
+ +
∂ ∂∂ ∂ ∂= ⋅ + ⋅
∂ ∂ ∂ ∂ ∂
1 11 2
11 1 1
1 11 2
122 2
s
m m
m m mm
m m
mm m
n n Fn n n
Fn nnn n
+ +
+
+ +
+
⎡ ⎤∂ ∂ ∂⎡ ⎤⎢ ⎥ ⎢ ⎥∂ ∂ ∂⎢ ⎥ ⎢ ⎥⇒ =⎢ ⎥ ∂⎢ ⎥∂ ∂⎢ ⎥ ⎢ ⎥∂∂ ∂⎢ ⎥ ⎣ ⎦⎣ ⎦
1 11
1n nsn n n n
T Tm mm
m m m mF F+ +
++
⎡ ⎤ ⎛ ⎞ ⎛ ⎞∂ ∂ ∂ ∂= → = ⎜ ⎟⎢ ⎥ ⎜ ⎟∂ ∂ ∂ ∂⎝ ⎠⎣ ⎦ ⎝ ⎠
( ) ( )( )( ) ( ). .
1 11 1s s sTm m Tm m m m mm mW F n F n W+ ++ +⎛ ⎞
= =⎜ ⎟⎝ ⎠
(1) & (2)
Sensitivity of F to change in the 2st element of the net input at layer m.
07/10/06 EC4460.SuFy06/MPF 22
• We need to compute sM
( )( )
( )( )
( )( )
( )( )
( ) ( )
( ) ( )
( )
1
2
2
1
2
2
2
1
1 1
2
2
2 2
( )1
1 11
( )2
2 22
( )1
1
s
2
2
02
0
MM
M
j j
M
j j
M
Mj j
M M
Mj j
M M
M M
M
M M
M
M M
M
FnF
n
t a
n
t a
n
t a aa n
t a aa n
f nt a
n
f nt a
n
f nn
∂⎡ ⎤⎢ ⎥∂⎢ ⎥=
∂⎢ ⎥⎢ ⎥∂⎣ ⎦⎡ ⎤∂ −⎢ ⎥⎢ ⎥∂⎢ ⎥= =⎢ ⎥∂ −⎢ ⎥⎢ ⎥∂⎣ ⎦⎡ ⎤∂ − ∂⎢ ⎥⋅⎢ ⎥∂ ∂⎢ ⎥=⎢ ⎥∂ − ∂⎢ ⎥⋅⎢ ⎥∂ ∂⎣ ⎦⎡ ⎤∂⎢ ⎥− − ⋅
∂⎢ ⎥= ⎢ ⎥∂⎢ ⎥− − ⋅⎢ ⎥∂⎣ ⎦
∂
∂= −
∑
∑
∑
∑
( )
( )( )
1 1
( )2 22
2
2 t - a
M M
M
M MM
t at af n
n
s F n
⎡ ⎤⎢ ⎥
−⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥ −∂ ⎣ ⎦⎢ ⎥
⎢ ⎥∂⎣ ⎦
⇒ = −
• Note: a = f(n)
07/10/06 EC4460.SuFy06/MPF 23
Summary:Start ( )
( )( )
( )( )
1 11 1
( )
1( ) 1
2
; 1, ,1
o
m mm m m
M
M MM
Tm m mm m
a p
a f W a b
a a
s F n t - a
s F n W s m M
+ ++ +
++
=
= +
=
= −
= = − …
Compute
Update the Weights
Note: We will need derivatives for allactivation functions⇒
( ) ( ) ( )( ) ( )
11
1
Tm mm m
m m m
W k W k s a
b k b k s
α
α
−⎧ + = −⎪= ⎨⎪ + = −⎩
07/10/06 EC4460.SuFy06/MPF 24
1-2-1Network
+
-
t
a
ep
Example: Function Approximation
( ) 1 sin4
g p pkπ⎛ ⎞= + ⎜ ⎟⎝ ⎠
k=1
07/10/06 EC4460.SuFy06/MPF 25
1-2-1Network
ap
p
a12n1
2
Input
w11,1
a11n1
1
w21,1
b12
b11
b2
a2n2
1
1
1
����Σ
����Σ ��
��Σw1
2,1 w21,2
����
����
Log-Sigmoid Layer
����
Linear Layer
a1 = logsig (W1p + b1) a2 = purelin (W2a1 + b2)
07/10/06 EC4460.SuFy06/MPF 26
Network ResponseSine Wave
-2 -1 0 1 2-1
0
1
2
3
[ ]
(1)(1)
(2) (2)
0.2 0.5(0) ;
0.4 0.1
(0) 0 0.1 ; 0.5
W b
W b
− −⎡ ⎤ ⎡ ⎤= =⎢ ⎥ ⎢ ⎥− −⎣ ⎦ ⎣ ⎦= =
Initial Conditions:
Example: Textbook, pp. 11-14
For initial values
07/10/06 EC4460.SuFy06/MPF 27
What does the 1-2-1 network look like ?
07/10/06 EC4460.SuFy06/MPF 28
07/10/06 EC4460.SuFy06/MPF 29
Example:
( ) [ ]1 sin ; 2;24ig p p pπ⎛ ⎞= + ∈ −⎜ ⎟
⎝ ⎠-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
Figure 11.10 Function Approximation Using a 1-3-1 Network
Figure 11.10 Effect of Increasing the Number of Hidden Neurons
Figure 11.10 Function Approximation Using a 1-3-1 Network
function approximation:
( ) ( )(1) (2)1 ;1 nf n f n n
e−= =+
( ) [ ]61 sin ; 2;24
g p p pπ ε⎛ ⎞= + −⎜ ⎟⎝ ⎠
Convergence issues:
( ) ( ) [ ]1 sin ; 2;2g p p pπ ε= + −
i=1 i=2
i=4 i=8
1-2-1 1-3-1
1-4-1 1-5-1
-2 -1 0 1 2-1
0
1
2
3
1
23
4
5
0
07/10/06 EC4460.SuFy06/MPF 30
-2 -1 0 1 2-1
0
1
2
3
-2 -1 0 1 2-1
0
1
2
3
Figure 11.13 Convergence to a Local Minimum
Figure 11.14 1-2-1 Network Approximation of g(p)
Figure 11.15 1-9-1 Network Approximation of g(p)
( ) ( ) [ ]1 sin ; 2;2g p p pπ ε= + −
Network generalization issues
1-2-1
1-9-1
-2 -1 0 1 2-1
0
1
2
3
1
23
45
0
07/10/06 EC4460.SuFy06/MPF 31
07/10/06 EC4460.SuFy06/MPF 32
• Potential problems with Backpropagation:− activation functions may be nonlinear
− performance surface is not unimodal
− convergence may be sped up with a variablelearning rate
increase step size when performance indexis flatdecrease step size when performanceindex is steep
Possible strategy:
• If error increases by more than a pre-defined value(typically 4-5%):
new weights are discardedlearning rate is decreasing (*0.7)
• If error increases by leass than 4-5%: keep newweights
• If error decreases: learning rate is increased by 5%
07/10/06 EC4460.SuFy06/MPF 33
• Convergence may be sped up with themomentum filter.
− filter response
mkW∆
mkb∆
Recall
• Introduce memory and LP filter behaviorto update Wm, bm
( )1 1k k kx x sγ γ−= + −Define
sk xk
• Apply above concept to iteration equations
( )
( )
11 1
1 1
1 )
1
mm m m m tk k k
m m m mk k k
W W W s (a
b b b s
γ γ α
γ γ α
−+ =
+ −
⎡ ⎤= + ∆ − −⎣ ⎦⎡ ⎤= + ∆ − −⎣ ⎦
( )11
1
Tm mm mk k
m m mk k
W W s a
b b s
α
α
−+
+
⎧ = −⎪⎨⎪ = −⎩
1
1
( ) ( ) (1 ) ( )( ) 1( )( ) 1
X z X z z S zX zH zS z z
γ γγ
γ
−
−
= + −−
==> = =−
mkW∆
mkb∆
07/10/06 EC4460.SuFy06/MPF 34
• Iteration Techniques
– Apply above concept to iteration equations
( ) ( )1
1
where selected so thatk k k k
k k
x x p p
F x F x
α+
+
= +
<
– Use Taylor series expansion
1. first order expansion:
( ) ( ) ( ) ( )1 k
Tk k k k kx xF x F x x F x VF x x+ == + ∆ + ∆
( )1 kk k x xx x F xα+ == − ∇
2. second order expansion:
( ) ( )1k k kF x F x x+ = + ∆
( ) ( ) 12k
TTk k k kx x kF x VF x x x A x=+ ∆ + ∆
( )1kk k k x xx x A VF x−
== −
( )2kk x xA V F x ==
Recall: potential problems with Newton scheme (Hessian, gradient, convergence)
Leads to Newton’s scheme
SD scheme
07/10/06 EC4460.SuFy06/MPF 35
• Levenberg-Marquardt Algorithm
Designed to speed up the convergence of the Newton’s method by reducing the computational load.
( ) ( ) ( ) 2TiF x V x V x v= = ∑Recall
Assume:1. ( ) 2 ( )TVF x J V x=
2. ( )2 2 ( ) ( ) 2 ( )TV F x J x J x S x= +
( ) ( )
( ) ( ) ( ) ( )
21
1
k kk k x x x x
T Tk k k k kk
x x F x F x
x J x J x I J x V xµ
+ = =
−
= − ∇ ∇
⎡ ⎤= − +⎣ ⎦
• General guidelines for µk
Start with µk=0.01: if F(x) doesn’t decrease, repeatwith µk= 10µk
07/10/06 EC4460.SuFy06/MPF 36
-5 0 5 10 15-5
0
5
10
15
• Squared Error Surface as a function of theweight values
-50
510
15
-5
0
5
10
15
0
5
10
Figure 12.3
w11,1
w11,1
w21,1
w21,1
07/10/06 EC4460.SuFy06/MPF 37
-10
0
10
20
30 -30-20
-100
1020
0
0.5
1
1.5
2
2.5
-10 0 10 20 30-25
-15
-5
5
15
w11,1
w11,1
w21,1
w21,1
• Squared Error Surface as a function of theweight values
07/10/06 EC4460.SuFy06/MPF 38
-5 0 5 10 15-5
0
5
10
15
w21,1
w11,1
Figure 12.6
Two SDBP (batch mode) trajectories
07/10/06 EC4460.SuFy06/MPF 39
-5 0 5 10 15-5
0
5
10
15
w21,1
w11,1
Figure 12.8
Trajectory with learning rate too large
07/10/06 EC4460.SuFy06/MPF 40
Momentum Backpropagation
-5 0 5 10 15-5
0
5
10
15
Steepest Descent Backpropagation
(SDBP)
Momentum Backpropagation
(MOBP)
w11,1
w21,1
γ 0.8=
( )
( )
1
11
1
1
1 )
1
m mk k
m mm tk
m mk k
m mk
W W
W s (a
b b
b s
γ γ α
γ γ α
+
−−
+
−
= +
⎡ ⎤+ ∆ − −⎣ ⎦
= +
⎡ ⎤+ ∆ − −⎣ ⎦
( )11
1
Tm mm mk k
m m mk k
W W s a
b b s
α
α
−+
+
⎧ = −⎪⎨⎪ = −⎩
07/10/06 EC4460.SuFy06/MPF 41
• If the squared error (over the entire training set)increases by more than some set percentage ζafter a weight update, then the weight update is discarded, the learning rate is multiplied bysome factor (1 > ρ > 0), and the momentumcoefficient γ is set to zero.
• If the squared error decreases after a weightupdate, then the weight update is accepted andthe learning rate is multiplied by some factorη > 1. If γ has been previously set to zero, it is reset to its original value.
• If the squared error increases by less than ζ, then the weight update is accepted, but thelearning rate and the momentum coefficient are unchanged.
Variable Learning Rate
07/10/06 EC4460.SuFy06/MPF 42
-5 0 5 10 15-5
0
5
10
15
w21,1
w11,1
η 1.05=
ρ 0.7=
ζ 4%=
100
101
102
103
0
0.5
1
1.5
Iteration Number10
010
110
210
30
20
40
60
Iteration Number
Figure 12.11
Variable Learning Rate Trajectory
Damping factor for learning rate
Error threshold
Weight selection threshold
Squared error Learning rate
07/10/06 EC4460.SuFy06/MPF 43
Conjugate Gradient
1. The first search direction is the steepest descent.p 0 g 0–= gk F x( )∇ x x k=
≡
x k 1+ x k α kp k+=
pk gk– βkpk 1–+=
βkg k 1–
T∆ gk
gk 1–T
∆ pk 1–
-----------------------------=
β kg k
T g k
g k 1–T g k 1–
-------------------------=
β kg k 1–
T∆ g k
g k 1–T g k 1–
-------------------------=
2. Take a step and choose the learning rate tominimize the function along the search direction.
3. Select the next search direction according to:
where
or
or
Fletcher-Reeves update
Polak-Ribiere updateHestenes-Steifel update
07/10/06 EC4460.SuFy06/MPF 44
-5 0 5 10 15-5
0
5
10
15
w21,1
w11,1
Conjugate Gradient Trajectory
07/10/06 EC4460.SuFy06/MPF 45
-5 0 5 10 15-5
0
5
10
15
w21,1
w11,1
Levenberg-Marquardt Trajectory
07/10/06 EC4460.SuFy06/MPF 46
• Resilient Backpropagation Network
• BPNN usually use sigmoid function (tansig, logsig) as activation functions to introduce nonlinear behavior
• Can cause the network to have very small gradientand iterations to stall (almost)
• Resilient BPNN uses − the signs of the gradient components only to
determine the direction of the weight update − weight change values determined by separate
update value
07/10/06 EC4460.SuFy06/MPF 47
• It is very difficult to know which trainingalgorithm will be the fastest for a givenproblem.
• Convergence speed depends on manyfactors: − complexity of the problem, − number of data points in the training
set, − number of weights and biases in the
network, − error goal, − whether the network is being used for
pattern recognition (discriminantanalysis) or function approximation(regression)
− etc...
Algorithm Comparisons
07/10/06 EC4460.SuFy06/MPF 48
Toy Example 1: Sinusoid function approximation
Network set-up: 1-5-1; Activation functions (tansig, purelin)Number of trials: 30 with random initial weights and biasError threshold: MSE<0.002
Algorithm Mean Time(s)
Ratio Min. Time(s)
Max. Time(s)
Std.(s)
LM 1.14 1.00 0.65 1.83 0.38BFG 5.22 4.58 3.17 14.38 2.08RP 5.67 4.97 2.66 17.24 3.72SCG 6.09 5.34 3.18 23.64 3.81CGB 6.61 5.80 2.99 23.65 3.67CGF 7.86 6.89 3.57 31.23 4.76CGP 8.24 7.23 4.07 32.32 5.03OSS 9.64 8.46 3.97 59.63 9.79GDX 27.69 24.29 17.21 258.15 43.65
Algorithm AcronymLM (trainlm) - Levenberg-MarquardtBFG (trainbfg) - BFGS Quasi-NewtonRP (trainrp) - Resilient BackpropagationSCG (trainscg) - Scaled Conjugate GradientCGB (traincgb) - Conjugate Gradient with Powell /Beale
RestartsCGF(traincgf) - Fletcher-Powell Conjugate GradientCGP (traincgp) - Polak-Ribiére Conjugate GradientOSS (trainoss) - One-Step SecantGDX (traingdx) - Variable Learning Rate Backpropagation
Sun Sparc 2 workstation
07/10/06 EC4460.SuFy06/MPF 49
07/10/06 EC4460.SuFy06/MPF 50
Example 2: function approximation (non linear regression) - Engine data set
Network set-up: 2-30-2Network inputs: engine speed and fueling levels Network outputs: torque and emission levels. Activation functions (tansig,purelin)Number of trials: 30 with random initial weights and biasError threshold: MSE < 0.005
Algorithm Mean Time(s)
Ratio Min. Time(s)
Max. Time(s)
Std.(s)
LM 18.45 1.00 12.01 30.03 4.27BFG 27.12 1.47 16.42 47.36 5.95RP 36.02 1.95 19.39 52.45 7.78SCG 37.93 2.06 18.89 50.34 6.12CGB 39.93 2.16 23.33 55.42 7.50CGF 44.30 2.40 24.99 71.55 9.89CGP 48.71 2.64 23.51 80.90 12.33OSS 65.91 3.57 31.83 134.31 34.24GDX 188.50 10.22 81.59 279.90 66.67
Algorithm AcronymLM (trainlm) - Levenberg-MarquardtBFG (trainbfg) - BFGS Quasi-NewtonRP (trainrp) - Resilient BackpropagationSCG (trainscg) - Scaled Conjugate GradientCGB (traincgb) - Conjugate Gradient with Powell /Beale
RestartsCGF(traincgf) - Fletcher-Powell Conjugate GradientCGP (traincgp) - Polak-Ribiére Conjugate GradientOSS (trainoss) - One-Step SecantGDX (traingdx) - Variable Learning Rate Backpropagation
Sun Enterprise 4000 workstation
07/10/06 EC4460.SuFy06/MPF 51
07/10/06 EC4460.SuFy06/MPF 52
Example 3: Pattern recognition - Cancer data setNetwork set-up: 9-5-5-2Network inputs: clump thickness, uniformity of cell size and cell shape, amount of marginal adhesion, frequency of bare nuclei.Network outputs: benign or malignant tumorActivation functions (tansig in all layers)Number of trials: 30 with random initial weights and biasError threshold: MSE < 0.012
Algorithm AcronymLM (trainlm) - Levenberg-MarquardtBFG (trainbfg) - BFGS Quasi-NewtonRP (trainrp) - Resilient BackpropagationSCG (trainscg) - Scaled Conjugate GradientCGB (traincgb) - Conjugate Gradient with Powell /Beale
RestartsCGF(traincgf) - Fletcher-Powell Conjugate GradientCGP (traincgp) - Polak-Ribiére Conjugate GradientOSS (trainoss) - One-Step SecantGDX (traingdx) - Variable Learning Rate Backpropagation
Sun Sparc 2 workstation
Algorithm
CGBRP
SCGCGPCGFLMBFGGDXOSS
Mean Time (s)
80.2783.4186.5887.70110.05110.33209.60313.22463.87
Ratio
1.001.041.081.091.371.372.613.905.78
Min.Time (s)
55.0759.5141.2156.3563.3358.94118.92166.48250.62
Max.Time (s)
102.31109.39112.19116.37171.53201.07318.18446.43599.99
Std. (s)
13.1713.4418.2518.0330.1338.2058.4475.4497.35
07/10/06 EC4460.SuFy06/MPF 53
07/10/06 EC4460.SuFy06/MPF 54
Other examples available at
http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/backpr14.shtml
07/10/06 EC4460.SuFy06/MPF 55
Several algorithm characteristics which can be deuced from experiments:
• In general, on function approximation problems, for networks that contain up to a fewhundred weights, the LM algorithm will have the fastest convergence. This advantage is especially noticeable if very accurate training is required.
• In many cases, trainlm is able to obtain lower mean square errors than any of the otheralgorithms tested.
• However, as the number of weights in the network increases, the advantage of thetrainlm decreases. In addition, trainlm performance is relatively poor on patternrecognition problems. The storage requirements of trainlm are larger than the otheralgorithms tested. By adjusting the mem_reduc parameter, discussed earlier, the storage requirements can be reduced, but at a cost of increased execution time.
• The trainrp function is the fastest algorithm on pattern recognition problems. However, it does not perform well on function approximation problems. Itsperformance also degrades as the error goal is reduced. The memory requirementsfor this algorithm are relatively small in comparison to the other algorithms considered.
• The conjugate gradient algorithms, in particular trainscg, seem to perform well over awide variety of problems, particularly for networks with a large number of weights.
The SCG algorithm is almost as fast as the LM algorithm on function approximationproblems (faster for large networks) and is almost as fast as trainrp on patternrecognition problems. Its performance does not degrade as quickly as trainrpperformance does when the error is reduced. The conjugate gradient algorithms haverelatively modest memory requirements.
• The trainbfg performance is similar to that of trainlm. It does not require as muchstorage as trainlm, but the computation required does increase geometrically with the size of the network, since the equivalent of a matrix inverse must be computed at each iteration.
• The variable learning rate algorithm traingdx is usually much slower than the othermethods, and has about the same storage requirements as trainrp, but it can still beuseful for some problems. There are certain situations in which it is better to convergemore slowly. For example, when using early stopping, you may have inconsistentresults if you use an algorithm that converges too quickly. You may overshoot the point at which the error on the validation set is minimized.
EXPERIMENT CONCLUSIONS
07/10/06 EC4460.SuFy06/MPF 56
Generalization Issues• Network may be overtrained (overfitting issues)
when MSE on training set is set too low Potential Risk: the network memorizes the
training examples, but doesn’t learn to generalize to similar but new situations• Consequences:
− very good performances on training set,− very poor performance on testing set
(1- 2
0-1 )
ne t
; no
isy
s in e
• How to prevent overfitting ?• Use a network not too large for the problem
a-priori network size is difficult to guess• Increase training set size if possible• Apply
− regularization− early stopping
07/10/06 EC4460.SuFy06/MPF 57
• Regularization
− Recall basic performance (MSE) function is defined as:
2 2
1 1
1 1 ( )N N
i i ii i
MSE e t aN N= =
= = −∑ ∑
− Performance function is modified as:
2
1
(1 )
1where
& : performance ratio
reg
P
ii
MSE MSE MSW
MSW wN
γ γ
γ=
= + −
= ∑
− Consequences: MSEreg forces the network• to have smaller weights and biases,• to produce a smoother response• to be less likely to overfit
− Drawbacks:• difficult to estimate γ
γ too large overfitting pbγ too small no good fit of training data
07/10/06 EC4460.SuFy06/MPF 58
• Automated Regularization(MATLAB: trainbr)
− Assume weights and bias are random variableswith specific distributions
− Define new performance function as:MSEaut=αMSE+βMSW
− Apply statistical concepts (Bayes Rule) to findoptimum values for α and β (iterative procedure)
Definition:
Basic MSE MSEaut
07/10/06 EC4460.SuFy06/MPF 59
• Early Stopping(MATLAB: train with option “val”)
Definition:Training set split into two sets:
− training subset: used to compute network weight andbiases
− validation subset: error on the validation is monitoredduring training: validation error:
goes down at training onsetgoes back up when network starts to overfit the data
− training continued until validation error increases for a specified number of iterations
− final weights & biases are those obtained for theminimum validation error.
Basic MSE Early Stopping MSE
07/10/06 EC4460.SuFy06/MPF 60
• Both regularization and early stopping can ensure networkgeneralization when properly applied.
• When using Bayesian regularization, it is important to train thenetwork until it reaches convergence. The MSE, MSW, and theeffective number of parameters should reach constant values when the network has converged.
• For early stopping, careful not to use an algorithm that converges toorapidly. If you are using a fast algorithm (like trainlm), you want to setthe training parameters so that the convergence is relatively slow (e.g.,set mu to a relatively large value, such as 1, and set mu_dec andmu_inc to values close to 1, such as 0.8 and 1.5, respectively). Thetraining functions trainscg and trainrp usually work well with earlystopping.
• With early stopping, the choice of the validation set is also importantThe validation set should be representative of all points in the trainingset.
• With both regularization and early stopping, it is a good idea to trainthe network starting from several different initial conditions. It ispossible for either method to fail in certain circumstances. By testingseveral different initial conditions, you can verify robust networkperformance.
• Based on our (MATWHORKS) experience, Bayesian regularizationgenerally provides better generalization performance than earlystopping, when training function approximation networks. This isbecause Bayesian regularization does not require that a validation dataset be separated out of the training data set. It uses all of the data. Thisadvantage is especially noticeable when the size of the data set is small.
(MATHWORKS) CONCLUSIONS
07/10/06 EC4460.SuFy06/MPF 61
Data Set Title No. pts. Network DescriptionSINE (5% N) 41 1-15-1 Single-cycle sine
wave withGaussian noiseat 5% level.
SINE (2% N) 41 1-15-1 Single-cycle sinewave withGaussian noiseat 2% level.
ENGINE (ALL) 1199 2-30-2 Engine sensor -full data set.
ENGINE (1/4) 300 2-30-2 Engine sensor –¼ of data set.
Early Stopping/Validation discussions
Method Engine(All)
Engine(1/4)
Sine(5% N)
Sine(2% N)
ES 1.3e-2 1.9e-2 1.7e-1 1.3e-1BR 2.6e-3 4.7e-3 3.0e-2 6.3e-3ES/BR 5 4 5.7 21Mean Squared Test Set Error
07/10/06 EC4460.SuFy06/MPF 62
• Some general design principles (from NN FAQ)
Data encoding issues
Number of layers issues
Number of neurons per layer issues
Input variable standardization issues
Output variable standardization issuesGeneralization error evaluation issues
07/10/06 EC4460.SuFy06/MPF 63
1 1 1 1 1 1 1 1
X
3 3 2 2 3 3 2
3 3 2 23 3 2 2
• Data encoding issues (from NN FAQ)
07/10/06 EC4460.SuFy06/MPF 64
− You may not need any hidden layers at all. Linear andgeneralized linear models are useful in a wide variety ofapplications. And even if the function you want to learn is mildly nonlinear, you may get better generalizationwith a simple linear model than with a complicatednon-linear model if there is too little data or too muchnoise to estimate the nonlinearities accurately.
− In MLPs with step/threshold/Heaviside activationfunctions, you need two hidden layers for fullgenerality.
− In MLPs with any of a wide variety of continuousnon-linear hidden-layer activation functions, one hiddenlayer with an arbitrarily large number of units sufficesfor the “universal approximation” property But there isno theory yet to tell you how many hidden units areneeded to approximate any given function.
• Number of layers issues [from NN FAQ]
07/10/06 EC4460.SuFy06/MPF 65
− The best number of hidden units depends in a complexway on:
• the numbers of input and output units • the number of training cases • the amount of noise in the targets • the complexity of the function or classification to
be learned • the architecture • the type of hidden unit activation function • the training algorithm • regularization
− In most situations, there is no way to determine the bestnumber of hidden units without training several networksand estimating the generalization error of each. If youhave too few hidden units, you will get high trainingerror and high generalization error due to underfittingand high statistical bias. If you have too many hiddenunits, you may get low training error but still have highgeneralization error due to overfitting and high variance.
• Number of neurons per layer issues [NN FAQ]
07/10/06 EC4460.SuFy06/MPF 66
• Input variable standardization issues [NN FAQ]
− Input contribution depends on its variability relative toother inputs
Example:Input 1 in range [[-1 1]Input 2 in range [0 10,000]
Input 1 contribution will be swamped by Input 2
Scale inputs so that variability reflects their importance. * If importance is not known: scale all inputs to same
variability or same range* If importance is known: scale more important inputs
so that they have larger variance/ranges
Standardizing input variables has different effects on differenttraining algorithms for MLPs. For example:
1) Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an importantconsideration for gradient descent methods such as standardbackpropagation
2) Quasi-Newton and conjugate gradient methods begin with a steepestdescent step and therefore are scale sensitive. However, they accumulatesecond-order information as training proceeds and hence are less scalesensitive than pure gradient descent.
3) Newton-Raphson and Gauss-Newton, if implemented correctly, aretheoretically invariant under scale changes as long as none of the scalingis so extreme as to produce underflow or overflow.
4) Levenberg-Marquardt is scale invariant as long as no ridging is required.There are several different ways to implement ridging; some are scaleinvariant and some are not. Performance under bad scaling will depend on details of the implementation.
07/10/06 EC4460.SuFy06/MPF 67
• Output variable standardization issues [NN FAQ]− Target ouptuts value ranges should reflect possible neural
network output values
− Standardizing target variables is typically more aconvenience for getting good initial weights than anecessity. However, if you have two or more target variablesand your error function is scale-sensitive like the usual least(mean) squares error function, then the variability of eachtarget relative to the others can effect how well the net learnsthat target. If one target has a range of 0 to 1, while anothertarget has a range of 0 to 106, the net will expend most of itseffort learning the second target to the possible exclusion ofthe first. So it is essential to rescale the targets so that theirvariability reflects their importance, or at least is not ininverse relation to their importance. If the targets are ofequal importance, they should typically be standardized tothe same range or the same standard deviation.
− If the target variable does not have known upper and lowerbounds, do not use an output activation function with abounded range
07/10/06 EC4460.SuFy06/MPF 68
• Generalization error evaluation issues [NN FAQ]3 basic necessary (not sufficient!) conditions for generalization
1) Network inputs contain sufficient information pertaining to the target, sothat there exists a mathematical function relating correct outputs to inputswith the desired degree of accuracy. (neural nets are not clairvoyant!)
2) Function which relates inputs to correct outputs must be in some sense,smooth, i.e., a small change in the inputs should, most of the time, produce a small change in the outputs. For continuous inputs and targets,smoothness of the function implies continuity and restrictions on the firstderivative over most of the input space. Some neural nets can learndiscontinuities as long as the function consists of a finite number ofcontinuous pieces. Very nonsmooth functions such as those produced bypseudo-random number generators and encryption algorithms cannot begeneralized by neural nets. Often a nonlinear transformation of the inputspace can increase the smoothness of the function and improvegeneralization.
3) the training set must be a sufficiently large and representative subset ofthe set of all cases that you want to generalize to. The importance of thiscondition is related to the fact that there are, loosely speaking, twodifferent types of generalization: interpolation and extrapolation.interpolation applies to cases that are more or less surrounded by nearbytraining cases; everything else is extrapolation. In particular, cases thatare outside the range of the training data require extrapolation. Casesinside large "holes" in the training data may also effectively requireextrapolation. Interpolation can often be done reliably, but extrapolationis notoriously unreliable. Hence it is important to have sufficienttraining data to avoid the need for extrapolation.
07/10/06 EC4460.SuFy06/MPF 69
• Cross-validation and bootstrapping schemes toevaluate generalization errors (and compareimplementations)
Schemes are called permutation tests because they are based on data resampling
1) Cross-validation (Resampling without replacement)
− How does this work?
Split data in k (~10) subsets of equal size.Train the NN k times, each time:
leave one of the subsets out of the trainingtest NN on the omitted subset
When k=sample size “leave-one-out” cross-validation
Overall accuracy is mean of all testing set accuracies
− Recommended for small datasets− Can be used to estimate model error or to compare
different NN set-ups
07/10/06 EC4460.SuFy06/MPF 70
2) Jackknife estimation
− Special case of cross-validation
− How does this work?
Split data in subsets of size equal to M-1 (for M data samples available);Train the NN on each set; Each time, test NN on the leave-one-out omitted sample (i.e., each testing set has only one sample).
Overall accuracy is mean of all testing set accuracies
− Recommended for small datasets− Can be used to estimate model error or to compare
different NN set-ups
07/10/06 EC4460.SuFy06/MPF 71
[Boostrap methods and permutation tests, Hesterberg et al.,W.H. Freeman et company, 2003http://www-stat.stanford.edu/%7Etibs/stat315a.html ]
3) Bootstrapping (Resampling with replacement)
07/10/06 EC4460.SuFy06/MPF 72
Bootstrapping
− How does this work?
Select k (from 50 to 2000) subsets of the data with replacement.Train the NN k times, each time:
Train on one subsetTest on another subset
Overall accuracy is mean of all testing set accuracies
− Recommended for small datasets− Is expensive to implement. Seems to work better
than cross-validation in many cases, but notalways… in such cases not worth the investment
− Can be used to estimate model error or to comparedifferent NN set-ups
07/10/06 EC4460.SuFy06/MPF 73
Performance Comparison
Which technique is best? Which is more accurate?
• Classifier performance assessment allows toevaluate how well it does and compares withother schemes.
• Hypothesis H0:
For a randomly drawn set of fixed size, algorithmsA and B have the same error rate.
• Useful when combining decisions/outputs fromseveral classifiers/detectors (in data fusionapplications).
evaluates set as a hypothesis testGiven two algorithms A and B
• Hypothesis H1:
For a randomly drawn set of fixed size, algorithmsA and B do not have the same error rate.
07/10/06 EC4460.SuFy06/MPF 74
• Need to define:
− Type 1 error rate:
Probability of incorrectly rejecting the true null hypothesis.
− Type 2 error rate:
Probability of incorrectly accepting a falsenull hypothesis.
Applied to this problem
Type 1 error rate:
Probability of incorrectly detecting a difference between classifier performance when no difference exists.
− Significance of level α:
α represents how selective (i.e., restrictive)the user wants the decision between H0 and H1 to be, i.e., for a = 0.05, the user is willingto accept the fact that there is a 5% chanceof deciding H0 is incorrect (or false) when itis in fact correct (or true).
07/10/06 EC4460.SuFy06/MPF 75
Thus,
• The larger α is, the more likely the user isto decide the claim (H0) is incorrect, whenin fact, it is correct, i.e., the user becomesmore selective, as the user rejects moreand more claims even though they arecorrect.
• The smaller α is, the less likely the user isto decide the claim is incorrect, when it is,in fact, correct, i.e., the user becomes lessselective, as the user will reject fewerclaims, however the user will accept moreand more claims which are, in fact, incorrect.
07/10/06 EC4460.SuFy06/MPF 76
• McNemar’s Test
− Define the following qualities:
Number of test cases misclassified by
A and B
n00
Number of test cases misclassified by B
but not by A
n10
Number of test cases misclassified by A
but not by B
n01
Number of test cases misclassified by neither A nor B
n11
• Total number of test cases
n = n01 + n10 + n11 + n00
Note:
• Under H0, A and B have same errorrates ⇒ n01 = n10
theoretically, the expected numberof errors made only by one of thetwo algorithms is
10 01
2En nE +
=
07/10/06 EC4460.SuFy06/MPF 77
− McNemar’s test compares the observednumber of errors obtained with one of thetwo algorithms and the expected number.
( )201 10
10 01
1n nz
n n− −
=+
− Compute
− Turns out z is21χ
• H0 (hypothesis that the algorithms A and B havesome error rate) is rejected with a significancelevel a (i.e., assuming we accept the α% chancedeciding H0 is incorrect when it is, in fact,correct). When
21,1z αχ −>
• How to read χ2 table
21,0.95χ =
07/10/06 EC4460.SuFy06/MPF 78
• Example:
Assume we have a problem with 9 classes and60 text samples.
Results give
Algorithm A gives 48 correct decisions.
Algorithm B gives 45 correct decisions.
Are the two algorithms to be considered withsame performances ?
n00 = 1
n10 = 4
n01 = 1
n11 = 44