vi. backpropagation neural networks...

07/10/06 EC4460.SuFy06/MPF 1

VI. Backpropagation Neural Networks (BPNN)

• Review of Adaline• Newton’s method• Backpropagation algorithm

– definition

– derivative computation

– weight/bias computation– function approximation example

– network generalization issues– potential problems with the BPNN

– momentum filter

– iteration schemes review• Generalization

References: [Hagan], [Mathworks]NN FAQ at ftp://ftp.sas.com/pub/neural/FAQ.htmlhttp://www-stat.stanford.edu/%7Etibs/stat315a.htmlPattern Classification, Duda & Hart, Wiley, 2001

• Implementation issues

– regularization– early stopping

07/10/06 EC4460.SuFy06/MPF 2

+a = Wp b

• Recall the Adaline (LMS) network:

• Restriction to the Adaline (LMS) ?

• Problem solved with Adaline/LMS:

Given a set {pi , ti } define the weights and bias which minimize the Mean Square error ( )2E t a⎡ ⎤−⎣ ⎦

T

x z1

x z

pwb

a

⎡ ⎤⎡ ⎤= = ⎢ ⎥⎢ ⎥

⎣ ⎦ ⎣ ⎦

=

��

a = purelin (Wp+b)

Linear Neuron

p a

1

n

��W

��

b

R x 1S x R

S x 1

S x 1

S x 1

Input

R S

Linear activation function

( )22( ) [ ] 2T T TF x E e E t x z x Rx x h c⎡ ⎤= = − = − +⎢ ⎥⎣ ⎦

07/10/06 EC4460.SuFy06/MPF 3

1x x (x )x ( ),

k k k

k k

Fez e t a

αα

+ = − ∇= − − = −

• Practical application:– solving F(x) requires computing R and h, and R-1

– alternative: solve problem iteratively usingsteepest descent only

LMS iteration:pick x(0)a=xk

Tzke=t-axk+1=xk+2αezkk=k+1

• Extensions ==> Multilayer perceptron

07/10/06 EC4460.SuFy06/MPF 4

• Why use multi-layer structures ?

Class 1

Class 2

subclasses

K = 9 subclasses, M = 2 classes

07/10/06 EC4460.SuFy06/MPF 5

• Example: pattern classification: the XOR gate

− Can it be solved with a single layer perceptron?

1 21 2

3 43 4

0 0, 0 ; , 1 ;

0 1

1 1, 1 ; , 0

0 1

p t p t

p t p t

⎧ ⎫ ⎧ ⎫⎛ ⎞ ⎛ ⎞= = = =⎨ ⎬ ⎨ ⎬⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠⎩ ⎭ ⎩ ⎭⎧ ⎫ ⎧ ⎫⎛ ⎞ ⎛ ⎞

= = = =⎨ ⎬ ⎨ ⎬⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠⎩ ⎭ ⎩ ⎭

×

×

07/10/06 EC4460.SuFy06/MPF 6

NN block diagram:

07/10/06 EC4460.SuFy06/MPF 7

Note: Final network space partitioning varies as a function of the number of neurons in the hidden layer

×

×

×

×

07/10/06 EC4460.SuFy06/MPF 8

Example:1

−0.50.5

2−2

−1

b1

b3

b2

y1

y2

y3

• Assume b1 = 0.5, b2 = 2, b3 = 1

• Plot the decision boundaries obtained assuming HL is used as activation functions

• Derive the weight matrix and bias vector used forthis network

• Design the NN second layer (following given in-classguidelines, i.e., identify weight matrix and bias

07/10/06 EC4460.SuFy06/MPF 9

07/10/06 EC4460.SuFy06/MPF 10

07/10/06 EC4460.SuFy06/MPF 11

• Example: multilayer perceptron (classification)

Assume dark = 1

07/10/06 EC4460.SuFy06/MPF 12

07/10/06 EC4460.SuFy06/MPF 13

• Example: function approximation

( )

( )

1

2

1 1 11,1 2,1 2

12

2 2 21,1 2,1

11

10 10, 10

10

1 1 0

nf ne

f n n

b

b

b

ω ω

ω ω

−=+

=

= = = −

=

= = =

p

a12n1

2

Input

w11,1

a11n1

1

w21,1

b12

b11

b2

a2n2

1

1

1

��Σ

��Σ ��

��Σw1

2,1 w21,2

��

��

Log-Sigmoid Layer

��

Linear Layer

a1 = logsig (W1p + b1) a2 = purelin (W2a1 + b2)

-2 -1 0 1 2-1

0

1

2

3

Example Function Approximation Network

Nominal Response of Networkof Figure Above

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

Effect of Parameter Changeson Network Response

1– b2 1≤ ≤

p

a(2)

(1)20 20b≤ ≤

(2)1,21 1w− ≤ ≤

(1)1,21 1w− ≤ ≤

07/10/06 EC4460.SuFy06/MPF 14

• Backpropagation algorithm:

• Goal: given a set of {pi, ti}; find the weightsand bias which minimize the mean squareerror (performance surface)

( ) 2[( ) ]F x E t a= −

First Layer

��

f 1

��

f 2

��

f 3

p a1 a2

��

W1

��b1

��

W2

��b21 1

n1 n2

a3

n3

1

��

W3

��b3

S2 x S1

S2 x 1

S2 x 1

S2 x 1S3 x S2

S3 x 1

S3 x 1

S3 x 1R x 1S1 x R

S1 x 1

S1 x 1

S1 x 1

Input

R S1 S2 S3

Second Layer Third Layer

a1 = f 1 (W1p + b1) a2 = f 2 (W2a1 + b2) a3 = f 3 (W3a2 + b3)

a3 = f 3 (W3 f 2 (W2f 1 (W1p + b1) + b2) + b3)

Discard the expected operation

( ) ( ) ( )Tk k k k kF x t a t a= − −

07/10/06 EC4460.SuFy06/MPF 15

• For a 1-layer only and purelin activation function

1

1

2

2

Tk k k k

k k k

w w e p

b b e

α

α+

+

= +

= +

07/10/06 EC4460.SuFy06/MPF 16

• How to compute the derivatives? → use SD

Recall: ( )( ) ( )

( ) ( )

, ,,

1x

1

m mi j i j m

i j

m mi i m

i

Fw k w kw

FFb k b kb

α

α

∂⎧ + = −⎪ ∂⎪⎨

∂⎪ + = −⎪ ∂⎩

• Note: F(x) may not be expressed directlyin terms of , etc……

( ) ( ) ( )df n w df n dn wdw dn dw

⎡ ⎤⎣ ⎦ = ×

1 2, ,,i j i jw w

We need to use the chain rule

• Example:

( )( )( ) ( )

3

3 5 1

2

2

5 1

n

w

f n e

f n w e

n w

+

=

=

= +

07/10/06 EC4460.SuFy06/MPF 17

, ,

mi

m m mi j i i j

mi

m m mi i i

nF Fw n w

nF Fb n b

∂∂ ∂= ×

∂ ∂ ∂

∂∂ ∂= ×

∂ ∂ ∂

1,

1

,

1

m m m mi j i j j i

mmijm

i j

mimi

n w a b

n aw

nb

−

−

= ∑ +

∂=

∂

∂=

∂

th,

th

th

th

: weight

associated to neuron input

: associated with neuron

mi j

mi

w m layer

ij

n i

th,

th th

th

: layer, associated to

neuron and input: associated with neuron

mi j

mi

w m

i jn i

1, , 1 , ,

, 1 ,

m m mi j k i j k jm

i

m mi k i k m

i

Fw w an

Fb bn

α

α

−+

+

∂= −

∂∂

= −∂

sim: sensitivity of F(.) to

changes in ith neuron element at layer m

07/10/06 EC4460.SuFy06/MPF 18

Expressing Weight/Bias in a Matrix Form

Matrix Form1

, ,

1

( 1) ( )

( 1) ( )

m m mi j i j jm

i

m mi i m

im m

k k

Fw k w k an

Fb k b kn

W W

α

α

α

−

+

∂+ = −

∂

∂+ = −

∂

= −

11 12 1

21 22 2

R

R

w w wW

w w w⎡ ⎤

= ⎢ ⎥⎣ ⎦

Associated with 1 neuron (1)

(2)

w1

( )1

1 11 2

11 12 11 12 1 1

1 121 2111 2

2 2

1 1 11 2

2

Tm

m

m mm m m m m m

m mm mk k

m m

mm m

m a

s

F Fa aw w w w n n

F Fw w a an n

Fn

a aFn

α

α−

− −

− −+

− −

∂ ∂⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ∂ ∂⎢ ⎥= −⎢ ⎥ ⎢ ⎥ ∂ ∂⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥∂ ∂⎣ ⎦

∂⎡ ⎤⎢ ⎥∂⎢ ⎥ ⎡ ⎤= − ⎣ ⎦∂⎢ ⎥⎢ ⎥∂⎣ ⎦

w21

w12

w22

111, 1 11, 1

1

112, 1 12, 2

1

121, 1 21, 1

2

input 1, neuron 1, 1

input 2, neuron 1, 1, 2

input 1, neuron 2, = 2, 1

m m mk k m

m mk k m

m m mk k m

Fw w a j inFw w a i jnFw w a i jn

α

α

α

−+

−+

−+

∂= − = =

∂∂

= − = =∂∂

− = =∂

07/10/06 EC4460.SuFy06/MPF 19

( )1

1 11 2

11 12 11 12 1 1

121 2111

2

1 1 11 2

2

Tm

m

m mm m m m m m

m mmk k

m

mm m

m a

s

F Fa aw w w w n n

Fw w an

Fn

a aFn

α

α−

− −

−+

− −

∂ ∂⎡ ⎤⎢ ⎥⎡ ⎤ ⎡ ⎤ ∂ ∂⎢ ⎥= −⎢ ⎥ ⎢ ⎥ ∂⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎢ ⎥∂⎣ ⎦

∂⎡ ⎤⎢ ⎥∂⎢ ⎥ ⎡ ⎤= − ⎣ ⎦∂⎢ ⎥⎢ ⎥∂⎣ ⎦

( )11

1

tm mm mk k

m m mk k

W W s a

b b s

α

α

−+

+

= −

= −

07/10/06 EC4460.SuFy06/MPF 20

mi

Fn

∂∂

Need to use the chain rule

Will involve terms of the form:1

1

mj

m mj i

nFn n

+

+

∂∂∂ ∂

Define the matrix:

1 11 1

11 2

1 12 2

1 2

,

,

m m

m mm

m m m

m m

n nn nn

n n nn n

+ +

+

+ +

⎡ ⎤∂ ∂⎢ ⎥∂ ∂∂ ⎢ ⎥=⎢ ⎥∂ ∂ ∂⎢ ⎥

∂ ∂⎢ ⎥⎣ ⎦

[ ]

( )

( )( ) ( )( ) ( )

( )

1 11,

1,

1 ( ),

1 ( ) 1 ( )111 1 12 2

1 ( ) 1 ( )21 1 22 2

111 12

21 22 2

nn

0

0

m m mmi ii

m mj j

mj

m mj j

mjm m m m

i j j jmj

m m mi j j

m m m m m mm

m m m m m m m

m m

m

w a bnn n

aa n

aw a f n

n

w f n

w f n w f n

w f n w f n

f nw ww w f n

+ ++

+

+

+ ++

+ +

⎡ ⎤∂ +∂ ⎣ ⎦=∂ ∂

∂∂= ⋅

∂ ∂

∂= ⋅ =

∂

=

⎡ ⎤∂ ⎢ ⎥⇒ =⎢ ⎥∂ ⎣ ⎦

⎡ ⎤⎢ ⎥⎣ ⎦

∑

( )( )1 ( ) n

m

mm mW F+

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦

=

07/10/06 EC4460.SuFy06/MPF 21

1m

2

n

mm

m

FnFsFn

∂⎡ ⎤⎢ ⎥∂∂ ⎢ ⎥= =

∂⎢ ⎥∂⎢ ⎥∂⎣ ⎦

n1 : first neuron

n2 : second neuron

Next, apply the chain rule for vectors:

1 11 2

1 11 1 1 2 1

(1)m m

m m m m m

n nF F Fn n n n n

+ +

+ +

∂ ∂∂ ∂ ∂= ⋅ + ⋅

∂ ∂ ∂ ∂ ∂

Sensitivity of F to change in the 1st element of the net input at layer m.

1 11 2

1 12 1 2 2 2

(2)m m

m m m m m

n nF F Fn n n n n

+ +

+ +

∂ ∂∂ ∂ ∂= ⋅ + ⋅

∂ ∂ ∂ ∂ ∂

1 11 2

11 1 1

1 11 2

122 2

s

m m

m m mm

m m

mm m

n n Fn n n

Fn nnn n

+ +

+

+ +

+

⎡ ⎤∂ ∂ ∂⎡ ⎤⎢ ⎥ ⎢ ⎥∂ ∂ ∂⎢ ⎥ ⎢ ⎥⇒ =⎢ ⎥ ∂⎢ ⎥∂ ∂⎢ ⎥ ⎢ ⎥∂∂ ∂⎢ ⎥ ⎣ ⎦⎣ ⎦

1 11

1n nsn n n n

T Tm mm

m m m mF F+ +

++

⎡ ⎤ ⎛ ⎞ ⎛ ⎞∂ ∂ ∂ ∂= → = ⎜ ⎟⎢ ⎥ ⎜ ⎟∂ ∂ ∂ ∂⎝ ⎠⎣ ⎦ ⎝ ⎠

( ) ( )( )( ) ( ). .

1 11 1s s sTm m Tm m m m mm mW F n F n W+ ++ +⎛ ⎞

= =⎜ ⎟⎝ ⎠

(1) & (2)

Sensitivity of F to change in the 2st element of the net input at layer m.

07/10/06 EC4460.SuFy06/MPF 22

• We need to compute sM

( )( )

( )( )

( )( )

( )( )

( ) ( )

( ) ( )

( )

1

2

2

1

2

2

2

1

1 1

2

2

2 2

( )1

1 11

( )2

2 22

( )1

1

s

2

2

02

0

MM

M

j j

M

j j

M

Mj j

M M

Mj j

M M

M M

M

M M

M

M M

M

FnF

n

t a

n

t a

n

t a aa n

t a aa n

f nt a

n

f nt a

n

f nn

∂⎡ ⎤⎢ ⎥∂⎢ ⎥=

∂⎢ ⎥⎢ ⎥∂⎣ ⎦⎡ ⎤∂ −⎢ ⎥⎢ ⎥∂⎢ ⎥= =⎢ ⎥∂ −⎢ ⎥⎢ ⎥∂⎣ ⎦⎡ ⎤∂ − ∂⎢ ⎥⋅⎢ ⎥∂ ∂⎢ ⎥=⎢ ⎥∂ − ∂⎢ ⎥⋅⎢ ⎥∂ ∂⎣ ⎦⎡ ⎤∂⎢ ⎥− − ⋅

∂⎢ ⎥= ⎢ ⎥∂⎢ ⎥− − ⋅⎢ ⎥∂⎣ ⎦

∂

∂= −

∑

∑

∑

∑

( )

( )( )

1 1

( )2 22

2

2 t - a

M M

M

M MM

t at af n

n

s F n

⎡ ⎤⎢ ⎥

−⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥ −∂ ⎣ ⎦⎢ ⎥

⎢ ⎥∂⎣ ⎦

⇒ = −

• Note: a = f(n)

07/10/06 EC4460.SuFy06/MPF 23

Summary:Start ( )

( )( )

( )( )

1 11 1

( )

1( ) 1

2

; 1, ,1

o

m mm m m

M

M MM

Tm m mm m

a p

a f W a b

a a

s F n t - a

s F n W s m M

+ ++ +

++

=

= +

=

= −

= = − …

Compute

Update the Weights

Note: We will need derivatives for allactivation functions⇒

( ) ( ) ( )( ) ( )

11

1

Tm mm m

m m m

W k W k s a

b k b k s

α

α

−⎧ + = −⎪= ⎨⎪ + = −⎩

07/10/06 EC4460.SuFy06/MPF 24

1-2-1Network

+

-

t

a

ep

Example: Function Approximation

( ) 1 sin4

g p pkπ⎛ ⎞= + ⎜ ⎟⎝ ⎠

k=1

07/10/06 EC4460.SuFy06/MPF 25

1-2-1Network

ap

p

a12n1

2

Input

w11,1

a11n1

1

w21,1

b12

b11

b2

a2n2

1

1

1

��Σ

��Σ ��

��Σw1

2,1 w21,2

��

��

Log-Sigmoid Layer

��

Linear Layer

a1 = logsig (W1p + b1) a2 = purelin (W2a1 + b2)

07/10/06 EC4460.SuFy06/MPF 26

Network ResponseSine Wave

-2 -1 0 1 2-1

0

1

2

3

[ ]

(1)(1)

(2) (2)

0.2 0.5(0) ;

0.4 0.1

(0) 0 0.1 ; 0.5

W b

W b

− −⎡ ⎤ ⎡ ⎤= =⎢ ⎥ ⎢ ⎥− −⎣ ⎦ ⎣ ⎦= =

Initial Conditions:

Example: Textbook, pp. 11-14

For initial values

07/10/06 EC4460.SuFy06/MPF 27

What does the 1-2-1 network look like ?

07/10/06 EC4460.SuFy06/MPF 28

07/10/06 EC4460.SuFy06/MPF 29

Example:

( ) [ ]1 sin ; 2;24ig p p pπ⎛ ⎞= + ∈ −⎜ ⎟

⎝ ⎠-2 -1 0 1 2

-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

Figure 11.10 Function Approximation Using a 1-3-1 Network

Figure 11.10 Effect of Increasing the Number of Hidden Neurons

Figure 11.10 Function Approximation Using a 1-3-1 Network

function approximation:

( ) ( )(1) (2)1 ;1 nf n f n n

e−= =+

( ) [ ]61 sin ; 2;24

g p p pπ ε⎛ ⎞= + −⎜ ⎟⎝ ⎠

Convergence issues:

( ) ( ) [ ]1 sin ; 2;2g p p pπ ε= + −

i=1 i=2

i=4 i=8

1-2-1 1-3-1

1-4-1 1-5-1

-2 -1 0 1 2-1

0

1

2

3

1

23

4

5

0

07/10/06 EC4460.SuFy06/MPF 30

-2 -1 0 1 2-1

0

1

2

3

-2 -1 0 1 2-1

0

1

2

3

Figure 11.13 Convergence to a Local Minimum

Figure 11.14 1-2-1 Network Approximation of g(p)

Figure 11.15 1-9-1 Network Approximation of g(p)

( ) ( ) [ ]1 sin ; 2;2g p p pπ ε= + −

Network generalization issues

1-2-1

1-9-1

-2 -1 0 1 2-1

0

1

2

3

1

23

45

0

07/10/06 EC4460.SuFy06/MPF 31

07/10/06 EC4460.SuFy06/MPF 32

• Potential problems with Backpropagation:− activation functions may be nonlinear

− performance surface is not unimodal

− convergence may be sped up with a variablelearning rate

increase step size when performance indexis flatdecrease step size when performanceindex is steep

Possible strategy:

• If error increases by more than a pre-defined value(typically 4-5%):

new weights are discardedlearning rate is decreasing (*0.7)

• If error increases by leass than 4-5%: keep newweights

• If error decreases: learning rate is increased by 5%

07/10/06 EC4460.SuFy06/MPF 33

• Convergence may be sped up with themomentum filter.

− filter response

mkW∆

mkb∆

Recall

• Introduce memory and LP filter behaviorto update Wm, bm

( )1 1k k kx x sγ γ−= + −Define

sk xk

• Apply above concept to iteration equations

( )

( )

11 1

1 1

1 )

1

mm m m m tk k k

m m m mk k k

W W W s (a

b b b s

γ γ α

γ γ α

−+ =

+ −

⎡ ⎤= + ∆ − −⎣ ⎦⎡ ⎤= + ∆ − −⎣ ⎦

( )11

1

Tm mm mk k

m m mk k

W W s a

b b s

α

α

−+

+

⎧ = −⎪⎨⎪ = −⎩

1

1

( ) ( ) (1 ) ( )( ) 1( )( ) 1

X z X z z S zX zH zS z z

γ γγ

γ

−

−

= + −−

==> = =−

mkW∆

mkb∆

07/10/06 EC4460.SuFy06/MPF 34

• Iteration Techniques

– Apply above concept to iteration equations

( ) ( )1

1

where selected so thatk k k k

k k

x x p p

F x F x

α+

+

= +

<

– Use Taylor series expansion

1. first order expansion:

( ) ( ) ( ) ( )1 k

Tk k k k kx xF x F x x F x VF x x+ == + ∆ + ∆

( )1 kk k x xx x F xα+ == − ∇

2. second order expansion:

( ) ( )1k k kF x F x x+ = + ∆

( ) ( ) 12k

TTk k k kx x kF x VF x x x A x=+ ∆ + ∆

( )1kk k k x xx x A VF x−

== −

( )2kk x xA V F x ==

Recall: potential problems with Newton scheme (Hessian, gradient, convergence)

Leads to Newton’s scheme

SD scheme

07/10/06 EC4460.SuFy06/MPF 35

• Levenberg-Marquardt Algorithm

Designed to speed up the convergence of the Newton’s method by reducing the computational load.

( ) ( ) ( ) 2TiF x V x V x v= = ∑Recall

Assume:1. ( ) 2 ( )TVF x J V x=

2. ( )2 2 ( ) ( ) 2 ( )TV F x J x J x S x= +

( ) ( )

( ) ( ) ( ) ( )

21

1

k kk k x x x x

T Tk k k k kk

x x F x F x

x J x J x I J x V xµ

+ = =

−

= − ∇ ∇

⎡ ⎤= − +⎣ ⎦

• General guidelines for µk

Start with µk=0.01: if F(x) doesn’t decrease, repeatwith µk= 10µk

07/10/06 EC4460.SuFy06/MPF 36

-5 0 5 10 15-5

0

5

10

15

• Squared Error Surface as a function of theweight values

-50

510

15

-5

0

5

10

15

0

5

10

Figure 12.3

w11,1

w11,1

w21,1

w21,1

07/10/06 EC4460.SuFy06/MPF 37

-10

0

10

20

30 -30-20

-100

1020

0

0.5

1

1.5

2

2.5

-10 0 10 20 30-25

-15

-5

5

15

w11,1

w11,1

w21,1

w21,1

• Squared Error Surface as a function of theweight values

07/10/06 EC4460.SuFy06/MPF 38

-5 0 5 10 15-5

0

5

10

15

w21,1

w11,1

Figure 12.6

Two SDBP (batch mode) trajectories

07/10/06 EC4460.SuFy06/MPF 39

-5 0 5 10 15-5

0

5

10

15

w21,1

w11,1

Figure 12.8

Trajectory with learning rate too large

07/10/06 EC4460.SuFy06/MPF 40

Momentum Backpropagation

-5 0 5 10 15-5

0

5

10

15

Steepest Descent Backpropagation

(SDBP)

Momentum Backpropagation

(MOBP)

w11,1

w21,1

γ 0.8=

( )

( )

1

11

1

1

1 )

1

m mk k

m mm tk

m mk k

m mk

W W

W s (a

b b

b s

γ γ α

γ γ α

+

−−

+

−

= +

⎡ ⎤+ ∆ − −⎣ ⎦

= +

⎡ ⎤+ ∆ − −⎣ ⎦

( )11

1

Tm mm mk k

m m mk k

W W s a

b b s

α

α

−+

+

⎧ = −⎪⎨⎪ = −⎩

07/10/06 EC4460.SuFy06/MPF 41

• If the squared error (over the entire training set)increases by more than some set percentage ζafter a weight update, then the weight update is discarded, the learning rate is multiplied bysome factor (1 > ρ > 0), and the momentumcoefficient γ is set to zero.

• If the squared error decreases after a weightupdate, then the weight update is accepted andthe learning rate is multiplied by some factorη > 1. If γ has been previously set to zero, it is reset to its original value.

• If the squared error increases by less than ζ, then the weight update is accepted, but thelearning rate and the momentum coefficient are unchanged.

Variable Learning Rate

07/10/06 EC4460.SuFy06/MPF 42

-5 0 5 10 15-5

0

5

10

15

w21,1

w11,1

η 1.05=

ρ 0.7=

ζ 4%=

100

101

102

103

0

0.5

1

1.5

Iteration Number10

010

110

210

30

20

40

60

Iteration Number

Figure 12.11

Variable Learning Rate Trajectory

Damping factor for learning rate

Error threshold

Weight selection threshold

Squared error Learning rate

07/10/06 EC4460.SuFy06/MPF 43

Conjugate Gradient

1. The first search direction is the steepest descent.p 0 g 0–= gk F x( )∇ x x k=

≡

x k 1+ x k α kp k+=

pk gk– βkpk 1–+=

βkg k 1–

T∆ gk

gk 1–T

∆ pk 1–

-----------------------------=

β kg k

T g k

g k 1–T g k 1–

-------------------------=

β kg k 1–

T∆ g k

g k 1–T g k 1–

-------------------------=

2. Take a step and choose the learning rate tominimize the function along the search direction.

3. Select the next search direction according to:

where

or

or

Fletcher-Reeves update

Polak-Ribiere updateHestenes-Steifel update

07/10/06 EC4460.SuFy06/MPF 44

-5 0 5 10 15-5

0

5

10

15

w21,1

w11,1

Conjugate Gradient Trajectory

07/10/06 EC4460.SuFy06/MPF 45

-5 0 5 10 15-5

0

5

10

15

w21,1

w11,1

Levenberg-Marquardt Trajectory

07/10/06 EC4460.SuFy06/MPF 46

• Resilient Backpropagation Network

• BPNN usually use sigmoid function (tansig, logsig) as activation functions to introduce nonlinear behavior

• Can cause the network to have very small gradientand iterations to stall (almost)

• Resilient BPNN uses − the signs of the gradient components only to

determine the direction of the weight update − weight change values determined by separate

update value

07/10/06 EC4460.SuFy06/MPF 47

• It is very difficult to know which trainingalgorithm will be the fastest for a givenproblem.

• Convergence speed depends on manyfactors: − complexity of the problem, − number of data points in the training

set, − number of weights and biases in the

network, − error goal, − whether the network is being used for

pattern recognition (discriminantanalysis) or function approximation(regression)

− etc...

Algorithm Comparisons

07/10/06 EC4460.SuFy06/MPF 48

Toy Example 1: Sinusoid function approximation

Network set-up: 1-5-1; Activation functions (tansig, purelin)Number of trials: 30 with random initial weights and biasError threshold: MSE<0.002

Algorithm Mean Time(s)

Ratio Min. Time(s)

Max. Time(s)

Std.(s)

LM 1.14 1.00 0.65 1.83 0.38BFG 5.22 4.58 3.17 14.38 2.08RP 5.67 4.97 2.66 17.24 3.72SCG 6.09 5.34 3.18 23.64 3.81CGB 6.61 5.80 2.99 23.65 3.67CGF 7.86 6.89 3.57 31.23 4.76CGP 8.24 7.23 4.07 32.32 5.03OSS 9.64 8.46 3.97 59.63 9.79GDX 27.69 24.29 17.21 258.15 43.65

Algorithm AcronymLM (trainlm) - Levenberg-MarquardtBFG (trainbfg) - BFGS Quasi-NewtonRP (trainrp) - Resilient BackpropagationSCG (trainscg) - Scaled Conjugate GradientCGB (traincgb) - Conjugate Gradient with Powell /Beale

RestartsCGF(traincgf) - Fletcher-Powell Conjugate GradientCGP (traincgp) - Polak-Ribiére Conjugate GradientOSS (trainoss) - One-Step SecantGDX (traingdx) - Variable Learning Rate Backpropagation

Sun Sparc 2 workstation

07/10/06 EC4460.SuFy06/MPF 49

07/10/06 EC4460.SuFy06/MPF 50

Example 2: function approximation (non linear regression) - Engine data set

Network set-up: 2-30-2Network inputs: engine speed and fueling levels Network outputs: torque and emission levels. Activation functions (tansig,purelin)Number of trials: 30 with random initial weights and biasError threshold: MSE < 0.005

Algorithm Mean Time(s)

Ratio Min. Time(s)

Max. Time(s)

Std.(s)

LM 18.45 1.00 12.01 30.03 4.27BFG 27.12 1.47 16.42 47.36 5.95RP 36.02 1.95 19.39 52.45 7.78SCG 37.93 2.06 18.89 50.34 6.12CGB 39.93 2.16 23.33 55.42 7.50CGF 44.30 2.40 24.99 71.55 9.89CGP 48.71 2.64 23.51 80.90 12.33OSS 65.91 3.57 31.83 134.31 34.24GDX 188.50 10.22 81.59 279.90 66.67



Sun Enterprise 4000 workstation

07/10/06 EC4460.SuFy06/MPF 51

07/10/06 EC4460.SuFy06/MPF 52

Example 3: Pattern recognition - Cancer data setNetwork set-up: 9-5-5-2Network inputs: clump thickness, uniformity of cell size and cell shape, amount of marginal adhesion, frequency of bare nuclei.Network outputs: benign or malignant tumorActivation functions (tansig in all layers)Number of trials: 30 with random initial weights and biasError threshold: MSE < 0.012



Sun Sparc 2 workstation

Algorithm

CGBRP

SCGCGPCGFLMBFGGDXOSS

Mean Time (s)

80.2783.4186.5887.70110.05110.33209.60313.22463.87

Ratio

1.001.041.081.091.371.372.613.905.78

Min.Time (s)

55.0759.5141.2156.3563.3358.94118.92166.48250.62

Max.Time (s)

102.31109.39112.19116.37171.53201.07318.18446.43599.99

Std. (s)

13.1713.4418.2518.0330.1338.2058.4475.4497.35

07/10/06 EC4460.SuFy06/MPF 53

07/10/06 EC4460.SuFy06/MPF 54

Other examples available at

http://www.mathworks.com/access/helpdesk/help/toolbox/nnet/backpr14.shtml

07/10/06 EC4460.SuFy06/MPF 55

Several algorithm characteristics which can be deuced from experiments:

• In general, on function approximation problems, for networks that contain up to a fewhundred weights, the LM algorithm will have the fastest convergence. This advantage is especially noticeable if very accurate training is required.

• In many cases, trainlm is able to obtain lower mean square errors than any of the otheralgorithms tested.

• However, as the number of weights in the network increases, the advantage of thetrainlm decreases. In addition, trainlm performance is relatively poor on patternrecognition problems. The storage requirements of trainlm are larger than the otheralgorithms tested. By adjusting the mem_reduc parameter, discussed earlier, the storage requirements can be reduced, but at a cost of increased execution time.

• The trainrp function is the fastest algorithm on pattern recognition problems. However, it does not perform well on function approximation problems. Itsperformance also degrades as the error goal is reduced. The memory requirementsfor this algorithm are relatively small in comparison to the other algorithms considered.

• The conjugate gradient algorithms, in particular trainscg, seem to perform well over awide variety of problems, particularly for networks with a large number of weights.

The SCG algorithm is almost as fast as the LM algorithm on function approximationproblems (faster for large networks) and is almost as fast as trainrp on patternrecognition problems. Its performance does not degrade as quickly as trainrpperformance does when the error is reduced. The conjugate gradient algorithms haverelatively modest memory requirements.

• The trainbfg performance is similar to that of trainlm. It does not require as muchstorage as trainlm, but the computation required does increase geometrically with the size of the network, since the equivalent of a matrix inverse must be computed at each iteration.

• The variable learning rate algorithm traingdx is usually much slower than the othermethods, and has about the same storage requirements as trainrp, but it can still beuseful for some problems. There are certain situations in which it is better to convergemore slowly. For example, when using early stopping, you may have inconsistentresults if you use an algorithm that converges too quickly. You may overshoot the point at which the error on the validation set is minimized.

EXPERIMENT CONCLUSIONS

07/10/06 EC4460.SuFy06/MPF 56

Generalization Issues• Network may be overtrained (overfitting issues)

when MSE on training set is set too low Potential Risk: the network memorizes the

training examples, but doesn’t learn to generalize to similar but new situations• Consequences:

− very good performances on training set,− very poor performance on testing set

(1- 2

0-1 )

ne t

; no

isy

s in e

• How to prevent overfitting ?• Use a network not too large for the problem

a-priori network size is difficult to guess• Increase training set size if possible• Apply

− regularization− early stopping

07/10/06 EC4460.SuFy06/MPF 57

• Regularization

− Recall basic performance (MSE) function is defined as:

2 2

1 1

1 1 ( )N N

i i ii i

MSE e t aN N= =

= = −∑ ∑

− Performance function is modified as:

2

1

(1 )

1where

& : performance ratio

reg

P

ii

MSE MSE MSW

MSW wN

γ γ

γ=

= + −

= ∑

− Consequences: MSEreg forces the network• to have smaller weights and biases,• to produce a smoother response• to be less likely to overfit

− Drawbacks:• difficult to estimate γ

γ too large overfitting pbγ too small no good fit of training data

07/10/06 EC4460.SuFy06/MPF 58

• Automated Regularization(MATLAB: trainbr)

− Assume weights and bias are random variableswith specific distributions

− Define new performance function as:MSEaut=αMSE+βMSW

− Apply statistical concepts (Bayes Rule) to findoptimum values for α and β (iterative procedure)

Definition:

Basic MSE MSEaut

07/10/06 EC4460.SuFy06/MPF 59

• Early Stopping(MATLAB: train with option “val”)

Definition:Training set split into two sets:

− training subset: used to compute network weight andbiases

− validation subset: error on the validation is monitoredduring training: validation error:

goes down at training onsetgoes back up when network starts to overfit the data

− training continued until validation error increases for a specified number of iterations

− final weights & biases are those obtained for theminimum validation error.

Basic MSE Early Stopping MSE

07/10/06 EC4460.SuFy06/MPF 60

• Both regularization and early stopping can ensure networkgeneralization when properly applied.

• When using Bayesian regularization, it is important to train thenetwork until it reaches convergence. The MSE, MSW, and theeffective number of parameters should reach constant values when the network has converged.

• For early stopping, careful not to use an algorithm that converges toorapidly. If you are using a fast algorithm (like trainlm), you want to setthe training parameters so that the convergence is relatively slow (e.g.,set mu to a relatively large value, such as 1, and set mu_dec andmu_inc to values close to 1, such as 0.8 and 1.5, respectively). Thetraining functions trainscg and trainrp usually work well with earlystopping.

• With early stopping, the choice of the validation set is also importantThe validation set should be representative of all points in the trainingset.

• With both regularization and early stopping, it is a good idea to trainthe network starting from several different initial conditions. It ispossible for either method to fail in certain circumstances. By testingseveral different initial conditions, you can verify robust networkperformance.

• Based on our (MATWHORKS) experience, Bayesian regularizationgenerally provides better generalization performance than earlystopping, when training function approximation networks. This isbecause Bayesian regularization does not require that a validation dataset be separated out of the training data set. It uses all of the data. Thisadvantage is especially noticeable when the size of the data set is small.

(MATHWORKS) CONCLUSIONS

07/10/06 EC4460.SuFy06/MPF 61

Data Set Title No. pts. Network DescriptionSINE (5% N) 41 1-15-1 Single-cycle sine

wave withGaussian noiseat 5% level.

SINE (2% N) 41 1-15-1 Single-cycle sinewave withGaussian noiseat 2% level.

ENGINE (ALL) 1199 2-30-2 Engine sensor -full data set.

ENGINE (1/4) 300 2-30-2 Engine sensor –¼ of data set.

Early Stopping/Validation discussions

Method Engine(All)

Engine(1/4)

Sine(5% N)

Sine(2% N)

ES 1.3e-2 1.9e-2 1.7e-1 1.3e-1BR 2.6e-3 4.7e-3 3.0e-2 6.3e-3ES/BR 5 4 5.7 21Mean Squared Test Set Error

07/10/06 EC4460.SuFy06/MPF 62

• Some general design principles (from NN FAQ)

Data encoding issues

Number of layers issues

Number of neurons per layer issues

Input variable standardization issues

Output variable standardization issuesGeneralization error evaluation issues

07/10/06 EC4460.SuFy06/MPF 63

1 1 1 1 1 1 1 1

X

3 3 2 2 3 3 2

3 3 2 23 3 2 2

• Data encoding issues (from NN FAQ)

07/10/06 EC4460.SuFy06/MPF 64

− You may not need any hidden layers at all. Linear andgeneralized linear models are useful in a wide variety ofapplications. And even if the function you want to learn is mildly nonlinear, you may get better generalizationwith a simple linear model than with a complicatednon-linear model if there is too little data or too muchnoise to estimate the nonlinearities accurately.

− In MLPs with step/threshold/Heaviside activationfunctions, you need two hidden layers for fullgenerality.

− In MLPs with any of a wide variety of continuousnon-linear hidden-layer activation functions, one hiddenlayer with an arbitrarily large number of units sufficesfor the “universal approximation” property But there isno theory yet to tell you how many hidden units areneeded to approximate any given function.

• Number of layers issues [from NN FAQ]

07/10/06 EC4460.SuFy06/MPF 65

− The best number of hidden units depends in a complexway on:

• the numbers of input and output units • the number of training cases • the amount of noise in the targets • the complexity of the function or classification to

be learned • the architecture • the type of hidden unit activation function • the training algorithm • regularization

− In most situations, there is no way to determine the bestnumber of hidden units without training several networksand estimating the generalization error of each. If youhave too few hidden units, you will get high trainingerror and high generalization error due to underfittingand high statistical bias. If you have too many hiddenunits, you may get low training error but still have highgeneralization error due to overfitting and high variance.

• Number of neurons per layer issues [NN FAQ]

07/10/06 EC4460.SuFy06/MPF 66

• Input variable standardization issues [NN FAQ]

− Input contribution depends on its variability relative toother inputs

Example:Input 1 in range [[-1 1]Input 2 in range [0 10,000]

Input 1 contribution will be swamped by Input 2

Scale inputs so that variability reflects their importance. * If importance is not known: scale all inputs to same

variability or same range* If importance is known: scale more important inputs

so that they have larger variance/ranges

Standardizing input variables has different effects on differenttraining algorithms for MLPs. For example:

1) Steepest descent is very sensitive to scaling. The more ill-conditioned the Hessian is, the slower the convergence. Hence, scaling is an importantconsideration for gradient descent methods such as standardbackpropagation

2) Quasi-Newton and conjugate gradient methods begin with a steepestdescent step and therefore are scale sensitive. However, they accumulatesecond-order information as training proceeds and hence are less scalesensitive than pure gradient descent.

3) Newton-Raphson and Gauss-Newton, if implemented correctly, aretheoretically invariant under scale changes as long as none of the scalingis so extreme as to produce underflow or overflow.

4) Levenberg-Marquardt is scale invariant as long as no ridging is required.There are several different ways to implement ridging; some are scaleinvariant and some are not. Performance under bad scaling will depend on details of the implementation.

07/10/06 EC4460.SuFy06/MPF 67

• Output variable standardization issues [NN FAQ]− Target ouptuts value ranges should reflect possible neural

network output values

− Standardizing target variables is typically more aconvenience for getting good initial weights than anecessity. However, if you have two or more target variablesand your error function is scale-sensitive like the usual least(mean) squares error function, then the variability of eachtarget relative to the others can effect how well the net learnsthat target. If one target has a range of 0 to 1, while anothertarget has a range of 0 to 106, the net will expend most of itseffort learning the second target to the possible exclusion ofthe first. So it is essential to rescale the targets so that theirvariability reflects their importance, or at least is not ininverse relation to their importance. If the targets are ofequal importance, they should typically be standardized tothe same range or the same standard deviation.

− If the target variable does not have known upper and lowerbounds, do not use an output activation function with abounded range

07/10/06 EC4460.SuFy06/MPF 68

• Generalization error evaluation issues [NN FAQ]3 basic necessary (not sufficient!) conditions for generalization

1) Network inputs contain sufficient information pertaining to the target, sothat there exists a mathematical function relating correct outputs to inputswith the desired degree of accuracy. (neural nets are not clairvoyant!)

2) Function which relates inputs to correct outputs must be in some sense,smooth, i.e., a small change in the inputs should, most of the time, produce a small change in the outputs. For continuous inputs and targets,smoothness of the function implies continuity and restrictions on the firstderivative over most of the input space. Some neural nets can learndiscontinuities as long as the function consists of a finite number ofcontinuous pieces. Very nonsmooth functions such as those produced bypseudo-random number generators and encryption algorithms cannot begeneralized by neural nets. Often a nonlinear transformation of the inputspace can increase the smoothness of the function and improvegeneralization.

3) the training set must be a sufficiently large and representative subset ofthe set of all cases that you want to generalize to. The importance of thiscondition is related to the fact that there are, loosely speaking, twodifferent types of generalization: interpolation and extrapolation.interpolation applies to cases that are more or less surrounded by nearbytraining cases; everything else is extrapolation. In particular, cases thatare outside the range of the training data require extrapolation. Casesinside large "holes" in the training data may also effectively requireextrapolation. Interpolation can often be done reliably, but extrapolationis notoriously unreliable. Hence it is important to have sufficienttraining data to avoid the need for extrapolation.

07/10/06 EC4460.SuFy06/MPF 69

• Cross-validation and bootstrapping schemes toevaluate generalization errors (and compareimplementations)

Schemes are called permutation tests because they are based on data resampling

1) Cross-validation (Resampling without replacement)

− How does this work?

Split data in k (~10) subsets of equal size.Train the NN k times, each time:

leave one of the subsets out of the trainingtest NN on the omitted subset

When k=sample size “leave-one-out” cross-validation

Overall accuracy is mean of all testing set accuracies

− Recommended for small datasets− Can be used to estimate model error or to compare

different NN set-ups

07/10/06 EC4460.SuFy06/MPF 70

2) Jackknife estimation

− Special case of cross-validation


Split data in subsets of size equal to M-1 (for M data samples available);Train the NN on each set; Each time, test NN on the leave-one-out omitted sample (i.e., each testing set has only one sample).


− Recommended for small datasets− Can be used to estimate model error or to compare

different NN set-ups

07/10/06 EC4460.SuFy06/MPF 71

[Boostrap methods and permutation tests, Hesterberg et al.,W.H. Freeman et company, 2003http://www-stat.stanford.edu/%7Etibs/stat315a.html ]

3) Bootstrapping (Resampling with replacement)

07/10/06 EC4460.SuFy06/MPF 72

Bootstrapping


Select k (from 50 to 2000) subsets of the data with replacement.Train the NN k times, each time:

Train on one subsetTest on another subset


− Recommended for small datasets− Is expensive to implement. Seems to work better

than cross-validation in many cases, but notalways… in such cases not worth the investment

− Can be used to estimate model error or to comparedifferent NN set-ups

07/10/06 EC4460.SuFy06/MPF 73

Performance Comparison

Which technique is best? Which is more accurate?

• Classifier performance assessment allows toevaluate how well it does and compares withother schemes.

• Hypothesis H0:

For a randomly drawn set of fixed size, algorithmsA and B have the same error rate.

• Useful when combining decisions/outputs fromseveral classifiers/detectors (in data fusionapplications).

evaluates set as a hypothesis testGiven two algorithms A and B

• Hypothesis H1:

For a randomly drawn set of fixed size, algorithmsA and B do not have the same error rate.

07/10/06 EC4460.SuFy06/MPF 74

• Need to define:

− Type 1 error rate:

Probability of incorrectly rejecting the true null hypothesis.

− Type 2 error rate:

Probability of incorrectly accepting a falsenull hypothesis.

Applied to this problem

Type 1 error rate:

Probability of incorrectly detecting a difference between classifier performance when no difference exists.

− Significance of level α:

α represents how selective (i.e., restrictive)the user wants the decision between H0 and H1 to be, i.e., for a = 0.05, the user is willingto accept the fact that there is a 5% chanceof deciding H0 is incorrect (or false) when itis in fact correct (or true).

07/10/06 EC4460.SuFy06/MPF 75

Thus,

• The larger α is, the more likely the user isto decide the claim (H0) is incorrect, whenin fact, it is correct, i.e., the user becomesmore selective, as the user rejects moreand more claims even though they arecorrect.

• The smaller α is, the less likely the user isto decide the claim is incorrect, when it is,in fact, correct, i.e., the user becomes lessselective, as the user will reject fewerclaims, however the user will accept moreand more claims which are, in fact, incorrect.

07/10/06 EC4460.SuFy06/MPF 76

• McNemar’s Test

− Define the following qualities:

Number of test cases misclassified by

A and B

n00

Number of test cases misclassified by B

but not by A

n10

Number of test cases misclassified by A

but not by B

n01

Number of test cases misclassified by neither A nor B

n11

• Total number of test cases

n = n01 + n10 + n11 + n00

Note:

• Under H0, A and B have same errorrates ⇒ n01 = n10

theoretically, the expected numberof errors made only by one of thetwo algorithms is

10 01

2En nE +

=

07/10/06 EC4460.SuFy06/MPF 77

− McNemar’s test compares the observednumber of errors obtained with one of thetwo algorithms and the expected number.

( )201 10

10 01

1n nz

n n− −

=+

− Compute

− Turns out z is21χ

• H0 (hypothesis that the algorithms A and B havesome error rate) is rejected with a significancelevel a (i.e., assuming we accept the α% chancedeciding H0 is incorrect when it is, in fact,correct). When

21,1z αχ −>

• How to read χ2 table

21,0.95χ =

07/10/06 EC4460.SuFy06/MPF 78

• Example:

Assume we have a problem with 9 classes and60 text samples.

Results give

Algorithm A gives 48 correct decisions.

Algorithm B gives 45 correct decisions.

Are the two algorithms to be considered withsame performances ?

n00 = 1

n10 = 4

n01 = 1

n11 = 44

vi. backpropagation neural networks...

Documents