intelligent control - iitk - indian institute of ...home.iitk.ac.in/~lbehera/files/lec7.pdf ·...

40
Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks – p.1/40

Upload: lykhanh

Post on 15-Apr-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Intelligent Control

Module I- Neural NetworksLecture 7

Adaptive Learning Rate

Laxmidhar Behera

Department of Electrical EngineeringIndian Institute of Technology, Kanpur

Recurrent Networks – p.1/40

Subjects to be covered

Motivation for adaptive learning rate

Lyapunov Stability Theory

Training Algorithm based on Lyapunov StabilityTheory

Simulations and discussion

Conclusion

Recurrent Networks – p.2/40

Training of a Feed Forward Network

x2

1x

WW

y

Figure 1: A feed-forward network

Here, W ∈ RM is the weight vector. The training dataconsists of, say, N patterns, xp, yp, p = 1, 2, ..., N .

Weight update law: W (t+1) = W (t)−η∂E

∂W, η : learning rate

Recurrent Networks – p.3/40

Motivation for adaptive learning rate

-10 -5 0 5 10x

0

10

20

30

40

50

60

70

f(x)

ActualAdaptive learning rate

x0=-6.7

Figure 2: Convergence to global minimum

With adaptive learning rate, one can employ a higherlearning rate when the error is far from global minimumand a smaller learning rate when it is near to it.

Recurrent Networks – p.4/40

Adaptive Learning Rate

The objective is to achieve global convergence for anon-quadratic, non-convex nonlinear function withoutincreasing the computational complexity.

In GD, learning rate is fixed. If one can have a largerlearning rate for a point far away from global minimumand a smaller learning rate for a point closer to globalminimum, then it would be possible to avoid localminima and ensure global convergence. Thisnecessitates need of adaptive learning rate.

Recurrent Networks – p.5/40

Lyapunov Stability Theory

Used extensively in control system problems.

If we choose a Lyapunov function candidate V (x(t), t)such that

V (x(t), t) is positive definite

V (x(t), t) is negative definitethen the system is asymptotically stable.

Local Invariant Set Theorem (La Salle)Consider an autonomous system of the form x = f(x)with f continuous, and let V (x) be a scalar functionwith continuous partial derivatives. Assume that

* for some l > 0, the region Ωl defined by V (x) < lis bounded.

Recurrent Networks – p.6/40

Lyapunov stability theory: contd...

* V (x) < 0 for all x in Ωl.

Let R be the set of all points within Ωl where V (x) = 0, andM be the largest invariant set in R. Then, every solutionx(t) originating in Ωl tends to M as t → ∞.

Problem lies in choosing a proper Lyapunov functioncandidate.

Recurrent Networks – p.7/40

Weight update law using Lyapunov based approach

The network output is given by

yp = f(W , xp) p = 1, 2, . . . N (1)

The usual quadratic cost function is given as:

E =1

2

N∑

p=1

(yp − yp)2 (2)

Let’s choose a Lyapunov function candidate for the systemas below:

V =1

2(yT y) (3)

where y = [y1 − y1, ....., yp − yp, ....., yN − yN ]T .Recurrent Networks – p.8/40

LF I Algorithm

The time derivative of the Lyapunov function V is given by

V = −y∂y

∂WW = −yT JW (4)

where

J =∂y

∂WJ ∈ RN×M

Theorem 1. If an arbitrary initial weight W (0) is updated by

W (t′) = W (0) +

∫ t′

0

W dt (5)where

W =‖ y ‖2

‖ JT y ‖2 +εJT y (6)

where ε is a small positive constant, then y converges to zerounder the condition that W exists along the convergencetrajectory. Recurrent Networks – p.9/40

Proof of LF - I Algorithm

Proof. Substitution of Eq. (6) into Eq. (4) yields

V1 = − ‖ y ‖2 ‖ JT y ‖2

‖ JT y ‖2 +ε≤ 0 (7)

where V1 < 0 for all y 6= 0. If V1 is uniformly continuous andbounded, then according to Barbalat’s lemma as t → ∞, V1 → 0and y → 0.

Recurrent Networks – p.10/40

LF - I Algorithm: contd...

The weight update law is a batch update law. Theinstantaneous LF I learning algorithm can be derived as:

W =‖ y ‖2

‖ JiT y ‖2

JiT y (8)

where y = yp − yp ∈ R and Ji = ∂yp

∂W∈ R(1×M). The

difference equation representation of the weight updateequation is given by

W (t + 1) = W (t) + µW (t) (9)

Here µ is a constant.

Recurrent Networks – p.11/40

Comparison with BP Algorithm

In gradient descent method we have,

4W = −η∂E

∂W= ηJi

T y

W (t + 1) = W (t) + ηJiT y (10)

The update equation for LF-I algorithm:

W (t + 1) = W (t) +(µ

‖ y ‖2

‖ JiT y ‖2

)Ji

T y

Comparing above two equations, we find that the fixedlearning rate η in BP algorithm is replaced by its adaptiveversion ηa:

ηa =(µ

‖ y ‖2

‖ JiT y ‖2

)(11)

Recurrent Networks – p.12/40

Adaptive Learning rate of LF-I

0 100 200 300 400No of iterations (4xno. of epochs)

0

10

20

30

40

50

Lear

ning

rate

LF - I : XOR

Learning rate is not fixed unlike BP algorithm.

Learning rate goes to zero as error goes to zero.Recurrent Networks – p.13/40

Convergence of LF-I

The theorem states that the global convergence of LF-I isguaranteed provided W exists along the convergencetrajectory. This, in turn, necessitates ‖ ∂V1

∂W‖=‖ JT y ‖6= 0.

‖ ∂V1

∂W‖= 0, indicates a local minimum of the error function.

Thus, the theorem only says that the global minimum isreached only when local minima are avoided duringtraining.

Since instantaneous update rule introduces noise, it maybe possible to reach global minimum in some cases,however, the global convergence is not guaranteed.

Recurrent Networks – p.14/40

LF II Algorithm

We consider following Lyapunov function

V2 =1

2(yT y + λWTW)

= V1 +λ

2WTW (12)

where λ is a positive constant. The time derivative ofabove equation is given by

V2 = −yT ∂y

∂WW + λWTW

= −yT (J − D)W (13)

where J = ∂y

∂W: N × m is the Jacobian matrix, and

D = λ 1‖y‖2 yWT ∈ RN×m

Recurrent Networks – p.15/40

LF II Algorithm: contd...

Theorem 2. If the update law for weight vector W follows adynamics given by following nonlinear differential equation

W = α(W)JT y − α(W)W (14)

where α(W) = ‖y‖2

‖JT y‖2+εis a scalar function of weight vector W

and ε is a small positive constant, then y converges to zero underthe condition that (J − D)T y is non-zero along the convergencetrajectory.

Recurrent Networks – p.16/40

Proof of LF II algorithm

Proof. W = α(W)JT y − α(W)W may be rewritten as

W =‖y‖2

‖JT y‖2 + ε(J − D)T y (15)

Substituting for W from above equation intoV2 = −yT (J − D)W, we get

V2 = −‖ y ‖2

‖ JT y ‖2 +ε‖ (J − D)T y ‖2≤ 0 (16)

Since (J − D)T y is non-zero, V2 < 0 for all y 6= 0 and V2 = 0

iff y = 0. If V2 is uniformly continuous and bounded, thenaccording to Barbalat’s lemma as t → ∞, V2 → 0 andy → 0. Recurrent Networks – p.17/40

Proof of LF II algorithm: contd...

The instantaneous weight update equation using LF IIalgorithm can be finally expressed in difference equationmodel as follows:

W(t + 1) = W(t) +(µ

‖y‖2

‖JpT y‖2 + ε

)(Jp − D)T y

= W(t) + µ‖y‖2

‖JpT y‖2 + ε

JpT y

−µ1W(t)

‖JpT y‖2 + ε

(17)

where µ1 = µλ and the acceleration W(t) is computed as:

W(t) =1

(4t)2[W(t) − 2W(t − 1) + W(t − 2)]

and 4t is taken to be one time unit for simulation. Recurrent Networks – p.18/40

Comparison with BP Algorithm

Applying gradient-descent to V2 = V1 + λ2WTW,

4W = −η

(∂V2

∂W

)T

= −η

(∂V1

∂W

)T

− η

[d

dW(λ

2WTW)

]T

= η

(∂y

∂W

)T

y − ηλW

Thus, the weight update equation for gradient descentmethod may be written as

W(t + 1) = W(t) + η′JpT y − µ′W

︸︷︷︸

acceleration term

(18)Recurrent Networks – p.19/40

Adaptive learning rate and adaptive acceleration

Comparing the two updates law, the adaptive learning ratein this case is given by

η′a = µ

‖y‖2

‖JpT y‖2 + ε

(19)

and the adaptive acceleration rate is given by

µ′a =

λ

‖JpT y‖2 + ε

(20)

Recurrent Networks – p.20/40

Convergence of LF II

The global minimum of V2 in is given by

y = 0, W = 0 (y ∈ Rn, W ∈ Rm)

Global minimum can be reached provided W does notvanish along the convergence trajectory.

Analyzing local minima conditions:

W vanishes under following conditions

1. First condition: J = D (J, D ∈ Rn×m)In case of neural networks, it is very unlikely that eachelement of J would be equal to that of D, thus thispossibility can easily be ruled out for a multi-layerperceptron network.

Recurrent Networks – p.21/40

Convergence of LF II: contd...

2. Second Condition: W vanishes whenever

(J − D)T y = 0

Assuming J 6= D, Rank ρ(J − D) = n ensures globalconvergence.

3. Third Condition:

JT y = DT y = λW

Solutions of above equation represent localminimaThe solution to above equation exists for everyvector W ∈ Rm whenever rank ρ(J) = m

Recurrent Networks – p.22/40

Convergence of LF II: contd...

For NN, n ≤ m ⇒ ρ(J) ≤ n. Hence there are at leastm− n vectors W ∈ Rm for which solutions do not existand hence local minima do not occur.

Thus, increasing no. of hidden layers or hiddenneurons (i.e, increasing m), chances of encounteringlocal minima can be reduced.

Increasing the number of output neurons increasesboth m and n as well as n/m.Thus, for MIMO systems, there are more local minima(for fixed number of weights) as compared to singleoutput systems.

Recurrent Networks – p.23/40

Avoiding local minima

V1

W

local minimum

global minimum

ABC D

tt − 1

t − 2

t + 14W(t − 1)

4W(t)

4W(t + 1)

Recurrent Networks – p.24/40

Avoiding local minima: contd...

Rewrite the update law for LF-II as

W(t + 1) = W(t) + 4W(t + 1) = W(t) − η′ ∂V1

∂W(t) − µ′W(t)

Consider point B (at time t − 1):The weight update for the interval (t − 1, t]computed at this instant4W(t) = 4W1(t − 1) + 4W2(t − 1).4W1(t − 1) = −η ∂V1

∂W(t − 1) > 0

4W2(t − 1) = −µW(t − 1) =−µ(4W (t − 1) −4W (t − 2)) > 0

It is to be noted that 4W (t − 1) < 4W (t − 2) asthe velocity is decreasing towards the point oflocal minimum.4W (t) > 0, hence speed increases. Recurrent Networks – p.25/40

Avoiding local minima: contd...

Consider point ’A’ (at t):Weight increments

4W1(t) = −η∂V1

∂W(t) = 0

4W2(t) = −µW(t) = −µ(4W (t) −4W (t − 1)) > 0

4W (t) < 4W (t − 1) ⇒ 4W2(t) > 0

4W(t + 1) = 4W1(t) + 4W2(t) > 0

This helps in avoiding local minimum

Recurrent Networks – p.26/40

Avoiding local minima: contd...Consider point ’D’ (at instant t + 1):

Weight contributions

4W1(t + 1) = −η∂V1

∂W(t + 1) < 0

4W2(t + 1) = −µW(t + 1)

= −µ(4W (t + 1) −4W (t)) > 0

contribution due to BP term becomes negative as theslope ∂V1

∂W> 0 on the right hand side of local minimum.

4W (t + 1) < 4W (t)

4W(t + 2) = 4W1(t + 1) + 4W2(t + 1) > 0 if4W2(t + 1) > 4W1(t + 1)

Thus it is possible to avoid local minima by properlychoosing µ. Recurrent Networks – p.27/40

Simulation results - LF-I vs LF-II: XOR

0 10 20 30 40 50Runs

50

100

150

200

250

300

Trai

ning

Epo

chs

LF I (λ=0.0, µ=0.55)LF II (λ=0.015, µ=0.65)

XOR

Figure 3: performance comparison for XOR

Observation: LF II provides tangible improvement over LF Iboth in terms of convergence time and training epochs.

Recurrent Networks – p.28/40

LF I vs LF II: 3-bit parity

0 10 20 30 40 50Runs

0

500

1000

1500

2000

2500

3000

Trai

ning

epo

chs

LF I (λ=0.0, µ=0.47)LF II (λ=0.03, µ=0.47)

3-bit Parity

Figure 4: performance comparison for 3-bit parity

Observation: LF II performs better than LF I both in termsof computation time and training epochs Recurrent Networks – p.29/40

LF I vs LF II: 8-3 Encoder

0 10 20 30 40 50Runs

0

50

100

150

Trai

ning

epo

chs

LF I (λ=0.0, µ=0.46)LF II (λ=0.01, µ=0.465)

8-3 Encoder

Figure 5: comparison for 8-3 encoder

Observation: LF II takes minimum epochs in most of theruns Recurrent Networks – p.30/40

LF I vs LF II: 2D Gabor function

0 10000 20000 30000Iterations (training data points)

0

0.1

0.2

0.3

0.4

0.5

rms t

rain

ing

erro

r

LF I (µ=0.8, λ=0.0)LF II (µ=0.8, λ=0.6)

2D Gabor Function

Figure 6: performance comparison for 2D Gabor

functionObservation: With increasing iterations, the performance ofLF II improves as compared to LF I

Recurrent Networks – p.31/40

Simulation Results - Comparison: contd...

XOR

Algorithm epochs time (sec) parametersBP 5620 0.0578 η = 0.5

BP 3769 0.0354 η = 0.95

EKF 3512 0.1662 λ = 0.9

LF-I 165 0.0062 µ = 0.55

LF-II 120 0.0038 µ = 0.65, λ = 0.01

Recurrent Networks – p.32/40

Comparison among BP, EKF and LF-II

0 10 20 30 40 50Run

0

0.1

0.2

0.3

0.4

Conv

erge

nce

time

(sec

onds

)

BPEKFLF - II

Observation: LF takes almost same time for any arbitrary

initial condition.

Recurrent Networks – p.33/40

Comparison among BP, EKF and LF: contd...

3-bit Parity

Algorithm epochs time (sec) parametersBP 12032 0.483 η = 0.5

BP 5941 0.2408 η = 0.95

EKF 2186 0.4718 λ = 0.9

LF-I 1338 0.1176 µ = 0.47

LF-II 738 0.0676 µ = 0.47, λ = 0.03

Recurrent Networks – p.34/40

Comparison among BP, EKF and LF: contd...

8-3 Encoder

Algorithm epochs time (sec) parametersBP 326 0.044 η = 0.7

BP 255 0.0568 η = 0.9

LF-I 72 0.0582 µ = 0.46

LF-II 42 0.051 µ = 0.465, λ = 0.01

Recurrent Networks – p.35/40

Comparison among BP, EKF and LF: contd...

2D Gabor function

Algorithm No. of Centers rms error/run parametersBP 40 0.0847241 η1,2 = 0.2

BP 80 0.0314169 η1,2 = 0.2

LF-I 40 0.0192033 µ = 0.8

LF-II 40 0.0186757 µ = 0.8, λ = 0.3

Recurrent Networks – p.36/40

Discussion

Global convergence of Lyapunov based learning Algorithms

Consider the following Lyapunov function candidate:

V2 = µV1 +1

2σ‖

∂V1

∂W‖2; where V1 =

1

2yT y (21)

The objective is to select an weight update law W suchthat the global minimum (V1 = 0 and ∂V1

∂W= 0), is reached.

The rate derivative of the Lyapunov function V2 is given as:

V2 =∂V1

∂W[µI + σ

∂2V1

∂W ∂W T]W (22)

Recurrent Networks – p.37/40

If the weight update law W is selected as

W = −[µI+σ∂2V1

∂W ∂W T]−1 ( ∂V1

∂W)T

‖ ∂V1

∂W‖2

(ζ‖∂V1

∂W‖2+η‖V1‖

2) (23)

with ζ > 0 and η > 0, then

V2 = −ζ‖∂V1

∂W‖2 − η‖V1‖

2 (24)

which is negative definite with respect to V1 and ∂V1

∂W. Thus,

V2 will finally converge to its equilibrium point given by V1 =

0 and ∂T V1

∂W= 0.

Recurrent Networks – p.38/40

But the implementation of the weight update algorithmbecomes very difficult due to the presence of aHessian term ∂2V1

∂W ∂WT .

Thus, the above algorithm is of theoretical interest.

The above weight update algorithm is similar to BPlearning algorithm with a fixed learning rate.

Recurrent Networks – p.39/40

Conclusion

LF Algorithms perform better than both EKF and BPalgorithms in terms of speed and accuracy.

LF II avoids local minima to a greater extent ascompared to LF I.

It is seen that by choosing a proper networkarchitecture, it is possible to reach global minimum.

LF-I Algorithm has an interesting parallel withconventional BP algorithm where the the fixedlearning rate of BP is replaced by an adaptive learningrate.

Recurrent Networks – p.40/40