cs 621 artificial intelligence lecture 25 – 14/10/05 prof. pushpak bhattacharyya

27
14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 1 CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya Training The Feedforward Network; Backpropagation Algorithm

Upload: lidia

Post on 19-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya Training The Feedforward Network; Backpropagation Algorithm. Multilayer Feedforward Network. - Needed for solving problems which are not linearly separable. - Hidden layer neurons: assist computation. ……. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 1

CS 621 Artificial Intelligence

Lecture 25 – 14/10/05

Prof. Pushpak Bhattacharyya

Training The Feedforward Network;Backpropagation Algorithm

Page 2: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 2

Multilayer Feedforward Network

- Needed for solving problems which are not linearly separable.

- Hidden layer neurons: assist computation.

Page 3: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 3

……..

……..

……..

Output layer

Hidden layer

Input layer

……..

Forward connection; no feedback connection

Page 4: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 4

Gradient Descent Rule

ΔWji α - δE/ δWji

fed feeding

P M

E = error = ½ Σ Σ( tm – om) 2

p=1 m=1

TOTAL SUM SQUARE ERROR(TSS)

j

i

Wji

Page 5: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 5

Gradient Descent For a Single Neuron

y

Wn

Xn

Wn-1

Xn-1

W0 = 0

X0 = -1

….

nNet input = Σ WiXi

i=0

Page 6: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 6

f = sigmoid = 1 / ( 1+ e-net )

y

net

df

y= f(net)

Characteristic function

= f(1-f)dnet

f =

Page 7: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 7

ΔWi - δE/ δWi

E = ½( t- o)2

target observed

α

Y = 0

Wn

Xn

Wn-1

Xn-1

W0

X0

….

Page 8: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 8

W = <Wn, ……, W0>

randomly initialized

ΔWi - δE/ δWi

= - η δE/ δWi , η is the learning rate

0 <= η <=1

α

Page 9: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 9

ΔWi = - η δE / δWi

δE / δWi = δ(1/2(t - o)2) / δWi

= (δE / δo) * (δo / δWi ); chain rule

= - (t - o) * (δo / δnet) * ( δnet / δWi)

E

Page 10: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 10

δo / δnet = δ f(net) / δnet

= f (net)

= f ( 1 - f )

= o ( 1 - o )

o

net

Page 11: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 11

δnet / δWi = xi

nnet = ΣWiXi

i = 0

y

Wn

Xn

Wi

Xi

W0

X0

….….W

Page 12: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 12

E = ½ (t - o)2

ΔWi = η (t - o) (1 - o) o Xi

δE / δo

δf / δnet

δnet / δWi

o

Wn

Xn

Wi

Xi

W0

X0

….…. W

Page 13: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 13

E = ½( t - o) 2

ΔWi = η (t - o) (1 - o) o Xi

Obs:

Xi = 0 , ΔWi = 0

If Xi is more, so is the ΔWi

BLAME/CREDIT ASSIGNMENT

o

Wn

Xn

Wi

Xi

W0

X0

….….

Page 14: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 14

More the difference ( t – o ), more is Δw.

If( t – o ) is +ve , so is Δw

If( t – o ) is –ve, so is Δw

Page 15: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 15

If o is 0/1 , Δw = 0

o is 0/1 when net = - ∞ or + ∞

Δw 0 because of o 0/1. It is called “saturation” or “paralysis’ of the network. It happens due to sigmoid.

o 1

net

Page 16: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 16

Solution to network saturation

y = k / (1+e–x)

y = tanh(x)

k

x- k

1.

2. k

Page 17: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 17

Scale the inputs

Reduced the values

Problem of floating/fixed number

representation error.

3.

Solution to network saturation

(Contd)

Page 18: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 18

ΔWi = η ( t - o) o ( 1 – o) Xi

Smaller η smaller ΔW

Page 19: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 19

Start with large η, gradually decrease it.

E

Wi

Global minimum

op. pt

Page 20: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 20

Gradient Descent training is typically slow:

First parameter: η ; learning rate

Second parameter: β; Momentum factor 0 <= β <= 1

Page 21: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 21

Use a part of previous weight Change into the current weight change.

(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1

Iteration

Momentum Factor

Page 22: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 22

Effect of β

If (ΔWi)n and (ΔWi)n-1 are of same sign then (ΔWi)n is enhanced.

If (ΔWi)n and (ΔWi)n-1 are of opposite sign then effective (ΔWi)n is reduced.

Page 23: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 23

1) Accelerates movement at A.

2) Dampens oscillation near global minimum.

E

W

P Q

R S

A

op. pt

Page 24: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 24

(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi )n-1

Relation between η and β ?

Pure gradient descent momentum

Page 25: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 25

η >> β ?

η << β ?

(ΔWi)n = η (t - o) o (1 – o) Xi + β(ΔWi)n-1

Relation between η and β

Page 26: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 26

If η << β

(ΔWi)n = β(ΔWi)n-1

recurrence Relation

(ΔWi )n = β(ΔWi)n-1

= β[β(ΔWi)n-2] = β2[β(ΔWi)n-3]

.

.

. = βn(ΔWi)0

Relation between η and β (Contd)

Page 27: CS 621 Artificial Intelligence Lecture 25 – 14/10/05 Prof. Pushpak Bhattacharyya

14.10.2005 Prof. Pushpak Bhattacharyya, IIT Bombay 27

β is typically 1/10 th of ηEmpirical Practice

If β is very large compared to η, no effect of output error, input or neuron characteristics is felt. Also (ΔW) goes on decreasing since β is a fraction.

Relation between η and β (Contd)