financial informatics –xvi: backpropagation learning · backpropagation is a procedure for...

1

Fin

an

cia

l In

form

ati

cs –

XV

I:Su

pervis

ed

Backpropagati

on

Learn

ing

1

Kh

ursh

id A

hm

ad

, P

rofe

ssor o

f C

om

pu

ter S

cie

nce,

Dep

artm

en

t of C

om

pu

ter S

cie

nce

Trin

ity C

oll

ege,

Du

bli

n-2

, IR

EL

AN

DN

ovem

ber 1

9th

, 20

08

.h

ttp

s://

ww

w.c

s.tc

d.i

e/K

hu

rsh

id.A

hm

ad

/Tea

chin

g.h

tml

2

Preamble

Neural Networks 'learn' by adapting in accordance with a

training regimen: Five key algorithms.

ER

RO

R-C

OR

RE

CT

ION

OR

PE

RF

OR

MA

NC

E L

EA

RN

ING

HE

BB

IAN

OR

CO

INC

IDE

NC

E L

EA

RN

ING

BO

LT

ZM

AN

LE

AR

NIN

G (

ST

OC

HA

ST

IC N

ET

L

EA

RN

ING

)

CO

MP

ET

ITIV

E L

EA

RN

ING

FIL

TE

R L

EA

RN

ING

(G

RO

SSB

ER

G'S

NE

TS)

3

ANN Learning Algorithms

EN

VIR

ON

ME

NT

LE

AR

NIN

G

SY

ST

EM

Error Signal

Vector

describing the

environment

Desired

response

Actual

Response

Σ ΣΣΣ-

+

TE

AC

HE

R

4E

NV

IRO

NM

EN

TL

EA

RN

ING

S

YS

TE

M

Vector describing

state of the

environment


5


EN

VIR

ON

ME

NT

CR

ITIC

LE

AR

NIN

G

SY

ST

EM

Actions

State-vector input

Primary

Reinforcement

Heuristic

Reinforcement

6

Back-propagation Algorithm:

Supervised Learning

Βac

kpro

pag

atio

n (

BP

) is

am

ongst

the

‘most

popula

r al

gori

thm

s

for

AN

Ns’

: it

has

bee

n e

stim

ated

by P

aul

Wer

bos,

the

per

son

who f

irst

work

ed o

n t

he

algori

thm

in t

he

1970’s

, th

at b

etw

een

40%

and 9

0%

of

the

real

worl

d A

NN

appli

cati

ons

use

the

BP

algori

thm

. W

erbos

trac

es t

he

algori

thm

to t

he

psy

cholo

gis

t

Sig

mund F

reud’s

theo

ry o

f psy

chodyn

am

ics.

W

erbos

appli

ed

the

algori

thm

in p

olitica

l fo

reca

stin

g.

•Dav

id R

um

elhar

t, G

eoff

ery

Hin

ton a

nd o

ther

s ap

pli

ed t

he

BP

algori

thm

in t

he

1980’s

to p

roble

ms

rela

ted t

o s

uper

vis

ed

lear

nin

g, par

ticu

larl

y patter

n rec

ognitio

n.

•The

most

use

ful

exam

ple

of

the

BP

alg

ori

thm

has

bee

n i

n

dea

ling w

ith p

roble

ms

rela

ted t

o pre

diction

and control.

7

Back-propagation Algorithm:

Architecture of a BP system

8

Back-propagation Algorithm

ΒA

SIC

DE

FIN

ITIO

NS

1.

Bac

kpro

pag

atio

n i

s a

pro

cedure

for

effi

cien

tly c

alcu

lati

ng

the

der

ivat

ives

of

som

e outp

ut

quan

tity

of

a non-l

inea

r

syst

em, w

ith r

espec

t to

all

inputs

and p

aram

eter

s of

that

syst

em, th

rough c

alcu

lati

ons

pro

ceed

ing b

ack

ward

s fr

om

outp

uts

to i

nputs

.

2.

Bac

kpro

pag

atio

n i

s an

y t

echniq

ue

for

adap

ting t

he

wei

ghts

or

par

amet

ers

of

a nonli

nea

r sy

stem

by s

om

eho

w u

sing s

uch

der

ivat

ives

or

the

equiv

alen

t.

Acc

ord

ing t

o P

au

l W

erb

os

ther

e is

no s

uch

th

ing a

s

a “

back

pro

pagati

on

net

work

”, h

e u

sed

an

AN

N

des

ign

call

ed a

multilaye

r per

ceptron

.

9


Pau

l W

erbos

pro

vid

ed a

‘ru

le f

or

updat

ing t

he

wei

ghts

of

a

mult

i-la

yer

ed n

etw

ork

under

goin

g s

up

erv

ised

lea

rnin

g.

It i

s th

e w

eight

adap

tati

on r

ule

whic

h i

s ca

lled

back

pro

pagation.

Typic

ally

, a

full

y c

onnec

ted f

eedfo

rwar

d n

etw

ork

is

use

d t

o

be

trai

ned

usi

ng t

he

BP

alg

ori

thm

: ac

tivat

ion i

n s

uch

net

work

s

trav

els

in a

dir

ecti

on f

rom

the

input

to t

he

outp

ut

layer

and t

he

unit

s in

one

layer

are

connec

ted t

o e

ver

y o

ther

unit

in t

he

nex

t

layer

.

Ther

e ar

e tw

o swee

ps

of

the

full

y c

onnec

ted n

etw

ork

: fo

rward

swee

p a

nd b

ack

ward

swee

p.

10


Ther

e ar

e tw

o swee

ps

of

the

full

y c

onnec

ted n

etw

ork

: fo

rward

swee

p a

nd b

ack

ward

swee

p.

Forw

ard

Swee

p:

This

sw

eep i

s si

mil

ar t

o a

ny o

ther

feed

forw

ard A

NN

–th

e in

put

stim

uli

is

giv

en t

o t

he

net

work

,

the

net

work

com

pute

s th

e w

eighte

d a

ver

age

from

all

the

input

unit

s an

d t

hen

pas

ses

the

aver

age

thro

ugh a

squash

funct

ion.

The

AN

N g

ener

ates

an o

utp

ut

subse

quen

tly.

The

AN

N m

ay h

ave

a num

ber

of hid

den

laye

rs, fo

r ex

ample

,

the

mult

i-net

per

ceptr

on, an

d t

he

outp

ut

from

eac

h h

idden

layer

bec

om

es t

he

input

to t

he

nex

t la

yer

forw

ard.

11


Ther

e ar

e tw

o swee

ps

of

the

full

y c

onnec

ted n

etw

ork

: fo

rward

swee

p a

nd b

ack

ward

swee

p.

Back

ward

s Swee

p:

This

sw

eep i

s si

mil

ar t

o t

he

forw

ard s

wee

p,

exce

pt

what

is

‘sw

ept’

are

the

erro

r val

ues

. T

hes

e val

ues

esse

nti

ally

are

the

dif

fere

nce

s bet

wee

n t

he

actu

al o

utp

ut

and a

des

ired

outp

ut.

The

AN

N m

ay h

ave

a num

ber

of hid

den

laye

rs, fo

r ex

ample

, th

e

mult

i-net

per

ceptr

on, an

d t

he

outp

ut

fro

m e

ach h

idden

lay

er

bec

om

es t

he

input

to t

he

nex

t la

yer

forw

ard.

In t

he

bac

kw

ard s

wee

p o

utp

ut

unit

sen

ds

erro

rs b

ack t

o t

he

firs

t

pro

xim

ate

hid

den

lay

er w

hic

h i

n t

urn

pas

ses

it o

nto

the

nex

t

hid

den

lay

er. N

o e

rror

signal

is

sent

to t

he

input

unit

s.

)(

jj

jo

de

−=

⇒

12


Fo

r ea

ch i

np

ut

vec

tor

asso

ciat

e a

targ

et o

utp

ut

vec

tor

wh

ile

no

t S

TO

P

ST

OP

= T

RU

E

for

each

in

pu

t v

ecto

r

•p

erfo

rm a

fo

rwar

d s

wee

p t

o f

ind

th

e ac

tual

ou

tpu

t

•ob

tain

an

err

or

vec

tor

by

co

mp

arin

g t

he

actu

al a

nd t

arg

et o

utp

ut

•if

th

e ac

tual

ou

tpu

t is

no

t w

ith

in t

ole

ran

ce s

et S

TO

P=

FA

LS

E

•u

se t

he

bac

kw

ard

sw

eep

to

det

erm

ine

wei

gh

t ch

ang

es

•up

dat

e w

eig

hts

13


Exam

ple

: P

erfo

rm a

com

ple

te f

orw

ard a

nd b

ackw

ard s

wee

p o

f

a 2-2

-1 (

2 i

nput

unit

s, 2

hid

den

lay

er u

nit

s an

d 1

outp

ut

unit

)

wit

h t

he

foll

ow

ing a

rchit

ectu

re. T

he

targ

et o

utp

ut d=

0.9

. T

he

input

is [

0.1

0.9

].

1 2

4 5

6

0

−0.2

0.3

0.3

−0.1

0.2

0.1

0.1

0.1

0.1

0.1

d0.

9

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

14


Exam

ple of a forw

ard

pass

and a back

ward

pass

thro

ugh a 2-2

-2-1

fee

dfo

rward

network

.

Inputs, outputs and err

ors

are

shown in boxes

.

0.2

88

-0.9

06

Forw

ard P

ass

-2 0.1

22

0.7

28

-1.9

72

0.9

86

1

3-2

3

0.9

93

0.5

00

-22

2-4

05

.00

02

2

0.1

0.9

3-2

-23

Input

to u

nit

Outp

ut

from

unit

0.1

25

-2 0.3

75

0.1

25

0.0

40

0.0

25

1

3-2

3

-0.1

10

-0.0

30

-22

2-4

-0.0

07

-0.0

01

22

3-2

-23

Tar

get

val

ue

is 0

.9 s

o e

rror

for

the

outp

ut

unit

is:

(0.9

00 –

0.2

88)

x 0

.288 x

(1 –

0.2

88)

= 0

.125

15


New

weights calculated followin

g the er

rors der

ived

above

-2 +

(0.8

x 0

.125)

x 1

= -

1.9

00

1.0

73 =

1 +

(0.8

x 0

.125)

x 0

.728

3 +

(0.8

x 0

.04)

x 1

= 3

.032

-1.9

8 =

-2 +

(0.8

x 0

.025)

x 1

3 +

(0.8

x 0

.125)

x 0

.122 =

3.0

12

-2 +

(0.8

x 0

.04)

x 0

.5 =

-1.9

84

2.0

19 =

2 +

(0.8

x 0

.025)

x 0

.993

2 +

(0.8

x –

0.0

07)

x 1

= 1

.994

1.9

99 =

2 +

(0.8

x –

0.0

01)

x 1

2.9

99 =

3 +

(0.8

x –

0.0

01)

x 0

.9-2

+ (

0.8

x –

0.0

07)

x 0

.1)

= -

2.0

01

-2.0

05 =

-2 +

(0.8

x –

0.0

07)

x 0

.93 +

(0.8

x –

0.0

01)

x 0

.1 =

2.9

99

-3.9

68 =

-4 +

(0.8

x 0

.04)

x 0

.993

2 +

(0.8

x 0

.025)

x 0

.5)

= 2

.010

16


A der

ivation of the BP alg

orith

m

The

erro

r si

gnal

at

the

outp

ut

of

neu

ron j

at t

he

nth

trai

nin

g c

ycl

e is

giv

en a

s:

)(

)(

)(

ny

nd

ne

jj

j−

=T

he

inst

anta

neo

us

val

ue

of

erro

r en

ergy f

or

neu

ron j

is

)(

212

ne

j

The

tota

l er

ror

ener

gy E

(n)

can b

e co

mpute

d b

y s

um

min

g u

p

the

inst

anta

neo

us

ener

gy o

ver

all

the

neu

rons

in t

he

outp

ut

layer

neu

rons

outp

ut

all

incl

udes

Cse

tth

e

ne

nE

Cj

j;

)(

21)

(2

∑ ∈

=

17


Derivative or differential coefficient

For a function f(x) at the argument x, the limit of the

differential coefficient

as

∆x→0

x

xf

xx

f

∆−

∆+

)(

)(

y=f(x)

(x+

∆x,y+

∆y)

(x,y)

Q

P

∆y

∆x

x

Qap

pro

aches

Pas

lim

xy

dx

dy

∆∆=

18


Derivative or differential coefficient

Typically defined for a function of a single

variable, if the left and right hand limits

exist and are equal, it is the gradient of the

curve at x, and is the limit of the gradient

of the chord adjoining the points (x,f(x))

and (x+

∆x,f(x+ ∆x)).The function of x

defined as this limit for each argument xis

the first derivative y=f(x).

19


Partial derivative or partial differential coefficient

The derivative of a function of two or more

variables with respect to one of these

variables, the others being regarded as

constant; written as: xf ∂∂

20


Total Derivative

The derivative of a function of two or more

variables with regard to a single parameter

in terms of which these variables are

expressed as:

if z=f(x,y)with parametric equations:

x=U(t), y=V(t)

then under appropriate conditions the total

derivative is:

dt

dy

yz

dt

dx

xz

dt

dz

∂∂+

∂∂=

21


Total or Exact Differential

The differential of a function of two or more

variables with regard to a single parameter in terms

of which these variables are expressed, equal to the

sum of the products of each partial derivative of the

function with the corresponding increment. If

z=f(x,y), x=U(t), y=V(t)

then under appropriate conditions, the total

differential is:

dy

yzdx

xzdz

∂∂+

∂∂=

22


Chain rule of calculus

A theorem that may be used in the differentiation of

a function of a function

where yis a differentiable function of t, and tis a

differentiable function of x. This enables a function

of y=f(x)to be differentiated by finding a suitable

function x, such that fis a composition of yand yis

a differentiable function of u, and uis a

differentiable function of x.

dxdt

dt

dy

dx

dy

.=

23



Similarly for partial differentiation

where f is a function of uand uis a function of x.

xu

uf

xf

∂∂∂∂

=∂∂

.

24



Now consider the error signal at the output of a

neuron jat iteration n-i.e. presentation of the nth

training examples:

ej(n)=dj(n)-y j(n)

where djis the desired output and yjis the actual

output and the total error energy overall the

neurons in the output layer:

where ‘C’is the number of all the neurons in the

output layer

)(

21)

(2

ne

nE

Cj

j∑ ∈

=

25



If Nis the total number of patterns, the

averaged squared error energy is:

Note that ejis a function of yjand wij(the

weights of connections between neurons in

two adjacent layers ‘i’and ‘j’)

∑ =

=N n

av

nE

NE

1

)(

1

26


A der


orith

m

The

tota

l er

ror

ener

gy E

(n)

can b

e co

mpute

d b

y s

um

min

g u

p

the

inst

anta

neo

us

ener

gy o

ver

all

the

neu

rons

in t

he

outp

ut

layer

;)

(21

)(

2∑ ∈

=C

jj

ne

nE

The

tota

l er

ror

ener

gy E

(n)

can b

e co

mpute

d b

y s

um

min

g u

p t

he

inst

anta

neo

us

ener

gy o

ver

all

the

neu

rons

in t

he

outp

ut

layer

wher

e th

e se

t C

incl

udes

all

the

neu

rons

in t

he

outp

ut

lay

er.

∑ =

=N

n

av

nE

NE

1

)(

1

Th

e in

stan

tan

eou

s er

ror

ener

gy, E(n

)an

d t

her

efore

th

e

aver

age

ener

gy E

av

is a

fu

nct

ion

of

the

free

para

met

ers,

incl

ud

ing s

yn

ap

tic

wei

gh

ts a

nd

bia

s le

vel

s

27


The originators of the BP algorithmsuggest

that

where ηis the learning rate parameter of the BP

algorithm. The minus sign indicates the use of

gradient descent in the weight; seeking a direction

for weight change that reduces the value of E(n)

ij

ijwE

nw

∂∂−

=∆

η)

(

)(

)(

)(

ny

nn

wi

jij

ηδ

=∆

∴

28


A der


orith

m

)(

of

funct

ion

ais

)(

)(

of

funct

ion

ais

)(

)(

of

funct

ion

ais

)(

)(

of

funct

ion

ais

)(

nw

nv

nv

ny

ny

ne

ne

nE

jij

jj

jj

j

jv

E(n

) is

a f

un

ctio

n o

f a f

un

ctio

n o

f a f

un

ctio

n o

f a f

un

ctio

n o

f w

ji(n

)

29


A der


orith

m

The back-propagation algorithm trains a

multilayer perceptron by propagating back

some measure of responsibilityto a hidden

(non-output) unit.

Back-propagation:

•Is a local rule for synaptic adjustment;

•Takes into account the position of a neuron in

a network to indicate how a neuron’s weight are

to change

30


A der


orith

m

Layers in back-propagating multi-layer perceptron

1.First layer –comprises fixedinput units;

2.Second (and possibly several subsequent layers)

comprise trainable‘hidden’units carrying an internal

representation.

3.Last layer –comprises the trainableoutput units of

the multi-layer perceptron

31


A der


orith

m

Modern back-propagation algorithms are based on a

formalism for propagating back the changes in the error

energy E, with respect to all the weights wijfrom a unit

(i) to its inputs (j).

More precisely, what is being computed during the

backwards propagation is the rate of change of the error

energy Ewith respects to the networks weight. This

computation is also called the computation of the

gradient:

ijwE

∂∂

32


A der


orith

mMore precisely, what is being computed during the backwards

propagation is the rate of change of the error energy Ewith

respects to the networks weight. This computation is also called

the computation of the gradient:

ij

jj

j

jj

ij

jj

j

ij

jj

j

ij

jj

ij

w

yd

yd

w

yd

w

yd

we

wE

∂

−∂

−=

∂

∑−

∂=

∂

∑−

∂=

∂∑∂

=∂∂

∑)

().

(.

2.21

))

((

21

)2

/)

((

)2

/(

2

22

33


A der


orith

mMore precisely, what is being computed during the backwards

propagation is the rate of change of the error energy Ewith

respects to the networks weight. This computation is also called

the computation of the gradient:

∑∑∑

∂∂−

=

∂∂−

=

∂∂−

∂∂=∂∂

jijj

j

ijj

j

j

ijj

ijj

j

j

ij

wye

wye

wy

wde

wE

0.

)(

.

34


A der


orith

m

The back-propagation learning rule is formulated

to change the connection weights wijso as to

reduce the error energy E by gradient descent

∑∑

∂∂−

=∂∂

=

∂∂−

=

−=

∆

jijj

jj

jijj

jij

old

ij

new

ijij

wyy

dwy

ewE

ww

w

)(

ηη

η

35


A der


orith

m

The

erro

r en

ergy E

is a

funct

ion o

f th

e er

ror e,

the

outp

ut y,

the

wei

ghte

d s

um

of

all

the

inputs

v, a

nd o

f th

e w

eights

wij:

Acc

ord

ing t

o t

he

chai

n r

ule

then

:

),

,,

(j

jji

jv

yw

eE

E≡

∴

jij

jji

we

e

nE

w

nE

∂∂

∂∂=

∂∂)

()

(

36


A der


orith

m

Furt

her

appli

cati

ons

of

the

chai

n r

ule

sugges

ts t

hat

:

jv

jij

jj

jj

jji

jij

jj

jji

wv

vy

ye

e

nE

w

nE

wy

ye

e

nE

w

nE

∂∂

∂∂

∂∂

∂∂=

∂∂

∂∂

∂∂

∂∂=

∂∂

)(

)(

)(

)(

37


A der


orith

m

Eac

h o

f th

e par

tial

der

ivat

ives

can

be

sim

pli

fied

as:

jv

[]

()

()

)(

)(

)(

)(

)(

1

)(

21)

(

'

2

ny

ny

nw

wwv

nv

nv

vvy

yd

yye

en

ee

e

nE

i

i

iji

jijij

jj

jjj

jj

jjj

j

Cj

j

jj

=

∂∂

=∂∂

=∂∂

=∂∂

−=

−∂∂

=∂∂

=∂∂

=∂

∂

∑

∑ ∈

ϕϕ

38


A der


orith

m

The

rate

of

chan

ge

of

the er

ror en

ergy E

wit

h r

espec

t to

chan

ges

in th

e sy

nap

tic

wei

ghts

is:

jv

()

jj

jji

yn

vn

ew

nE

)(

)(

)(

'ϕ

−=

∂∂

39


A der


orith

m

The

so-c

alle

d d

elta

rule

sugges

ts t

hat

jv

)(

)(

)(

∆

)(

)(

∆

nn

yn

w

or

w

nE

nw

jji

jiji

δη

−=

∂∂η

−=

δis

cal

led

th

e lo

cal

gra

die

nt

40


A der


orith

m

Th

e lo

cal

gra

die

ntδ

is g

iven

by

jv

))(

()

()

(

)(

)(

'n

vn

en

or

v

nE

n

jj

jj

jj

ϕ=

δ

∂∂−

=δ

41


A der


orith

m

The

case

of

the

outp

ut

neu

ron j:

jv

)(

))(

()

()

(∆

'n

yn

vn

en

wj

jj

jji

ϕη

−=

Th

e w

eig

ht

ad

just

men

t re

qu

ires

th

e

com

pu

tati

on

of

the e

rro

r si

gn

al

42


A der


orith

m

Th

e ca

se o

f th

e o

utp

ut

neu

ron

j:

Th

e so

-cal

led

del

ta

rule

sug

ges

ts t

hat

jv

−+

−

−+

=

−

−+

=

−+

=

−=

))(

exp(

1

11

))(

exp(

1

1))

((

)(

exp(

.))

(ex

p(

1

1))

((

)(

..

.))

((

))(

exp(

1

1))

((

)(

))(

()

()

(∆

'

2

'

'

nv

nv

nv

nv

nv

nv

nv

tr

wn

vof

Der

ivative

nv

nv

funct

ion

Logis

itic

ny

nv

ne

nw

jj

j

j

j

j

jj

j

j

jj

jj

ji

ϕϕ

ϕ

ϕ

ϕη

43


A der


orith

m

() (

))

()

(1

)(

)(

)(

∆

)(

1)

())

((

))(

()

(

)(

))(

()

()

(∆

'

'

ny

ny

ny

ne

nw

and

ny

ny

nv

nv

ny

ny

nv

ne

nw

jj

jj

ji

jj

j

jj

jj

jj

ji

−−

=

−=

∴

=

−=

η

ϕ

ϕ

ϕη

Th

e ca

se o

f th

e o

utp

ut

neu

ron

j:

Th

e so

-cal

led

del

ta

rule

sug

ges

ts t

hat

44


A der


orith

m

The case of the hidden neuron (j) -

the delta rule suggests that

)(

))(

()

()

(∆

'n

yn

vn

en

wi

jj

jji

ϕη

−=

The weight adjustment requires the computation of

the error signal, but the situation is complicatedin

that we do not have a (set of) desired output(s) for

the hidden neurons and consequently we will have

difficulty in computing the error signal ejfor the

hidden neuron j.

45


A der


orith

m

The case of the hidden neurons (j):

Recall that the local gradient of the

hidden neuron jis given as δjand that y j

is the output and equals

))(

()

(

)(

)(

)(

)(

)(

)(

)(

)(

)(

'n

vn

y

nE

n

nv

ny

ny

nE

nv

nE

n

j

j

j

jj

jj

j

ϕδδ

∂∂−

=

∂∂

∂∂−

=∂∂

−=

We

ha

ve

use

d t

he

cha

in r

ule

of

calc

ulu

s

her

e

' jϕ

46


A der


orith

m

The case of the hidden neurons (j): The ‘error’energy

related to the hidden neurons is given as

)(

21)

(2

ne

nE

k

k∑

=

The rate of change of the error (energy) with respect to

the input during the backward pass –the input during

the pass yk(n)-is given as: ∑

∑

∂∂=

∂∂

∂∂=

∂∂

kj

kk

j

kj

k

j

ny

ne

en

y

nE

ny

ne

ny

nE

)(

)(

)(

)(

)(

)(

21

)(

)(

2

47


A der


orith

m

The case of the hidden neurons (j):The rate of change

of the error (energy) with respect to the input during

the backward pass –the input during the pass yk(n)-is

given as:

)(

)(

)(

)(

)(

)(

)(

ny

nv

nv

ne

ne

ny

nE

jj

kjk

k

j∂∂

∂∂=

∂∂∑

48


A der


orith

m

The

case

of

the

hid

den

neu

rons

(k)-

the

del

ta r

ule

sugges

ts t

hat

)(

))(

()

()

(∆

'n

yn

vn

en

wj

kk

kki

ϕη

−=

Th

e er

ror s

ign

al

for

a h

idd

en n

euro

n h

as,

th

us,

to

be

com

pu

ted

rec

urs

ivel

y i

n t

erm

s o

f th

e e

rro

r

sig

nals

of

AL

L t

he

neu

ron

s to

wh

ich

th

e h

idd

en

neu

ron

jis

dir

ectl

y c

on

nec

ted

. T

her

efo

re, th

e b

ack

-

pro

pa

ga

tio

n a

lgo

rith

m b

eco

mes

co

mp

lica

ted

.

49


A der


orith

m

Th

e ca

se o

f th

e h

idd

en n

euro

ns

(j)

Th

e lo

cal

gra

die

ntδ

is g

iven

by

))(

()

()

(

)(

)(

'n

vn

en

or

v

nE

n

jj

jj

jj

ϕ=

δ

∂∂−

=δ

50


A der


orith

m

Th

e ca

se o

f th

e h

idd

en n

euro

ns

(j)

Th

e lo

cal

gra

die

ntδ

is g

iven

by

))(

()

()

(

)(

)(

)(

)(

'n

vy

nE

n

vy

y

nE

n

as

redef

ined

be

can

whic

h

v

nE

n

j

j

j

jj

j

j

j

j

ϕδδδ

∂∂

−=

∂∂

∂∂

−=

∂∂

−=

51


A der


orith

m

Th

e ca

se o

f th

e h

idd

en n

euro

ns

(j)

Th

e er

ror

ener

gy f

or

the

hid

den

neu

ron c

an b

e

giv

en a

s

No

te t

hat

we

hav

e u

sed t

he

index

kin

stea

d o

f

the

ind

ex j i

n o

rder

to

av

oid

con

fusi

on

.

;)

(21

)(

2

∑ ∈

=C

k

kn

en

E

financial informatics –xvi: backpropagation learning · backpropagation is a procedure for...

Documents