backprop derivation

Notes on Backpropagation

Peter SadowskiDepartment of Computer Science

University of California IrvineIrvine, CA 92697

[email protected]

Abstract

This document derives backpropagation for some common error functions anddescribes some other tricks. I wrote this up because I kept having to reassuremyself of the algebra.

1 Cross Entropy Error with Logistic Activation

Often we have multiple probabilistic output units that do not represent a simple classification task,and are therefore not suited for a softmax activation function, such as an autoencoder network onbinary data. The sum of squares error is ill-suited because we still desire the probabilistic interpre-tation of the output. Instead, we can generalize the two-class case in which we sum of the crossentropy errors for multiple output units, where each output is itself a two-class problem. This issimply the expectation of the loglikelihood of independent targets.The cross entropy error for a single example is

E = nouti=1

(ti log(oi) + (1 ti) log(1 oi)) (1)

where t is the target, o is the output, indexed by i. Our activation function is the logistic

oi =1

1 + exi(2)

xi =j=1

ojwji (3)

Our first goal is to compute the error gradient for the weights in the final layer Ewji

. We begin byobserving,

E

wji=

E

oi

oi

xi

xi

wji(4)

and proceed by analyzing the partial derivatives for specific data points,E

oi=

ti

oi+

1 ti1 oi

(5)

=oi ti

oi(1 oi)(6)

oi

xi= oi(1 oi) (7)

xi

wji= oj (8)

1

where oj is the activation of the j node in the hidden layer. Combining things back together,E

xi= oi ti (9)

andE

wji= (oi ti)oj (10)

.

This gives us the gradient for the weights in the last layer of the network. We now need to calculatethe error gradient for the weights of the lower layers. Here it is useful to calculate the quantity E

xj

where j indexes the units in the second layer down.

E

xj=

nouti=1

E

xi

xi

oj

oj

xj(11)

=

nouti=1

(oi ti)(wji)(oj(1 oj)) (12)

Then a weight wkj connecting units in the second and third layers down has gradientE

wkj=

E

xj

xj

wkj(13)

=nouti=1

(oi ti)(wji)(oj(1 oj))(ok) (14)

In conclusion, to compute Ewij

for a general multilayer network, we simply need to compute Exj

recursively, then multiply by xjwkj

= ok.

2 Classification with Softmax Transfer and Cross Entropy Error

For classification problems with more than 2 classes, we need a way of assigning probabilities toeach class. The softmax activation function allows us to do this while still using the cross entropyerror and having a nice gradient to descend. (Note that this cross entropy error function is modifiedfor multiclass output.) It turns out to be the same gradient as for the summed cross entropy case.The softmax activation of the ith output unit is

fi(w, x) =exinclass

k exk

(15)

and the cross entropy error function for multi-class output is

E =

nclassi

ti log(oi) (16)

SoE

oi=

ti

oi(17)

oi

xk=

{oi(1 oi) i = koiok i 6= k

(18)

2

E

xi=

nclassk

E

ok

ok

xi(19)

=E

oi

oi

xik 6=i

E

ok

ok

xi(20)

= ti(1 oi) +k 6=i

tkok (21)

= ti + oik

tk (22)

= oi ti (23)(24)

the gradient for weights in the top layer is thusE

wji=

i

E

xi

xi

wji(25)

= (oi ti)oj (26)

and for units in the second lowest layer indexed by j,

E

xj=

nclassi

E

xi

xi

oj

oj

xj(27)

=nclass

i

(oi ti)(wji)(oj(1 oj)) (28)

Notice that this gradient has the same formula as for the summed cross entropy case, but it is differentbecause the different activation function in last layer means o takes on different values.

3 Algebraic trick for cross-entropy calculation

For a single output neuron with logistic activation, the cross-entropy error is given by

E = (t log o+ (1 t) log (1 o)) (29)=

(t log (

o

1 o) + log(1 o)

)(30)

=

(t log (

1

1+ex

1 11+ex

) + log (1 1

1 + ex)

)(31)

=

(tx+ log (

1

1 + ex)

)(32)

= tx+ log (1 + ex) (33)

3

backprop derivation

Documents