backprop derivation
DESCRIPTION
Backpropagation for neural networksTRANSCRIPT
-
Notes on Backpropagation
Peter SadowskiDepartment of Computer Science
University of California IrvineIrvine, CA 92697
Abstract
This document derives backpropagation for some common error functions anddescribes some other tricks. I wrote this up because I kept having to reassuremyself of the algebra.
1 Cross Entropy Error with Logistic Activation
Often we have multiple probabilistic output units that do not represent a simple classification task,and are therefore not suited for a softmax activation function, such as an autoencoder network onbinary data. The sum of squares error is ill-suited because we still desire the probabilistic interpre-tation of the output. Instead, we can generalize the two-class case in which we sum of the crossentropy errors for multiple output units, where each output is itself a two-class problem. This issimply the expectation of the loglikelihood of independent targets.The cross entropy error for a single example is
E = nouti=1
(ti log(oi) + (1 ti) log(1 oi)) (1)
where t is the target, o is the output, indexed by i. Our activation function is the logistic
oi =1
1 + exi(2)
xi =j=1
ojwji (3)
Our first goal is to compute the error gradient for the weights in the final layer Ewji
. We begin byobserving,
E
wji=
E
oi
oi
xi
xi
wji(4)
and proceed by analyzing the partial derivatives for specific data points,E
oi=
ti
oi+
1 ti1 oi
(5)
=oi ti
oi(1 oi)(6)
oi
xi= oi(1 oi) (7)
xi
wji= oj (8)
1
-
where oj is the activation of the j node in the hidden layer. Combining things back together,E
xi= oi ti (9)
andE
wji= (oi ti)oj (10)
.
This gives us the gradient for the weights in the last layer of the network. We now need to calculatethe error gradient for the weights of the lower layers. Here it is useful to calculate the quantity E
xj
where j indexes the units in the second layer down.
E
xj=
nouti=1
E
xi
xi
oj
oj
xj(11)
=
nouti=1
(oi ti)(wji)(oj(1 oj)) (12)
Then a weight wkj connecting units in the second and third layers down has gradientE
wkj=
E
xj
xj
wkj(13)
=nouti=1
(oi ti)(wji)(oj(1 oj))(ok) (14)
In conclusion, to compute Ewij
for a general multilayer network, we simply need to compute Exj
recursively, then multiply by xjwkj
= ok.
2 Classification with Softmax Transfer and Cross Entropy Error
For classification problems with more than 2 classes, we need a way of assigning probabilities toeach class. The softmax activation function allows us to do this while still using the cross entropyerror and having a nice gradient to descend. (Note that this cross entropy error function is modifiedfor multiclass output.) It turns out to be the same gradient as for the summed cross entropy case.The softmax activation of the ith output unit is
fi(w, x) =exinclass
k exk
(15)
and the cross entropy error function for multi-class output is
E =
nclassi
ti log(oi) (16)
SoE
oi=
ti
oi(17)
oi
xk=
{oi(1 oi) i = koiok i 6= k
(18)
2
-
E
xi=
nclassk
E
ok
ok
xi(19)
=E
oi
oi
xik 6=i
E
ok
ok
xi(20)
= ti(1 oi) +k 6=i
tkok (21)
= ti + oik
tk (22)
= oi ti (23)(24)
the gradient for weights in the top layer is thusE
wji=
i
E
xi
xi
wji(25)
= (oi ti)oj (26)
and for units in the second lowest layer indexed by j,
E
xj=
nclassi
E
xi
xi
oj
oj
xj(27)
=nclass
i
(oi ti)(wji)(oj(1 oj)) (28)
Notice that this gradient has the same formula as for the summed cross entropy case, but it is differentbecause the different activation function in last layer means o takes on different values.
3 Algebraic trick for cross-entropy calculation
For a single output neuron with logistic activation, the cross-entropy error is given by
E = (t log o+ (1 t) log (1 o)) (29)=
(t log (
o
1 o) + log(1 o)
)(30)
=
(t log (
1
1+ex
1 11+ex
) + log (1 1
1 + ex)
)(31)
=
(tx+ log (
1
1 + ex)
)(32)
= tx+ log (1 + ex) (33)
3