backprop derivation

3
Notes on Backpropagation Peter Sadowski Department of Computer Science University of California Irvine Irvine, CA 92697 [email protected] Abstract This document derives backpropagation for some common error functions and describes some other tricks. I wrote this up because I kept having to reassure myself of the algebra. 1 Cross Entropy Error with Logistic Activation Often we have multiple probabilistic output units that do not represent a simple classification task, and are therefore not suited for a softmax activation function, such as an autoencoder network on binary data. The sum of squares error is ill-suited because we still desire the probabilistic interpre- tation of the output. Instead, we can generalize the two-class case in which we sum of the cross entropy errors for multiple output units, where each output is itself a two-class problem. This is simply the expectation of the loglikelihood of independent targets. The cross entropy error for a single example is E = nout i=1 (t i log(o i ) + (1 t i ) log(1 o i )) (1) where t is the target, o is the output, indexed by i. Our activation function is the logistic o i = 1 1+ e xi (2) x i = j=1 o j w ji (3) Our first goal is to compute the error gradient for the weights in the final layer ∂E ∂wji . We begin by observing, ∂E ∂w ji = ∂E ∂o i ∂o i ∂x i ∂x i ∂w ji (4) and proceed by analyzing the partial derivatives for specific data points, ∂E ∂o i = t i o i + 1 t i 1 o i (5) = o i t i o i (1 o i ) (6) ∂o i ∂x i = o i (1 o i ) (7) ∂x i ∂w ji = o j (8) 1

Upload: joeyb379

Post on 15-Nov-2015

5 views

Category:

Documents


1 download

DESCRIPTION

Backpropagation for neural networks

TRANSCRIPT

  • Notes on Backpropagation

    Peter SadowskiDepartment of Computer Science

    University of California IrvineIrvine, CA 92697

    [email protected]

    Abstract

    This document derives backpropagation for some common error functions anddescribes some other tricks. I wrote this up because I kept having to reassuremyself of the algebra.

    1 Cross Entropy Error with Logistic Activation

    Often we have multiple probabilistic output units that do not represent a simple classification task,and are therefore not suited for a softmax activation function, such as an autoencoder network onbinary data. The sum of squares error is ill-suited because we still desire the probabilistic interpre-tation of the output. Instead, we can generalize the two-class case in which we sum of the crossentropy errors for multiple output units, where each output is itself a two-class problem. This issimply the expectation of the loglikelihood of independent targets.The cross entropy error for a single example is

    E = nouti=1

    (ti log(oi) + (1 ti) log(1 oi)) (1)

    where t is the target, o is the output, indexed by i. Our activation function is the logistic

    oi =1

    1 + exi(2)

    xi =j=1

    ojwji (3)

    Our first goal is to compute the error gradient for the weights in the final layer Ewji

    . We begin byobserving,

    E

    wji=

    E

    oi

    oi

    xi

    xi

    wji(4)

    and proceed by analyzing the partial derivatives for specific data points,E

    oi=

    ti

    oi+

    1 ti1 oi

    (5)

    =oi ti

    oi(1 oi)(6)

    oi

    xi= oi(1 oi) (7)

    xi

    wji= oj (8)

    1

  • where oj is the activation of the j node in the hidden layer. Combining things back together,E

    xi= oi ti (9)

    andE

    wji= (oi ti)oj (10)

    .

    This gives us the gradient for the weights in the last layer of the network. We now need to calculatethe error gradient for the weights of the lower layers. Here it is useful to calculate the quantity E

    xj

    where j indexes the units in the second layer down.

    E

    xj=

    nouti=1

    E

    xi

    xi

    oj

    oj

    xj(11)

    =

    nouti=1

    (oi ti)(wji)(oj(1 oj)) (12)

    Then a weight wkj connecting units in the second and third layers down has gradientE

    wkj=

    E

    xj

    xj

    wkj(13)

    =nouti=1

    (oi ti)(wji)(oj(1 oj))(ok) (14)

    In conclusion, to compute Ewij

    for a general multilayer network, we simply need to compute Exj

    recursively, then multiply by xjwkj

    = ok.

    2 Classification with Softmax Transfer and Cross Entropy Error

    For classification problems with more than 2 classes, we need a way of assigning probabilities toeach class. The softmax activation function allows us to do this while still using the cross entropyerror and having a nice gradient to descend. (Note that this cross entropy error function is modifiedfor multiclass output.) It turns out to be the same gradient as for the summed cross entropy case.The softmax activation of the ith output unit is

    fi(w, x) =exinclass

    k exk

    (15)

    and the cross entropy error function for multi-class output is

    E =

    nclassi

    ti log(oi) (16)

    SoE

    oi=

    ti

    oi(17)

    oi

    xk=

    {oi(1 oi) i = koiok i 6= k

    (18)

    2

  • E

    xi=

    nclassk

    E

    ok

    ok

    xi(19)

    =E

    oi

    oi

    xik 6=i

    E

    ok

    ok

    xi(20)

    = ti(1 oi) +k 6=i

    tkok (21)

    = ti + oik

    tk (22)

    = oi ti (23)(24)

    the gradient for weights in the top layer is thusE

    wji=

    i

    E

    xi

    xi

    wji(25)

    = (oi ti)oj (26)

    and for units in the second lowest layer indexed by j,

    E

    xj=

    nclassi

    E

    xi

    xi

    oj

    oj

    xj(27)

    =nclass

    i

    (oi ti)(wji)(oj(1 oj)) (28)

    Notice that this gradient has the same formula as for the summed cross entropy case, but it is differentbecause the different activation function in last layer means o takes on different values.

    3 Algebraic trick for cross-entropy calculation

    For a single output neuron with logistic activation, the cross-entropy error is given by

    E = (t log o+ (1 t) log (1 o)) (29)=

    (t log (

    o

    1 o) + log(1 o)

    )(30)

    =

    (t log (

    1

    1+ex

    1 11+ex

    ) + log (1 1

    1 + ex)

    )(31)

    =

    (tx+ log (

    1

    1 + ex)

    )(32)

    = tx+ log (1 + ex) (33)

    3