back-propagation) - cnbc.cmu.eduplaut/intropdp/slides/backprop.pdf · high momentum (epochs 1-2)...

5
XOR with intermediate (“hidden”) units Intermediate units can re-represent input patterns as new patterns with altered similarities Targets which are not linearly separable in the input space can be linearly separable in the intermediate representational space Intermediate units are called “hidden” because their activations are not determined directly by the training environment (inputs and targets) 1 / 33 hidden output Hidden-to-output weights can be trained with the Delta rule How can we train input-to-hidden weights? Hidden units do not have targets (for determining error) Trick: We don’t need to know the error, just the error derivatives 2 / 33 Delta rule as gradient descent in error (sigmoid units) n j = i a i w ij a j = 1 1+ exp (-n j ) Error E = 1 2 j (t j - a j ) 2 w ij a i n j t j a j E Gradient descent: w ij = - E w ij E w ij = E a j da j dn j n j w ij = - (t j - a j ) a j (1 - a j ) a i w ij = - E w ij = (t j - a j ) a j (1 - a j ) a i 3 / 33 Generalized Delta rule (“back-propagation”) n j = i a i w ij a j = 1 1+ exp (-n j ) Error E = 1 2 j (t j - a j ) 2 hidden output n i w ij a i n j t j a j E Gradient descent: w ij = - E w ij E n j = E a j da j dn j = - (t j - a j ) a j (1 - a j ) E w ij = E n j n j w ij = E n j a i E a i = j E n j n j a i = j E n j w ij 4 / 33

Upload: ngodan

Post on 27-Aug-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: back-propagation) - cnbc.cmu.eduplaut/IntroPDP/slides/backprop.pdf · High momentum (epochs 1-2) High momentum (epochs 3-4) 17/33 18/33 High learning rate (epochs 1-2) High learning

XOR with intermediate (“hidden”) units

Intermediate units can re-representinput patterns as new patterns withaltered similarities

Targets which are not linearly separablein the input space can be linearlyseparable in the intermediaterepresentational space

Intermediate units are called “hidden”because their activations are notdetermined directly by the trainingenvironment (inputs and targets)

1 / 33

hidden output

Hidden-to-output weights can be trainedwith the Delta rule

How can we train input-to-hidden weights?

Hidden units do not have targets (fordetermining error)Trick: We don’t need to know the error,just the error derivatives

2 / 33

Delta rule as gradient descent in error (sigmoid units)

nj =∑

i

ai wij

aj =1

1 + exp (−nj)

Error E =1

2

j

(tj − aj)2

wij

ai → nj →tjaj → E

Gradient descent: 4wij = −ε ∂E

∂wij

∂E

∂wij=

∂E

∂aj

dajdnj

∂nj∂wij

= − (tj − aj) aj (1− aj) ai

4wij = −ε ∂E

∂wij= ε (tj − aj) aj (1− aj) ai

3 / 33

Generalized Delta rule (“back-propagation”)

nj =∑

i

ai wij

aj =1

1 + exp (−nj)

Error E =1

2

j

(tj − aj)2

hidden output

ni →wij

ai → nj →tjaj → E

Gradient descent: 4 wij = −ε ∂E

∂wij

∂E

∂nj=

∂E

∂aj

dajdnj

= − (tj − aj) aj (1− aj)

∂E

∂wij=

∂E

∂nj

∂nj∂wij

=∂E

∂njai

∂E

∂ai=

j

∂E

∂nj

∂nj∂ai

=∑

j

∂E

∂njwij

4 / 33

Page 2: back-propagation) - cnbc.cmu.eduplaut/IntroPDP/slides/backprop.pdf · High momentum (epochs 1-2) High momentum (epochs 3-4) 17/33 18/33 High learning rate (epochs 1-2) High learning

Back-propagation

Forward pass (⇑) Backward pass (⇓)

aj =1

1 + exp (−nj)

nj =∑

i

ai wij

ai =1

1 + exp (−ni )

∂E

∂aj= −(tj − aj)

∂E

∂nj=

∂E

∂ajaj(1− aj)

∂E

∂wij=

∂E

∂njai

∂E

∂ai=

j

∂E

∂njwij

5 / 33

What do hidden representations learn?

Plaut and Shallice (1993)

Mapped orthography to semantics(unrelated similarities)

Compared similarities among hiddenrepresentations to those amongorthographic and semantic representations(over settling)

Hidden representations “split thedifference” between input andoutput similarity 6 / 33

Accelerating learning: Momentum descent

4wij[t] = −ε ∂E

∂wij+ α

(4w

[t−1]ij

)

7 / 33

“Auto-encoder” network (4–2–4)

8 / 33

Page 3: back-propagation) - cnbc.cmu.eduplaut/IntroPDP/slides/backprop.pdf · High momentum (epochs 1-2) High momentum (epochs 3-4) 17/33 18/33 High learning rate (epochs 1-2) High learning

Projections of error surface in weight space

Asterisk: error of current set of weights

Tick mark: error of next set of weights

Solid curve (0): Gradient direction

Solid curve (21): Integrated gradientdirection (including momentum)

This is actual direction of weight step(tick mark is on this curve)Number is angle with gradient direction

Dotted curves: Random directions (eachlabeled by angle with gradient direction)

9 / 33

Epochs 1-2

10 / 33

Epochs 3-4

11 / 33

Epochs 5-6

12 / 33

Page 4: back-propagation) - cnbc.cmu.eduplaut/IntroPDP/slides/backprop.pdf · High momentum (epochs 1-2) High momentum (epochs 3-4) 17/33 18/33 High learning rate (epochs 1-2) High learning

Epochs 7-8

13 / 33

Epochs 9-10

14 / 33

Epochs 25-50

15 / 33

Epochs 75-end

16 / 33

Page 5: back-propagation) - cnbc.cmu.eduplaut/IntroPDP/slides/backprop.pdf · High momentum (epochs 1-2) High momentum (epochs 3-4) 17/33 18/33 High learning rate (epochs 1-2) High learning

High momentum (epochs 1-2)

17 / 33

High momentum (epochs 3-4)

18 / 33

High learning rate (epochs 1-2)

19 / 33

High learning rate (epochs 3-4)

20 / 33