back-propagation) - cnbc.cmu.eduplaut/intropdp/slides/backprop.pdf · high momentum (epochs 1-2)...

XOR with intermediate (“hidden”) units

Intermediate units can re-representinput patterns as new patterns withaltered similarities

Targets which are not linearly separablein the input space can be linearlyseparable in the intermediaterepresentational space

Intermediate units are called “hidden”because their activations are notdetermined directly by the trainingenvironment (inputs and targets)

1 / 33

hidden output

Hidden-to-output weights can be trainedwith the Delta rule

How can we train input-to-hidden weights?

Hidden units do not have targets (fordetermining error)Trick: We don’t need to know the error,just the error derivatives

2 / 33

Delta rule as gradient descent in error (sigmoid units)

nj =∑

i

ai wij

aj =1

1 + exp (−nj)

Error E =1

2

∑

j

(tj − aj)2

wij

ai → nj →tjaj → E

Gradient descent: 4wij = −ε ∂E

∂wij

∂E

∂wij=

∂E

∂aj

dajdnj

∂nj∂wij

= − (tj − aj) aj (1− aj) ai

4wij = −ε ∂E

∂wij= ε (tj − aj) aj (1− aj) ai

3 / 33

Generalized Delta rule (“back-propagation”)

nj =∑

i

ai wij

aj =1

1 + exp (−nj)

Error E =1

2

∑

j

(tj − aj)2

hidden output

ni →wij

ai → nj →tjaj → E

Gradient descent: 4 wij = −ε ∂E

∂wij

∂E

∂nj=

∂E

∂aj

dajdnj

= − (tj − aj) aj (1− aj)

∂E

∂wij=

∂E

∂nj

∂nj∂wij

=∂E

∂njai

∂E

∂ai=

∑

j

∂E

∂nj

∂nj∂ai

=∑

j

∂E

∂njwij

4 / 33

Back-propagation

Forward pass (⇑) Backward pass (⇓)

aj =1

1 + exp (−nj)

nj =∑

i

ai wij

ai =1

1 + exp (−ni )

∂E

∂aj= −(tj − aj)

∂E

∂nj=

∂E

∂ajaj(1− aj)

∂E

∂wij=

∂E

∂njai

∂E

∂ai=

∑

j

∂E

∂njwij

5 / 33

What do hidden representations learn?

Plaut and Shallice (1993)

Mapped orthography to semantics(unrelated similarities)

Compared similarities among hiddenrepresentations to those amongorthographic and semantic representations(over settling)

Hidden representations “split thedifference” between input andoutput similarity 6 / 33

Accelerating learning: Momentum descent

4wij[t] = −ε ∂E

∂wij+ α

(4w

[t−1]ij

)

7 / 33

“Auto-encoder” network (4–2–4)

8 / 33

Projections of error surface in weight space

Asterisk: error of current set of weights

Tick mark: error of next set of weights

Solid curve (0): Gradient direction

Solid curve (21): Integrated gradientdirection (including momentum)

This is actual direction of weight step(tick mark is on this curve)Number is angle with gradient direction

Dotted curves: Random directions (eachlabeled by angle with gradient direction)

9 / 33

Epochs 1-2

10 / 33

Epochs 3-4

11 / 33

Epochs 5-6

12 / 33

Epochs 7-8

13 / 33

Epochs 9-10

14 / 33

Epochs 25-50

15 / 33

Epochs 75-end

16 / 33

High momentum (epochs 1-2)

17 / 33

High momentum (epochs 3-4)

18 / 33

High learning rate (epochs 1-2)

19 / 33

High learning rate (epochs 3-4)

20 / 33

back-propagation) - cnbc.cmu.eduplaut/intropdp/slides/backprop.pdf · high momentum (epochs 1-2)...

Documents