deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) •suppose data...

Deep Learning Basics Lecture 2: Backpropagation

Princeton University COS 495

Instructor: Yingyu Liang

How to train the dragon?

… …

…… …

ℎ1 ℎ2 ℎ𝐿

How to get the expected output

Loss of the system𝑙(𝑥; 𝜃) = 𝑙(𝑓𝜃 , 𝑥, 𝑦)

𝑥 𝑙 𝑥; 𝜃 ≠ 0𝑓𝜃(𝑥)

𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Find direction 𝑑 so that:

Loss 𝑙(𝑥; 𝜃 + 𝑑)

How to find 𝑑: 𝑙 𝑥; 𝜃 + 𝜖𝑣 ≈ 𝑙 𝑥; 𝜃 + 𝛻𝑙 𝑥; 𝜃 ∗ 𝜖𝑣 for small scalar 𝜖

𝑙 𝑥; 𝜃 + 𝑑 ≈ 0

Conclusion: Move 𝜃 along −𝛻𝑙 𝑥; 𝜃 for a small amount

𝑙 𝑥; 𝜃 + 𝑑

Neural Networks as real circuits Pictorial illustration of gradient descent

Gradient

• Gradient of the loss is simple• E.g., 𝑙 𝑓𝜃 , 𝑥, 𝑦 = 𝑓𝜃 𝑥 − 𝑦 2/2

•𝜕𝑙

𝜕𝜃= (𝑓𝜃 𝑥 − 𝑦)

𝜕𝑓

𝜕𝜃

• Key part: gradient of the hypothesis

Open the box: real circuit

Single neuron

− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2

Single neuron

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2

Gradient: 𝜕𝑓

𝜕𝑥1= 1,

𝜕𝑓

𝜕𝑥2= −1

Two neurons

+ 𝑥2

− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Two neurons

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Gradient: 𝜕𝑥2

𝜕𝑥3= 1,

𝜕𝑥2

𝜕𝑥4= 1. What about

𝜕𝑓

𝜕𝑥3?

𝜕𝑥2

𝜕𝑥3= 1

𝜕𝑥2

𝜕𝑥4= 1

Two neurons

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4

Gradient:𝜕𝑓

𝜕𝑥3=

𝜕𝑓

𝜕𝑥2

𝜕𝑥3= −1

Multiple input

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4

Gradient:𝜕𝑥2

𝜕𝑥5= 1

𝜕𝑥2

𝜕𝑥5= 1

Multiple input

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4

Gradient:𝜕𝑓

𝜕𝑥5=

𝜕𝑓

𝜕𝑥5

𝜕𝑥3= −1

Weights on the edges

+ 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3𝑥3 + 𝑤4𝑥4

+ 𝑥2

1− 𝑓

+ 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑤3= −1 × 𝑥3 = −𝑥3

−𝑥3

−𝑥4

Activation

𝜎 𝑥2

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3𝑥3 + 𝑤4𝑥4

Activation

𝜎 𝑥2

1− 𝑓

Let 𝑛𝑒𝑡2 = 𝑤3𝑥3 + 𝑤4𝑥4

𝑛𝑒𝑡2

Activation

𝜎 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑤3= −1 × 𝜎′ × 𝑥3 = −𝜎′𝑥3

𝑛𝑒𝑡2

𝜕𝑥2

𝜕𝑛𝑒𝑡2= 𝜎′

𝜕𝑛𝑒𝑡2

𝜕𝑤3= 𝑥3

Activation

𝜎 𝑥2

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑤3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑤3= −1 × 𝜎′ × 𝑥3 = −𝜎′𝑥3

𝑛𝑒𝑡2−𝜎′

−𝜎′𝑥3

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

Function: 𝑓 = 𝑥1 − 𝑥2 = (𝑥1+𝑥5) − 𝜎 𝑤3𝑥3 + 𝑤4𝑥4

𝑛𝑒𝑡2

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

𝑥3 𝑤3

𝑛𝑒𝑡2

Multiple paths

𝜎 𝑥2

+ 𝑥1

1− 𝑓

Gradient:𝜕𝑓

𝜕𝑥3=

𝜕𝑓

𝜕𝑥2

𝜕𝑛𝑒𝑡2

𝜕𝑥3+

𝜕𝑓

𝜕𝑥1

𝜕𝑥3= −1 × 𝜎′ × 𝑤3 + 1 × 1 = −𝜎′𝑤3 + 1

𝑥3 𝑤3

𝑛𝑒𝑡2

Summary

• Forward to compute 𝑓

• Backward to compute the gradients

𝜎 ℎ21

𝜎 ℎ11

+ 𝑓

𝑛𝑒𝑡11

𝑛𝑒𝑡21

Math form

Gradient descent

• Minimize loss 𝐿 𝜃 , where the hypothesis is parametrized by 𝜃

• Gradient descent• Initialize 𝜃0

• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻 𝐿 𝜃𝑡

Stochastic gradient descent (SGD)

• Suppose data points arrive one by one

• 𝐿 𝜃 =1

𝑛σ𝑡=1

𝑛 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡), but we only know 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡) at time 𝑡

• Idea: simply do what you can based on local information• Initialize 𝜃0

• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻𝑙(𝜃𝑡, 𝑥𝑡 , 𝑦𝑡)

Mini-batch

• Instead of one data point, work with a small batch of 𝑏 points

(𝑥𝑡𝑏+1,𝑦𝑡𝑏+1),…, (𝑥𝑡𝑏+𝑏,𝑦𝑡𝑏+𝑏)

• Update rule

𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡𝛻1

1≤𝑖≤𝑏

𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖

• Typical batch size: 𝑏 = 128

deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) •suppose data...

Documents

𝑛 · 𝑛 𝑛 𝒌 𝒌 𝒏 𝒌= ≤ 𝒌 𝒏...

dynamics of a nonautonomous stochastic sis...

development of a new hybrid biodegradable drug-eluting stent...

quantum computing: a physicist’s point of · pdf...

極限math-aquarium【練習問題＋解答】極限 4 3 (1)...

memo · 2020. 7. 21. ·...

unit 5 integration test 1 study guide · 2020. 1. 29. ·...

c nh x lecture 8 nguyen van thuy nh ch t ©¹ a tÍch phÂn...

cobertura global y de orden superior · black-scholes...

multi-dimensional parallel discontinuous galerkin...

20 ：損傷度shibahara/osakafu-content/...nugget...

how to sum frequency and second harmonic ......theory 𝑃...

option price decomposition in spot-dependent volatility ...2...

bc calculus series convergence/divergence b notesheet name:...

does listed real estate behave like direct real …...t f w...

some stochastic functional differential equations...

lecture guide - vvc for...

binomial coefficients pascal’s triangle€¦ · the core...

simulating burr type vii distributions through the method of...

peer-reviewed article bioresources...resistance at the bolt...