deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) โ€ขsuppose data...

Post on 14-Oct-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning Basics Lecture 2: Backpropagation

Princeton University COS 495

Instructor: Yingyu Liang

How to train the dragon?

โ€ฆ โ€ฆ

โ€ฆโ€ฆ โ€ฆ

โ€ฆ

๐‘ฅ

โ„Ž1 โ„Ž2 โ„Ž๐ฟ

๐‘ฆ

How to get the expected output

Loss of the system๐‘™(๐‘ฅ; ๐œƒ) = ๐‘™(๐‘“๐œƒ , ๐‘ฅ, ๐‘ฆ)

๐‘ฅ ๐‘™ ๐‘ฅ; ๐œƒ โ‰  0๐‘“๐œƒ(๐‘ฅ)

๐‘ฆ

How to get the expected output

๐‘ฅ ๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘ โ‰ˆ 0

Find direction ๐‘‘ so that:

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

How to get the expected output

๐‘ฅ

How to find ๐‘‘: ๐‘™ ๐‘ฅ; ๐œƒ + ๐œ–๐‘ฃ โ‰ˆ ๐‘™ ๐‘ฅ; ๐œƒ + ๐›ป๐‘™ ๐‘ฅ; ๐œƒ โˆ— ๐œ–๐‘ฃ for small scalar ๐œ–

๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘ โ‰ˆ 0

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

How to get the expected output

๐‘ฅ

Conclusion: Move ๐œƒ along โˆ’๐›ป๐‘™ ๐‘ฅ; ๐œƒ for a small amount

๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

Neural Networks as real circuits Pictorial illustration of gradient descent

Gradient

โ€ข Gradient of the loss is simpleโ€ข E.g., ๐‘™ ๐‘“๐œƒ , ๐‘ฅ, ๐‘ฆ = ๐‘“๐œƒ ๐‘ฅ โˆ’ ๐‘ฆ 2/2

โ€ข๐œ•๐‘™

๐œ•๐œƒ= (๐‘“๐œƒ ๐‘ฅ โˆ’ ๐‘ฆ)

๐œ•๐‘“

๐œ•๐œƒ

โ€ข Key part: gradient of the hypothesis

Open the box: real circuit

Single neuron

๐‘ฅ2

๐‘ฅ1

โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2

Single neuron

๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2

Gradient: ๐œ•๐‘“

๐œ•๐‘ฅ1= 1,

๐œ•๐‘“

๐œ•๐‘ฅ2= โˆ’1

Two neurons

+ ๐‘ฅ2

๐‘ฅ1

โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

Two neurons

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

Gradient: ๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= 1,

๐œ•๐‘ฅ2

๐œ•๐‘ฅ4= 1. What about

๐œ•๐‘“

๐œ•๐‘ฅ3?

๐‘ฅ3

๐‘ฅ4

๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= 1

๐œ•๐‘ฅ2

๐œ•๐‘ฅ4= 1

Two neurons

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= โˆ’1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

Multiple input

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ5 + ๐‘ฅ4

Gradient:๐œ•๐‘ฅ2

๐œ•๐‘ฅ5= 1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

๐‘ฅ5

๐œ•๐‘ฅ2

๐œ•๐‘ฅ5= 1

Multiple input

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ5 + ๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ5=

๐œ•๐‘“

๐œ•๐‘ฅ5

๐œ•๐‘ฅ5

๐œ•๐‘ฅ3= โˆ’1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

๐‘ฅ5

โˆ’1

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ค3

๐‘ฅ4

๐‘ค4

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘ค3= โˆ’1 ร— ๐‘ฅ3 = โˆ’๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

โˆ’๐‘ฅ3

โˆ’๐‘ฅ4

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Let ๐‘›๐‘’๐‘ก2 = ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ฅ3 = โˆ’๐œŽโ€ฒ๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2= ๐œŽโ€ฒ

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= ๐‘ฅ3

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ฅ3 = โˆ’๐œŽโ€ฒ๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2โˆ’๐œŽโ€ฒ

โˆ’๐œŽโ€ฒ๐‘ฅ3

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ1+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘ฅ5

๐‘›๐‘’๐‘ก2

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ1+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3 ๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ3+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ฅ3+

๐œ•๐‘“

๐œ•๐‘ฅ1

๐œ•๐‘ฅ1

๐œ•๐‘ฅ3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ค3 + 1 ร— 1 = โˆ’๐œŽโ€ฒ๐‘ค3 + 1

๐‘ฅ3 ๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Summary

โ€ข Forward to compute ๐‘“

โ€ข Backward to compute the gradients

๐œŽ โ„Ž21

๐œŽ โ„Ž11

+ ๐‘“

๐‘ฅ1

๐‘ฅ2

๐‘›๐‘’๐‘ก11

๐‘›๐‘’๐‘ก21

Math form

Gradient descent

โ€ข Minimize loss ๐ฟ ๐œƒ , where the hypothesis is parametrized by ๐œƒ

โ€ข Gradient descentโ€ข Initialize ๐œƒ0

โ€ข ๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป ๐ฟ ๐œƒ๐‘ก

Stochastic gradient descent (SGD)

โ€ข Suppose data points arrive one by one

โ€ข ๐ฟ ๐œƒ =1

๐‘›ฯƒ๐‘ก=1

๐‘› ๐‘™(๐œƒ, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก), but we only know ๐‘™(๐œƒ, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก) at time ๐‘ก

โ€ข Idea: simply do what you can based on local informationโ€ข Initialize ๐œƒ0

โ€ข ๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป๐‘™(๐œƒ๐‘ก, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก)

Mini-batch

โ€ข Instead of one data point, work with a small batch of ๐‘ points

(๐‘ฅ๐‘ก๐‘+1,๐‘ฆ๐‘ก๐‘+1),โ€ฆ, (๐‘ฅ๐‘ก๐‘+๐‘,๐‘ฆ๐‘ก๐‘+๐‘)

โ€ข Update rule

๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป1

๐‘

1โ‰ค๐‘–โ‰ค๐‘

๐‘™ ๐œƒ๐‘ก , ๐‘ฅ๐‘ก๐‘+๐‘– , ๐‘ฆ๐‘ก๐‘+๐‘–

โ€ข Typical batch size: ๐‘ = 128

top related