deep learning basics lecture 1: feedforward...stochastic gradient descent (sgd) โ€ขsuppose data...

31
Deep Learning Basics Lecture 2: Backpropagation Princeton University COS 495 Instructor: Yingyu Liang

Upload: others

Post on 14-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Deep Learning Basics Lecture 2: Backpropagation

Princeton University COS 495

Instructor: Yingyu Liang

Page 2: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

How to train the dragon?

โ€ฆ โ€ฆ

โ€ฆโ€ฆ โ€ฆ

โ€ฆ

๐‘ฅ

โ„Ž1 โ„Ž2 โ„Ž๐ฟ

๐‘ฆ

Page 3: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

How to get the expected output

Loss of the system๐‘™(๐‘ฅ; ๐œƒ) = ๐‘™(๐‘“๐œƒ , ๐‘ฅ, ๐‘ฆ)

๐‘ฅ ๐‘™ ๐‘ฅ; ๐œƒ โ‰  0๐‘“๐œƒ(๐‘ฅ)

๐‘ฆ

Page 4: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

How to get the expected output

๐‘ฅ ๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘ โ‰ˆ 0

Find direction ๐‘‘ so that:

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

Page 5: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

How to get the expected output

๐‘ฅ

How to find ๐‘‘: ๐‘™ ๐‘ฅ; ๐œƒ + ๐œ–๐‘ฃ โ‰ˆ ๐‘™ ๐‘ฅ; ๐œƒ + ๐›ป๐‘™ ๐‘ฅ; ๐œƒ โˆ— ๐œ–๐‘ฃ for small scalar ๐œ–

๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘ โ‰ˆ 0

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

Page 6: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

How to get the expected output

๐‘ฅ

Conclusion: Move ๐œƒ along โˆ’๐›ป๐‘™ ๐‘ฅ; ๐œƒ for a small amount

๐‘™ ๐‘ฅ; ๐œƒ + ๐‘‘

Loss ๐‘™(๐‘ฅ; ๐œƒ + ๐‘‘)

Page 7: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Neural Networks as real circuits Pictorial illustration of gradient descent

Page 8: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Gradient

โ€ข Gradient of the loss is simpleโ€ข E.g., ๐‘™ ๐‘“๐œƒ , ๐‘ฅ, ๐‘ฆ = ๐‘“๐œƒ ๐‘ฅ โˆ’ ๐‘ฆ 2/2

โ€ข๐œ•๐‘™

๐œ•๐œƒ= (๐‘“๐œƒ ๐‘ฅ โˆ’ ๐‘ฆ)

๐œ•๐‘“

๐œ•๐œƒ

โ€ข Key part: gradient of the hypothesis

Page 9: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Open the box: real circuit

Page 10: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Single neuron

๐‘ฅ2

๐‘ฅ1

โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2

Page 11: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Single neuron

๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2

Gradient: ๐œ•๐‘“

๐œ•๐‘ฅ1= 1,

๐œ•๐‘“

๐œ•๐‘ฅ2= โˆ’1

Page 12: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Two neurons

+ ๐‘ฅ2

๐‘ฅ1

โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

Page 13: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Two neurons

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

Gradient: ๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= 1,

๐œ•๐‘ฅ2

๐œ•๐‘ฅ4= 1. What about

๐œ•๐‘“

๐œ•๐‘ฅ3?

๐‘ฅ3

๐‘ฅ4

๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= 1

๐œ•๐‘ฅ2

๐œ•๐‘ฅ4= 1

Page 14: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Two neurons

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘ฅ3= โˆ’1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

Page 15: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Multiple input

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ5 + ๐‘ฅ4

Gradient:๐œ•๐‘ฅ2

๐œ•๐‘ฅ5= 1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

๐‘ฅ5

๐œ•๐‘ฅ2

๐œ•๐‘ฅ5= 1

Page 16: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Multiple input

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ฅ3 + ๐‘ฅ5 + ๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ5=

๐œ•๐‘“

๐œ•๐‘ฅ5

๐œ•๐‘ฅ5

๐œ•๐‘ฅ3= โˆ’1

๐‘ฅ3

๐‘ฅ4

โˆ’1

โˆ’1

๐‘ฅ5

โˆ’1

Page 17: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

Page 18: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ค3

๐‘ฅ4

๐‘ค4

Page 19: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Weights on the edges

+ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘ค3= โˆ’1 ร— ๐‘ฅ3 = โˆ’๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

โˆ’๐‘ฅ3

โˆ’๐‘ฅ4

Page 20: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

Page 21: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Let ๐‘›๐‘’๐‘ก2 = ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Page 22: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ฅ3 = โˆ’๐œŽโ€ฒ๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2= ๐œŽโ€ฒ

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= ๐‘ฅ3

Page 23: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Activation

๐œŽ ๐‘ฅ2

โˆ’1

๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = ๐‘ฅ1 โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ค3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ค3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ฅ3 = โˆ’๐œŽโ€ฒ๐‘ฅ3

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2โˆ’๐œŽโ€ฒ

โˆ’๐œŽโ€ฒ๐‘ฅ3

Page 24: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ1+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3

๐‘ฅ4

๐‘ค3

๐‘ค4

๐‘ฅ5

๐‘›๐‘’๐‘ก2

Page 25: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ1+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

๐‘ฅ3 ๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Page 26: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Multiple paths

๐œŽ ๐‘ฅ2

โˆ’1

+ ๐‘ฅ1

1โˆ’ ๐‘“

Function: ๐‘“ = ๐‘ฅ1 โˆ’ ๐‘ฅ2 = (๐‘ฅ3+๐‘ฅ5) โˆ’ ๐œŽ ๐‘ค3๐‘ฅ3 + ๐‘ค4๐‘ฅ4

Gradient:๐œ•๐‘“

๐œ•๐‘ฅ3=

๐œ•๐‘“

๐œ•๐‘ฅ2

๐œ•๐‘ฅ2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘›๐‘’๐‘ก2

๐œ•๐‘ฅ3+

๐œ•๐‘“

๐œ•๐‘ฅ1

๐œ•๐‘ฅ1

๐œ•๐‘ฅ3= โˆ’1 ร— ๐œŽโ€ฒ ร— ๐‘ค3 + 1 ร— 1 = โˆ’๐œŽโ€ฒ๐‘ค3 + 1

๐‘ฅ3 ๐‘ค3

๐‘ค4

๐‘›๐‘’๐‘ก2

Page 27: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Summary

โ€ข Forward to compute ๐‘“

โ€ข Backward to compute the gradients

๐œŽ โ„Ž21

๐œŽ โ„Ž11

+ ๐‘“

๐‘ฅ1

๐‘ฅ2

๐‘›๐‘’๐‘ก11

๐‘›๐‘’๐‘ก21

Page 28: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Math form

Page 29: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Gradient descent

โ€ข Minimize loss ๐ฟ ๐œƒ , where the hypothesis is parametrized by ๐œƒ

โ€ข Gradient descentโ€ข Initialize ๐œƒ0

โ€ข ๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป ๐ฟ ๐œƒ๐‘ก

Page 30: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Stochastic gradient descent (SGD)

โ€ข Suppose data points arrive one by one

โ€ข ๐ฟ ๐œƒ =1

๐‘›ฯƒ๐‘ก=1

๐‘› ๐‘™(๐œƒ, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก), but we only know ๐‘™(๐œƒ, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก) at time ๐‘ก

โ€ข Idea: simply do what you can based on local informationโ€ข Initialize ๐œƒ0

โ€ข ๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป๐‘™(๐œƒ๐‘ก, ๐‘ฅ๐‘ก , ๐‘ฆ๐‘ก)

Page 31: Deep Learning Basics Lecture 1: feedforward...Stochastic gradient descent (SGD) โ€ขSuppose data points arrive one by one โ€ข๐ฟ =1 ๐‘› ฯƒ๐‘ก=1 ๐‘›๐‘™( , ๐‘ก, ๐‘ก), but we only

Mini-batch

โ€ข Instead of one data point, work with a small batch of ๐‘ points

(๐‘ฅ๐‘ก๐‘+1,๐‘ฆ๐‘ก๐‘+1),โ€ฆ, (๐‘ฅ๐‘ก๐‘+๐‘,๐‘ฆ๐‘ก๐‘+๐‘)

โ€ข Update rule

๐œƒ๐‘ก+1 = ๐œƒ๐‘ก โˆ’ ๐œ‚๐‘ก๐›ป1

๐‘

1โ‰ค๐‘–โ‰ค๐‘

๐‘™ ๐œƒ๐‘ก , ๐‘ฅ๐‘ก๐‘+๐‘– , ๐‘ฆ๐‘ก๐‘+๐‘–

โ€ข Typical batch size: ๐‘ = 128