introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · introduction to...

134
Introduction to neural networks 16-385 Computer Vision Spring 2019, Lecture 19 http://www.cs.cmu.edu/~16385/

Upload: others

Post on 07-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Introduction to neural networks

16-385 Computer VisionSpring 2019, Lecture 19http://www.cs.cmu.edu/~16385/

Page 2: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

• Homework 5 has been posted and is due on April 10th.- Any questions about the homework?- How many of you have looked at/started/finished homework 5?

• Homework 6 will be posted on Wednesday.- There will be some adjustment in deadlines (and homework lengths), so that homework 7 is not squeezed to 5 days.

Course announcements

Page 3: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

• Perceptron.

• Neural networks.

• Training perceptrons.

• Gradient descent.

• Backpropagation.

• Stochastic gradient descent.

Overview of today’s lecture

Page 4: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Slide credits

Most of these slides were adapted from:

• Kris Kitani (16-385, Spring 2017).

• Noah Snavely (Cornell University).

• Fei-Fei Li (Stanford University).

• Andrej Karpathy (Stanford University).

Page 5: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Perceptron

Page 6: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

1950s Age of the Perceptron

1980s Age of the Neural Network

2010s Age of the Deep Network

1969 Perceptrons (Minsky, Papert)

2000s Age of the Support Vector Machine

1957 The Perceptron (Rosenblatt)

1990s Age of the Graphical Model

1986 Back propagation (Hinton)

deep learning = known algorithms + computing power + big data

Page 7: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19
Page 8: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The Perceptron

inputs

weights

output

sum sign function(Heaviside step function)

Page 9: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural nets/perceptrons are loosely inspired by

biology.

But they certainly are not a model of how the brain

works, or even how neurons work.

Aside: Inspiration from Biology

Page 10: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

N-d binary vector

perceptron is just one line of code!

sign of zero is +1

Page 11: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

initialized to 0

Page 12: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

observation (1,-1)

label -1

Page 13: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

= 1

observation (1,-1)

label -1

Page 14: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

observation (1,-1)

label -1

Page 15: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

observation (1,-1)

label -1

Page 16: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

observation (1,-1)

label -1

-1 1(1,-1)(0,0)(-1,1)

no match!

Page 17: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

observation (-1,1)

label +1

(-1,1)

Page 18: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

observation (-1,1)

label +1

= 1(-1,1)

(-1,1) (-1,1)

Page 19: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

observation (-1,1)

label +1

= 1(-1,1)

(-1,1) (-1,1)

Page 20: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

observation (-1,1)

label +1

update w

+1 0(-1,1)(-1,1)(-1,1)

match!

Page 21: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19
Page 22: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

Page 23: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

Page 24: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19
Page 25: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

update w

Page 26: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

repeat …

Page 27: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The Perceptron

inputs

weights

output

sum sign function(e.g., step,sigmoid, Tanh, ReLU)

bias

Page 28: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Another way to draw it…

inputs

weights

output

Activation Function(e.g., Sigmoid function of weighted sum)

(1) Combine the sum and

activation function

(2) suppress the bias

term (less clutter)

Page 29: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

output

float perceptron(vector<float> x, vector<float> w)

{

float a = dot(x,w);

return f(a);

}

float f(float a)

{

return 1.0 / (1.0+ exp(-a));

}

Activation function (sigmoid, logistic function)

Perceptron function (logistic regression)

Programming the 'forward pass'

Page 30: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural networks

Page 31: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Connect a bunch of perceptrons together …

Page 32: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

Connect a bunch of perceptrons together …

Page 33: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

Connect a bunch of perceptrons together …

How many perceptrons in this neural network?

Page 34: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘one perceptron’

Connect a bunch of perceptrons together …

Page 35: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘two perceptrons’

Connect a bunch of perceptrons together …

Page 36: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘three perceptrons’

Connect a bunch of perceptrons together …

Page 37: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘four perceptrons’

Connect a bunch of perceptrons together …

Page 38: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘five perceptrons’

Connect a bunch of perceptrons together …

Page 39: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Neural Networka collection of connected perceptrons

‘six perceptrons’

Connect a bunch of perceptrons together …

Page 40: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

‘input’ layer

Some terminology…

…also called a Multi-layer Perceptron (MLP)

Page 41: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

‘input’ layer‘hidden’ layer

Some terminology…

…also called a Multi-layer Perceptron (MLP)

Page 42: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

‘input’ layer‘hidden’ layer

‘output’ layer

Some terminology…

…also called a Multi-layer Perceptron (MLP)

Page 43: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

this layer is a

‘fully connected layer’

all pairwise neurons between layers are connected

Page 44: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

so is this

all pairwise neurons between layers are connected

Page 45: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How many neurons (perceptrons)?

How many weights (edges)?

How many learnable parameters total?

Page 46: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How many neurons (perceptrons)?

How many weights (edges)?

How many learnable parameters total?

4 + 2 = 6

Page 47: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How many neurons (perceptrons)?

How many weights (edges)?

How many learnable parameters total?

4 + 2 = 6

(3 x 4) + (4 x 2) = 20

Page 48: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How many neurons (perceptrons)?

How many weights (edges)?

How many learnable parameters total?

4 + 2 = 6

(3 x 4) + (4 x 2) = 20

20 + 4 + 2 = 26bias terms

Page 49: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

performance usually tops out at 2-3 layers,

deeper networks don’t really improve performance...

...with the exception of convolutional networks for images

Page 50: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Training perceptrons

Page 51: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Let’s start easy

Page 52: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

world’s smallest perceptron!

What does this look like?

Page 53: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

world’s smallest perceptron!

(a.k.a. line equation, linear regression)

Page 54: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given a set of samples and a Perceptron

Estimate the parameters of the Perceptron

Learning a Perceptron

Page 55: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

What do you think the weight parameter is?

1 1.1

2 1.9

3.5 3.4

10 10.1

Given training data:

Page 56: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

What do you think the weight parameter is?

1 1.1

2 1.9

3.5 3.4

10 10.1

Given training data:

not so obvious as the network gets more complicated so we use …

Page 57: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given several examples

An Incremental Learning Strategy(gradient descent)

and a perceptron

Page 58: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given several examples

Modify weight such that gets ‘closer’ to

and a perceptron

An Incremental Learning Strategy(gradient descent)

Page 59: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given several examples

Modify weight such that gets ‘closer’ to

and a perceptron

perceptron

output

true

label

perceptron

parameter

An Incremental Learning Strategy(gradient descent)

Page 60: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given several examples

Modify weight such that gets ‘closer’ to

and a perceptron

perceptron

output

true

label

perceptron

parameter

An Incremental Learning Strategy(gradient descent)

what does

this mean?

Page 61: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Loss Function

defines what is means to be

close to the true solution

YOU get to chose the loss function!(some are better than others depending on what you want to do)

Before diving into gradient descent, we need to understand …

Page 62: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Squared Error (L2)(a popular loss function) ((why?))

Page 63: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

L1 Loss L2 Loss

Zero-One Loss Hinge Loss

Page 64: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

World’s Smallest Perceptron!

back to the…

function of ONE parameter!

(a.k.a. line equation, linear regression)

Page 65: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given a set of samples and a Perceptron

Estimate the parameter of the Perceptron

Learning a Perceptron

what is this

activation function?

Page 66: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given a set of samples and a Perceptron

Estimate the parameter of the Perceptron

Learning a Perceptron

what is this

activation function?linear function!

Page 67: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given several examples

Modify weight such that gets ‘closer’ to

Learning Strategy(gradient descent)

and a perceptron

perceptron

output

true

label

perceptron

parameter

Page 68: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Let’s demystify this process first…

Code to train your perceptron:

Page 69: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Let’s demystify this process first…

Code to train your perceptron:

just one line of code!

Now where does this come from?

Page 70: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient descent

Page 71: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

(partial) derivatives tell us how much

one variable affects another

Page 72: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Slope of a function Knobs on a machine

Two ways to think about them:

Page 73: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

describes the slope around a

point

1. Slope of a function:

Page 74: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

describes how each

‘knob’ affects the output

input output

2. Knobs on a machine:

output will change bysmall change in parameter

Page 75: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given a

fixed-point on a function,

move in the direction

opposite of the gradient

Gradient descent:

Page 76: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient descent:

update rule:

Page 77: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Backpropagation

Page 78: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

World’s Smallest Perceptron!

back to the…

function of ONE parameter!

(a.k.a. line equation, linear regression)

Page 79: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Training the world’s smallest perceptron

this should be the

gradient of the loss

function

Now where does this come from?

This is just gradient

descent, that means…

Page 80: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

…is the rate at which this will change…

… per unit change of this

the loss function

the weight parameter

Let’s compute the derivative…

Page 81: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Compute the derivative

That means the weight update for gradient descent is:

just shorthand

move in direction of negative gradient

Page 82: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient Descent (world’s smallest perceptron)

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

Page 83: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Training the world’s smallest perceptron

Page 84: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

world’s (second) smallest

perceptron!

function of two parameters!

Page 85: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

we just need to compute partial

derivatives for this network

Page 86: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Derivative computation

Why do we have partial derivatives now?

Page 87: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Derivative computation

Gradient Update

Page 88: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update(adjustable step size)

two lines now

(side computation to track loss. not

needed for backprop)

Page 89: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

We haven’t seen a lot of ‘propagation’ yet

because our perceptrons only had one layer…

Page 90: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

multi-layer perceptron

function of FOUR parameters and FOUR layers!

Page 91: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 92: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 93: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 94: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 95: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 96: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 97: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 98: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 99: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

hidden

layer 2

hidden

layer 3

output

layer 4

weightinputactivationsum

weight weightactivation activation

input

layer 1

Page 100: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Entire network can be written out as one long equation

What is known? What is unknown?

We need to train the network:

Page 101: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Entire network can be written out as a long equation

What is known? What is unknown?

known

We need to train the network:

Page 102: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Entire network can be written out as a long equation

What is known? What is unknown?

unknown

We need to train the network:

activation function

sometimes has

parameters

Page 103: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Given a set of samples and a MLP

Estimate the parameters of the MLP

Learning an MLP

Page 104: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient Descent

For each random sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient updatevector of parameter update equations

vector of parameter partial derivatives

Page 105: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

So we need to compute the partial derivatives

Page 106: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Partial derivative describes…

So, how do you compute it?

(loss layer)

Remember,

Page 107: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The Chain Rule

Page 108: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Intuitively, the effect of weight on loss function :

depends on

depends on

depends on

According to the chain rule…

Page 109: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Chain Rule!

Page 110: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Just the partial

derivative of L2 loss

Page 111: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Let’s use a Sigmoid function

Page 112: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Let’s use a Sigmoid function

Page 113: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

rest of the network

Page 114: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19
Page 115: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

already computed.

re-use (propagate)!

Page 116: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The Chain Rule

a.k.a. backpropagation

Page 117: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The chain rule says…

depends ondepends on

depends ondepends ondepends on

depends on

depends on

Page 118: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

The chain rule says…

depends ondepends on

depends ondepends ondepends on

depends on

depends on

already computed.

re-use (propagate)!

Page 119: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

depends ondepends on

depends ondepends ondepends on

depends on

depends on

Page 120: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

depends ondepends on

depends ondepends ondepends on

depends on

depends on

Page 121: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

depends ondepends on

depends ondepends ondepends on

depends on

depends on

Page 122: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

Page 123: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

vector of parameter update equations

vector of parameter partial derivatives

Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

Page 124: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

Stochastic gradient

descent

Page 125: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

What we are truly minimizing:

minθ

𝑖=1

𝑁

𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

The gradient is:

Page 126: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

What we are truly minimizing:

minθ

𝑖=1

𝑁

𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

The gradient is:

𝑖=1

𝑁𝜕𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

𝜕θ

What we use for gradient update is:

Page 127: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

What we are truly minimizing:

minθ

𝑖=1

𝑁

𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

The gradient is:

𝑖=1

𝑁𝜕𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

𝜕θ

What we use for gradient update is:

𝜕𝐿(𝑦𝑖 , 𝑓𝑀𝐿𝑃(𝑥𝑖))

𝜕θfor some i

Page 128: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

vector of parameter update equations

vector of parameter partial derivatives

Stochastic Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

Page 129: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How do we select which sample?

Page 130: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How do we select which sample?

• Select randomly!

Do we need to use only one sample?

Page 131: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How do we select which sample?

• Select randomly!

Do we need to use only one sample?

• You can use a minibatch of size B < N.

Why not do gradient descent with all samples?

Page 132: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How do we select which sample?

• Select randomly!

Do we need to use only one sample?

• You can use a minibatch of size B < N.

Why not do gradient descent with all samples?

• It’s very expensive when N is large (big data).

Do I lose anything by using stochastic GD?

Page 133: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

How do we select which sample?

• Select randomly!

Do we need to use only one sample?

• You can use a minibatch of size B < N.

Why not do gradient descent with all samples?

• It’s very expensive when N is large (big data).

Do I lose anything by using stochastic GD?

• Same convergence guarantees and complexity!

• Better generalization.

Page 134: Introduction to neural networks - cs.cmu.edu16385/s19/lectures/lecture19.pdf · Introduction to neural networks 16-385 Computer Vision 16385/ Spring 2019, Lecture 19

References

Basic reading: No standard textbooks yet! Some good resources:

• https://sites.google.com/site/deeplearningsummerschool/

• http://www.deeplearningbook.org/

• http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf