david mcallester, winter 2018 computation graphs and ...dmcallester/deepclass18/02mlp/mlpb.pdf ·...

45
TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation Graphs and Backpropagation The Educational Framework (EDF) Minibatching Explicit Index Notation 1

Upload: others

Post on 13-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

TTIC 31230, Fundamentals of Deep Learning

David McAllester, Winter 2018

Computation Graphs and Backpropagation

The Educational Framework (EDF)

Minibatching

Explicit Index Notation

1

Page 2: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Computation Graphs

A computation graph (sometimes called a “computational graph”)is a sequence of assignment statements.

L0[j] = Relu

∑i

W 0[j, i] x[i]

+ b0[j]

L1[y] = σ

∑j

W 1[y, j] L0[j]

+ b1[y]

P (y) =

1

ZeL

1[y]

2

Page 3: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Computation Graphs

A simpler example:

` =

√x2 + y2

u = x2

v = y2

r = u + v

` =√r

A computation graph defines a DAG (a directed acyclic graph)where the variables are the nodes. Each assignment determinesone or more directed edges from the left hand variable to theright hand variables.

3

Page 4: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Computation Graphs

1. u = x2

2. w = y2

3. r = u + w4. ` =

√r

For each variable z we can consider ∂`/∂z.

Gradients are computed in the reverse order.

(4) ∂`/∂r = 12√r

(3) ∂`/∂u = ∂`/∂r(3) ∂`/∂w = ∂`/∂r(2) ∂`/∂y = (∂`/∂w) ∗ (2y)(1) ∂`/∂x = (∂`/∂u) ∗ (2x)

4

Page 5: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

A More Abstract Example

y = f (x)

z = g(y, x)

u = h(z)

` = u

For now assume all values are scalars.

We will “backpopagate” the assignments the reverse order.

5

Page 6: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

∂`/∂u = 1

6

Page 7: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

∂`/∂u = 1

∂`/∂z = (∂`/∂u) (∂h/∂z) (this uses the value of z)

7

Page 8: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

∂`/∂u = 1

∂`/∂z = (∂`/∂u) (∂h/∂z)

∂`/∂y = (∂`/∂z) (∂g/∂y) (this uses the value of y and x)

8

Page 9: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

∂`/∂u = 1

∂`/∂z = (∂`/∂u) (∂h/∂z)

∂`/∂y = (∂`/∂z) (∂g/∂y)

∂`/∂x = ??? Oops, we need to add up multiple occurrences.

9

Page 10: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

We let x.grad be an attribute (as in Python) of node x.

We will initialize x.grad to zero.

During backpropagation we will accumulate contributions to∂`/∂x into x.grad.

10

Page 11: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

z.grad = y.grad = x.grad = 0

u.grad = 1

Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.

11

Page 12: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

z.grad = y.grad = x.grad = 0

u.grad = 1

Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.

z.grad += u.grad ∗ ∂h/∂z

12

Page 13: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

z.grad = y.grad = x.grad = 0

u.grad = 1

Loop Invariant: For any variable w whose definition has notyet been processed we have that w.grad is ∂`/∂w as definedby the set of assignments already processed.

z.grad += u.grad ∗ ∂h/∂zy.grad += z.grad ∗ ∂g/∂yx.grad += z.grad ∗ ∂g/∂x

Page 14: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Backpropagation

y = f (x)

z = g(y, x)

u = h(z)

` = u

z.grad = y.grad = x.grad = 0

u.grad = 1

z.grad += u.grad ∗ ∂h/∂zy.grad += z.grad ∗ ∂g/∂yx.grad += z.grad ∗ ∂g/∂xx.grad += y.grad ∗ ∂f/∂x

14

Page 15: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Vector-Valued Case

y = f (x)

z = g(y, x)

u = h(z)

` = u

Now suppose the variables can be vector-valued.

The loss ` is still a scalar.

In this casex.grad = ∇x `

These are now vectors of the same dimension with

x.grad[i] = ∂`/∂x[i]

15

Page 16: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Tensor-Valued Case

y = f (x)

z = g(y, x)

u = h(z)

` = u

Now suppose the variables can be tensor-valued (the values aremulti-dimensional arrays). The loss is still a scalar.

The computation is now a “tensor flow”.

16

Page 17: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Tensor-Valued Case

For the tensor case we simply view tensors as a special case ofvectors. The indeces of x.grad are the same as the indeces ofx.

x.grad.shape = x.shape

The backpropagation equation for y = f (x) is

x.grad[i1, . . . , in] = (y.grad)[j1, . . . , jk]∇xf (x)[j1, . . . , jk, i1, . . . , in]

j1, . . . , jk are indeces of y and i1, . . . , in are indeces of x.

Indeces not on the left of the equation are implicitly summed.

17

Page 18: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The EDF Framework

The educational frameword (EDF) is a simple Python-NumPyimplementation of a deep learning framework.

In EDF we write

y = F (x)

z = G(y, x)

u = H(z)

` = u

This is Python code where variables are bound to objects.

18

Page 19: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The EDF Framework

y = F (x)

z = G(y, x)

u = H(z)

` = u

This is Python code where variables are bound to objects.

x is an object in the class Input.

y is an object in the class F .

z is an object in the class G.

u and ` are the same object in the class H .

Page 20: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

y = F (x)

class F (CompNode):

def init (self, x):CompNodes.append(self)self.x = x

def forward(self):self.value = f(self.x.value)

def backward(self):self.x.addgrad(self.grad ∗ ∇x f (x)) #needs x.value

20

Page 21: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Nodes of the Computation Graph

There are three kinds of nodes in a computation graph —inputs, parameters and computation nodes.

class Input:def init (self):

passdef addgrad(self, delta):

pass

class CompNode: #initialization is handled by the subclassdef addgrad(self, delta):

self.grad += delta

21

Page 22: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

class Parameter:

def init (self,value):Parameters.append(self)self.value = value

def addgrad(self, delta):#sums over the minibatchself.grad += np.sum(delta, axis = 0)

def SGD(self):self.value -= learning rate*self.grad

22

Page 23: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

MLP in EDF

The following Python code constructs the computation graphof a multi-layer perceptron (NLP) with one hidden layer.

L1 = Relu(Affine(Phi1,x))

Q = Softmax(Sigmoid(Affine(Phi2,L1))

ell = LogLoss(Q,y)

Here x and y are input computation nodes whose value havebeen set. Here Phi1 and Phi2 are “parameter packages” (amatrix and a bias vector in this case). We have computationnode classes Affine, Relu, Sigmoid, LogLoss each of whichhas a forward and a backward method.

Page 24: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

class Affine(CompNode):

def __init__(self,Phi,x):

CompNodes.append(self)

self.x = x

self.Phi = Phi

def forward(self):

self.value = (np.matmul(self.x.value,

self.Phi.w.value)

+ self.Phi.b.value)

Page 25: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

def backward(self):

self.x.addgrad(

np.matmul(self.grad,

self.Phi.w.value.transpose()))

self.Phi.b.addgrad(self.grad)

self.Phi.w.addgrad(self.x.value[:,:,np.newaxis]

* self.grad[:,np.newaxis,:])

25

Page 26: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Core of EDF

def Forward():

for c in CompNodes: c.forward()

def Backward(loss):

for c in CompNodes + Parameters: c.grad = 0

loss.grad = 1/nBatch

for c in CompNodes[::-1]: c.backward()

def SGD():

for p in Parameters:

p.SGD()

26

Page 27: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Minibatching

Ther running time of EDF, and of any framework, is greatlyimproved by minibatching.

Minibatching: We run some number of instances together(or in parallel) and then do a parameter update based on theaverage gradients of the instances of the batch.

For NumPy minibatching is not so much about parallelism asabout making the vector operations larger so that the vectoroperations dominate the slowness of Python. On a GPU mini-batching allows parallelism over the batch elements.

27

Page 28: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Minibatching

With minibatching each input value and each computed valueis actually a batch of values.

We add a batch index as an additional first tensor dimensionforeach input and computed node.

For example, if a given input is a D-dimensional vector thenthe value of an input node has shape (B,D) where B is sizeof the minibatch.

Parameters do not have a batch index.

Page 29: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

class Parameter:

def init (self,value):Parameters.append(self)self.value = value

def addgrad(self, delta):#sums over the minibatchself.grad += np.sum(delta, axis = 0)

def SGD(self):self.value -= learning rate*self.grad

29

Page 30: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

forward and backward must handle minibatching

The forward and backward methods must be written to handleminibatching. We will consider some examples.

30

Page 31: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

An MLP

L1 = Relu(Affine(Phi1,x))

P = Softmax(Sigmoid(Affine(Phi2,L1))

ell = LogLoss(P,y)

31

Page 32: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Sigmoid Class

The Sigmoid and Relu classes just work.

class Sigmoid:

def __init__(self,x):

CompNodes.append(self)

self.x = x

def forward(self):

self.value = 1. / (1. + np.exp(-self.x.value))

def backward(self):

self.x.grad += self.grad

* self.value

* (1.-self.value)

Page 33: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

y = Affine([W,B],x)

forward:

y.value[b,j] = x.value[b,i]W.value[i,j]

y.value[b,j] += B.value[j]

backward:

x.grad[b,i] += y.grad[b,j]W.value[i,j]

W.grad[i,j] += y.grad[b,j]x.value[i]

B.grad[j] += y.grad[b,j]

Page 34: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

class Affine(CompNode):

def __init__(self,Phi,x):

CompNodes.append(self)

self.x = x

self.Phi = Phi

def forward(self):

# y.value[b,j] = x.value[b,i]W.value[i,j]

# y.value[b,j] += B.value[j]

self.value = (np.matmul(self.x.value,

self.Phi.W.value)

+ self.Phi.B.value)

Page 35: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

def backward(self):

self.x.addgrad(

# x.grad[b,i] += y.grad[b,j]W.value[i,j]

np.matmul(self.grad,

self.Phi.W.value.transpose()))

# B.grad[j] += y.grad[b,j]

self.Phi.B.addgrad(self.grad)

# W.grad[i,j] += y.grad[b,j]x.value[b,i]

self.Phi.W.addgrad(self.x.value[:,:,np.newaxis]

* self.grad[:,np.newaxis,:])

35

Page 36: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Explicit Index Notation

A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]

abbreviates

for x1, . . . , xn, y1, . . . , yk : A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]

A[x1, . . . , xn] = e[x1, . . . , xn, y1, . . . , yk]

abbreviates

for x1, . . . , xn : A[x1, . . . , xn] = 0

A[x1, . . . , xn] += e[x1, . . . , xn, y1, . . . , yk]

36

Page 37: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Affine Transformation with Batch Index

y[b, i] = x[b, j]W [j, i]

y[b, i] += B[i]

37

Page 38: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Swap Rule for Tensor Products

For backprop swap an input with the output and add “grad”.

y[b, i] = x[b, j]W [j, i]

x.grad[b, j] += y.grad[b, i]W [j, i]

W.grad[i, j] += x[b, j]y.grad[b, i]

y[b, i] += B[i]

B.grad[i] += y.grad[b, i]

Page 39: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

The Swap Rule for Functions of Tensor Products

For backprop swap an input with the output and add “grad”and f .

y[b, i] = f (x[b, j]W [j, i])

x.grad[b, j] += y.grad[b, i] f W [j, i]

W.grad[i, j] += y.grad[b, i] f x[b, j]

Here f is scalar to scalar.

39

Page 40: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

NumPy: Reshaping Tensors

For an ndarray x (tensor) we have that x.shape is a tuple ofdimensions. The product of the dimensions is the number ofnumbers.

In NumPy an ndarray (tensor) x can be reshaped into anyshape with the same number of numbers.

40

Page 41: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

NumPy: Broadcasting

Shapes can contain dimensions of size 1.

Dimensions of size 1 are treated as “wild card” dimensions inoperations on tensors.

x.shape = (5, 1)

y.shape = (1, 10)

z = x ∗ yz.shape = (5, 10)

z[i, j] = x[i, 0] ∗ y[0, j]

Page 42: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

class Affine(CompNode):

...

def backward(self):

...

# W.grad[i,j] += y.grad[b,j]x.value[b,i]

self.Phi.W.addgrad(self.grad[:,np.newaxis,:]

*self.x.value[:,:,np.newaxis])

class Parameter:

...

def addgrad(self, delta):

self.grad += np.sum(delta, axis = 0)

42

Page 43: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

NumPy: Broadcasting

When a scalar is added to a matrix the scalar is reshaped toshape (1, 1) so that it is added to each element of the matrix.

When a vector of shape (k) is added to a matrix the vector isreshaped to (1, k) so that it is added to each row of the matrix.

In general when two tensors of different order (number of di-mensions) are added, unit dimensions are prepended to theshape of the tensor of smaller order to make the orders match.

43

Page 44: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

Appendix: The Jacobian Matrix

Consider y = f (x) in vector-valued case.

In the vector-valued case the backpropagation equation is

x.grad += (y.grad)> ∇xf (x)

where

(∇xf (x))[i, j] = J [i, j] = ∂f (x)[i]/∂x[j]

The matrix J [i, j] is the Jacobian of f .

Index Types: j is an x-index and i is a y-index.

44

Page 45: David McAllester, Winter 2018 Computation Graphs and ...dmcallester/DeepClass18/02MLP/MLPb.pdf · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2018 Computation

END