neural networks basics and parallelization/vectorization
Post on 03-Dec-2021
10 Views
Preview:
TRANSCRIPT
Neural Networks
basics and parallelization/vectorization
Kenjiro Taura
1 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
2 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
3 / 40
What is machine learning?
input: a set of training data set
D = { (xi, ti) | i = 0, 1, · · · }
each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task
goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.
f(xi) ≈ ti for (xi, ti) ∈ D
put formally, find f that minimizes an error or a loss:
L(f ;D) ≡∑
(xi,ti)∈D
err(f(xi), ti),
where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value
4 / 40
What is machine learning?
input: a set of training data set
D = { (xi, ti) | i = 0, 1, · · · }each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task
goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.
f(xi) ≈ ti for (xi, ti) ∈ D
put formally, find f that minimizes an error or a loss:
L(f ;D) ≡∑
(xi,ti)∈D
err(f(xi), ti),
where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value
4 / 40
What is machine learning?
input: a set of training data set
D = { (xi, ti) | i = 0, 1, · · · }each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task
goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.
f(xi) ≈ ti for (xi, ti) ∈ D
put formally, find f that minimizes an error or a loss:
L(f ;D) ≡∑
(xi,ti)∈D
err(f(xi), ti),
where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value
4 / 40
What is machine learning?
input: a set of training data set
D = { (xi, ti) | i = 0, 1, · · · }each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task
goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.
f(xi) ≈ ti for (xi, ti) ∈ D
put formally, find f that minimizes an error or a loss:
L(f ;D) ≡∑
(xi,ti)∈D
err(f(xi), ti),
where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value
4 / 40
Machine learning as an optimization problem
finding a good function from the space of literally all possiblefunctions is neither easy nor meaningful
we normally fix a search space of functions (F) parameterizedby w and find a good function fw ∈ F (parametric models)
the task is then to find the value of w that minimizes the loss:
L(w;D) ≡∑
(xi,ti)∈D
err(fw(xi), ti)
5 / 40
Machine learning as an optimization problem
finding a good function from the space of literally all possiblefunctions is neither easy nor meaningful
we normally fix a search space of functions (F) parameterizedby w and find a good function fw ∈ F (parametric models)
the task is then to find the value of w that minimizes the loss:
L(w;D) ≡∑
(xi,ti)∈D
err(fw(xi), ti)
5 / 40
Machine learning as an optimization problem
finding a good function from the space of literally all possiblefunctions is neither easy nor meaningful
we normally fix a search space of functions (F) parameterizedby w and find a good function fw ∈ F (parametric models)
the task is then to find the value of w that minimizes the loss:
L(w;D) ≡∑
(xi,ti)∈D
err(fw(xi), ti)
5 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
6 / 40
A simple example (linear regression)
training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value
let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.
fw(x) ≡ w2x2 + w1x+ w0
let the error function be a simple square distance:
err(y, t) ≡ (y − t)2
the task is to find w = (w0, w1, w2) that minimizes:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti) =∑
(xi,ti)∈D
(w2x2i+w1xi+w0−ti)
2
7 / 40
A simple example (linear regression)
training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value
let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.
fw(x) ≡ w2x2 + w1x+ w0
let the error function be a simple square distance:
err(y, t) ≡ (y − t)2
the task is to find w = (w0, w1, w2) that minimizes:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti) =∑
(xi,ti)∈D
(w2x2i+w1xi+w0−ti)
2
7 / 40
A simple example (linear regression)
training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value
let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.
fw(x) ≡ w2x2 + w1x+ w0
let the error function be a simple square distance:
err(y, t) ≡ (y − t)2
the task is to find w = (w0, w1, w2) that minimizes:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti) =∑
(xi,ti)∈D
(w2x2i+w1xi+w0−ti)
2
7 / 40
A simple example (linear regression)
training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value
let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.
fw(x) ≡ w2x2 + w1x+ w0
let the error function be a simple square distance:
err(y, t) ≡ (y − t)2
the task is to find w = (w0, w1, w2) that minimizes:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti) =∑
(xi,ti)∈D
(w2x2i+w1xi+w0−ti)
2
7 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
8 / 40
A somewhat more realistic example: image
(digits) recognition
training data D = { (xi, ti) | i = 0, 1, · · · }
xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1
D = {( ,4), ( ,9), . . .}
the search space: the followingcomposition parameterized by threematrices W0,W1 and W2
fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))
9 / 40
A somewhat more realistic example: image
(digits) recognition
training data D = { (xi, ti) | i = 0, 1, · · · }
xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1
D = {( ,4), ( ,9), . . .}the search space: the followingcomposition parameterized by threematrices W0,W1 and W2
fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y
9 / 40
A handwritten digits recognition
the value fW0,W1,W2(x) is a 10-vectorrepresenting probabilities that x belongsto each of the ten classes
a loss function is the cross entropycommonly used in multiclassclassifications (· : a dot product)
err(y, t) = H(t, y) ≡ −t · log y
the task is to find W0,W1 and W2 thatminimizes:
L(W0,W1,W2;D)
=∑
(xi,ti)∈D
H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))
×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y t
H
e
10 / 40
A handwritten digits recognition
the value fW0,W1,W2(x) is a 10-vectorrepresenting probabilities that x belongsto each of the ten classes
a loss function is the cross entropycommonly used in multiclassclassifications (· : a dot product)
err(y, t) = H(t, y) ≡ −t · log y
the task is to find W0,W1 and W2 thatminimizes:
L(W0,W1,W2;D)
=∑
(xi,ti)∈D
H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))
×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y t
H
e
10 / 40
A handwritten digits recognition
the value fW0,W1,W2(x) is a 10-vectorrepresenting probabilities that x belongsto each of the ten classes
a loss function is the cross entropycommonly used in multiclassclassifications (· : a dot product)
err(y, t) = H(t, y) ≡ −t · log y
the task is to find W0,W1 and W2 thatminimizes:
L(W0,W1,W2;D)
=∑
(xi,ti)∈D
H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y t
H
e
10 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
11 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
12 / 40
How to find the minimizing parameter?
it boils down to minimizing a function that takes lots ofparameters w
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti),
for which we compute a derivative of L with respect to w andmove w to its opposite direction (gradient descent; GD)
w = w − ηt∂L
∂w
(η : a scalar controlling a learning rate)
repeat this until L(w;D) converges
13 / 40
How to find the minimizing parameter?
it boils down to minimizing a function that takes lots ofparameters w
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti),
for which we compute a derivative of L with respect to w andmove w to its opposite direction (gradient descent; GD)
w = w − ηt∂L
∂w
(η : a scalar controlling a learning rate)
repeat this until L(w;D) converges
13 / 40
How to find the minimizing parameter?
it boils down to minimizing a function that takes lots ofparameters w
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti),
for which we compute a derivative of L with respect to w andmove w to its opposite direction (gradient descent; GD)
w = w − ηt∂L
∂w
(η : a scalar controlling a learning rate)
repeat this until L(w;D) converges
13 / 40
A linear regression example
recall that in the linear regression example:
L(w;D) =∑
(xi,ti)∈D
(w2x2i + w1xi + w0 − ti)
2
differentiate L by w = t(w0 w1 w2) to get:
∂L
∂w=
∑(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)(1 xi x2
i )
(remark: we used a chain rule)
so you repeat:
w = w − η∑
(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)
1xi
x2i
until L(w;D) converges
14 / 40
A linear regression example
recall that in the linear regression example:
L(w;D) =∑
(xi,ti)∈D
(w2x2i + w1xi + w0 − ti)
2
differentiate L by w = t(w0 w1 w2) to get:
∂L
∂w=
∑(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)(1 xi x2
i )
(remark: we used a chain rule)
so you repeat:
w = w − η∑
(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)
1xi
x2i
until L(w;D) converges
14 / 40
A linear regression example
recall that in the linear regression example:
L(w;D) =∑
(xi,ti)∈D
(w2x2i + w1xi + w0 − ti)
2
differentiate L by w = t(w0 w1 w2) to get:
∂L
∂w=
∑(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)(1 xi x2
i )
(remark: we used a chain rule)
so you repeat:
w = w − η∑
(xi,ti)∈D
2(w2x2i + w1xi + w0 − ti)
1xi
x2i
until L(w;D) converges
14 / 40
A problem of the gradient descent
the loss function we want to minimize is normally asummation over all training data:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti)
the gradient descent method just described:
1 computes∂
∂werr(fw(xi), ti) for each training data
(xi, ti) ∈ D, with the current value of w2 sum them over whole data set and then update w
it is commonly observed that the convergence becomes fasterwhen we update w more “incrementally” → StochasticGradient Descent (SGD)
15 / 40
A problem of the gradient descent
the loss function we want to minimize is normally asummation over all training data:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti)
the gradient descent method just described:
1 computes∂
∂werr(fw(xi), ti) for each training data
(xi, ti) ∈ D, with the current value of w2 sum them over whole data set and then update w
it is commonly observed that the convergence becomes fasterwhen we update w more “incrementally” → StochasticGradient Descent (SGD)
15 / 40
A problem of the gradient descent
the loss function we want to minimize is normally asummation over all training data:
L(w;D) =∑
(xi,ti)∈D
err(fw(xi), ti)
the gradient descent method just described:
1 computes∂
∂werr(fw(xi), ti) for each training data
(xi, ti) ∈ D, with the current value of w2 sum them over whole data set and then update w
it is commonly observed that the convergence becomes fasterwhen we update w more “incrementally” → StochasticGradient Descent (SGD)
15 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
16 / 40
SGD
repeat:
1 randomly draw a subset of training data D′ (a mini batch;D′ ⊂ D)
2 compute the gradient of loss over the mini batch
∂L(w;D′)
∂w=
∑(xi,ti)∈D′
∂
∂werr(fw(xi), ti)
3 update w
w = w − ηt∂L(w;D′)
∂w4 “update sooner rather than later”
17 / 40
SGD
repeat:
1 randomly draw a subset of training data D′ (a mini batch;D′ ⊂ D)
2 compute the gradient of loss over the mini batch
∂L(w;D′)
∂w=
∑(xi,ti)∈D′
∂
∂werr(fw(xi), ti)
3 update w
w = w − ηt∂L(w;D′)
∂w4 “update sooner rather than later”
17 / 40
SGD
repeat:
1 randomly draw a subset of training data D′ (a mini batch;D′ ⊂ D)
2 compute the gradient of loss over the mini batch
∂L(w;D′)
∂w=
∑(xi,ti)∈D′
∂
∂werr(fw(xi), ti)
3 update w
w = w − ηt∂L(w;D′)
∂w4 “update sooner rather than later”
17 / 40
SGD and neural networks
in neural networks, a function is acomposition of many stages eachrepresented by a lot of parameters
x1 = f1(w1;x)
x2 = f2(w2;x1)
. . .
y = fn(wn;xn)
e = err(y, t)
we need to differentiate e byw1, · · · , wn
W
xi
xi+1 = f(Wxi)
18 / 40
The digits recognition example
x1 = W0x
x2 = ReLU(x1)
x3 = W1x2
x4 = ReLU(x3)
x5 = W2x4
y = softmax(x5)
e = H(y, t)
you need to get differentiation of e byW0,W1 and W2 done right
×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y t
H
e
19 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
20 / 40
Differentiating multivariable functions
x = t(x0 · · · xn−1) ∈ Rn (a column vector)
f(x) : a scalar
definition: a derivative of f with respect to x, written f ′(x)
or∂f
∂x, is a row n-vector a s.t.
f(x+∆x) ≈ f(x) + a∆x
= f(x) +n−1∑i=0
ai∆xi
(a row vector × a column vector, yielding a 1× 1 matrix,identified with a scalar)
when it exists,
a =
(∂f
∂x0
· · · ∂f
∂xn−1
)21 / 40
The Chain Rule
consider a function f that depends ony = (y0, · · · , ym−1) ∈ Rm, each of which in turn depends onx = (x0, · · · , xn−1) ∈ Rn
the chain rule (math textbook version):
∂f
∂xi
=∑
0≤j<m
∂f
∂yj
∂yj∂xi
(0 ≤ i < n)
x0 · · · xn−1
f
x
y
y0 · · · ym−1
22 / 40
The Chain Rule : intuition
x0 · · · xn−1
f
x
yy0 · · · ym−1
say you increase an input variable xi by ∆xi, each yj willincrease by
≈ ∂yj∂xi
∆xi,
which will contribute to increasing the final output (f) by
≈ ∂f
∂yj
∂yj∂xi
∆xi
23 / 40
Chain Rule
master the following “index-free” version for neural network
x, y : a scalar (a single component in a vector/matrix/highdimensional array)
the chain rule (ML practioner’s version):
∂f
∂x=
∑all variables y that x directly affects
∂f
∂y
∂y
∂x
x
y
f
24 / 40
Chain Rule and “Back Propagation”
Chain rule allows you to compute
∂L
∂x,
the derivative of the loss with respect toa variable, from
∂L
∂y,
the derivatives of the loss with respect toupstream variables
∂L
∂x=
∑all variables y a step ahead of x
∂L
∂y
∂y
∂x ×
x
ReLU
x1
×
ReLU
x3
x2
×
x4
softmax
x5
W0
W1
W2
y t
H
e
25 / 40
Contents
1 What is machine learning?A simple linear regressionA handwritten digits recognition
2 TrainingA simple gradient descentStochastic gradient descent
3 Chain Rule
4 Back Propagation in Action
26 / 40
Component functions
we used the following functions
Convolution(W ;x) : linear/local transformations to imagesLinear(W ;x) : linear or “fully connected” layerReLU(x) : rectified linear unitssoftmax(x)H(t, x) : cross entropy
we summarize their definitions and their derivatives
27 / 40
Convolution
it takesan image = 2D pixels × a number of channelsa “filter” or a “kernel”, which is essentially a small image
and slides the filter over all pixels of the input and takes thelocal inner product at each pixelan illustration of a single channel 2D convolution (imagine agrayscale image)
*
filter (kernel)
output image output image
W
x y
=
28 / 40
Convolution (a single channel version)
Wi,j : a filter (−K ≤ i ≤ K, −K ≤ j ≤ K)xi,j : an input image (0 ≤ i < H, 0 ≤ j < W )yi,j : an output image (0 ≤ i < H, 0 ≤ j < W )
*
filter (kernel)
output image output image
W
x y
=
assuming xi,j = 0 for underflowed/overflowed indices forbrevity,
yi,j =∑
−K≤i′≤K,−K≤j′≤K
wi′,j′xi+i′,j+j′
(for each i, j) 29 / 40
Convolution
say input has IC channels and output OC channels
Woc,ic,i,j : filter (0 ≤ ic < IC, 0 ≤ oc < OC)
xic,i,j : an input image
yoc,i,j : an output image
yoc,i,j =∑ic,i′,j′
woc,ic,i′,j′xic,i+i′,j+j′
(for each oc, i, j)
the actual code does this for each image in a batch
yb,oc,i,j =∑ic,i′,j′
woc,ic,i′,j′xb,ic,i+i′,j+j′
(for each b, oc, i, j)
30 / 40
Convolution (Back propagation)
∂L
∂xb,ic,i+i′,j+j′=
∑b′,oc,i,j
∂L
∂yb′,oc,i,j
∂yb′,oc,i,j∂xb,ic,i+i′,j+j′
=∑oc,i,j
∂L
∂yb,oc,i,jwoc,ic,i′,j′
∂L
∂woc,ic,i′,j′=
∑b,oc′,i,j
∂L
∂yb,oc′,i,j
∂yb,oc′,i,j∂woc,ic,i′,j′
=∑b,i,j
∂L
∂yb,oc,i,jxb,ic,i+i′,j+j′
31 / 40
Linear (a.k.a. Fully Connected Layer)
definition:
y = Linear(W ;x) ≡ Wx
yi =∑j
Wijxj
derivatives:
by W
∂yi′
∂Wij=
{xj (i′ = i)0 (i′ ̸= i)
by x
∂yi∂xj
= wij
32 / 40
Linear (a.k.a. Fully Connected Layer)
back propagation:∂L∂W
∂L
∂Wij=
∑i′
∂L
∂yi′
∂yi′
∂Wij
=∂L
∂yixj
∂L
∂W=
∂L
∂y× x (outer product)
∂L∂x
∂L
∂xj=
∑i
∂L
∂yi
∂yi∂xj
=∑i
wij∂L
∂yi
∂L
∂x=
∂L
∂yW (vector-matrix product)
33 / 40
ReLU
definition (scalar ReLU): for x ∈ R, define
relu(x) ≡ max(x, 0)
derivatives of relu: for y = relu(x),
∂y
∂x=
{1 (x > 0)0 (x ≤ 0)
= max(sign(x), 0)
0x
y
y = relu(x)
34 / 40
ReLU
definition (vector ReLU): for a vector x ∈ Rn, defineReLU as the application of relu to each component
ReLU(x) ≡
relu(x0)...
relu(xn−1)
derivatives of ReLU:
∂yj∂xi
=
{max(sign(xi), 0) (i = j)0 (i ̸= j)
35 / 40
ReLU
back propagation:
∂L
∂xj
=∑i
∂L
∂yi
∂yi∂xj
=∂L
∂yj
∂yj∂xj
=
∂L
∂yj(xj ≥ 0)
0 (xj < 0)
36 / 40
softmax
definition: for x ∈ Rn
y = softmax(x) ≡ 1∑n−1i=0 exp(xj)
exp(x0)...
exp(xn−1)
it is a vector whose:
each component > 0,sum of all components = 1largest component “dominates”
0123456789 0123456789
softmax1.0
37 / 40
Derivative of softmax
for y = softmax(x),
(∂y
∂x
)is an n× n matrix whose elements
are given by:(diagonal elements)(
∂y
∂x
)i,i
=∂
∂xi
exp(xi)∑k exp(xk)
=exp(xi)
∑k exp(xk)− exp(xi)
2
(∑
k exp(xk))2
= yi(1− yi)
(non-diagonal elements) for i ̸= j,(∂y
∂x
)i,j
=∂
∂xj
exp(xi)∑k exp(xk)
= − exp(xi)exp(xj)
(∑
k exp(xk))2
= −yiyj 38 / 40
Cross entropy
definition: for y ∈ Rn,
H(t, y) ≡ −n−1∑i=0
ti log yi
derivative of H: for z = H(t, y),∂z
∂yis an n-vector
∂z
∂y= −
(t0y0
· · · tn−1
yn−1
)if t is a one hot vector, so is this vector; if t = c,
∂z
∂y= −
(0 · · · 0 1
yc0 · · · 0
)= − c
yc39 / 40
Composition of softmax and cross entropy
In particular, composition of softmax and H enjoy aremarkable simplification when differentiated:
for z = H(t, y) = H(t, softmax(x)),
∂z
∂x=
∂z
∂y
∂y
∂x
= − c
yc
∂y
∂x
= − 1
yc
∂yc∂x
=1
yc(y0yc y1yc · · · yc(1− yc) yc+1yc · · · yn−1yc)
= (y0 y1 · · · (yc − 1) yc+1 · · · yn−1)
= t(y − t)
40 / 40
top related