neural networks basics and parallelization/vectorization

Neural Networks

basics and parallelization/vectorization

Kenjiro Taura

1 / 40

Contents

1 What is machine learning?A simple linear regressionA handwritten digits recognition

2 TrainingA simple gradient descentStochastic gradient descent

3 Chain Rule

4 Back Propagation in Action

2 / 40

Contents

3 Chain Rule

3 / 40

What is machine learning?

input: a set of training data set

D = { (xi, ti) | i = 0, 1, · · · }

each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task

goal: a supervised machine learning tries to find a function fthat “matches” training data well. i.e.

f(xi) ≈ ti for (xi, ti) ∈ D

put formally, find f that minimizes an error or a loss:

L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),

where err(yi, ti) is a function that measures an “error” or a“distance” between the predicted output and the true value

4 / 40

D = { (xi, ti) | i = 0, 1, · · · }each xi is normally a real vector (i.e. many real values)each ti is a real value (regression), 0/1 (binary classification),a discrete value (multi-class classification), etc., dependingon the task

L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),

4 / 40

L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),

4 / 40

L(f ;D) ≡∑

(xi,ti)∈D

err(f(xi), ti),

4 / 40

Machine learning as an optimization problem

finding a good function from the space of literally all possiblefunctions is neither easy nor meaningful

we normally fix a search space of functions (F) parameterizedby w and find a good function fw ∈ F (parametric models)

the task is then to find the value of w that minimizes the loss:

L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40

L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40

L(w;D) ≡∑

(xi,ti)∈D

err(fw(xi), ti)

5 / 40

Contents

3 Chain Rule

6 / 40

A simple example (linear regression)

training data D = { (xi, ti) | i = 0, 1, · · · }xi : a real valueti : a real value

let the search space be a set of polynomials of degree ≤ 2. afunction is then parameterized by w = (w0 w1 w2). i.e.

fw(x) ≡ w2x2 + w1x+ w0

let the error function be a simple square distance:

err(y, t) ≡ (y − t)2

the task is to find w = (w0, w1, w2) that minimizes:

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti) =∑

(xi,ti)∈D

(w2x2i+w1xi+w0−ti)

7 / 40

fw(x) ≡ w2x2 + w1x+ w0

err(y, t) ≡ (y − t)2

L(w;D) =∑

(xi,ti)∈D

7 / 40

fw(x) ≡ w2x2 + w1x+ w0

err(y, t) ≡ (y − t)2

L(w;D) =∑

(xi,ti)∈D

7 / 40

fw(x) ≡ w2x2 + w1x+ w0

err(y, t) ≡ (y − t)2

L(w;D) =∑

(xi,ti)∈D

7 / 40

Contents

3 Chain Rule

8 / 40

A somewhat more realistic example: image

(digits) recognition

training data D = { (xi, ti) | i = 0, 1, · · · }

xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1

D = {( ,4), ( ,9), . . .}

the search space: the followingcomposition parameterized by threematrices W0,W1 and W2

fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))

9 / 40

A somewhat more realistic example: image

(digits) recognition

training data D = { (xi, ti) | i = 0, 1, · · · }

xi : a vector of pixel values of an image:ti : a “one hot” vector representing theclass ∈ {0, 1, · · · , 9} (e.g.∼ t(0 0 0 0 1 0 0 0 0 0 0) represents “4”)we write i to mean the hot vector vhaving vi = 1

D = {( ,4), ( ,9), . . .}the search space: the followingcomposition parameterized by threematrices W0,W1 and W2

fW0,W1,W2(x) ≡ softmax(W2ReLU(W1ReLU(W0x)))×

softmax

9 / 40

A handwritten digits recognition

the value fW0,W1,W2(x) is a 10-vectorrepresenting probabilities that x belongsto each of the ten classes

a loss function is the cross entropycommonly used in multiclassclassifications (· : a dot product)

err(y, t) = H(t, y) ≡ −t · log y

the task is to find W0,W1 and W2 thatminimizes:

L(W0,W1,W2;D)

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))

softmax

10 / 40

L(W0,W1,W2;D)

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))

softmax

10 / 40

L(W0,W1,W2;D)

(xi,ti)∈D

H(softmax(ti,W2 ReLU(W1 ReLU(W0xi))))×

softmax

10 / 40

Contents

3 Chain Rule

11 / 40

Contents

3 Chain Rule

12 / 40

How to find the minimizing parameter?

it boils down to minimizing a function that takes lots ofparameters w

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),

for which we compute a derivative of L with respect to w andmove w to its opposite direction (gradient descent; GD)

w = w − ηt∂L

(η : a scalar controlling a learning rate)

repeat this until L(w;D) converges

13 / 40

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),

w = w − ηt∂L

13 / 40

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti),

w = w − ηt∂L

13 / 40

A linear regression example

recall that in the linear regression example:

L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

differentiate L by w = t(w0 w1 w2) to get:

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

(remark: we used a chain rule)

so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

until L(w;D) converges

14 / 40

L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

14 / 40

L(w;D) =∑

(xi,ti)∈D

(w2x2i + w1xi + w0 − ti)

∑(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)(1 xi x2

so you repeat:

w = w − η∑

(xi,ti)∈D

2(w2x2i + w1xi + w0 − ti)

14 / 40

A problem of the gradient descent

the loss function we want to minimize is normally asummation over all training data:

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)

the gradient descent method just described:

1 computes∂

∂werr(fw(xi), ti) for each training data

(xi, ti) ∈ D, with the current value of w2 sum them over whole data set and then update w

it is commonly observed that the convergence becomes fasterwhen we update w more “incrementally” → StochasticGradient Descent (SGD)

15 / 40

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)

1 computes∂

15 / 40

L(w;D) =∑

(xi,ti)∈D

err(fw(xi), ti)

1 computes∂

15 / 40

Contents

3 Chain Rule

16 / 40

repeat:

1 randomly draw a subset of training data D′ (a mini batch;D′ ⊂ D)

2 compute the gradient of loss over the mini batch

∂L(w;D′)

∑(xi,ti)∈D′

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)

∂w4 “update sooner rather than later”

17 / 40

repeat:

∂L(w;D′)

∑(xi,ti)∈D′

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)

17 / 40

repeat:

∂L(w;D′)

∑(xi,ti)∈D′

∂werr(fw(xi), ti)

3 update w

w = w − ηt∂L(w;D′)

17 / 40

SGD and neural networks

in neural networks, a function is acomposition of many stages eachrepresented by a lot of parameters

x1 = f1(w1;x)

x2 = f2(w2;x1)

y = fn(wn;xn)

e = err(y, t)

we need to differentiate e byw1, · · · , wn

xi+1 = f(Wxi)

18 / 40

The digits recognition example

x1 = W0x

x2 = ReLU(x1)

x3 = W1x2

x4 = ReLU(x3)

x5 = W2x4

y = softmax(x5)

e = H(y, t)

you need to get differentiation of e byW0,W1 and W2 done right

softmax

19 / 40

Contents

3 Chain Rule

20 / 40

Differentiating multivariable functions

x = t(x0 · · · xn−1) ∈ Rn (a column vector)

f(x) : a scalar

definition: a derivative of f with respect to x, written f ′(x)

or∂f

∂x, is a row n-vector a s.t.

f(x+∆x) ≈ f(x) + a∆x

= f(x) +n−1∑i=0

ai∆xi

(a row vector × a column vector, yielding a 1× 1 matrix,identified with a scalar)

when it exists,

· · · ∂f

∂xn−1

)21 / 40

The Chain Rule

consider a function f that depends ony = (y0, · · · , ym−1) ∈ Rm, each of which in turn depends onx = (x0, · · · , xn−1) ∈ Rn

the chain rule (math textbook version):

0≤j<m

∂yj∂xi

(0 ≤ i < n)

x0 · · · xn−1

y0 · · · ym−1

22 / 40

The Chain Rule : intuition

x0 · · · xn−1

yy0 · · · ym−1

say you increase an input variable xi by ∆xi, each yj willincrease by

≈ ∂yj∂xi

∆xi,

which will contribute to increasing the final output (f) by

≈ ∂f

∂yj∂xi

23 / 40

Chain Rule

master the following “index-free” version for neural network

x, y : a scalar (a single component in a vector/matrix/highdimensional array)

the chain rule (ML practioner’s version):

∑all variables y that x directly affects

24 / 40

Chain Rule and “Back Propagation”

Chain rule allows you to compute

the derivative of the loss with respect toa variable, from

the derivatives of the loss with respect toupstream variables

∑all variables y a step ahead of x

∂x ×

softmax

25 / 40

Contents

3 Chain Rule

26 / 40

Component functions

we used the following functions

Convolution(W ;x) : linear/local transformations to imagesLinear(W ;x) : linear or “fully connected” layerReLU(x) : rectified linear unitssoftmax(x)H(t, x) : cross entropy

we summarize their definitions and their derivatives

27 / 40

Convolution

it takesan image = 2D pixels × a number of channelsa “filter” or a “kernel”, which is essentially a small image

and slides the filter over all pixels of the input and takes thelocal inner product at each pixelan illustration of a single channel 2D convolution (imagine agrayscale image)

filter (kernel)

output image output image

28 / 40

Convolution (a single channel version)

Wi,j : a filter (−K ≤ i ≤ K, −K ≤ j ≤ K)xi,j : an input image (0 ≤ i < H, 0 ≤ j < W )yi,j : an output image (0 ≤ i < H, 0 ≤ j < W )

filter (kernel)

output image output image

assuming xi,j = 0 for underflowed/overflowed indices forbrevity,

yi,j =∑

−K≤i′≤K,−K≤j′≤K

wi′,j′xi+i′,j+j′

(for each i, j) 29 / 40

Convolution

say input has IC channels and output OC channels

Woc,ic,i,j : filter (0 ≤ ic < IC, 0 ≤ oc < OC)

xic,i,j : an input image

yoc,i,j : an output image

yoc,i,j =∑ic,i′,j′

woc,ic,i′,j′xic,i+i′,j+j′

(for each oc, i, j)

the actual code does this for each image in a batch

yb,oc,i,j =∑ic,i′,j′

woc,ic,i′,j′xb,ic,i+i′,j+j′

(for each b, oc, i, j)

30 / 40

Convolution (Back propagation)

∂xb,ic,i+i′,j+j′=

∑b′,oc,i,j

∂yb′,oc,i,j

∂yb′,oc,i,j∂xb,ic,i+i′,j+j′

=∑oc,i,j

∂yb,oc,i,jwoc,ic,i′,j′

∂woc,ic,i′,j′=

∑b,oc′,i,j

∂yb,oc′,i,j

∂yb,oc′,i,j∂woc,ic,i′,j′

=∑b,i,j

∂yb,oc,i,jxb,ic,i+i′,j+j′

31 / 40

Linear (a.k.a. Fully Connected Layer)

definition:

y = Linear(W ;x) ≡ Wx

yi =∑j

derivatives:

∂yi′

∂Wij=

{xj (i′ = i)0 (i′ ̸= i)

∂yi∂xj

32 / 40

Linear (a.k.a. Fully Connected Layer)

back propagation:∂L∂W

∂Wij=

∑i′

∂yi′

∂Wij

∂yixj

∂y× x (outer product)

∂L∂x

∂xj=

∂yi∂xj

wij∂L

∂yW (vector-matrix product)

33 / 40

definition (scalar ReLU): for x ∈ R, define

relu(x) ≡ max(x, 0)

derivatives of relu: for y = relu(x),

{1 (x > 0)0 (x ≤ 0)

= max(sign(x), 0)

y = relu(x)

34 / 40

definition (vector ReLU): for a vector x ∈ Rn, defineReLU as the application of relu to each component

ReLU(x) ≡

relu(x0)...

relu(xn−1)

derivatives of ReLU:

∂yj∂xi

{max(sign(xi), 0) (i = j)0 (i ̸= j)

35 / 40

back propagation:

∂yi∂xj

∂yj∂xj

∂yj(xj ≥ 0)

0 (xj < 0)

36 / 40

softmax

definition: for x ∈ Rn

y = softmax(x) ≡ 1∑n−1i=0 exp(xj)

exp(x0)...

exp(xn−1)

it is a vector whose:

each component > 0,sum of all components = 1largest component “dominates”

0123456789 0123456789

softmax1.0

37 / 40

Derivative of softmax

for y = softmax(x),

)is an n× n matrix whose elements

are given by:(diagonal elements)(

exp(xi)∑k exp(xk)

=exp(xi)

∑k exp(xk)− exp(xi)

k exp(xk))2

= yi(1− yi)

(non-diagonal elements) for i ̸= j,(∂y

exp(xi)∑k exp(xk)

= − exp(xi)exp(xj)

k exp(xk))2

= −yiyj 38 / 40

Cross entropy

definition: for y ∈ Rn,

H(t, y) ≡ −n−1∑i=0

ti log yi

derivative of H: for z = H(t, y),∂z

∂yis an n-vector

∂y= −

· · · tn−1

yn−1

)if t is a one hot vector, so is this vector; if t = c,

∂y= −

(0 · · · 0 1

yc0 · · · 0

)= − c

yc39 / 40

Composition of softmax and cross entropy

In particular, composition of softmax and H enjoy aremarkable simplification when differentiated:

for z = H(t, y) = H(t, softmax(x)),

= − c

= − 1

∂yc∂x

yc(y0yc y1yc · · · yc(1− yc) yc+1yc · · · yn−1yc)

= (y0 y1 · · · (yc − 1) yc+1 · · · yn−1)

= t(y − t)

40 / 40

neural networks basics and parallelization/vectorization

Documents

turbodecodingalgorithm parallelization

parallelization and vectorization in gcc€¦ · 3 july...

parallel hybrid computing · gpu gpu gpu gpu openmp hmpp...

llvm auto-vectorization - fosdem

detection, vectorization and characterization of linear...

compiler auto vectorization guide

parallelization stategies of deeplearning neural network...

semantic segmentation for line drawing vectorization using...

vectorization vs. compilation in query execution · 3...

vectorization: an introduction

neural networks: forward and backpropagation · exploring...

comp 515: advanced compilation for vector and parallel...

seamless parallelization and vectorization integration...

lhcb ideas and path toward vectorization and parallelization...

vectorization in matlab

vectorization in the polyhedral model

dynamic parallelization and vectorization of binary ...the...

vectorization of linear discrete filtering algorithms...

parallel programming...

parallelization of artificial neural networks joe bradish...