deep neural networks - biuu.cs.biu.ac.il/~jkeshet/teaching/iml2016/iml2016_tirgul09.pdf ·...

Deep Neural Networks

Tirgul 9

Setup

• Handful of labeled examples, say images of cats with the label “Cat” and images of other things with the label “Not Cat”

• Algorithm that “learns” to identify images of cats and, when fed a new image, hopes to produce the correct label

• Incredibly general setting: • Data: symptoms; Labels: illnesses

• Image recognition, automatic caption generation, speech recognition, etc.

3

Perceptrons: Early Deep Learning Algorithms

4

Perceptrons: Early Deep Learning Algorithms

• Basic neural network building block: perceptron

• Say we have n points in the plane, labeled ‘0’ and ‘1’. We’re given a new point and we want to guess its label

• Solution:• Find separating hyperplane: pick a line that

best separates the labeled data and use that as your classifier.

5

• Each piece of input data would be represented as a vector x = (𝑥1, 𝑥2).

• Our function would be : “‘0’ if below the line, ‘1’ if above”.

• The decision boundary: 𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏

• Activation function:

• ℎ 𝑥 = 1: 𝑖𝑓 𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏 > 0

0: 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

*The activation function of a node defines the output of that node given an input.

6

Training the Perceptron

• Feeding it multiple training samples

• Calculating the output for each of them.

• After each sample, the weights w are adjusted in such a way so as to minimize the output error:• (𝑦 ∈ {0,1}): Update rule: 𝑤 ← 𝑤 + 𝑦 − 𝑦 𝑥

7

Perceptron omnipotent?

• Logic Gate - AND

0 1

0

1

AND 0 1

0 FALSE FALSE

1 FALSE TRUE

-1.5

1

1

output

AND Gate

1

0/1

0/1

8


• Logic Gate - OR

0 1

0

1

OR 0 1

0 FALSE TRUE

1 TRUE TRUE

-0.5

1

1

output

OR Gate

1

0/1

0/1

9


• Logic Gate - NOT

0 1

NOT 0 1

TRUE FALSE

0.5

-1

NOT Gate

output

1

0/1

10


• Logic Gate - XOR

0 1

0

1

XOR 0 1

0 FALSE TRUE

1 TRUE FALSE

? NOT LINEAR FUNCTION

11

Single Perceptron Drawbacks

• Can only learn linearly separable functions.

• To address this problem, we’ll need to use a multilayer perceptron.• A.k.a feedforward neural network.

• Multiple perceptrons = a more powerful mechanism for learning.

12

Multilayer Perceptron

• A neural network = composition of perceptrons, connected in different ways.

• Example:

13

Feedforward Neural Networks for Deep Learning

14

Feedforward Neural Networks for Deep Learning

• An input, output, and one or more hidden layers• 3-unit input layer, 4-unit hidden layer and

an output layer with 2.

• Each unit is a single perceptron.

• The units of the input layer serve as inputs for the units of the hidden layer, while the hidden layer units are inputs to the output layer.

15

Feedforward Neural Networks for Deep Learning• Each connection between two neurons

has a weight w.

• Fully connected case: Each unit of layer t is typically connected to every unit of the previous layer t – 1.

• The information moves in only one direction, forward, from the input nodes, through the hidden and to the output nodes.

16

Beyond Linearity

• What if each of our perceptrons is only allowed to use a linear activation function? • A linear composition of linear functions is still just a linear function.

• The final output of our network will still be some linear function of the inputs.

• If we’re restricted to linear activation functions, then the feedforwardneural network is no more powerful than the perceptron, no matter how many layers it has.

18

Beyond Linearity

• Because of this, most neural networks use non-linear activation functions like the logistic (sigmoid), tanh, or rectifier (ReLU).

• Without them the network can only learn functions which are linear combinations of its inputs.

19

Activation Functions

Function Range

Logistic (sigmoid) (0,1)

tanh (-1,1)

Rectifier linear unit (ReLU) [0, ∞)

20

Activation Functions

22

𝒙 𝒉𝟏 𝒉𝟐 𝒚

𝜃𝑖 = 𝑡ℎ𝑒 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟 𝑖

𝑓𝑖 = the activation function used in layer 𝑖 23

Loss

• The measure of how well the model fits the training set is given by a suitable loss function: 𝐿 𝑥, 𝑦; 𝜃 , e. g. :• Sum-of-squares: 𝑖=1

𝐾 𝑦 − 𝑦 2

• Negative log likelihood: − log 𝑝 𝑐𝑙𝑎𝑠𝑠 = 𝑘 𝑥; 𝜃)

• The loss depends on the input 𝑥, the target label 𝑦, and the parameters 𝜃.

𝜃𝑖 = (𝑤𝑖 , 𝑏𝑖) 24

Activation Functions Derivatives

Function Derivative (w.r.t 𝑥)

Logistic (sigmoid)

tanh

Rectifier linear unit (ReLU)

25

DNN algorithm

1. Feedforward

2. Backpropagation

27

Feedforward

• Inputting values at the input layer from where it travels from input to hidden and from hidden to output layer.

28

Feedforward Example

• Activation function: • Sigmoid: 𝜎(𝑁𝑒𝑡)

• Loss function:

• Sum of Squares: E =1

2 𝑖 𝑦 − 𝑦 2

𝑖𝑤𝑖𝑥 + 𝑏𝑖

Activation function

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

Notation: ℎ𝑖(𝑗)

refers to the 𝑗th

neuron at the 𝑖th hidden layer. 29

Feedforward Example

• 𝑥 = 0.05, 0.1 , 𝑦 = 0.01

• Suppose the weight values of our network are given.• (appear in red)

• Notation: 𝑤𝑙𝑗 denotes the weight vector entering neuron 𝑗 of layer 𝑙.

0.15

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

0.20

0.25

0.30

0.350.35

0.40

0.60

0.15

0.45

30

Feedforward Example

• 𝑥 = 0.05, 0.1 , 𝑦 = 0.01

• 𝑁𝑒𝑡ℎ1(1) = 𝑖 𝑥

(𝑖) ⋅ 𝑤11(𝑖)

• 𝑂𝑢𝑡ℎ1(1) = 𝜎 𝑁𝑒𝑡

ℎ11

= 0.05 ⋅ 0.15 + 0.1 ⋅ 0.20 + 1 ⋅ 0.35 = 0.3775

= 𝑥(1)⋅ 𝑤11(1)

+ 𝑥(2) ⋅ 𝑤11(2)

+ 1 ⋅ 𝑏1

=1

1+𝑒−𝑁𝑒𝑡

ℎ11=

1

1+𝑒−0.3775= 0.59326

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

0.20

0.25

0.30

0.350.35

0.40

0.60

0.05

0.10.01

0.15

0.45

31

Feedforward Example

• 𝑂𝑢𝑡ℎ1(1) = 0.59326

• 𝑂𝑢𝑡ℎ1(2) = 0.59688

• 𝑁𝑒𝑡𝑧 = 𝑖 ℎ1(𝑖)

⋅ 𝑤2(𝑖)

• 𝑂𝑢𝑡𝑧 = 𝜎 𝑁𝑒𝑡𝑧 =1

1+𝑒−𝑁𝑒𝑡𝑧

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

0.20

0.25

0.30

0.350.35

0.40

0.45

0.60

0.05

0.10.01

0.15

= 0.59326 ⋅ 0.40 + 0.59688 ⋅ 0.45

= ℎ1(1)

⋅ 𝑤2(1)

+ℎ1(2)

⋅ 𝑤2(2)

+ 1 ⋅ 𝑏2

=1

1+𝑒−1.1059= 0.7513

← Calculated in the same way

+ 1 ⋅ 0.60 = 1.1059

32

Calculating the Error

• Loss function:

• 𝐸 =1

2 𝑖 𝑦 − 𝑦 2

• Our error:

• 𝐸 =1

20.7513 − 0.01 2 = 0.2747

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

0.20

0.25

0.30

0.350.35

0.40

0.45

0.60

0.05

0.10.01

0.15

𝒊 iterates over the output nodes (in this case there is only one)

33

Backpropagation

• Goal: • Update every weight vector so they cause 𝑦 to be closer to 𝑦.

• Thus minimizing the error of the network.

34

Backpropagation

• We will start by updating 𝑤2

• Using Gradient Descent

• Recall: the update rule:

• 𝑤2 = 𝑤2 − 𝜂𝜕𝐸

𝜕𝑤2

z

𝑥(1)

𝑥(2)

ℎ1(1)

ℎ1(2)

0.20

0.25

0.30

0.350.35

0.40

0.45

0.60

0.05

0.10.01

0.15

35

Backpropagation

• Find 𝜕𝐸

𝜕𝑤2:

• We will use the chain rule.

• 𝐸 =1

2𝑦 − 𝑦 2

•𝜕𝐸

𝜕𝑤2=

𝜕𝐸

𝜕𝑜𝑢𝑡𝑧


𝜕𝑛𝑒𝑡𝑧


𝜕𝑤2

36

The Chain Rule

• Feedforward Calculations:

1.𝑁𝑒𝑡ℎ1(𝑗) = ⟨𝒘1𝑗 , 𝒙⟩

2. 𝑂𝑢𝑡ℎ1(𝑗) = 𝜎 𝑁𝑒𝑡

ℎ1𝑗 (add bias)

3.𝑁𝑒𝑡z = ⟨𝒘2, 𝑂𝑢𝑡ℎ1⟩

4. 𝑂𝑢𝑡𝑧 = 𝜎 𝑁𝑒𝑡z

37

The Chain Rule

Backpropagation:

•𝜕𝐸

𝜕𝑤2=

𝜕𝐸





𝜕𝑤2

•𝜕𝐸

𝜕𝑜𝑢𝑡𝑧= 2

1

2𝑦 − 𝑂𝑢𝑡𝑧 −1

•𝜕𝑜𝑢𝑡𝑧

𝜕𝑛𝑒𝑡𝑧= 𝜎 𝑁𝑒𝑡z 1 − 𝜎 𝑁𝑒𝑡z

•𝜕𝑛𝑒𝑡𝑧

𝜕𝑤2= 𝑂𝑢𝑡ℎ1

Feedforward Calculations:

1.𝑁𝑒𝑡ℎ1(𝑗) = ⟨𝒘1𝑗 , 𝒙⟩


ℎ1𝑗 (add bias)



Error: 𝐸 =1

2𝑦 − 𝑂𝑢𝑡𝑧

2

𝑦

38

The Update Rule

•𝜕𝐸

𝜕𝑤2=

𝜕𝐸





𝜕𝑤2

• Updating 𝑤2:

= 21

2𝑦 − 𝑂𝑢𝑡𝑧 −1 𝜎 𝑁𝑒𝑡z 1 − 𝜎 𝑁𝑒𝑡z 𝑂𝑢𝑡ℎ1

= − 𝑦 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡𝑧 1 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡ℎ1

𝑤2 = 𝑤2 − 𝜂 − 𝑦 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡𝑧 1 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡ℎ1

39

The Chain Rule

• Next, we will continue the backwards pass in order to calculate 𝑤1𝑗

40

The Chain Rule

Backpropagation:

•𝜕𝐸

𝜕𝑤1𝑗=

𝜕𝐸





𝜕𝑂𝑢𝑡ℎ1(𝑗)


𝜕𝑁𝑒𝑡ℎ1(𝑗)


𝜕𝑤1𝑗

•𝜕𝑛𝑒𝑡𝑧


= 𝒘2(𝑗)

•𝜕𝑂𝑢𝑡

ℎ1(𝑗)


= 𝜎 𝑁𝑒𝑡ℎ1(𝑗) 1 − 𝜎 𝑁𝑒𝑡

ℎ1(𝑗)

•𝜕𝑁𝑒𝑡

ℎ1(𝑗)

𝜕𝑤1𝑗= 𝒙

Feedforward Calculations:

1.𝑁𝑒𝑡ℎ1(𝑗) = ⟨𝒘1𝑗 , 𝒙⟩


ℎ1𝑗 (add bias)



Error: 𝐸 =1

2𝑦 − 𝑂𝑢𝑡𝑧

2

𝑦

Calculated previously

41

The Update Rule

•𝜕𝐸

𝜕𝑤1𝑗=

𝜕𝐸





𝜕𝑂𝑢𝑡ℎ1

𝜕𝑂𝑢𝑡ℎ1𝜕𝑁𝑒𝑡ℎ1

𝜕𝑁𝑒𝑡ℎ1𝜕𝑤1𝑗

• Updating 𝑤1𝑗:

= 21

2𝑦 − 𝑂𝑢𝑡𝑧 −1 𝑂𝑢𝑡𝑧 1 − 𝑂𝑢𝑡𝑧 𝒘2

(𝑗)𝜎 𝑁𝑒𝑡

ℎ1(𝑗) 1 − 𝜎 𝑁𝑒𝑡

ℎ1(𝑗) 𝒙

𝑤1𝑗 = 𝑤1𝑗 − 𝜂 − 𝑦 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡𝑧 1 − 𝑂𝑢𝑡𝑧 𝒘2(𝑗)

𝑂𝑢𝑡ℎ1(𝑗) 1 − 𝑂𝑢𝑡

ℎ1(𝑗) 𝒙

= − 𝑦 − 𝑂𝑢𝑡𝑧 𝑂𝑢𝑡𝑧 1 − 𝑂𝑢𝑡𝑧 𝒘2(𝑗)

𝑂𝑢𝑡ℎ1(𝑗) 1 − 𝑂𝑢𝑡

ℎ1(𝑗) 𝒙

42

Calculated previously

Updated Weights

• Finally, we updated the weights.

• When we fed forward the first input (0.05, 0.1), the error was: 0.2983

• After a single update, the error is down to: 0.29102

• After repeating the process 10,000 times, the error is: 0.00003

43

Summary

• Single Layered Perceptron• Only solves linear problems

• Neural Networks• Non-linear activation functions

• Feedforward

• Backpropagation

• Example from:• https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-

example/

44

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

deep neural networks - biuu.cs.biu.ac.il/~jkeshet/teaching/iml2016/iml2016_tirgul09.pdf ·...

Documents