training neural networks
TRANSCRIPT
![Page 1: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/1.jpg)
Training Neural Networks
SS 2020 – Machine PerceptionOtmar Hilliges
12 March 2020
![Page 2: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/2.jpg)
Errata
• Projects are in teams of two – Mea Culpa• See Piazza announcement for more details and post questions
there
3/12/20 2
![Page 3: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/3.jpg)
More announcements – COVID-19
Situation changes daily; we keep on monitoring this
If minimum distance of 1 meter cannot be guaranteed, events at ETH are forbidden
This effects lectures -> please sit apart from each otherIf possible, prioritize video-recordings over lecture hall
3/12/20 3
![Page 4: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/4.jpg)
Last lecture
• Perceptron learning algorithm• MLP as engineering model of a neural network
3/12/20 4
![Page 5: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/5.jpg)
This lecture
• What types of functions can be approximated by neural networks?• Universal approximation theorem
• How do we train neural networks?• Backprop algorithm
3/12/20 5
![Page 6: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/6.jpg)
Perceptron - Block diagram
3/12/20 6
𝒙
𝒘 𝒃
⋅ + 𝝈 𝒚𝑥 =[𝑥&, 𝑥(, 𝑥), … , 𝑥*]
𝑤 =[𝑤&, 𝑤(, 𝑤), … , 𝑤*]
𝑤+𝑥 + 𝑏𝑦 = 𝜎(𝑤+𝑥 + 𝑏)
![Page 7: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/7.jpg)
Multi-layered Perceptron - Block diagram
We can combine several layers:With 𝒙 ! = 𝒙,
∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝜎 𝒘 " #𝒙 "$% + 𝑏 "
And 𝑓 𝒙;𝒘, 𝑏 = 𝒙 &
3/12/20 7
𝒙
𝒘 𝒃
⋅ + 𝝈
𝒘 𝒃
⋅ + 𝝈… 𝒚
![Page 8: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/8.jpg)
Linear activation functions?
Assuming that 𝜎 is a linear transform,∀𝑥 ∈ ℝ!, 𝜎 𝒙 = 𝛼𝒙 + 𝛽𝐼
with 𝛼, 𝛽 ∈ ℝ, we get:∀𝑙 = 1,… , 𝐿, 𝒙 " = 𝛼𝒘 " #𝒙 "$% + 𝛼𝑏 " + 𝛽𝐼,
which results in an affine mapping:𝑓 𝒙;𝒘, 𝑏 = 𝐴 & 𝒙 + 𝐵 & ,
where 𝐴 ' = 𝐼, 𝐵 ' = 0 and
∀𝑙 < 𝐿 9 𝐴 " = 𝛼𝒘 " 𝐴 "$%
𝐵 " = 𝛼𝒘 " 𝐵 "$% + 𝛼𝑏 " + 𝛽𝐼
3/12/20 8
Important message: The activation function should be non-linear, or the resulting MLP is an affine mapping with a peculiar parametrization!
![Page 9: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/9.jpg)
So what happens with a non-linearity?
3/12/20 9
![Page 10: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/10.jpg)
Example: Solving ’XOR’
Task learn function 𝑦 = 𝑓∗ 𝒙 that maps binary variables 𝑥%, 𝑥( to “true” or “false” if one and only one 𝑥) == 1We will use function 𝑓 𝒙; Θ and find parameters Θ to make 𝑓∗ ≈ 𝑓
𝑋 =0 00 111
01
, 𝑌 =
0110
3/12/20 10
![Page 11: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/11.jpg)
Attempt I: Use a linear model?
Assuming a linear function:
Solving via normal equation:
3/12/20 11
𝐽 Θ =1
(𝑁 = 4)1-∈/
𝑓∗ 𝑥 1 − 𝑓(𝑥 1 ; Θ ))
𝒘 = 𝟎 and 𝑏 =12
with 𝑓 𝒙;𝒘, 𝑏 = 𝒙+𝒘+ 𝑏
![Page 12: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/12.jpg)
Visualization
3/12/20 12
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.
173
![Page 13: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/13.jpg)
ReLu
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0
z
0
g(z
)=
max
{0,z}
Figure 6.3: The rectified linear activation function. This activation function is the defaultactivation function recommended for use with most feedforward neural networks. Applyingthis function to the output of a linear transformation yields a nonlinear transformation.However, the function remains very close to linear, in the sense that is a piecewise linearfunction with two linear pieces. Because rectified linear units are nearly linear, theypreserve many of the properties that make linear models easy to optimize with gradient-based methods. They also preserve many of the properties that make linear modelsgeneralize well. A common principle throughout computer science is that we can buildcomplicated systems from minimal components. Much as a Turing machine’s memoryneeds only to be able to store 0 or 1 states, we can build a universal function approximatorfrom rectified linear functions.
175
3/12/20 14
𝑔 𝑧 = max{0, 𝑧}
![Page 14: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/14.jpg)
Feedforward Neural Network
3/12/20 15
𝑋
𝑊 𝑐
𝜎+!
𝑤 𝑏
+! 𝑦
𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
![Page 15: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/15.jpg)
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 16
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=00
01
11
01
1 11 1 =
01
01
12
12
![Page 16: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/16.jpg)
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 17
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=01
01
12
12
+𝑐 + 0−1 =
0 −1112
001
![Page 17: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/17.jpg)
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 18
𝑋𝑊
𝑋 =0 00 111
01
, 𝑌 =
0110
=+𝑐0 −1112
001
max 0, max{𝑂, } =0 0112
001
![Page 18: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/18.jpg)
3/12/20 19
𝑋𝑊 =+𝑐0 −1112
001
max 0, max{𝑂, } =0 0112
001
ℎ( ℎ)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
0 1
x1
0
1
x2
Original x space
0 1 2
h1
0
1
h2
Learned h space
Figure 6.1: Solving the XOR problem by learning a representation. The bold numbersprinted on the plot indicate the value that the learned function must output at each point.(Left)A linear model applied directly to the original input cannot implement the XORfunction. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,the model’s output must decrease as x2 increases. A linear model must apply a fixedcoefficient w2 to x2. The linear model therefore cannot use the value of x1 to changethe coefficient on x2 and cannot solve this problem. (Right)In the transformed spacerepresented by the features extracted by a neural network, a linear model can now solvethe problem. In our example solution, the two points that must have output 1 have beencollapsed into a single point in feature space. In other words, the nonlinear features havemapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]>.The linear model can now describe the function as increasing in h1 and decreasing in h2.In this example, the motivation for learning the feature space is only to make the modelcapacity greater so that it can fit the training set. In more realistic applications, learnedrepresentations can also help the model to generalize.
173
𝑋 =00
01
11
01
![Page 19: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/19.jpg)
The XOR Multi-Layered Perceptron
Functionwewanttolearn:𝑓 𝒙;𝑊, 𝒄,𝒘, 𝑏 = 𝒘#max 0, 𝑋𝑊 + 𝒄 + 𝑏
Oracle solution:𝑊 = 1 1
1 1 , 𝐜 = 0−1 , 𝒘 = 1
−2 , 𝑏 = 0
3/12/20 20
𝑊𝑋
𝑋 =0 00 111
01
, 𝑌 =
0110
=+𝑐max 0,0 0112
001
w7 +𝑏 1 −2 =
0110
= 𝑌+0
![Page 20: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/20.jpg)
So far: The non-linearity is critically important
Next: What type of functions can we approximate?
3/12/20 21
![Page 21: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/21.jpg)
Universal approximation theorem
Given that 𝜎 ∈ 𝐶( (ℝ) is non-linear (e.g., sigmoid), what type of functions can we learn?
3/12/20 22
∃ 𝑔 𝑥 𝑎𝑠 𝑁𝑁,
It is an approximation
Given enough hidden units.One layer is enough in theory. In practice deeper is better.
continuous function𝑓(𝑥)
Original proof: Multilayer feedforward networks are universal approximators
𝑔 𝑥 ≈ 𝑓 𝑥 , 𝑎𝑛𝑑 𝑔 𝑥 − 𝑓 𝑥 < 𝜖
![Page 22: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/22.jpg)
Universal approximation theorem
3/12/20 23[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function
𝑤 = 16, 𝑏 = −9
𝑥 𝑔(𝑥)
![Page 23: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/23.jpg)
Universal approximation theorem
3/12/20 24[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1
𝑤 = 16, 𝑏 = −9
𝑤 = 40
𝑤 = 8
𝑤 = 16
![Page 24: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/24.jpg)
Universal approximation theorem
3/12/20 25[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑔 𝑥 = 𝜎 𝑤+𝑥 + 𝑏 =exp(𝑤+𝑥 + 𝑏)
exp(𝑤+𝑥 + 𝑏) + 1Where 𝜎 is the sigmoid function
𝑤 = 16, 𝑏 = −9
𝑏 = −12
𝑏 = −7𝑏 = −9
![Page 25: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/25.jpg)
Universal approximation theorem
3/12/20 26[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑤 = 100, 𝑏 = −40
Step function
𝑠 = −𝑏𝑤= 0.4Leading edge:
𝑥 𝑔(𝑥)
![Page 26: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/26.jpg)
Universal approximation theorem
3/12/20 27[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥 𝑔(𝑥)
ℎ = 0.5
ℎ = 0.5
𝑠 = 0.4
𝑠 = 0.4
Step function
Output layer is always assumed linear
![Page 27: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/27.jpg)
Universal approximation theorem
3/12/20 28[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥 𝑔(𝑥)
𝑠( = 0.2
ℎ) = 0.5
ℎ( = 0.5
𝑠) = 0.4
![Page 28: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/28.jpg)
Universal approximation theorem
3/12/20 29[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
“Bump function”
𝑥 𝑔(𝑥)
𝑠( = 0.2
ℎ) = −0.5
ℎ( = 0.5
𝑠) = 0.4
![Page 29: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/29.jpg)
Universal approximation theorem
3/12/20 30[see also: http://neuralnetworksanddeeplearning.com/chap4.html]
𝑥
𝑠(
𝑠)
𝑔(𝑥)
𝑠8
𝑠9
ℎ!
−ℎ!
ℎ"
−ℎ"
![Page 30: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/30.jpg)
Universal approximation theorem
3/12/20 31
𝑥 𝑦
• Tells us that feed forward networks can approximate 𝑓 𝑥• It does not tell us anything about the chances of learning the correct
parameters
![Page 31: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/31.jpg)
More precisely…
Let 𝜎:ℝ → ℝ be a non-constant, bounded and continous(activation) function𝐼* denotes the 𝑚-dimensional unit hypercube 0,1 * and the spaceof real-valued functions on 𝐼* is denoted with 𝐶 𝐼* .Then any function 𝑓 ∈ 𝐶 𝐼* can be approximated given any 𝜖 > 0, integer 𝑁, real constants 𝑣) , 𝑏) ∈ ℝ and real vectors 𝑤) ∈ ℝ* for 𝑖 =1,… , 𝑁:
𝑓 𝑥 ≈ 𝑔 𝑥 =Y)+%
,
𝑣)𝜎(𝑤)#𝑥 + 𝑏))
3/12/20 32And: 𝑔 𝑥 − 𝑓 𝑥 < 𝜖 , for all 𝑥 ∈ 𝐼!
[Cybenko, G. "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and Systems, 1989.]
![Page 32: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/32.jpg)
However…
Networks with a single-hidden layer, need to have exponential width
In practice, deeper networks work better.
Lot’s of ongoing work to provide theory on what network properties lead to which approximation capabilities.
E.g., [Lu et al. “The Expressive Power of Neural Networks: A View from the Width”. NeurIPS 2017]
3/12/20 33
![Page 33: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/33.jpg)
So far: NNs are universal approximators
Next: How do we find the network parameters?
3/12/20 34
![Page 34: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/34.jpg)
General procedure
Iterative gradient descent to find parameters Θ:
Initialize weights with small random valuesInitialize biases with 0 or small positive valuesCompute gradientsUpdate parameters with SGD
3/12/20 35
Compute the negative gradient at 𝜃&
−𝛻𝐶 𝜃:
Times the learning rate 𝜂
−𝜂𝛻𝐶 𝜃:
SGD in a nutshell:
![Page 35: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/35.jpg)
Chain rule
𝑦 = 𝑔 𝑥 and 𝑧 = 𝑓 𝑔 𝑥 = 𝑓(𝑦)
𝑑𝑧𝑑𝑥
=𝑑𝑧𝑑𝑦
𝑑𝑦𝑑𝑥
For vector types𝜕𝑧𝜕𝑥
=𝜕𝑧𝜕𝑦𝜕𝑦𝜕x
3/12/20 36
![Page 36: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/36.jpg)
Two ways to compute gradients
Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (
Symbolic differentiation gives us:
𝑑𝑓𝑑𝑥
=
exp exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (
+ cos exp 𝑥 + exp 𝑥 ( exp 𝑥 + 2 exp 𝑥 (
3/12/20 37
![Page 37: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/37.jpg)
If we were to write a program…
Consider the scalar function:𝑓 = exp exp 𝑥 + exp 𝑥 ( + sin exp 𝑥 + exp 𝑥 (
We would define and compute intermediate variables: 𝑎 = exp 𝑥𝑏 = 𝑎(
𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.
3/12/20 38
![Page 38: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/38.jpg)
We can draw this as graph
3/12/20 39
𝑓 = exp exp 𝑥 + exp 𝑥 " + sin exp 𝑥 + exp 𝑥 "
𝑎 = exp 𝑥𝑏 = 𝑎"
𝑐 = 𝑎 + 𝑏𝑑 = exp 𝑐𝑒 = sin 𝑐𝑓 = 𝑑 + 𝑒.
𝑥 𝑎exp(L)
(L)" 𝑏
+ 𝑐
exp(L)
sin(L)
𝑑
𝑒 + 𝑓
![Page 39: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/39.jpg)
Mechanically writing down derivatives
#$##= 1
#$#%= 1
#$#&= #$
#####&+ #$
#%#%#&
#$#'= #$
#&#&#'
#$#(= #$
#&#&#(+ #$#'
#'#(
#$#)= #$
#(#(#)
3/12/20
#$##= 1
#$#%= 1
#$#&= #$
##exp(𝑐) + #$
#%cos(𝑐)
#$#'= #$
#&
#$#(= #$
#&+ #$#'2𝑎
#$#)= #$
#(exp(𝑥)
𝑥𝑎
exp(')
(')!
𝑏 +𝑐
exp(')
sin(')
𝑑 𝑒+
𝑓
![Page 40: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/40.jpg)
Backpropagation in Neural Networks
Example & Derivation on the blackboard
3/12/20 41
![Page 41: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/41.jpg)
42
![Page 42: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/42.jpg)
3/12/20 44
![Page 43: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/43.jpg)
![Page 44: Training Neural Networks](https://reader034.vdocuments.us/reader034/viewer/2022050512/6271d363e27cc23147384a71/html5/thumbnails/44.jpg)
Next week
Convolutional Neural Networks
3/12/20 49