security and fairness of deep learningece739/lectures/18739... · security and fairness of deep...
TRANSCRIPT
![Page 1: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/1.jpg)
Stochastic Gradient Descent
Spring 2020
Security and Fairness of Deep Learning
![Page 2: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/2.jpg)
Image Classification
![Page 3: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/3.jpg)
Linear model
• Score function• Maps raw data to class scores
• Loss function• Measures how well predicted classes agree with ground truth labels
• Multiclass Support Vector Machine loss (SVM loss)• Softmax classifier (cross-entropy loss)
• Learning• Find parameters of score function that minimize loss function
• Multiclass Support Vector Machine loss (SVM loss)
![Page 4: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/4.jpg)
Recall: Linear model with SVM loss
• Score function• Maps raw data to class scores
• Loss function• Measures how well predicted classes agree with ground truth labels
f(xi,W ) = Wxi
L =1
N
X
i
X
j 6=yi
[max(0, f(xi;W )j � f(xi;W )yi +�)] + �X
k
X
l
W 2k,l
![Page 5: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/5.jpg)
Today
• Learning model parameters with Stochastic Gradient Descent that minimize loss
• Later• Different score functions: deep networks• Same loss functions and learning algorithm
![Page 6: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/6.jpg)
Outline
• Visualizing the loss function• Optimization
• Random search• Random local search• Gradient descent
• Mini-batch gradient descent
![Page 7: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/7.jpg)
Visualizing SVM loss function
• Difficult to visualize fully • CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of
30,730 parameters
• Can gain intuition by visualizing along rays (1 dimension) or planes (2 dimensions)
![Page 8: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/8.jpg)
Visualizing in 1-D
• Generate random weight matrix• Generate random direction• Compute loss along this direction
W
W1
L(W + aW1)
Loss for single example
Where is the minima?
![Page 9: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/9.jpg)
Visualizing in 2-D
• Compute loss along plane L(W + aW1 + bW2)
Loss for single example Average loss for 100 examples (convex function)
![Page 10: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/10.jpg)
How do we find weights that minimize loss?
• Random search• Try many random weight matrices and pick the best one• Performance: poor
• Random local search• Start with random weight matrix• Try many local perturbations, pick the best one, and iterate• Performance: better but still quite poor
• Useful idea: iterative refinement of weight matrix
![Page 11: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/11.jpg)
Optimization basics
![Page 12: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/12.jpg)
The problem of optimization
Find the value of x where f(x) is minimum
Our setting: x represents weights, f(x) represents loss function
x
f(x)
![Page 13: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/13.jpg)
In two stages
• Function of single variable• Function of multiple variables
![Page 14: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/14.jpg)
Derivative of a function of single variable
df(x)
dx= lim
h !0
f(x+ h)� f(x)
h
![Page 15: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/15.jpg)
Derivatives
d
dx(x2) = 2x
d
dx(ex) = ex
d
dx(ln x) =
1
xif x > 0
![Page 16: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/16.jpg)
Finding minima
Increase x if derivative negative, decrease if positivei.e., take step in direction opposite to sign of derivative(key idea of gradient descent)
x
f(x)𝑑𝑦𝑑𝑥
= 0
𝑑𝑦𝑑𝑥 < 0
𝑑𝑦𝑑𝑥 > 0
![Page 17: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/17.jpg)
Doesn’t always work
• Theoretical and empirical evidence that gradient descent works quite well for deep networks
f(x)
x
global minimum
inflection point
local minimum
global maximum
![Page 18: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/18.jpg)
In two stages
• Function of single variable• Function of multiple variables
![Page 19: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/19.jpg)
Partial derivatives
The partial derivative of an n-ary function f(x1,...,xn) in the direction xi at the point (a1,...,an) is defined to be:
![Page 20: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/20.jpg)
Partial derivative example
By IkamusumeFan - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=42262627
At (1, 1), the slope is 3
![Page 21: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/21.jpg)
The gradient of a scalar function
• The gradient 𝛻𝑓(𝑋) of a scalar function 𝑓(𝑋) of a multi-variate input 𝑋 is a multiplicative factor that gives us the change in 𝑓(𝑋) for tiny variations in 𝑋
𝑑𝑓 𝑋 = 𝛻𝑓 𝑋 𝑑𝑋𝛻𝑓 𝑋 = 𝑑𝑓(𝑋)/𝑑𝑋
𝑑𝑓(𝑋)
𝑑𝑋
![Page 22: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/22.jpg)
Gradients of scalar functions with multi-variate inputs•Consider 𝑓 𝑋 = 𝑓 𝑥., 𝑥0, … , 𝑥2
•𝛻𝑓(𝑋) = 45 6478
45 6479
⋯ 45 647;
![Page 23: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/23.jpg)
Computing gradients analytically
f(x, y) = x+ y ! @f
@x= 1
@f
@y= 1
![Page 24: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/24.jpg)
Computing gradients analytically
f(x, y) = xy ! @f
@x= y
@f
@y= x
rf = [@f
@x,@f
@y] = [y, x]
![Page 25: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/25.jpg)
Derivatives measure sensitivity
If we were to increase by a tiny amount, the effect on the whole expression would be to decrease it (due to the negative sign), and by three times that amount.
x = 4, y = �3 f(x, y) = �12@f
@x= �3
x
f(x, y) = xy ! @f
@x= y
@f
@y= x
![Page 26: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/26.jpg)
Finding minima
Take step in direction opposite to sign of gradient
Gradientvector 𝛻𝑓(𝑋)
Moving in this direction increases 𝑓(𝑋) fastest
−𝛻𝑓(𝑋)Moving in this
direction decreases 𝑓 𝑋 fastest
![Page 27: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/27.jpg)
Gradient descent algorithm
• Initialize: • 𝑥=• 𝑘 = 0
• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?
• 𝑘 = 𝑘 + 1
• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀
Average gradient across all training
examples
f(xk) =1
N⌃N
i=1fi(xk)
5f(xk) =1
N⌃N
i=1 5 fi(xk)
![Page 28: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/28.jpg)
Step size affects convergence of gradient descent
Murphy, Machine Learning, Fig 8.2
![Page 29: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/29.jpg)
Gradient descent algorithm
• Initialize: • 𝑥=• 𝑘 = 0
• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?
• 𝑘 = 𝑘 + 1
• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀
Challenge to discuss later: How to choose step size?
Average gradient across all training
examples
Challenge: Not scalable for very large data sets
![Page 30: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/30.jpg)
Mini-batch gradient descent
• Initialize: • 𝑥=• 𝑘 = 0
• Do• 𝑥?@. = 𝑥? − 𝜂?𝛻𝑓 𝑥?
• 𝑘 = 𝑘 + 1
• Until 𝑓 𝑥? − 𝑓 𝑥?D. ≤ 𝜀
Faster convergence
Average gradient over small batches
of training examples (e.g., sample of 256
examples)
Special case: Stochastic or online gradient descent àuse single training example in each update step
![Page 31: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/31.jpg)
Stochastic gradient descent convergence
Murphy, Machine Learning, Fig 8.8
![Page 32: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/32.jpg)
SVM loss visualization
Challenge: Gradient does not exist
![Page 33: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/33.jpg)
Computing subgradients analytically
The set of subderivatives at x0 for a convex function is a nonempty closed interval [a, b], where a and b are the one-sided limits:
![Page 34: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/34.jpg)
Computing subgradients analytically
f(x, y) = max(x, y) ! @f
@x= I(x >= y)
@f
@y= I(y >= x)
The (sub)gradient is 1 on the input that is larger and 0 on the other input
![Page 35: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/35.jpg)
Subgradient of SVM loss
Li =X
j 6=yi
⇥max(0, wT
j xi � wTyixi +�)
⇤
rwyiLi = �
0
@X
j 6=yi
I(wTj xi � wT
yixi +� > 0)
1
Axi
Number of classes that didn’t meet the desired margin
rwjLi = I(wTj xi � wT
yixi +� > 0)xi
j-th class didn’t meet the desired margin
![Page 36: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/36.jpg)
Review derivatives
• Please review rules for computing derivatives and partial derivatives of functions, including the chain rule
• https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives
• You will need to use them in HW1!
![Page 37: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/37.jpg)
Summary
![Page 38: Security and Fairness of Deep Learningece739/lectures/18739... · Security and Fairness of Deep Learning. Image Classification. Linear model ... scores •Loss function •Measures](https://reader034.vdocuments.us/reader034/viewer/2022043019/5f3b9effc99335015963221e/html5/thumbnails/38.jpg)
Acknowledgment
• Based on material from• Stanford CS231n http://cs231n.github.io/• CMU 11-785 Course• Spring 2019 Course