cs540 introduction to artificial intelligence lecture...
TRANSCRIPT
Stochastic Gradient Multi-Class Classification Regularization
CS540 Introduction to Artificial IntelligenceLecture 4
Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles
Dyer
June 15, 2020
Stochastic Gradient Multi-Class Classification Regularization
Perceptron Algorithm vs Logistic RegressionMotivation
For LTU Perceptrons, w is updated for each instance xisequentially.
w � w � α pai � yi q xi
For Logistic Perceptrons, w is updated using the gradient thatinvolves all instances in the training data.
w � w � αn
i�1
pai � yi q xi
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient DescentMotivation
Each gradient descent step requires the computation ofgradients for all training instances i � 1, 2, ..., n. It is verycostly.
Stochastic gradient descent picks one instance xi randomly,compute the gradient, and update the weights and biases.
When a small subset of instances is selected randomly eachtime, it is called mini-batch gradient descent.
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent Diagram 1Motivation
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent Diagram 2Motivation
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent, Part 1Algorithm
Inputs, Outputs: same as backpropagation.
Initialize the weights.
Randomly permute (shuffle) the training set. Evaluate theactivation functions at one instance at a time.
Compute the gradient using the chain rule.
BCBw plq
j 1j
� δplqij a
pl�1qij 1
BCBbplqj
� δplqij
Stochastic Gradient Multi-Class Classification Regularization
Stochastic Gradient Descent, Part 2Algorithm
Update the weights and biases using gradient descent.
For l � 1, 2, ..., L
wplqj 1j Ð w
plqj 1j � α
BCBw plq
j 1j
, j 1 � 1, 2, ....,mpl�1q, j � 1, 2, ....,mplq
bplqj Ð b
plqj � α
BCBbplqj
, j � 1, 2, ....,mplq
Repeat the process until convergent.
|C � C prev | ε
Stochastic Gradient Multi-Class Classification Regularization
Choice of Learning RateDiscussion
Changing the learning rate α as the weights get closer to theoptimal weights could speed up convergence.
Popular choices of learning rate includeα?t
andα
t, where t is
the current number of iterations.
Other methods of choosing step size include using the secondderivative (Hessian) information, such as Newton’s methodand BFGS, or using information about the gradient in previoussteps, such as adaptive gradient methods like AdaGrad andAdam.
Stochastic Gradient Multi-Class Classification Regularization
Stochastic vs Full Gradient DescentQuiz
Given the same initial weights and biases, stochastic gradientdescent with instances picked randomly without replacementand full gradient descent lead to the same updated weights.
A: Do not choose this.
B: True.
C: Do not choose this.
D: False.
E: Do not choose this.
Stochastic Gradient Multi-Class Classification Regularization
Multi-Class ClassificationMotivation
When there are K categories to classify, the labels can take Kdifferent values, yi P t1, 2, ...,Ku.Logistic regression and neural network cannot be directlyapplied to these problems.
Stochastic Gradient Multi-Class Classification Regularization
Method 1, One VS AllDiscussion
Train a binary classification model with labels y 1i � 1tyi�ju foreach j � 1, 2, ...,K .
Given a new test instance xi , evaluate the activation functionapjqi from model j .
yi � arg maxj
apjqi
One problem is that the scale of apjqi may be different for
different j .
Stochastic Gradient Multi-Class Classification Regularization
Method 2, One VS OneDiscussion
Train a binary classification model for each of theK pK � 1q
2pairs of labels.
Given a new test instance xi , apply allK pK � 1q
2models and
output the class that receives the largest number of votes.
yi � arg maxj
¸j 1�j
ypj vs j 1qi
One problem is that it is not clear what to do if multipleclasses receive the same number of votes.
Stochastic Gradient Multi-Class Classification Regularization
One Hot EncodingDiscussion
If y is not binary, use one-hot encoding for y .
For example, if y has three categories, then
yi P$&%��1
00
�� ,��0
10
�� ,��0
01
��,.-
Stochastic Gradient Multi-Class Classification Regularization
Method 3, Softmax FunctionDiscussion
For both logistic regression and neural network, the last layerwill have K units, aij , for j � 1, 2, ...,K , and the softmaxfunction is used instead of the sigmoid function.
aij � g�wTj xi � bj
�
exp��wT
j xi � bj
K
j 1�1
exp��wT
j 1 xi � bj 1 , j � 1, 2, ...,K
Stochastic Gradient Multi-Class Classification Regularization
Softmax DerivativesDiscussion
Cross entropy loss is also commonly used with a softmaxactivation function.
The gradient of cross-entropy loss with respect to aij ,component j of the output layer activation for instance i hasthe same form as the one for logistic regression.
BCBaij � aij � yij ñ ∇aiC � ai � yi
The gradient with respect to the weights can be found usingthe chain rule.
Stochastic Gradient Multi-Class Classification Regularization
Generalization ErrorMotivation
With a large number of hidden units and small enoughlearning rate α, a multi-layer neural network can fit everyfinite training set perfectly.
It does not imply the performance on the test set will be good.
This problem is called overfitting.
Stochastic Gradient Multi-Class Classification Regularization
Generalization Error DiagramMotivation
Stochastic Gradient Multi-Class Classification Regularization
Method 1, Validation SetDiscussion
Set aside a subset of the training set as the validation set.
During training, the cost (or accuracy) on the training set willalways be decreasing until it hits 0.
Train the network until the cost (or accuracy) on thevalidation set begins to increase.
Stochastic Gradient Multi-Class Classification Regularization
Validation Set DiagramDiscussion
Stochastic Gradient Multi-Class Classification Regularization
Method 2, Drop OutDiscussion
At each hidden layer, a random set of units from that layer isset to 0.
For example, each unit is retained with probability p � 0.5.During the test, the activations are reduced by p � 0.5 (or 50percent).
The intuition is that if a hidden unit works well with differentcombinations of other units, it does not rely on other unitsand it is likely to be individually useful.
Stochastic Gradient Multi-Class Classification Regularization
Drop Out DiagramDiscussion
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L1 and L2 RegularizationDiscussion
The idea is to include an additional cost for non-zero weights.
The models are simpler if many weights are zero.
For example, if logistic regression has only a few non-zeroweights, it means only a few features are relevant, so onlythese features are used for prediction.
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L1 RegularizationDiscussion
For L1 regularization, add the 1-norm of the weights to thecost.
C �n
i�1
pai � yi q2 � λ
�����wb
�����1
�n
i�1
pai � yi q2 � λ
�m
i�1
|wi | � |b|�
Linear regression with L1 regularization is called LASSO (leastabsolute shrinkage and selection operator).
Stochastic Gradient Multi-Class Classification Regularization
Method 3, L2 RegularizationDiscussion
For L2 regularization, add the 2-norm of the weights to thecost.
C �n
i�1
pai � yi q2 � λ
�����wb
�����2
2
�n
i�1
pai � yi q2 � λ
�m
i�1
w2i � b2
�
Stochastic Gradient Multi-Class Classification Regularization
L1 and L2 Regularization ComparisonDiscussion
L1 regularization leads to more weights that are exactly 0. Itis useful for feature selection.
L2 regularization leads to more weights that are close to 0. Itis easier to do gradient descent because 1-norm is notdifferentiable.
Stochastic Gradient Multi-Class Classification Regularization
L1 and L2 Regularization DiagramDiscussion
Stochastic Gradient Multi-Class Classification Regularization
Method 4, Data AugmentationDiscussion
More training data can be created from the existing ones, forexample, by translating or rotating the handwritten digits.
Stochastic Gradient Multi-Class Classification Regularization
HyperparametersDiscussion
It is not clear how to choose the learning rate α, the stoppingcriterion ε, and the regularization parameters.
For neural networks, it is also not clear how to choose thenumber of hidden layers and the number of hidden units ineach layer.
The parameters that are not parameters of the functions inthe hypothesis space are called hyperparameters.
Stochastic Gradient Multi-Class Classification Regularization
K Fold Cross ValidationDiscussion
Partition the training set into K groups.
Pick one group as the validation set.
Train the model on the remaining training set.
Repeat the process for each of the K groups.
Compare accuracy (or cost) for models with differenthyperparameters and select the best one.
Stochastic Gradient Multi-Class Classification Regularization
5 Fold Cross Validation ExampleDiscussion
Partition the training set S into 5 subsets S1,S2, S3,S4, S5
Si X Sj � H and5¤
i�1
Si � S
Iteration Training Validation
1 S2 Y S3 Y S4 Y S5 S1
2 S1 Y S3 Y S4 Y S5 S2
3 S1 Y S2 Y S4 Y S5 S3
4 S1 Y S2 Y S3 Y S5 S4
5 S1 Y S2 Y S3 Y S4 S5
Stochastic Gradient Multi-Class Classification Regularization
Leave One Out Cross ValidationDiscussion
If K � n, each time exactly one training instance is left out asthe validation set. This special case is called Leave One OutCross Validation (LOOCV).