optimization for machine learning - computer science at ubc

118
Optimization for Machine Learning CS 406 Mark Schmidt UBC Computer Science Term 2, 2014-15 Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 1 / 40

Upload: others

Post on 19-Dec-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization for Machine Learning - Computer Science at UBC

Optimization for Machine LearningCS 406

Mark Schmidt

UBC Computer Science

Term 2, 2014-15

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 1 / 40

Page 2: Optimization for Machine Learning - Computer Science at UBC

Goals of this Lecture

1 Give an overview and motivation for the machine learning techniqueof supervised learning.

2 Generalize convergence rates of gradient methods for solving linearsystems to general smooth convex optimization problems.

3 Introduce the proximal-gradient algorithm, one of the most efficientalgorithms for solving special classes of non-smooth convexoptimization problems.

4 Introduce the stochastic-gradient algorithm, for solving data-fittingproblems when the size of the data is very large.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 2 / 40

Page 3: Optimization for Machine Learning - Computer Science at UBC

Machine Learning

Machine Learning

Study of using computers to automatically detect patterns in data,and use these to make predictions or decisions.

One of the fastest-growing areas of science/engineering.

Recent successes: Kinect, book/movie recommendation, spamdetection, credit card fraud detection, face recognition, speechrecognition, object recognition, self-driving cars.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 3 / 40

Page 4: Optimization for Machine Learning - Computer Science at UBC

Machine Learning

Machine Learning

Study of using computers to automatically detect patterns in data,and use these to make predictions or decisions.

One of the fastest-growing areas of science/engineering.

Recent successes: Kinect, book/movie recommendation, spamdetection, credit card fraud detection, face recognition, speechrecognition, object recognition, self-driving cars.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 3 / 40

Page 5: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Supervised learning

Supervised learning:Given input and output examples.Build a model that predicts the output from the inputs.You can use the model to predict the output on new inputs.

Canonical example: hand-written digit recognition:

1.2. Supervised learning 7

true class = 7 true class = 2 true class = 1

true class = 0 true class = 4 true class = 1

true class = 4 true class = 9 true class = 5

(a)

true class = 7 true class = 2 true class = 1

true class = 0 true class = 4 true class = 1

true class = 4 true class = 9 true class = 5

(b)

Figure 1.5 (a) First 9 test MNIST gray-scale images. (b) Same as (a), but with the features permutedrandomly. Classification performance is identical on both versions of the data (assuming the training datais permuted in an identical way). Figure generated by shuffledDigitsDemo.

or width is below some threshold. However, distinguishing versicolor from virginica is slightlyharder; any decision will need to be based on at least two features. (It is always a good ideato perform exploratory data analysis, such as plotting the data, before applying a machinelearning method.)

Image classification and handwriting recognition

Now consider the harder problem of classifying images directly, where a human has not pre-processed the data. We might want to classify the image as a whole, e.g., is it an indoors oroutdoors scene? is it a horizontal or vertical photo? does it contain a dog or not? This is calledimage classification.

In the special case that the images consist of isolated handwritten letters and digits, forexample, in a postal or ZIP code on a letter, we can use classification to perform handwritingrecognition. A standard dataset used in this area is known as MNIST, which stands for “ModifiedNational Institute of Standards”5. (The term “modified” is used because the images have beenpreprocessed to ensure the digits are mostly in the center of the image.) This dataset contains60,000 training images and 10,000 test images of the digits 0 to 9, as written by various people.The images are size 28× 28 and have grayscale values in the range 0 : 255. See Figure 1.5(a) forsome example images.

Many generic classification methods ignore any structure in the input features, such as spatiallayout. Consequently, they can also just as easily handle data that looks like Figure 1.5(b), whichis the same data except we have randomly permuted the order of all the features. (You willverify this in Exercise 1.1.) This flexibility is both a blessing (since the methods are generalpurpose) and a curse (since the methods ignore an obviously useful source of information). Wewill discuss methods for exploiting structure in the input features later in the book.

5. Available from http://yann.lecun.com/exdb/mnist/.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 4 / 40

Page 6: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Supervised learning

Supervised learning:Given input and output examples.Build a model that predicts the output from the inputs.You can use the model to predict the output on new inputs.

Canonical example: hand-written digit recognition:

1.2. Supervised learning 7

true class = 7 true class = 2 true class = 1

true class = 0 true class = 4 true class = 1

true class = 4 true class = 9 true class = 5

(a)

true class = 7 true class = 2 true class = 1

true class = 0 true class = 4 true class = 1

true class = 4 true class = 9 true class = 5

(b)

Figure 1.5 (a) First 9 test MNIST gray-scale images. (b) Same as (a), but with the features permutedrandomly. Classification performance is identical on both versions of the data (assuming the training datais permuted in an identical way). Figure generated by shuffledDigitsDemo.

or width is below some threshold. However, distinguishing versicolor from virginica is slightlyharder; any decision will need to be based on at least two features. (It is always a good ideato perform exploratory data analysis, such as plotting the data, before applying a machinelearning method.)

Image classification and handwriting recognition

Now consider the harder problem of classifying images directly, where a human has not pre-processed the data. We might want to classify the image as a whole, e.g., is it an indoors oroutdoors scene? is it a horizontal or vertical photo? does it contain a dog or not? This is calledimage classification.

In the special case that the images consist of isolated handwritten letters and digits, forexample, in a postal or ZIP code on a letter, we can use classification to perform handwritingrecognition. A standard dataset used in this area is known as MNIST, which stands for “ModifiedNational Institute of Standards”5. (The term “modified” is used because the images have beenpreprocessed to ensure the digits are mostly in the center of the image.) This dataset contains60,000 training images and 10,000 test images of the digits 0 to 9, as written by various people.The images are size 28× 28 and have grayscale values in the range 0 : 255. See Figure 1.5(a) forsome example images.

Many generic classification methods ignore any structure in the input features, such as spatiallayout. Consequently, they can also just as easily handle data that looks like Figure 1.5(b), whichis the same data except we have randomly permuted the order of all the features. (You willverify this in Exercise 1.1.) This flexibility is both a blessing (since the methods are generalpurpose) and a curse (since the methods ignore an obviously useful source of information). Wewill discuss methods for exploiting structure in the input features later in the book.

5. Available from http://yann.lecun.com/exdb/mnist/.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 4 / 40

Page 7: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Supervised Learning

You have a well-defined pattern recognition problem.

But don’t know how to write a program to solve it.

And you have lots of labeled data.

Key reason for machine learning’s popularity and success.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 5 / 40

Page 8: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Supervised Learning

You have a well-defined pattern recognition problem.

But don’t know how to write a program to solve it.

And you have lots of labeled data.

Key reason for machine learning’s popularity and success.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 5 / 40

Page 9: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Supervised Learning

You have a well-defined pattern recognition problem.

But don’t know how to write a program to solve it.

And you have lots of labeled data.

Key reason for machine learning’s popularity and success.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 5 / 40

Page 10: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Supervised Learning

Training and Testing

Steps for supervised learning:1 Training phase: build model that maps from input features to labels.

(based on many examples of the correct behaviour)2 Testing phase: model is used to label new inputs.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 6 / 40

Page 11: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Problem Formulation

Connection to Numerical Optimization

Typically, the training phase is formulated as an optimization problem,

minx

m∑

i=1

fi (x) + λr(x).

data fitting term + regularizer

Data-fitting term: how well does x fit data sample i?

Regularizer: how simple is x?(simple models are more likely to do well at test time)

Example is least squares,

fi (x) =1

2(bi −

n∑

j=1

aijxj)2.

Squared `2-norm regularization:

r(x) = ‖x‖2.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 7 / 40

Page 12: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Problem Formulation

Connection to Numerical Optimization

Typically, the training phase is formulated as an optimization problem,

minx

m∑

i=1

fi (x) + λr(x).

data fitting term + regularizer

Data-fitting term: how well does x fit data sample i?

Regularizer: how simple is x?(simple models are more likely to do well at test time)

Example is least squares,

fi (x) =1

2(bi −

n∑

j=1

aijxj)2.

Squared `2-norm regularization:

r(x) = ‖x‖2.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 7 / 40

Page 13: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Problem Formulation

Connection to Numerical Optimization

Typically, the training phase is formulated as an optimization problem,

minx

m∑

i=1

fi (x) + λr(x).

data fitting term + regularizer

Data-fitting term: how well does x fit data sample i?

Regularizer: how simple is x?(simple models are more likely to do well at test time)

Example is least squares,

fi (x) =1

2(bi −

n∑

j=1

aijxj)2.

Squared `2-norm regularization:

r(x) = ‖x‖2.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 7 / 40

Page 14: Optimization for Machine Learning - Computer Science at UBC

Machine Learning Problem Formulation

Connection to Numerical Optimization

Typically, the training phase is formulated as an optimization problem,

minx

m∑

i=1

fi (x) + λr(x).

data fitting term + regularizer

Data-fitting term: how well does x fit data sample i?

Regularizer: how simple is x?(simple models are more likely to do well at test time)

Example is least squares,

fi (x) =1

2(bi −

n∑

j=1

aijxj)2.

Squared `2-norm regularization:

r(x) = ‖x‖2.Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 7 / 40

Page 15: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms

Outline

1 Machine Learning

2 Convergence Rates of First-Order AlgorithmsMotivation and NotationConvergence Rate

3 Proximal-Gradient Methods

4 Stochastic Gradient Methods

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 8 / 40

Page 16: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Motivation for First-Order Methods

We first consider the unconstrained optimization problem,

minx

f (x).

In typical ML models the dimension, dimension n is very large.

We will focus on matrix-free methods, as in the previous lecture:

Allows n to be in the billions or more.We can show dimension-independent convergence rates.

As before, the simplest case is gradient descent,

xk+1 = xk − αk∇f (xk).

How many iterations are needed?

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 9 / 40

Page 17: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Motivation for First-Order Methods

We first consider the unconstrained optimization problem,

minx

f (x).

In typical ML models the dimension, dimension n is very large.

We will focus on matrix-free methods, as in the previous lecture:

Allows n to be in the billions or more.We can show dimension-independent convergence rates.

As before, the simplest case is gradient descent,

xk+1 = xk − αk∇f (xk).

How many iterations are needed?

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 9 / 40

Page 18: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Strongly-Convex and Strongly-Smooth

Consider special case of least squares:

minx

f (x) =1

2‖b− Ax‖2.

Recall that∇2f (x) = ATA,

so the eigenvalues of ∇2f (x) are between λ1 and λn for all x.

Functions f with eigenvalues of Hessian bounded between positiveconstants for all f are called ‘strongly smooth’ and ‘strongly convex’.

These assumptions are sufficient to show a linear convergence rate.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 10 / 40

Page 19: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Strongly-Convex and Strongly-Smooth

Consider special case of least squares:

minx

f (x) =1

2‖b− Ax‖2.

Recall that∇2f (x) = ATA,

so the eigenvalues of ∇2f (x) are between λ1 and λn for all x.

Functions f with eigenvalues of Hessian bounded between positiveconstants for all f are called ‘strongly smooth’ and ‘strongly convex’.

These assumptions are sufficient to show a linear convergence rate.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 10 / 40

Page 20: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Smoothness

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≤ λ1vTv = λ1‖v‖2 for any v and z.

f (y) ≤ f (x) +∇f (x)T (y − x) +λ12‖y − x‖2

Global quadratic upper bound on function value.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 11 / 40

Page 21: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Smoothness

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≤ λ1vTv = λ1‖v‖2 for any v and z.

f (y) ≤ f (x) +∇f (x)T (y − x) +λ12‖y − x‖2

Global quadratic upper bound on function value.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 11 / 40

Page 22: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Smoothness

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≤ λ1vTv = λ1‖v‖2 for any v and z.

f (y) ≤ f (x) +∇f (x)T (y − x) +λ12‖y − x‖2

Global quadratic upper bound on function value.

f(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 11 / 40

Page 23: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Smoothness

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≤ λ1vTv = λ1‖v‖2 for any v and z.

f (y) ≤ f (x) +∇f (x)T (y − x) +λ12‖y − x‖2

Global quadratic upper bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 11 / 40

Page 24: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Smoothness

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≤ λ1vTv = λ1‖v‖2 for any v and z.

f (y) ≤ f (x) +∇f (x)T (y − x) +λ12‖y − x‖2

Global quadratic upper bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

f(x) + ∇f(x)T(y-x) + (L/2)||y-x||2

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 11 / 40

Page 25: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≥ λn‖v‖2 for any v and z.

f (y) ≥ f (x) +∇f (x)T (y − x) +λn2‖y − x‖2

Global quadratic lower bound on function value.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 12 / 40

Page 26: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≥ λn‖v‖2 for any v and z.

f (y) ≥ f (x) +∇f (x)T (y − x) +λn2‖y − x‖2

Global quadratic lower bound on function value.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 12 / 40

Page 27: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≥ λn‖v‖2 for any v and z.

f (y) ≥ f (x) +∇f (x)T (y − x) +λn2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 12 / 40

Page 28: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≥ λn‖v‖2 for any v and z.

f (y) ≥ f (x) +∇f (x)T (y − x) +λn2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 12 / 40

Page 29: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Motivation and Notation

Implication of Strong-Convexity

From Taylor’s theorem, for some z we have:

f (y) = f (x) +∇f (x)T (y − x) +1

2(y − x)T∇2f (z)(y − x)

Use that vT∇2f (z)v ≥ λn‖v‖2 for any v and z.

f (y) ≥ f (x) +∇f (x)T (y − x) +λn2‖y − x‖2

Global quadratic lower bound on function value.

f(x)

f(x) + ∇f(x)T(y-x)

f(x) + ∇f(x)T(y-x) + (μ/2)||y-x||2

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 12 / 40

Page 30: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Bounds on Progress and Sub-Optimality

We have the upper bound

f (xk+1) ≤ f (xk) +∇f (xk)T (xk+1 − xk) +λ12‖xk+1 − xk‖2,

treating xk+1 as a variable and minimizing the right side gives

xk+1 = xk −1

λ1∇f (xk), f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2,

which is gradient descent with a particular step-size.

We have the lower bound

f (y) ≥ f (xk) +∇f (xk)T (y − xk) +λn2‖y − xk‖2,

and minimizing both sides in terms of y gives

f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2,

which bounds how far xk is from the solution x∗.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 13 / 40

Page 31: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Bounds on Progress and Sub-Optimality

We have the upper bound

f (xk+1) ≤ f (xk) +∇f (xk)T (xk+1 − xk) +λ12‖xk+1 − xk‖2,

treating xk+1 as a variable and minimizing the right side gives

xk+1 = xk −1

λ1∇f (xk), f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2,

which is gradient descent with a particular step-size.

We have the lower bound

f (y) ≥ f (xk) +∇f (xk)T (y − xk) +λn2‖y − xk‖2,

and minimizing both sides in terms of y gives

f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2,

which bounds how far xk is from the solution x∗.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 13 / 40

Page 32: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Bounds on Progress and Sub-Optimality

We have the upper bound

f (xk+1) ≤ f (xk) +∇f (xk)T (xk+1 − xk) +λ12‖xk+1 − xk‖2,

treating xk+1 as a variable and minimizing the right side gives

xk+1 = xk −1

λ1∇f (xk), f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2,

which is gradient descent with a particular step-size.

We have the lower bound

f (y) ≥ f (xk) +∇f (xk)T (y − xk) +λn2‖y − xk‖2,

and minimizing both sides in terms of y gives

f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2,

which bounds how far xk is from the solution x∗.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 13 / 40

Page 33: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Bounds on Progress and Sub-Optimality

We have the upper bound

f (xk+1) ≤ f (xk) +∇f (xk)T (xk+1 − xk) +λ12‖xk+1 − xk‖2,

treating xk+1 as a variable and minimizing the right side gives

xk+1 = xk −1

λ1∇f (xk), f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2,

which is gradient descent with a particular step-size.

We have the lower bound

f (y) ≥ f (xk) +∇f (xk)T (y − xk) +λn2‖y − xk‖2,

and minimizing both sides in terms of y gives

f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2,

which bounds how far xk is from the solution x∗.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 13 / 40

Page 34: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Bounds on Progress and Sub-Optimality

We have the upper bound

f (xk+1) ≤ f (xk) +∇f (xk)T (xk+1 − xk) +λ12‖xk+1 − xk‖2,

treating xk+1 as a variable and minimizing the right side gives

xk+1 = xk −1

λ1∇f (xk), f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2,

which is gradient descent with a particular step-size.

We have the lower bound

f (y) ≥ f (xk) +∇f (xk)T (y − xk) +λn2‖y − xk‖2,

and minimizing both sides in terms of y gives

f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2,

which bounds how far xk is from the solution x∗.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 13 / 40

Page 35: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 36: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

f(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 37: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

f(x) GuaranteedProgress

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 38: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

f(x) GuaranteedProgress

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 39: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

f(x) GuaranteedProgress

MaximumSuboptimality

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 40: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

The bound guaranteed progress and maximum sub-optimality.

f(x) GuaranteedProgress

MaximumSuboptimality

f(x+)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 14 / 40

Page 41: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

combine them to get

f (xk+1)− f (x∗) ≤(

1− λnλ1

)[f (xk)− f (x∗)]

Applying recursively gives a linear convergence rate:

f (xk)− f (x∗) ≤(

1− λnλ1

)k

[f (x0)− f (x∗)]

We say that the condition number for f is given by κf = λ1λn

.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 15 / 40

Page 42: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Linear Convergence of Gradient Descent

We have bounds on xk+1 and x∗:

f (xk+1) ≤ f (xk)− 1

2λ1‖∇f (xk)‖2, f (x∗) ≥ f (xk)− 1

2λn‖∇f (xk)‖2.

combine them to get

f (xk+1)− f (x∗) ≤(

1− λnλ1

)[f (xk)− f (x∗)]

Applying recursively gives a linear convergence rate:

f (xk)− f (x∗) ≤(

1− λnλ1

)k

[f (x0)− f (x∗)]

We say that the condition number for f is given by κf = λ1λn

.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 15 / 40

Page 43: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Convergence Rate of Gradient Descent

What about line-search?

Exact line-search has the same rate (using αk = 1/λ1 is a special case).Backtracking line-search has a slightly slower rate (but don’t need λ1).

It’s also possible to show that a different step-size gives

‖xk − x∗‖ ≤(κf − 1

κf + 1

)k

‖x0 − x∗‖.

Similar to the rate for solving linear systems (last lecture).

Can we derive a method with the faster rate of conjugate gradient?(‘Non-linear’ conjugate gradient methods don’t actually have a faster rate.)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 16 / 40

Page 44: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Convergence Rate of Gradient Descent

What about line-search?

Exact line-search has the same rate (using αk = 1/λ1 is a special case).Backtracking line-search has a slightly slower rate (but don’t need λ1).

It’s also possible to show that a different step-size gives

‖xk − x∗‖ ≤(κf − 1

κf + 1

)k

‖x0 − x∗‖.

Similar to the rate for solving linear systems (last lecture).

Can we derive a method with the faster rate of conjugate gradient?(‘Non-linear’ conjugate gradient methods don’t actually have a faster rate.)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 16 / 40

Page 45: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Convergence Rate of Gradient Descent

What about line-search?

Exact line-search has the same rate (using αk = 1/λ1 is a special case).Backtracking line-search has a slightly slower rate (but don’t need λ1).

It’s also possible to show that a different step-size gives

‖xk − x∗‖ ≤(κf − 1

κf + 1

)k

‖x0 − x∗‖.

Similar to the rate for solving linear systems (last lecture).

Can we derive a method with the faster rate of conjugate gradient?(‘Non-linear’ conjugate gradient methods don’t actually have a faster rate.)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 16 / 40

Page 46: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Nesterov’s accelerated gradient method

There is a method similar to conjugate gradient,

xk+1 = yk − αk∇f (yk),

yk+1 = xk + βk(xk+1 − xk),

called Nesterov’s accelerated gradient method.

It has a faster rate of

f (xk)− f (x∗) ≤(

1− 1√κf

)t

[f (x0)− f (x∗)].

Slower in practice than non-linear conjugate gradient andquasi-Newton methods, but does not depend on dimension andgeneralizes to non-smooth problems...

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 17 / 40

Page 47: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Nesterov’s accelerated gradient method

There is a method similar to conjugate gradient,

xk+1 = yk − αk∇f (yk),

yk+1 = xk + βk(xk+1 − xk),

called Nesterov’s accelerated gradient method.

It has a faster rate of

f (xk)− f (x∗) ≤(

1− 1√κf

)t

[f (x0)− f (x∗)].

Slower in practice than non-linear conjugate gradient andquasi-Newton methods, but does not depend on dimension andgeneralizes to non-smooth problems...

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 17 / 40

Page 48: Optimization for Machine Learning - Computer Science at UBC

Convergence Rates of First-Order Algorithms Convergence Rate

Nesterov’s accelerated gradient method

There is a method similar to conjugate gradient,

xk+1 = yk − αk∇f (yk),

yk+1 = xk + βk(xk+1 − xk),

called Nesterov’s accelerated gradient method.

It has a faster rate of

f (xk)− f (x∗) ≤(

1− 1√κf

)t

[f (x0)− f (x∗)].

Slower in practice than non-linear conjugate gradient andquasi-Newton methods, but does not depend on dimension andgeneralizes to non-smooth problems...

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 17 / 40

Page 49: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods

Outline

1 Machine Learning

2 Convergence Rates of First-Order Algorithms

3 Proximal-Gradient MethodsMotivation: LASSOProjected GradientProximal-Gradient

4 Stochastic Gradient Methods

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 18 / 40

Page 50: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

Motivation: Spam Filtering

We want to recognize e-mail spam.

We will look at phrases in the e-mail messages:

“CPSC 406”.“Meet singles in your area now’

There are too many possible phrases (model would be huge).

But some are more helpful than others: feature selection.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 19 / 40

Page 51: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

Motivation: Spam Filtering

We want to recognize e-mail spam.

We will look at phrases in the e-mail messages:

“CPSC 406”.“Meet singles in your area now’

There are too many possible phrases (model would be huge).

But some are more helpful than others: feature selection.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 19 / 40

Page 52: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

Motivation: Spam Filtering

We want to recognize e-mail spam.

We will look at phrases in the e-mail messages:

“CPSC 406”.“Meet singles in your area now’

There are too many possible phrases (model would be huge).

But some are more helpful than others: feature selection.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 19 / 40

Page 53: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

Consider the `1-regularized least squares problem (LASSO),

minx

1

2‖b− Ax‖2 + λ‖x‖1.

Recall the definition of the `1-norm,

‖x‖1 =∑

j

|xj |.

The `1-norm shrinks x, and encourages xj to be exactly zero:

Weight xj for “meet singles” now should be hi (relevant).Weight xj for “Hello” should be 0 (not relevant).

Each column of A contains the values of one feature, so settingxj = 0 means we ignore the feature.

The challenge is that |xj | is non-differentiable.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 20 / 40

Page 54: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

Consider the `1-regularized least squares problem (LASSO),

minx

1

2‖b− Ax‖2 + λ‖x‖1.

Recall the definition of the `1-norm,

‖x‖1 =∑

j

|xj |.

The `1-norm shrinks x, and encourages xj to be exactly zero:

Weight xj for “meet singles” now should be hi (relevant).Weight xj for “Hello” should be 0 (not relevant).

Each column of A contains the values of one feature, so settingxj = 0 means we ignore the feature.

The challenge is that |xj | is non-differentiable.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 20 / 40

Page 55: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

Consider the `1-regularized least squares problem (LASSO),

minx

1

2‖b− Ax‖2 + λ‖x‖1.

Recall the definition of the `1-norm,

‖x‖1 =∑

j

|xj |.

The `1-norm shrinks x, and encourages xj to be exactly zero:

Weight xj for “meet singles” now should be hi (relevant).Weight xj for “Hello” should be 0 (not relevant).

Each column of A contains the values of one feature, so settingxj = 0 means we ignore the feature.

The challenge is that |xj | is non-differentiable.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 20 / 40

Page 56: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

How can we solve non-differentiable problems like the LASSO?

Try to convert it into a smooth problem?

We can write the LASSO as a quadratic program (QP).But can’t solve general huge-dimensional QPs.

Use an off-the-shelf non-smooth solver?

These methods have sub-linear convergence rates.They are very slow!

Use a special class of methods called proximal-gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 21 / 40

Page 57: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

How can we solve non-differentiable problems like the LASSO?

Try to convert it into a smooth problem?

We can write the LASSO as a quadratic program (QP).But can’t solve general huge-dimensional QPs.

Use an off-the-shelf non-smooth solver?

These methods have sub-linear convergence rates.They are very slow!

Use a special class of methods called proximal-gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 21 / 40

Page 58: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

How can we solve non-differentiable problems like the LASSO?

Try to convert it into a smooth problem?

We can write the LASSO as a quadratic program (QP).But can’t solve general huge-dimensional QPs.

Use an off-the-shelf non-smooth solver?

These methods have sub-linear convergence rates.They are very slow!

Use a special class of methods called proximal-gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 21 / 40

Page 59: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Motivation: LASSO

LASSO: Sparse Regularization

How can we solve non-differentiable problems like the LASSO?

Try to convert it into a smooth problem?

We can write the LASSO as a quadratic program (QP).But can’t solve general huge-dimensional QPs.

Use an off-the-shelf non-smooth solver?

These methods have sub-linear convergence rates.They are very slow!

Use a special class of methods called proximal-gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 21 / 40

Page 60: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Example: Non-Negative Least Squares

Consider non-negative least squares,

minx≥0

m∑

i=1

(bi −n∑

j=1

aijxj)2,

Should this be easier than with general constraints?

The constraints are simple:

Given y , we can efficiently find closest x satisfying constraints.(just set negative yi to zero)

Gradient projection:

Alternates between gradient step and projection step:

xk+1 = project[xk − αk∇f (xk)],

project[y] = arg minx

1

2‖x− y‖2, s.t. x ≥ 0.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 22 / 40

Page 61: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Example: Non-Negative Least Squares

Consider non-negative least squares,

minx≥0

m∑

i=1

(bi −n∑

j=1

aijxj)2,

Should this be easier than with general constraints?

The constraints are simple:

Given y , we can efficiently find closest x satisfying constraints.(just set negative yi to zero)

Gradient projection:

Alternates between gradient step and projection step:

xk+1 = project[xk − αk∇f (xk)],

project[y] = arg minx

1

2‖x− y‖2, s.t. x ≥ 0.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 22 / 40

Page 62: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Example: Non-Negative Least Squares

Consider non-negative least squares,

minx≥0

m∑

i=1

(bi −n∑

j=1

aijxj)2,

Should this be easier than with general constraints?

The constraints are simple:

Given y , we can efficiently find closest x satisfying constraints.(just set negative yi to zero)

Gradient projection:

Alternates between gradient step and projection step:

xk+1 = project[xk − αk∇f (xk)],

project[y] = arg minx

1

2‖x− y‖2, s.t. x ≥ 0.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 22 / 40

Page 63: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 64: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)

x1

x2

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 65: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)Feasible Set

x1

x2

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 66: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)Feasible Set

xk

x1

x2

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 67: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)Feasible Set

xk

x1

x2

xk - !!f(xk)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 68: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)Feasible Set

xk

x1

x2

xk - !!f(xk)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 69: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Gradient Projection

f(x)Feasible Set

[xk - !!f(xk)]+

xk

x1

x2

xk - !!f(xk)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 23 / 40

Page 70: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Simple Constraints

Gradient projection has the same convergence rate as theunconstrained gradient method,

f (xk)− f (x∗) ≤(

1− 1

κf

)[f (x0)− f (x∗)].

You can do line-search to select the step-size.

Accelerated gradient projection,

xk+1 = project[yk − αk∇f (yk)],

yk+1 = xk + βk(xk+1 − xk),

gives a better dependence on the condition number,

f (xk)− f (x∗) ≤(

1− 1√κf

)[f (x0)− f (x∗)].

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 24 / 40

Page 71: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Projected Gradient

Simple Constraints

Gradient projection has the same convergence rate as theunconstrained gradient method,

f (xk)− f (x∗) ≤(

1− 1

κf

)[f (x0)− f (x∗)].

You can do line-search to select the step-size.

Accelerated gradient projection,

xk+1 = project[yk − αk∇f (yk)],

yk+1 = xk + βk(xk+1 − xk),

gives a better dependence on the condition number,

f (xk)− f (x∗) ≤(

1− 1√κf

)[f (x0)− f (x∗)].

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 24 / 40

Page 72: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal-Gradient Method

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Alternates between gradient descent on f and proximity operator of r :

xk+ 12

= xk − αk∇f (xk),

xk+1 = arg miny

1

2‖y − xk+ 1

2‖2 + αk r(y)

,

Convergence rates are still the same as for minimizing f alone.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 25 / 40

Page 73: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal-Gradient Method

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Alternates between gradient descent on f and proximity operator of r :

xk+ 12

= xk − αk∇f (xk),

xk+1 = arg miny

1

2‖y − xk+ 1

2‖2 + αk r(y)

,

Convergence rates are still the same as for minimizing f alone.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 25 / 40

Page 74: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal-Gradient Method

The proximal-gradient method addresses problem of the form

minx

f (x) + r(x),

where f is smooth but r is a general convex function.

Alternates between gradient descent on f and proximity operator of r :

xk+ 12

= xk − αk∇f (xk),

xk+1 = arg miny

1

2‖y − xk+ 1

2‖2 + αk r(y)

,

Convergence rates are still the same as for minimizing f alone.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 25 / 40

Page 75: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

f(x)

x

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 76: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

f(x)

x

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 77: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

f(x)

x

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 78: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

Feasible Set

f(x)

x

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 79: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

Feasible Set

x - !f’(x)f(x)

x

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 80: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

Feasible Set

f(x)

x

x - !f’(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 81: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Special case of Projected-Gradient Methods

Projected-gradient methods are a special case:

r(x) =

0 if x ∈ C∞ if x /∈ C

,

givesxk+1 = projectC[xk − αk∇f (xk)],

Feasible Set

x+

f(x)

x

x - !f’(x)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 26 / 40

Page 82: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal Operator, Iterative Soft Thresholding

For L1-regularization, we obtain iterative soft-thresholding:

xk+1 = softThreshαkλ[xk − αk∇f (xk)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 27 / 40

Page 83: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal Operator, Iterative Soft Thresholding

For L1-regularization, we obtain iterative soft-thresholding:

xk+1 = softThreshαkλ[xk − αk∇f (xk)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 27 / 40

Page 84: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal Operator, Iterative Soft Thresholding

For L1-regularization, we obtain iterative soft-thresholding:

xk+1 = softThreshαkλ[xk − αk∇f (xk)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 27 / 40

Page 85: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Proximal Operator, Iterative Soft Thresholding

For L1-regularization, we obtain iterative soft-thresholding:

xk+1 = softThreshαkλ[xk − αk∇f (xk)].

Example with λ = 1:Input Threshold Soft-Threshold

0.6715−1.20750.71721.63020.4889

0−1.2075

01.6302

0

0−0.2075

00.6302

0

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 27 / 40

Page 86: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 87: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:

1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 88: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.

2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 89: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.

3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 90: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.

4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 91: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.

5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 92: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.

6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 93: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 94: Optimization for Machine Learning - Computer Science at UBC

Proximal-Gradient Methods Proximal-Gradient

Exact Proximal-Gradient Methods

For what problems can we apply these methods?

We can efficiently compute the proximity operator for:1 Lower and upper bounds.2 Small number of linear constraint.3 Probability constraints.4 L1-Regularization.5 Group `1-Regularization.6 A few other simple regularizers/constraints.

Can solve huge instances of these constrained/non-smooth problem.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 28 / 40

Page 95: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods

Outline

1 Machine Learning

2 Convergence Rates of First-Order Algorithms

3 Proximal-Gradient Methods

4 Stochastic Gradient MethodsMotivation: Big-M ProblemsNotation and AlgorithmConvergence Rate

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 29 / 40

Page 96: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Motivation: Big-M Problems

Big-N Problems

Consider the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x),

where m is very large.

This could be a big least squares problem, or another ML model.

Examples:

Each i is a Facebook user.Each i is a product on Amazon.Each i is a webpage on the internet.

We can’t afford to go through all m examples many times.

One way to deal with this restriction is stochastic gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 30 / 40

Page 97: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Motivation: Big-M Problems

Big-N Problems

Consider the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x),

where m is very large.

This could be a big least squares problem, or another ML model.

Examples:

Each i is a Facebook user.Each i is a product on Amazon.Each i is a webpage on the internet.

We can’t afford to go through all m examples many times.

One way to deal with this restriction is stochastic gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 30 / 40

Page 98: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Motivation: Big-M Problems

Big-N Problems

Consider the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x),

where m is very large.

This could be a big least squares problem, or another ML model.

Examples:

Each i is a Facebook user.Each i is a product on Amazon.Each i is a webpage on the internet.

We can’t afford to go through all m examples many times.

One way to deal with this restriction is stochastic gradient methods.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 30 / 40

Page 99: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Stochastic Gradient Methods

Stochastic gradient methods consider minimizing expectations,

minx

E[f (x)].

They assume we can generate a random vector pk whose expectationis the gradient

E[pk ] = ∇f (xk),

and take a gradient step using this direction,

xk+1 = xk − αkpk .

For convergence, usually require the step-sizes αk to converge to 0.E.g., Robbins-Munro conditions,

∞∑

k=1

αk =∞,∞∑

k=1

α2k <∞,

suggests using αk = γ/k for some constant γ.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 31 / 40

Page 100: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Stochastic Gradient Methods

Stochastic gradient methods consider minimizing expectations,

minx

E[f (x)].

They assume we can generate a random vector pk whose expectationis the gradient

E[pk ] = ∇f (xk),

and take a gradient step using this direction,

xk+1 = xk − αkpk .

For convergence, usually require the step-sizes αk to converge to 0.E.g., Robbins-Munro conditions,

∞∑

k=1

αk =∞,∞∑

k=1

α2k <∞,

suggests using αk = γ/k for some constant γ.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 31 / 40

Page 101: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Gradient Method vs. Stochastic Gradient Method

Gradient method:

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = "(yi, θ

!Φ(xi))

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γt

n

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Stochastic gradient method:

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = "(yi, θ

!Φ(xi))

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γt

n

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 32 / 40

Page 102: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Gradient Method vs. Stochastic Gradient Method

Gradient method:

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = "(yi, θ

!Φ(xi))

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γt

n

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Stochastic gradient method:

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = "(yi, θ

!Φ(xi))

+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γt

n

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 32 / 40

Page 103: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Application to Finite Sums

Returning to the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x).

Set pk to the gradient of a random function fik ,

E[pk ] = Ei [∇fi (xk)]

=m∑

i=1

p(i)∇fi (xk)

=1

m

m∑

i=1

∇fi (xk).

This gives us the stochastic gradient algorithm

xk+1 = xk − αk∇fik (xk).

The iteration cost is independent of m.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 33 / 40

Page 104: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Application to Finite Sums

Returning to the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x).

Set pk to the gradient of a random function fik ,

E[pk ] = Ei [∇fi (xk)]

=m∑

i=1

p(i)∇fi (xk)

=1

m

m∑

i=1

∇fi (xk).

This gives us the stochastic gradient algorithm

xk+1 = xk − αk∇fik (xk).

The iteration cost is independent of m.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 33 / 40

Page 105: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Notation and Algorithm

Application to Finite Sums

Returning to the problem of minimizing a finite sum,

minx

1

m

m∑

i=1

fi (x).

Set pk to the gradient of a random function fik ,

E[pk ] = Ei [∇fi (xk)]

=m∑

i=1

p(i)∇fi (xk)

=1

m

m∑

i=1

∇fi (xk).

This gives us the stochastic gradient algorithm

xk+1 = xk − αk∇fik (xk).

The iteration cost is independent of m.Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 33 / 40

Page 106: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Convergence Rate of Stochastic Gradient

Stochastic gradient has much faster iterations.

But how many iterations are required?

If we set αk = 1/kλn, we have that

E[f (xk)− f (x∗)] = O(1/k).

This is a sublinear rate.

Often works badly in practice:

Initial αk might be huge.Later αk might be tiny.

Nesterov/Newton-like variations can only improve the constant.(even in low dimensions)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 34 / 40

Page 107: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Convergence Rate of Stochastic Gradient

Stochastic gradient has much faster iterations.

But how many iterations are required?

If we set αk = 1/kλn, we have that

E[f (xk)− f (x∗)] = O(1/k).

This is a sublinear rate.

Often works badly in practice:

Initial αk might be huge.Later αk might be tiny.

Nesterov/Newton-like variations can only improve the constant.(even in low dimensions)

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 34 / 40

Page 108: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Comparison of Gradient and Stochastic Gradient

Stochastic vs. deterministic methods

• Goal = best of both worlds: linear rate with O(1) iteration cost

time

log(

exce

ss c

ost)

stochastic

deterministic

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 35 / 40

Page 109: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Improving Stochastic Gradient

How can we improve this algorithm?

1 Averaging: If you use a bigger step-size, αk = γ/√k, then the average

of the iterations ( 1k

∑ki=1 xi ) has nearly-optimal constants.

2 Constant step-sizes: If you use a constant step-size αk = α, you canshow

E[f (xk)− f (x∗)] = (1− 2αλn)k [f (x0)− f (x∗)] + O(α),

which shows rapid progress but non-convergence.3 Use special problem structures: For certain problems, you can show

faster rates.

Since 2012, large focus on better algorithms for finite sum structure.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 36 / 40

Page 110: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Improving Stochastic Gradient

How can we improve this algorithm?1 Averaging: If you use a bigger step-size, αk = γ/

√k, then the average

of the iterations ( 1k

∑ki=1 xi ) has nearly-optimal constants.

2 Constant step-sizes: If you use a constant step-size αk = α, you canshow

E[f (xk)− f (x∗)] = (1− 2αλn)k [f (x0)− f (x∗)] + O(α),

which shows rapid progress but non-convergence.3 Use special problem structures: For certain problems, you can show

faster rates.

Since 2012, large focus on better algorithms for finite sum structure.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 36 / 40

Page 111: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Improving Stochastic Gradient

How can we improve this algorithm?1 Averaging: If you use a bigger step-size, αk = γ/

√k, then the average

of the iterations ( 1k

∑ki=1 xi ) has nearly-optimal constants.

2 Constant step-sizes: If you use a constant step-size αk = α, you canshow

E[f (xk)− f (x∗)] = (1− 2αλn)k [f (x0)− f (x∗)] + O(α),

which shows rapid progress but non-convergence.

3 Use special problem structures: For certain problems, you can showfaster rates.

Since 2012, large focus on better algorithms for finite sum structure.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 36 / 40

Page 112: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Improving Stochastic Gradient

How can we improve this algorithm?1 Averaging: If you use a bigger step-size, αk = γ/

√k, then the average

of the iterations ( 1k

∑ki=1 xi ) has nearly-optimal constants.

2 Constant step-sizes: If you use a constant step-size αk = α, you canshow

E[f (xk)− f (x∗)] = (1− 2αλn)k [f (x0)− f (x∗)] + O(α),

which shows rapid progress but non-convergence.3 Use special problem structures: For certain problems, you can show

faster rates.

Since 2012, large focus on better algorithms for finite sum structure.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 36 / 40

Page 113: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Improving Stochastic Gradient

How can we improve this algorithm?1 Averaging: If you use a bigger step-size, αk = γ/

√k, then the average

of the iterations ( 1k

∑ki=1 xi ) has nearly-optimal constants.

2 Constant step-sizes: If you use a constant step-size αk = α, you canshow

E[f (xk)− f (x∗)] = (1− 2αλn)k [f (x0)− f (x∗)] + O(α),

which shows rapid progress but non-convergence.3 Use special problem structures: For certain problems, you can show

faster rates.

Since 2012, large focus on better algorithms for finite sum structure.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 36 / 40

Page 114: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Stochastic Average Gradient

The stochastic average gradient (SAG) algorithm uses,

xk+1 = xk +αk

m

m∑

i=1

yik ,

and evaluates a random ∇fi (xk), with yik the last evaluation of ∇fi .

With αk = 1/16Λ1, the SAG algorithm has linear rate,

E[f (xk)]− f (x∗) ≤(

1−min

1

8m,λn

16Λ1

)k

[f (x0)− f (x∗)],

where Λ1 bounds eigenvalues of each ∇2fi (x).

Iteration cost independent of m, rate similar to gradient method.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 37 / 40

Page 115: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Stochastic Average Gradient

The stochastic average gradient (SAG) algorithm uses,

xk+1 = xk +αk

m

m∑

i=1

yik ,

and evaluates a random ∇fi (xk), with yik the last evaluation of ∇fi .With αk = 1/16Λ1, the SAG algorithm has linear rate,

E[f (xk)]− f (x∗) ≤(

1−min

1

8m,λn

16Λ1

)k

[f (x0)− f (x∗)],

where Λ1 bounds eigenvalues of each ∇2fi (x).

Iteration cost independent of m, rate similar to gradient method.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 37 / 40

Page 116: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Comparing FG and SG Methods

quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236)

0 10 20 30 40 50

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SGASG

IAG

0 10 20 30 40 50

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 38 / 40

Page 117: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

SAG Compared to FG and SG Methods

quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236)

0 10 20 30 40 50

10−12

10−10

10−8

10−6

10−4

10−2

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFG

L−BFGS

SG

ASG

IAG

SAG−LS

0 10 20 30 40 50

10−15

10−10

10−5

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

AFGL−BFGS

SG

ASG

IAG

SAG−LS

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 39 / 40

Page 118: Optimization for Machine Learning - Computer Science at UBC

Stochastic Gradient Methods Convergence Rate

Summary

Part 1: Numerical optimization is at the core of many modernmachine learning applications.

Part 2: Gradient-based methods allow elegant scaling withdimensionality for smooth problems.

Part 3: Proximal-gradient methods allow the same scaling for manynon-smooth problems.

Part 4: Stochastic gradient methods allow scaling to a huge numberof data samples.

Mark Schmidt (UBC Computer Science) Optimization for Machine Learning Term 2, 2014-15 40 / 40