stochastic gradient methods · optimization for data science stochastic gradient methods lecturer:...

Post on 13-Jul-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Optimization for Data ScienceStochastic Gradient Methods

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Master 2 Data Science, Institut Polytechnique de Paris (IPP)

2

Solving the Finite Sum Training Problem

3

General methods

Recap

● Gradient Descent

Training Problem

Two parts● Proximal gradient

(ISTA)● Fast proximal

gradient (FISTA)

L(w)

4

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Can we use this sum structure?

5The Training Problem

6The Training Problem

7Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

8Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

9Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

10Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

EXE:

11Stochastic Gradient Descent

12Stochastic Gradient Descent

Optimal point

13

Strong Convexity

Assumptions for Convergence

14

Strong Convexity

Assumptions for Convergence

15

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

16

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

17Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

18Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

19Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

20Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

21Stochastic Gradient Descent α =0.01

22Stochastic Gradient Descent α =0.1

23Stochastic Gradient Descent α =0.2

24Stochastic Gradient Descent α =0.5

25

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

26

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

27

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

28

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

EXE:

29

EXE:

Proof:

Taking expectation

30

Strongly quasi-convexity

Realistic assumptions for Convergence

Each fi is convex and Li smooth

Definition: Gradient Noise

31Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

32Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

33Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

34Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

35Assumptions for Convergence

EXE: Calculate the Li ’s and for

36Assumptions for Convergence

EXE: Calculate the Li ’s and for

37Assumptions for Convergence

EXE: Calculate the Li ’s and for

38Relationship between smoothness constantsEXE:

show that

39Relationship between smoothness constantsEXE:

show that

Proof: From the Hessian definition of smoothness

Furthermore

The final result now follows by taking the max over w, then max over i

40Complexity / Convergence

Theorem.

EXE: The steps of the proof are given in the SGD_proof exercise list for homework!

RMG, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtarik (2019) ICML 2019SGD: General Analysis and Improved Rates.

41Stochastic Gradient Descent α =0.5

42Stochastic Gradient Descent α =0.5

1) Start with big steps and end with smaller steps

43Stochastic Gradient Descent α =0.5

2) Try averaging the points

1) Start with big steps and end with smaller steps

44SGD shrinking stepsize

Shrinking Stepsize

45SGD shrinking stepsize

Shrinking Stepsize How should we

sample j ?

Does this converge?

46SGD with shrinking stepsize Compared with Gradient Descent

Gradient Descent

SGD 1.0

47SGD with shrinking stepsize Compared with Gradient Descent

SGD 1.0

Gradient Descent

49Complexity / Convergence

Theorem for shrinking stepsizes

50Complexity / Convergence

Theorem for shrinking stepsizes

51Complexity / Convergence

Theorem for shrinking stepsizes

52Stochastic Gradient Descent Compared with Gradient Descent

Gradient Descent

SGD

Noisy iterates. Take averages?

53SGD with (late start) averaging

B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging

54SGD with (late start) averaging

B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging

This is not efficient. How to make this efficient?

55Stochastic Gradient Descent With and without averaging

Starts slow, but can reach higher accuracy

56Stochastic Gradient Descent With and without averaging

Starts slow, but can reach higher accuracy

Only use averaging towards the end?

57Stochastic Gradient Descent Averaging the last few iterates

Averaging starts here

58Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

59Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

60Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

61Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

62Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

What happens if n is big?What happens if is small?

68Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

69Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

70Comparison SGD vs GD

time

SGD

GD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

71Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

72

Practical SGD for Sparse Data

73

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

74

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

75

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

Rescaling O(d)

Addition sparse vector O(s)

76

SGD step

Lazy SGD updates for Sparse Data

EXE:

77

SGD step

Lazy SGD updates for Sparse Data

EXE:

78

SGD step

Lazy SGD updates for Sparse Data

EXE:

79

SGD step

Lazy SGD updates for Sparse Data

O(1) scaling + O(s) sparse add = O(s) update

EXE:

80

Why Machine Learners Like SGD

81

The statistical learning problem:Minimize the expected loss over an unknown expectation

Why Machine Learners like SGD

SGD can solve the statistical learning problem!SGD can solve the statistical learning problem!

Though we solve:

We want to solve:

82

The statistical learning problem:Minimize the expected loss over an unknown expectation

Why Machine Learners like SGD

83

Exercise List time! Please solve:

stoch_ridge_reg_exeSGD_proof_exe

84

Appendix

top related