stochastic gradient methods · optimization for data science stochastic gradient methods lecturer:...

78
1 Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quenn Bertrand, Nidham Gazagnadou Master 2 Data Science, Institut Polytechnique de Paris (IPP)

Upload: others

Post on 13-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

1

Optimization for Data ScienceStochastic Gradient Methods

Lecturer: Robert M. Gower & Alexandre Gramfort

Tutorials: Quentin Bertrand, Nidham Gazagnadou

Master 2 Data Science, Institut Polytechnique de Paris (IPP)

Page 2: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

2

Solving the Finite Sum Training Problem

Page 3: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

3

General methods

Recap

● Gradient Descent

Training Problem

Two parts● Proximal gradient

(ISTA)● Fast proximal

gradient (FISTA)

L(w)

Page 4: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

4

A Datum Function

Finite Sum Training Problem

Optimization Sum of Terms

Can we use this sum structure?

Page 5: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

5The Training Problem

Page 6: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

6The Training Problem

Page 7: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

7Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Page 8: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

8Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 9: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

9Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

Page 10: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

10Stochastic Gradient Descent

Is it possible to design a method that uses only the gradient of a single data function at each iteration?

Unbiased EstimateLet j be a random index sampled from {1, …, n} selected uniformly at random. Then

EXE:

Page 11: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

11Stochastic Gradient Descent

Page 12: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

12Stochastic Gradient Descent

Optimal point

Page 13: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

13

Strong Convexity

Assumptions for Convergence

Page 14: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

14

Strong Convexity

Assumptions for Convergence

Page 15: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

15

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 16: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

16

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 17: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

17Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 18: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

18Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 19: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

19Complexity / Convergence

Theorem

EXE: Do exercises on convergence of random sequences.

Page 20: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

20Proof:

Unbiased estimatorTaking expectation with respect to j

Taking total expectationStrong conv.

Bounded Stoch grad

Page 21: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

21Stochastic Gradient Descent α =0.01

Page 22: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

22Stochastic Gradient Descent α =0.1

Page 23: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

23Stochastic Gradient Descent α =0.2

Page 24: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

24Stochastic Gradient Descent α =0.5

Page 25: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

25

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 26: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

26

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 27: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

27

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

Page 28: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

28

Expected Bounded Stochastic Gradients

Strong Convexity

Assumptions for Convergence

EXE:

Page 29: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

29

EXE:

Proof:

Taking expectation

Page 30: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

30

Strongly quasi-convexity

Realistic assumptions for Convergence

Each fi is convex and Li smooth

Definition: Gradient Noise

Page 31: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

31Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

Page 32: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

32Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

Page 33: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

33Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

Page 34: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

34Assumptions for Convergence

EXE: Calculate the Li ’s and for

HINT: A twice differentiable fi is Li - smooth if and only if

Page 35: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

35Assumptions for Convergence

EXE: Calculate the Li ’s and for

Page 36: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

36Assumptions for Convergence

EXE: Calculate the Li ’s and for

Page 37: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

37Assumptions for Convergence

EXE: Calculate the Li ’s and for

Page 38: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

38Relationship between smoothness constantsEXE:

show that

Page 39: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

39Relationship between smoothness constantsEXE:

show that

Proof: From the Hessian definition of smoothness

Furthermore

The final result now follows by taking the max over w, then max over i

Page 40: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

40Complexity / Convergence

Theorem.

EXE: The steps of the proof are given in the SGD_proof exercise list for homework!

RMG, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtarik (2019) ICML 2019SGD: General Analysis and Improved Rates.

Page 41: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

41Stochastic Gradient Descent α =0.5

Page 42: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

42Stochastic Gradient Descent α =0.5

1) Start with big steps and end with smaller steps

Page 43: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

43Stochastic Gradient Descent α =0.5

2) Try averaging the points

1) Start with big steps and end with smaller steps

Page 44: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

44SGD shrinking stepsize

Shrinking Stepsize

Page 45: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

45SGD shrinking stepsize

Shrinking Stepsize How should we

sample j ?

Does this converge?

Page 46: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

46SGD with shrinking stepsize Compared with Gradient Descent

Gradient Descent

SGD 1.0

Page 47: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

47SGD with shrinking stepsize Compared with Gradient Descent

SGD 1.0

Gradient Descent

Page 48: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

49Complexity / Convergence

Theorem for shrinking stepsizes

Page 49: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

50Complexity / Convergence

Theorem for shrinking stepsizes

Page 50: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

51Complexity / Convergence

Theorem for shrinking stepsizes

Page 51: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

52Stochastic Gradient Descent Compared with Gradient Descent

Gradient Descent

SGD

Noisy iterates. Take averages?

Page 52: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

53SGD with (late start) averaging

B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging

Page 53: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

54SGD with (late start) averaging

B. T. Polyak and A. B. Juditsky, SIAM Journal on Control and Optimization (1992)Acceleration of stochastic approximation by averaging

This is not efficient. How to make this efficient?

Page 54: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

55Stochastic Gradient Descent With and without averaging

Starts slow, but can reach higher accuracy

Page 55: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

56Stochastic Gradient Descent With and without averaging

Starts slow, but can reach higher accuracy

Only use averaging towards the end?

Page 56: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

57Stochastic Gradient Descent Averaging the last few iterates

Averaging starts here

Page 57: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

58Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Page 58: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

59Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Page 59: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

60Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

Page 60: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

61Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

Page 61: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

62Comparison GD and SGD for strongly convex SGD GD

Iteration complexity

Cost of an interation

Total complexity*

What happens if n is big?What happens if is small?

Page 62: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

68Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 63: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

69Comparison SGD vs GD

time

SGD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 64: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

70Comparison SGD vs GD

time

SGD

GD

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 65: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

71Comparison SGD vs GD

time

SGD

GD

Stoch. Average Gradient (SAG)

M. Schmidt, N. Le Roux, F. Bach (2016)Mathematical Programming Minimizing Finite Sums with the Stochastic Average Gradient.

Page 66: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

72

Practical SGD for Sparse Data

Page 67: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

73

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

Page 68: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

74

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

Page 69: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

75

Finite Sum Training Problem

Lazy SGD updates for Sparse DataL2 regularizor + linear hypothesis

Assume each data point xi is s-sparse, how many operations does each SGD step cost?

Rescaling O(d)

Addition sparse vector O(s)

Page 70: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

76

SGD step

Lazy SGD updates for Sparse Data

EXE:

Page 71: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

77

SGD step

Lazy SGD updates for Sparse Data

EXE:

Page 72: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

78

SGD step

Lazy SGD updates for Sparse Data

EXE:

Page 73: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

79

SGD step

Lazy SGD updates for Sparse Data

O(1) scaling + O(s) sparse add = O(s) update

EXE:

Page 74: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

80

Why Machine Learners Like SGD

Page 75: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

81

The statistical learning problem:Minimize the expected loss over an unknown expectation

Why Machine Learners like SGD

SGD can solve the statistical learning problem!SGD can solve the statistical learning problem!

Though we solve:

We want to solve:

Page 76: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

82

The statistical learning problem:Minimize the expected loss over an unknown expectation

Why Machine Learners like SGD

Page 77: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

83

Exercise List time! Please solve:

stoch_ridge_reg_exeSGD_proof_exe

Page 78: Stochastic Gradient Methods · Optimization for Data Science Stochastic Gradient Methods Lecturer: Robert M. Gower & Alexandre Gramfort Tutorials: Quentin Bertrand, Nidham Gazagnadou

84

Appendix