subgradient methods for huge-scale optimization problems - Юрий Нестеров, catholic...
DESCRIPTION
We consider a new class of huge-scale problems, the problems with sparse subgradients. The most important functions of this type are piecewise linear. For optimization problems with uniform sparsity of corresponding linear operators, we suggest a very efficient implementation of subgradient iterations, the total cost of which depends logarithmically in the dimension. This technique is based on a recursive update of the results of matrix/vector products and the values of symmetric functions. It works well, for example, for matrices with few nonzero diagonals and for max-type functions. We show that the updating technique can be efficiently coupled with the simplest subgradient methods. Similar results can be obtained for a new non-smooth random variant of a coordinate descent scheme. We also present promising results of preliminary computational experiments.TRANSCRIPT
![Page 1: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/1.jpg)
Subgradient methods for huge-scale optimizationproblems
Yurii Nesterov, CORE/INMA (UCL)
November 10, 2014 (Yandex, Moscow)
Yu. Nesterov Subgradient methods for huge-scale problems 1/22
![Page 2: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/2.jpg)
Outline
1 Problems sizes
2 Sparse Optimization problems
3 Sparse updates for linear operators
4 Fast updates in computational trees
5 Simple subgradient methods
6 Application examples
7 Computational experiments: Google problem
Yu. Nesterov Subgradient methods for huge-scale problems 2/22
![Page 3: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/3.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 4: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/4.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 5: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/5.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 6: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/6.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 7: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/7.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 8: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/8.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.
Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 9: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/9.jpg)
Nonlinear Optimization: problems sizes
Class Operations Dimension Iter.Cost Memory
Small-size All 100 − 102 n4 → n3 Kilobyte: 103
Medium-size A−1 103 − 104 n3 → n2 Megabyte: 106
Large-scale Ax 105 − 107 n2 → n Gigabyte: 109
Huge-scale x + y 108 − 1012 n→ log n Terabyte: 1012
Sources of Huge-Scale problems
Internet (New)
Telecommunications (New)
Finite-element schemes (Old)
PDE, Weather prediction (Old)
Main hope: Sparsity.Yu. Nesterov Subgradient methods for huge-scale problems 3/22
![Page 10: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/10.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 11: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/11.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN ,
and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 12: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/12.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 13: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/13.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 14: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/14.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x .
Sparsity coefficient: γ(A)def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 15: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/15.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 16: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/16.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 17: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/17.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 18: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/18.jpg)
Sparse problems
Problem: minx∈Q
f (x), where Q is closed and convex in RN , and
f (x) = Ψ(Ax), where Ψ is a simple convex function:
Ψ(y1) ≥ Ψ(y2) + 〈Ψ′(y2), y1 − y2〉, y1, y2 ∈ RM ,
A : RN → RM is a sparse matrix.
Let p(x)def= # of nonzeros in x . Sparsity coefficient: γ(A)
def= p(A)
MN .
Example 1: Matrix-vector multiplication
Computation of vector Ax needs p(A) operations.
Initial complexity MN is reduced in γ(A) times.
Yu. Nesterov Subgradient methods for huge-scale problems 4/22
![Page 19: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/19.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 20: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/20.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 21: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/21.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 22: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/22.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 23: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/23.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 24: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/24.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax).
If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 25: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/25.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 26: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/26.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion:
As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 27: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/27.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.
Note: For Large- and Huge-scale problems, we often haveγ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 28: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/28.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6.
Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 29: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/29.jpg)
Example: Gradient Method
x0 ∈ Q, xk+1 = πQ(xk − hf ′(xk)), k ≥ 0.
Main computational expenses
Projection of simple set Q needs O(N) operations.
Displacement xk → xk − hf ′(xk) needs O(N) operations.
f ′(x) = AT Ψ′(Ax). If Ψ is simple, then the main efforts arespent for two matrix-vector multiplications: 2p(A).
Conclusion: As compared with full matrices, we accelerate inγ(A) times.Note: For Large- and Huge-scale problems, we often have
γ(A) ≈ 10−4 . . . 10−6. Can we get more?
Yu. Nesterov Subgradient methods for huge-scale problems 5/22
![Page 30: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/30.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 31: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/31.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 32: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/32.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d
we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 33: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/33.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+
= Ax︸︷︷︸y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 34: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/34.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 35: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/35.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 36: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/36.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}.
Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 37: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/37.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 38: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/38.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej),
can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 39: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/39.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 40: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/40.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej)
= γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 41: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/41.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 42: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/42.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 43: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/43.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A),
then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 44: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/44.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 45: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/45.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration:
(10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 46: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/46.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12
⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 47: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/47.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec
≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 48: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/48.jpg)
Sparse updating strategy
Main idea
After update x+ = x + d we have y+def= Ax+ = Ax︸︷︷︸
y
+Ad .
What happens if d is sparse?
Denote σ(d) = {j : d (j) 6= 0}. Then y+ = y +∑
j∈σ(d)
d (j) · Aej .
Its complexity, κA(d)def=
∑j∈σ(d)
p(Aej), can be VERY small!
κA(d) = M∑
j∈σ(d)
γ(Aej) = γ(d) · 1p(d)
∑j∈σ(d)
γ(Aej) ·MN
≤ γ(d) max1≤j≤m
γ(Aej) ·MN.
If γ(d) ≤ cγ(A), γ(Aj) ≤ cγ(A), then κA(d) ≤ c2 · γ2(A) ·MN .
Expected acceleration: (10−6)2 = 10−12 ⇒ 1 sec ≈ 32 000years!
Yu. Nesterov Subgradient methods for huge-scale problems 6/22
![Page 49: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/49.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 50: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/50.jpg)
When it can work?
Simple methods:
No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 51: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/51.jpg)
When it can work?
Simple methods: No full-vector operations!
(Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 52: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/52.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 53: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/53.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems:
Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 54: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/54.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 55: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/55.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 56: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/56.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉.
The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 57: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/57.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 58: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/58.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 59: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/59.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)].
Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 60: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/60.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),
can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 61: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/61.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 62: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/62.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But:
We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 63: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/63.jpg)
When it can work?
Simple methods: No full-vector operations! (Is it possible?)
Simple problems: Functions with sparse gradients.
Let us try:
1 Quadratic function f (x) = 12〈Ax , x〉 − 〈b, x〉. The gradient
f ′(x) = Ax − b, x ∈ RN ,
is not sparse even if A is sparse.
2 Piece-wise linear function g(x) = max1≤i≤m
[〈ai , x〉 − b(i)]. Its
subgradient f ′(x) = ai(x), i(x) : f (x) = 〈ai(x), x〉 − b(i(x)),can be sparse is ai is sparse!
But: We need a fast procedure for updating max-type operations.
Yu. Nesterov Subgradient methods for huge-scale problems 7/22
![Page 64: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/64.jpg)
Fast updates in short computational trees
Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.
Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi ,j are some bivariate functions.
v2,1v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1
Yu. Nesterov Subgradient methods for huge-scale problems 8/22
![Page 65: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/65.jpg)
Fast updates in short computational trees
Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.
Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi ,j are some bivariate functions.
v2,1v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1
Yu. Nesterov Subgradient methods for huge-scale problems 8/22
![Page 66: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/66.jpg)
Fast updates in short computational trees
Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.
Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.
Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi ,j are some bivariate functions.
v2,1v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1
Yu. Nesterov Subgradient methods for huge-scale problems 8/22
![Page 67: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/67.jpg)
Fast updates in short computational trees
Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.
Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi ,j are some bivariate functions.
v2,1v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1
Yu. Nesterov Subgradient methods for huge-scale problems 8/22
![Page 68: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/68.jpg)
Fast updates in short computational trees
Def: Function f (x), x ∈ Rn, is short-tree representable, if it canbe computed by a short binary tree with the height ≈ ln n.
Let n = 2k and the tree has k + 1 levels: v0,i = x (i), i = 1, . . . , n.Size of the next level halves the size of the previous one:
vi+1,j = ψi+1,j(vi ,2j−1, vi ,2j), j = 1, . . . , 2k−i−1, i = 0, . . . , k − 1,
where ψi ,j are some bivariate functions.
v2,1v1,1 v1,2
v0,1 v0,2 v0,3 v0,4
v2,n/4v1,n/2−1 v1,n/2
v0,n−3v0,n−2v0,n−1 v0,n
. . . . . . . . .
. . .
vk−1,1 vk−1,2
vk,1
Yu. Nesterov Subgradient methods for huge-scale problems 8/22
![Page 69: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/69.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 70: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/70.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 71: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/71.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 72: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/72.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 73: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/73.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 74: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/74.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 75: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/75.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 76: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/76.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 77: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/77.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes with
Sublinear Iteration Cost.
Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 78: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/78.jpg)
Main advantages
Important examples (symmetric functions)
f (x) = ‖x‖p, p ≥ 1, ψi ,j(t1, t2) ≡ [ |t1|p + |t2|p ]1/p ,
f (x) = ln
(n∑
i=1ex(i)
), ψi ,j(t1, t2) ≡ ln (et1 + et2) ,
f (x) = max1≤i≤n
x (i), ψi ,j(t1, t2) ≡ max {t1, t2} .
The binary tree requires only n − 1 auxiliary cells.
Its value needs n − 1 applications of ψi ,j(·, ·) ( ≡ operations).
If x+ differs from x in one entry only, then for re-computingf (x+) we need only k ≡ log2 n operations.
Thus, we can have pure subgradient minimization schemes withSublinear Iteration Cost
.Yu. Nesterov Subgradient methods for huge-scale problems 9/22
![Page 79: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/79.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 80: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/80.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 81: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/81.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 82: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/82.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 83: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/83.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 84: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/84.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 85: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/85.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ).
Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 86: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/86.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 87: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/87.jpg)
Simple subgradient methods
I. Problem: f ∗def= min
x∈Qf (x), where
Q is a closed and convex and ‖f ′(x)‖ ≤ L(f ), x ∈ Q,
the optimal value f ∗ is known.
Consider the following optimization scheme (B.Polyak, 1967):
x0 ∈ Q, xk+1 = πQ
(xk −
f (xk)− f ∗
‖f ′(xk)‖2f ′(xk)
), k ≥ 0.
Denote f ∗k = min0≤i≤k
f (xi ). Then for any k ≥ 0 we have:
f ∗k − f ∗ ≤ L(f )‖x0−πX∗ (x0)‖(k+1)1/2 ,
‖xk − x∗‖ ≤ ‖x0 − x∗‖, ∀x∗ ∈ X∗.
Yu. Nesterov Subgradient methods for huge-scale problems 10/22
![Page 88: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/88.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 89: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/89.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖.
Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 90: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/90.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 91: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/91.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 92: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/92.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 93: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/93.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 94: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/94.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.
Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 95: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/95.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗.
Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 96: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/96.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖,
〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 97: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/97.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 98: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/98.jpg)
Proof:
Let us fix x∗ ∈ X∗. Denote rk(x∗) = ‖xk − x∗‖. Then
r2k+1(x∗) ≤
∥∥∥xk − f (xk )−f ∗
‖f ′(xk )‖2 f ′(xk)− x∗∥∥∥2
= r2k (x∗)− 2 f (xk )−f ∗
‖f ′(xk )‖2 〈f′(xk), xk − x∗〉+ (f (xk )−f ∗)2
‖f ′(xk )‖2
≤ r2k (x∗)− (f (xk )−f ∗)2
‖f ′(xk )‖2 ≤ r2k (x∗)− (f ∗k −f ∗)2
L2(f ).
From this reasoning, ‖xk+1 − x∗‖2 ≤ ‖xk − x∗‖2, ∀x∗ ∈ X ∗.Corollary: Assume X∗ has recession direction d∗. Then
‖xk − πX∗(x0)‖ ≤ ‖x0 − πX∗(x0)‖, 〈d∗, xk〉 ≥ 〈d∗, x0〉.
(Proof: consider x∗ = πX∗(x0) + αd∗, α ≥ 0.)
Yu. Nesterov Subgradient methods for huge-scale problems 11/22
![Page 99: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/99.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 100: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/100.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0},
where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 101: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/101.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 102: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/102.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 103: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/103.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method.
It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 104: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/104.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 105: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/105.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A):
xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 106: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/106.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 107: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/107.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 108: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/108.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 109: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/109.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem:
If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 110: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/110.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2,
then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 111: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/111.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅
and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 112: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/112.jpg)
Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: minx∈Q{f (x) : g(x) ≤ 0}, where
Q is closed and convex,
f , g have uniformly bounded subgradients.
Consider the following method. It has step-size parameter h > 0.
If g(xk) > h ‖g ′(xk)‖, then (A): xk+1 = πQ
(xk − g(xk )
‖g ′(xk )‖2 g ′(xk)),
else (B): xk+1 = πQ
(xk − h
‖f ′(xk )‖ f ′(xk)).
Let Fk ⊆ {0, . . . , k} be the set (B)-iterations, and f ∗k = mini∈Fk
f (xi ).
Theorem: If k > ‖x0 − x∗‖2/h2, then Fk 6= ∅ and
f ∗k − f (x) ≤ hL(f ), maxi∈Fk
g(xi ) ≤ hL(g).
Yu. Nesterov Subgradient methods for huge-scale problems 12/22
![Page 113: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/113.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 114: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/114.jpg)
Computational strategies
1. Constants L(f ), L(g) are known
(e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 115: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/115.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 116: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/116.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} .
Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 117: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/117.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N
(easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 118: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/118.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 119: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/119.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 120: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/120.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 121: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/121.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 122: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/122.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 123: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/123.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 124: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/124.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 125: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/125.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 126: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/126.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run.
Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 127: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/127.jpg)
Computational strategies
1. Constants L(f ), L(g) are known (e.g. Linear Programming)
We can take h = εmax{L(f ),L(g)} . Then we need to decide on the
number of steps N (easy!).
Note: The standard advice is h = R√N+1
(much more difficult!)
2. Constants L(f ), L(g) are not known
Start from a guess.
Restart from scratch each time we see the guess is wrong.
The guess is doubled after restart.
3. Tracking the record value f ∗k
Double run. Other ideas are welcome!
Yu. Nesterov Subgradient methods for huge-scale problems 13/22
![Page 128: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/128.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 129: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/129.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 130: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/130.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 131: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/131.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.
Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 132: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/132.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.
Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 133: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/133.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).
Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 134: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/134.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 135: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/135.jpg)
Application examples
Observations:
1 Very often, Large- and Huge- scale problems have repetitivesparsity patterns and/or limited connectivity.
Social networks.Mobile phone networks.Truss topology design (local bars).Finite elements models (2D: four neighbors, 3D: six neighbors).
2 For p-diagonal matrices κ(A) ≤ p2.
Yu. Nesterov Subgradient methods for huge-scale problems 14/22
![Page 136: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/136.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 137: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/137.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 138: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/138.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 139: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/139.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 140: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/140.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 141: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/141.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 142: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/142.jpg)
Google problem
Goal: Rank the agents in the society by their social weights.
Unknown: xi ≥ 0 - social influence of agent i = 1, . . . ,N.
Known: σi - set of friends of agent i .
Hypothesis
Agent i shares his support among all friends by equal parts.
The influence of agent i is equal to the total support obtainedfrom his friends.
Yu. Nesterov Subgradient methods for huge-scale problems 15/22
![Page 143: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/143.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 144: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/144.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.
Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 145: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/145.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.
Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 146: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/146.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 147: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/147.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.
The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 148: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/148.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 149: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/149.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 150: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/150.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 151: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/151.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 152: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/152.jpg)
Mathematical formulation: quadratic problem
Let E ∈ RN×N be an incidence matrix of the connections graph.Denote e = (1, . . . , 1)T ∈ RN and E = E · diag (ET e)−1.Since, ET e = e, this matrix is stochastic.
Problem: Find x∗ ≥ 0 : E x∗ = x∗, x∗ 6= 0.The size is very big!
Known technique:
Regularization + Fixed Point (Google Founders, B.Polyak &coauthors, etc.)
N09: Solve it by random CD-method as applied to12‖E x − x‖2 + γ
2 [〈e, x〉 − 1]2, γ > 0.
Main drawback: No interpretation for the objective function!
Yu. Nesterov Subgradient methods for huge-scale problems 16/22
![Page 153: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/153.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 154: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/154.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 155: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/155.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 156: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/156.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 157: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/157.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!
Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 158: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/158.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.
Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 159: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/159.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.
If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 160: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/160.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 161: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/161.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉
≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 162: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/162.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉
≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 163: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/163.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞
= 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 164: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/164.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.
Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 165: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/165.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.
(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 166: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/166.jpg)
Nonsmooth formulation of Google Problem
Main property of spectral radius (A ≥ 0)
If A ∈ Rn×n+ , then ρ(A) = min
x≥0max
1≤i≤n
1x(i) 〈ei ,Ax〉.
The minimum is attained at the corresponding eigenvector.
Since ρ(E ) = 1, our problem is as follows:
f (x)def= max
1≤i≤N[〈ei , E x〉 − x (i)] → min
x≥0.
Interpretation: Increase self-confidence!Since f ∗ = 0, we can apply Polyak’s method with sparse updates.Additional features; the optimal set X ∗ is a convex cone.If x0 = e, then the whole sequence is separated from zero:
〈x∗, e〉 ≤ 〈x∗, xk〉 ≤ ‖x∗‖1 · ‖xk‖∞ = 〈x∗, e〉 · ‖xk‖∞.Goal: Find x ≥ 0 such that ‖x‖∞ ≥ 1 and f (x) ≤ ε.(First condition is satisfied automatically.)
Yu. Nesterov Subgradient methods for huge-scale problems 17/22
![Page 167: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/167.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 168: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/168.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 169: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/169.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 170: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/170.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N
≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 171: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/171.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N,
GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 172: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/172.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.
(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 173: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/173.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10,
log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 174: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/174.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20,
log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 175: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/175.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 176: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/176.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 177: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/177.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 178: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/178.jpg)
Computational experiments: Iteration Cost
We compare Polyak’s GM with sparse update (GMs) with thestandard one (GM).
Setup: Each agent has exactly p random friends.
Thus, κ(A)def= max
1≤i≤MκA(AT ei ) ≈ p2.
Iteration Cost: GMs ≤ κ(A) log2 N ≈ p2 log2 N, GM ≈ pN.(log2 103 = 10, log2 106 = 20, log2 109 = 30)
Time for 104 iterations (p = 32)
N κ(A) GMs GM
1024 1632 3.00 2.982048 1792 3.36 6.414096 1888 3.75 15.118192 1920 4.20 139.92
16384 1824 4.69 408.38
Time for 103 iterations (p = 16)
N κ(A) GMs GM
131072 576 0.19 213.9262144 592 0.25 477.8524288 592 0.32 1095.5
1048576 608 0.40 2590.8
1 sec ≈ 100 min!
Yu. Nesterov Subgradient methods for huge-scale problems 18/22
![Page 179: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/179.jpg)
Convergence of GMs : Medium Size
Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
1.0 · 105 0.1100 16.443.0 · 105 0.0429 49.326.0 · 105 0.0221 98.651.1 · 106 0.0119 180.852.2 · 106 0.0057 361.714.1 · 106 0.0028 674.097.6 · 106 0.0014 1249.541.0 · 107 0.0010 1644.13
Dimension and accuracy are sufficiently high, but the time is stillreasonable.
Yu. Nesterov Subgradient methods for huge-scale problems 19/22
![Page 180: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/180.jpg)
Convergence of GMs : Medium Size
Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
1.0 · 105 0.1100 16.443.0 · 105 0.0429 49.326.0 · 105 0.0221 98.651.1 · 106 0.0119 180.852.2 · 106 0.0057 361.714.1 · 106 0.0028 674.097.6 · 106 0.0014 1249.541.0 · 107 0.0010 1644.13
Dimension and accuracy are sufficiently high, but the time is stillreasonable.
Yu. Nesterov Subgradient methods for huge-scale problems 19/22
![Page 181: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/181.jpg)
Convergence of GMs : Medium Size
Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
1.0 · 105 0.1100 16.443.0 · 105 0.0429 49.326.0 · 105 0.0221 98.651.1 · 106 0.0119 180.852.2 · 106 0.0057 361.714.1 · 106 0.0028 674.097.6 · 106 0.0014 1249.541.0 · 107 0.0010 1644.13
Dimension and accuracy are sufficiently high, but the time is stillreasonable.
Yu. Nesterov Subgradient methods for huge-scale problems 19/22
![Page 182: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/182.jpg)
Convergence of GMs : Medium Size
Let N = 131072, p = 16, κ(A) = 576, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
1.0 · 105 0.1100 16.443.0 · 105 0.0429 49.326.0 · 105 0.0221 98.651.1 · 106 0.0119 180.852.2 · 106 0.0057 361.714.1 · 106 0.0028 674.097.6 · 106 0.0014 1249.541.0 · 107 0.0010 1644.13
Dimension and accuracy are sufficiently high, but the time is stillreasonable.
Yu. Nesterov Subgradient methods for huge-scale problems 19/22
![Page 183: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/183.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107. Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 184: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/184.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107. Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 185: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/185.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107. Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 186: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/186.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107. Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 187: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/187.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107.
Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 188: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/188.jpg)
Convergence of GMs : Large Scale
Let N = 1048576, p = 8, κ(A) = 192, and L(f ) = 0.21.
Iterations f − f ∗ Time (sec)
0 2.000000 0.001.0 · 105 0.546662 7.694.0 · 105 0.276866 30.741.0 · 106 0.137822 76.862.5 · 106 0.063099 192.145.1 · 106 0.032092 391.979.9 · 106 0.016162 760.881.5 · 107 0.010009 1183.59
Final point x∗: ‖x∗‖∞ = 2.941497, R20
def= ‖x∗ − e‖22 = 1.2 · 105.
Theoretical bound:L2(f )R2
0ε2
= 5.3 · 107. Time for GM: ≈ 1 year!
Yu. Nesterov Subgradient methods for huge-scale problems 20/22
![Page 189: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/189.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 190: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/190.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 191: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/191.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows.
Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 192: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/192.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense.
It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 193: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/193.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 194: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/194.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 195: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/195.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints.
(No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 196: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/196.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)
Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 197: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/197.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.
May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 198: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/198.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 199: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/199.jpg)
Conclusion
1 Sparse GM is an efficient and reliable method for solvingLarge- and Huge- Scale problems with uniform sparsity.
2 We can treat also dense rows. Assume that inequality〈a, x〉 ≤ b is dense. It is equivalent to the following system:
y (1) = a(1) x (1), y (j) = y (j−1) + a(j) x (j), j = 2, . . . , n,
y (n) ≤ b.
We need new variables y (j) for all nonzero coefficients of a.
Introduce p(a) additional variables and p(A) additionalequality constraints. (No problem!)Hidden drawback: the above equalities are satisfied with errors.May be it is not too bad?
3 Similar technique can be applied to dense columns.
Yu. Nesterov Subgradient methods for huge-scale problems 21/22
![Page 200: Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Catholic University of Louvain, Belgium](https://reader033.vdocuments.us/reader033/viewer/2022042816/5589306ed8b42a3e608b466c/html5/thumbnails/200.jpg)
Thank you for your attention!
Yu. Nesterov Subgradient methods for huge-scale problems 22/22