advanced optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · combettes, pesquet (2010);...

111
Advanced Optimization (10-801: CMU) Lecture 19 Parallel proximal; Incremental gradient 26 Mar, 2014 Suvrit Sra

Upload: others

Post on 02-Apr-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Advanced Optimization(10-801: CMU)

Lecture 19Parallel proximal; Incremental gradient

26 Mar, 2014

Suvrit Sra

Page 2: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 3: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 4: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 5: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 6: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 7: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford

min f(x) + h(x)

z ← 12(I +RfRh)z

Reflection operator

Rf := 2 proxf −I.

Observe: Rf = −Rf∗ (another justification of “reflection”)

proxf + proxf∗ = I

2 proxf = 2I − 2 proxf∗

2 proxf −I = I − 2 proxf∗

Rf = −Rf∗

2 / 25

Page 8: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford – open problem

min f(x) + g(x) + h(x)

3 / 25

Page 9: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford – open problem

min f(x) + g(x) + h(x)

z ← 12(I +RfRgRh)z

3 / 25

Page 10: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford – open problem

min f(x) + g(x) + h(x)

0 ∈ ∂f(x) + ∂g(x) + ∂h(x)

3x ∈ (I + ∂f)(x) + (I + ∂g)(x) + (I + ∂h)(x)

3x ∈ (I + ∂f)(x) + z + w

now what?

3 / 25

Page 11: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford – open problem

min f(x) + g(x) + h(x)

Partial solution (Borwein, Tam (2013))

Thf := 12(I +RfRh)

T[fgh] := ThfTghTfg

z ← T[fgh]z

3 / 25

Page 12: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Douglas-Rachford – open problem

min f(x) + g(x) + h(x)

Partial solution (Borwein, Tam (2013))

Thf := 12(I +RfRh)

T[fgh] := ThfTghTfg

z ← T[fgh]z

◦ Works for more than 3 functions too!

◦ For two functions T[fg] = TgfTfg

◦ Does not coincide with usual DR.

◦ Finding “correct” generalization an open problem

3 / 25

Page 13: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Parallel proximal methods

Optimizing separable objective functions

f(x) := 12‖x− y‖

22 +

∑ifi(x)

f(x) :=∑

ifi(x)

Let us consider

min f(x) =∑m

i=1fi(x), x ∈ Rn.

4 / 25

Page 14: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Parallel proximal methods

Optimizing separable objective functions

f(x) := 12‖x− y‖

22 +

∑ifi(x)

f(x) :=∑

ifi(x)

Let us consider

min f(x) =∑m

i=1fi(x), x ∈ Rn.

4 / 25

Page 15: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

I Original problem over H = Rn

I Suppose we have∑m

i=1 fi(x)

I Introduce new variables (x1, . . . , xm)

I Now problem is over domain Hm := H×H× · · · ×H (m-times)

I New constraint: x1 = x2 = . . . = xm

min(x1,...,xm)

∑ifi(xi)

s.t. x1 = x2 = · · · = xm.

Technique due to: G. Pierra (1976)

5 / 25

Page 16: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

I Original problem over H = Rn

I Suppose we have∑m

i=1 fi(x)

I Introduce new variables (x1, . . . , xm)

I Now problem is over domain Hm := H×H× · · · ×H (m-times)

I New constraint: x1 = x2 = . . . = xm

min(x1,...,xm)

∑ifi(xi)

s.t. x1 = x2 = · · · = xm.

Technique due to: G. Pierra (1976)

5 / 25

Page 17: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

I Original problem over H = Rn

I Suppose we have∑m

i=1 fi(x)

I Introduce new variables (x1, . . . , xm)

I Now problem is over domain Hm := H×H× · · · ×H (m-times)

I New constraint: x1 = x2 = . . . = xm

min(x1,...,xm)

∑ifi(xi)

s.t. x1 = x2 = · · · = xm.

Technique due to: G. Pierra (1976)

5 / 25

Page 18: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

I Original problem over H = Rn

I Suppose we have∑m

i=1 fi(x)

I Introduce new variables (x1, . . . , xm)

I Now problem is over domain Hm := H×H× · · · ×H (m-times)

I New constraint: x1 = x2 = . . . = xm

min(x1,...,xm)

∑ifi(xi)

s.t. x1 = x2 = · · · = xm.

Technique due to: G. Pierra (1976)

5 / 25

Page 19: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

I Original problem over H = Rn

I Suppose we have∑m

i=1 fi(x)

I Introduce new variables (x1, . . . , xm)

I Now problem is over domain Hm := H×H× · · · ×H (m-times)

I New constraint: x1 = x2 = . . . = xm

min(x1,...,xm)

∑ifi(xi)

s.t. x1 = x2 = · · · = xm.

Technique due to: G. Pierra (1976)

5 / 25

Page 20: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 21: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 22: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 23: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 24: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 25: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Product space technique

Two block problem

minxf(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

I Let y = (y1, . . . , ym)

I proxf (y) = (proxf1(y1), . . . ,proxfm(ym))

I proxIB ≡ ΠB(y) can be solved as follows:

minz∈B12‖z − y‖22

minx∈H∑

i12‖x− yi‖

22

=⇒ x = 1m

∑i yi

Exercise: Work out the details of DR using the product space idea

This technique commonly exploited in ADMM too

6 / 25

Page 26: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 27: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 28: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 29: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 30: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 31: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 32: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Alternative: two block proximity

minx12‖x− y‖22 + f(x) + h(x)

Usually proxf+h 6= proxf ◦ proxh

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 02 k-th iteration (k ≥ 0)

wk = proxf (xk + uk)uk+1 = xk + uk − wk

xk+1 = proxh(wk + zk)zk+1 = wk + zk − xk+1

Why does it work?

Exercise: Use the product-space technique to extend this to aparallel prox-Dykstra method for m ≥ 3 functions.Combettes, Pesquet (2010); Bauschke, Combettes (2012)

7 / 25

Page 33: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µinfzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 34: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µinfzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 35: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µinfzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 36: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µ

infzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 37: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µinfzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 38: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – some insight

minx12‖x− y‖22 + f(x) + h(x)

L(x, z, w, ν, µ) := 12‖x− y‖

22+f(z)+h(w)+νT (x−z)+µT (x−w).

I Let’s derive the dual from L:

g(ν, µ) := infx,z,w

L(x, z, ν, µ)

x− y + ν + µ = 0 =⇒ x = y − ν − µinfzf(z)− νT z = −f∗(ν), (similarly get − h∗(µ))

g(ν, µ) = −12‖ν + µ‖22 + (ν + µ)T y − f∗(ν)− h∗(µ)

Equivalent dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

8 / 25

Page 39: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

9 / 25

Page 40: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

9 / 25

Page 41: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

9 / 25

Page 42: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

9 / 25

Page 43: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

0 ∈ νk+1 + µk − y + ∂f∗(νk+1) =⇒ y − µk ∈ νk+1 + ∂f∗(νk+1)

=⇒ νk+1 = proxf∗(y−µk) =⇒ νk+1 = y − µk − proxf (y − µk)Similarly, µk+1 = y − νk+1 − proxh(y − νk+1)

9 / 25

Page 44: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

0 ∈ νk+1 + µk − y + ∂f∗(νk+1) =⇒ y − µk ∈ νk+1 + ∂f∗(νk+1)=⇒ νk+1 = proxf∗(y−µk) =⇒ νk+1 = y − µk − proxf (y − µk)

Similarly, µk+1 = y − νk+1 − proxh(y − νk+1)

9 / 25

Page 45: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

Dual problem

min G(ν, µ) := 12‖ν + µ− y‖22 + f∗(ν) + h∗(µ).

Solve this dual via Block-Coordinate Descent!

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).◦

0 ∈ νk+1 + µk − y + ∂f∗(νk+1) =⇒ y − µk ∈ νk+1 + ∂f∗(νk+1)=⇒ νk+1 = proxf∗(y−µk) =⇒ νk+1 = y − µk − proxf (y − µk)

Similarly, µk+1 = y − νk+1 − proxh(y − νk+1)

9 / 25

Page 46: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

νk+1 = y − µk − proxf (y − µk)µk+1 = y − νk+1 − proxh(y − νk+1)

Now use Lagrangian stationarity condition

x = y − ν − µ =⇒ y − µ = x+ ν

to rewrite BCD using primal and dual variables.

10 / 25

Page 47: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

νk+1 = y − µk − proxf (y − µk)µk+1 = y − νk+1 − proxh(y − νk+1)

Now use Lagrangian stationarity condition

x = y − ν − µ =⇒ y − µ = x+ ν

to rewrite BCD using primal and dual variables.

10 / 25

Page 48: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

νk+1 = y − µk − proxf (y − µk)µk+1 = y − νk+1 − proxh(y − νk+1)

Now use Lagrangian stationarity condition

x = y − ν − µ =⇒ y − µ = x+ ν

to rewrite BCD using primal and dual variables.

10 / 25

Page 49: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

νk+1 = y − µk − proxf (y − µk)µk+1 = y − νk+1 − proxh(y − νk+1)

Now use Lagrangian stationarity condition

x = y − ν − µ =⇒ y − µ = x+ ν

to rewrite BCD using primal and dual variables.

BCD

νk+1 = argminν G(ν, µk),

µk+1 = argminµ G(νk+1, µ).

10 / 25

Page 50: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Proximal-Dykstra – key insight

I 0 ∈ νk+1 + µk − y + ∂f∗(νk+1)

I 0 ∈ νk+1 + µk+1 − y + ∂h∗(µk+1).

νk+1 = y − µk − proxf (y − µk)µk+1 = y − νk+1 − proxh(y − νk+1)

Now use Lagrangian stationarity condition

x = y − ν − µ =⇒ y − µ = x+ ν

to rewrite BCD using primal and dual variables.

Prox-Dykstra

wk ← proxf (xk + νk)

νk+1 ← xk + νk − wkxk+1 ← proxh(wk + µk)

µk+1 ← µk + wk − xk+1

10 / 25

Page 51: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example practical use

Anisotropic 2D-TV Proximity operator

minX

12‖X − Y ‖

2F +

∑ijwcij |xi,j+1−xij |+

∑ijwrij |xi+1,j −xij |

• Amenable to prox-Dykstra

• Used in (Barbero, Sra, ICML 2011).

• The subproblem:

min 12‖a− b‖

22 +

∑i wi|ai − ai+1|

itself has been subject of over 15 papers!

• I still use it now and then

11 / 25

Page 52: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example practical use

Anisotropic 2D-TV Proximity operator

minX

12‖X − Y ‖

2F +

∑ijwcij |xi,j+1−xij |+

∑ijwrij |xi+1,j −xij |

• Amenable to prox-Dykstra

• Used in (Barbero, Sra, ICML 2011).

• The subproblem:

min 12‖a− b‖

22 +

∑i wi|ai − ai+1|

itself has been subject of over 15 papers!

• I still use it now and then

11 / 25

Page 53: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental first-ordermethods

12 / 25

Page 54: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Separable objectives

min f(x) =∑m

i fi(x) + λr(x)

Gradient / subgradient methods

xk+1 = xk − αk∇f(xk) λ = 0,

xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk)

xk+1 = proxαkr(xk − αk∇f(xk))

Product-space based methods

minF (x1, . . . , xm) + IB(x1, . . . , xm)

(x1,k+1, . . . , xm,k+1)← proxF (y1,k, . . . , ym,k)

How much computation does one iteration take?

13 / 25

Page 55: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Separable objectives

min f(x) =∑m

i fi(x) + λr(x)

Gradient / subgradient methods

xk+1 = xk − αk∇f(xk) λ = 0,

xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk)

xk+1 = proxαkr(xk − αk∇f(xk))

Product-space based methods

minF (x1, . . . , xm) + IB(x1, . . . , xm)

(x1,k+1, . . . , xm,k+1)← proxF (y1,k, . . . , ym,k)

How much computation does one iteration take?

13 / 25

Page 56: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Separable objectives

min f(x) =∑m

i fi(x) + λr(x)

Gradient / subgradient methods

xk+1 = xk − αk∇f(xk) λ = 0,

xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk)

xk+1 = proxαkr(xk − αk∇f(xk))

Product-space based methods

minF (x1, . . . , xm) + IB(x1, . . . , xm)

(x1,k+1, . . . , xm,k+1)← proxF (y1,k, . . . , ym,k)

How much computation does one iteration take?

13 / 25

Page 57: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

What if at iteration k, we randomly pick an integeri(k) ∈ {1, 2, . . . ,m}?

And instead just perform the update?

xk+1 = xk − αk∇fi(k)(xk)

I The update requires only gradient for fi(k)

I One iteration now m times faster than with ∇f(x)

But does this make sense?

14 / 25

Page 58: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

What if at iteration k, we randomly pick an integeri(k) ∈ {1, 2, . . . ,m}?

And instead just perform the update?

xk+1 = xk − αk∇fi(k)(xk)

I The update requires only gradient for fi(k)

I One iteration now m times faster than with ∇f(x)

But does this make sense?

14 / 25

Page 59: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

What if at iteration k, we randomly pick an integeri(k) ∈ {1, 2, . . . ,m}?

And instead just perform the update?

xk+1 = xk − αk∇fi(k)(xk)

I The update requires only gradient for fi(k)

I One iteration now m times faster than with ∇f(x)

But does this make sense?

14 / 25

Page 60: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

What if at iteration k, we randomly pick an integeri(k) ∈ {1, 2, . . . ,m}?

And instead just perform the update?

xk+1 = xk − αk∇fi(k)(xk)

I The update requires only gradient for fi(k)

I One iteration now m times faster than with ∇f(x)

But does this make sense?

14 / 25

Page 61: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 62: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 63: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers;

using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 64: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 65: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 66: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 67: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

♠ Old idea; has been used extensively as backpropagation inneural networks, Widrow-Hoff least mean squares, gradientmethods with errors, stochastic gradient, etc.

♠ Can “stream” through data — go through components one byone, say cyclically instead of randomly

♠ For large m many fi(x) may have similar minimizers; using thefi individually we could take advantage, and greatly speed up.

♠ Incremental methods usually effective far from the eventuallimit (solution) — become very slow close to the solution.

♠ Several open questions related to convergence and rate ofconvergence (for both convex, nonconvex)

♠ Usually randomization greatly simplifies convergence analysis

15 / 25

Page 68: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Solving f ′(x) = 0 we obtain

x∗ =

∑i aibi∑i a

2i

I Minimum of a single fi(x) = 12(aix− bi)2 is x∗i = bi/ai

I Notice now that

x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

(Use:∑

i aibi =∑

i a2i (bi/ai))

16 / 25

Page 69: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Solving f ′(x) = 0 we obtain

x∗ =

∑i aibi∑i a

2i

I Minimum of a single fi(x) = 12(aix− bi)2 is x∗i = bi/ai

I Notice now that

x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

(Use:∑

i aibi =∑

i a2i (bi/ai))

16 / 25

Page 70: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Solving f ′(x) = 0 we obtain

x∗ =

∑i aibi∑i a

2i

I Minimum of a single fi(x) = 12(aix− bi)2 is x∗i = bi/ai

I Notice now that

x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

(Use:∑

i aibi =∑

i a2i (bi/ai))

16 / 25

Page 71: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Solving f ′(x) = 0 we obtain

x∗ =

∑i aibi∑i a

2i

I Minimum of a single fi(x) = 12(aix− bi)2 is x∗i = bi/ai

I Notice now that

x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

(Use:∑

i aibi =∑

i a2i (bi/ai))

16 / 25

Page 72: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Notice: x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

I If we have a scalar x that lies outside R?

I We see that

∇fi(x) = ai(aix− bi)

∇f(x) =∑

iai(aix− bi)

I ∇fi(x) has same sign as ∇f(x). So using ∇fi(x) instead of∇f(x) also ensures progress.

I But once inside region R, no guarantee that incrementalmethod will make progress towards optimum.

17 / 25

Page 73: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Notice: x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

I If we have a scalar x that lies outside R?

I We see that

∇fi(x) = ai(aix− bi)

∇f(x) =∑

iai(aix− bi)

I ∇fi(x) has same sign as ∇f(x). So using ∇fi(x) instead of∇f(x) also ensures progress.

I But once inside region R, no guarantee that incrementalmethod will make progress towards optimum.

17 / 25

Page 74: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Notice: x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

I If we have a scalar x that lies outside R?

I We see that

∇fi(x) = ai(aix− bi)

∇f(x) =∑

iai(aix− bi)

I ∇fi(x) has same sign as ∇f(x). So using ∇fi(x) instead of∇f(x) also ensures progress.

I But once inside region R, no guarantee that incrementalmethod will make progress towards optimum.

17 / 25

Page 75: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example (Bertsekas)

I Assume all variables involved are scalars.

min f(x) = 12

∑m

i=1(aix− bi)2

I Notice: x∗ ∈ [mini x∗i ,maxi x

∗i ] =: R

I If we have a scalar x that lies outside R?

I We see that

∇fi(x) = ai(aix− bi)

∇f(x) =∑

iai(aix− bi)

I ∇fi(x) has same sign as ∇f(x). So using ∇fi(x) instead of∇f(x) also ensures progress.

I But once inside region R, no guarantee that incrementalmethod will make progress towards optimum.

17 / 25

Page 76: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

min f(x) =∑

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkfi(k)(x)

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Convergence rate analysis?

18 / 25

Page 77: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

min f(x) =∑

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkfi(k)(x)

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Convergence rate analysis?

18 / 25

Page 78: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

min f(x) =∑

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkfi(k)(x)

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Convergence rate analysis?

18 / 25

Page 79: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

min f(x) =∑

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkfi(k)(x)

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Convergence rate analysis?

18 / 25

Page 80: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example

Fermat-Weber problem(historically the first facility-location problem)

minx∑

iwi‖x− ai‖

I Assuming ‖·‖ = ‖·‖2I Also assume no ai is an optimum

I [Weiszfeld; ’37] Let T := x 7→(∑

iwiai‖x−ai‖

)/(∑

iwi

‖x−ai‖)

I Assuming T is well-defined, T k(x0)→ argmin

I [Kuhn; 73] completed the proof

I What if ‖·‖ = ‖·‖p?

I 100s of papers discuss the Fermat-Weber problem

19 / 25

Page 81: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example

Fermat-Weber problem(historically the first facility-location problem)

minx∑

iwi‖x− ai‖

I Assuming ‖·‖ = ‖·‖2I Also assume no ai is an optimum

I [Weiszfeld; ’37] Let T := x 7→(∑

iwiai‖x−ai‖

)/(∑

iwi

‖x−ai‖)

I Assuming T is well-defined, T k(x0)→ argmin

I [Kuhn; 73] completed the proof

I What if ‖·‖ = ‖·‖p?

I 100s of papers discuss the Fermat-Weber problem

19 / 25

Page 82: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example

Fermat-Weber problem(historically the first facility-location problem)

minx∑

iwi‖x− ai‖

I Assuming ‖·‖ = ‖·‖2I Also assume no ai is an optimum

I [Weiszfeld; ’37] Let T := x 7→(∑

iwiai‖x−ai‖

)/(∑

iwi

‖x−ai‖)

I Assuming T is well-defined, T k(x0)→ argmin

I [Kuhn; 73] completed the proof

I What if ‖·‖ = ‖·‖p?

I 100s of papers discuss the Fermat-Weber problem

19 / 25

Page 83: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Example

Fermat-Weber problem(historically the first facility-location problem)

minx∑

iwi‖x− ai‖

I Assuming ‖·‖ = ‖·‖2I Also assume no ai is an optimum

I [Weiszfeld; ’37] Let T := x 7→(∑

iwiai‖x−ai‖

)/(∑

iwi

‖x−ai‖)

I Assuming T is well-defined, T k(x0)→ argmin

I [Kuhn; 73] completed the proof

I What if ‖·‖ = ‖·‖p?

I 100s of papers discuss the Fermat-Weber problem

19 / 25

Page 84: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

Fermat-Weber problem

minx∑

iwi‖x− ai‖

Now, fi(x) := wi‖x− ai‖2.

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkwi(k)‖x− ai(k)‖2

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Exercise: Obtain closed form solution to xk+1

Rate of convergence? Most likely, sublinear?Can we somehow get linear convergence?

20 / 25

Page 85: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

Fermat-Weber problem

minx∑

iwi‖x− ai‖

Now, fi(x) := wi‖x− ai‖2.

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkwi(k)‖x− ai(k)‖2

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Exercise: Obtain closed form solution to xk+1

Rate of convergence? Most likely, sublinear?Can we somehow get linear convergence?

20 / 25

Page 86: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

Fermat-Weber problem

minx∑

iwi‖x− ai‖

Now, fi(x) := wi‖x− ai‖2.

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkwi(k)‖x− ai(k)‖2

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Exercise: Obtain closed form solution to xk+1

Rate of convergence? Most likely, sublinear?Can we somehow get linear convergence?

20 / 25

Page 87: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal method

Fermat-Weber problem

minx∑

iwi‖x− ai‖

Now, fi(x) := wi‖x− ai‖2.

xk+1 = proxαkfi(k)(xk)

xk+1 = argmin(12‖x− xk‖22 + αkwi(k)‖x− ai(k)‖2

)i(k) ∈ {1, 2, . . . ,m} picked uniformly at random.

Exercise: Obtain closed form solution to xk+1

Rate of convergence? Most likely, sublinear?Can we somehow get linear convergence?

20 / 25

Page 88: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 89: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 90: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 91: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)

Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 92: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 93: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 94: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental proximal-gradients

min∑

ifi(x) + r(x).

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = zi − ηk∇fi(zi), i = 1, . . . ,m− 1.

We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)Might be easier to analyze

xk+1 = proxηkr(xk − ηk

∑m

i=1∇fi(zi)

), k = 0, 1, . . . ,

z1 = xk

zi+1 = proxηkr(zi − ηk∇fi(zi)), i = 1, . . . ,m− 1.

Moreover, analysis easier if we go through the fi randomly(so-called stochastic)

21 / 25

Page 95: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental methods: deterministic

min (f(x) =∑

i fi(x)) + r(x)

Gradient with error

∇fi(k)(x) = ∇f(x) + e

xk+1 = proxαr[xk − αk(∇f(xk) + ek)]

So if in the limit error αkek disappears, we should be ok!

22 / 25

Page 96: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental methods: deterministic

min (f(x) =∑

i fi(x)) + r(x)

Gradient with error

∇fi(k)(x) = ∇f(x) + e

xk+1 = proxαr[xk − αk(∇f(xk) + ek)]

So if in the limit error αkek disappears, we should be ok!

22 / 25

Page 97: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

Incremental gradient methods may be viewed as

Gradient methods with error in gradient computation

I If we can control this error, we can control convergence

I Error makes even smooth case more like nonsmooth case

I So, convergence crucially depends on stepsize αk

Some stepsize choices

♠ αk = c, a small enough constant

♠ αk → 0,∑

k αk =∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat

♠ αk = min(c, a/(b+ k)), where a, b, c > 0 (user chosen).

23 / 25

Page 98: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

Incremental gradient methods may be viewed as

Gradient methods with error in gradient computation

I If we can control this error, we can control convergence

I Error makes even smooth case more like nonsmooth case

I So, convergence crucially depends on stepsize αk

Some stepsize choices

♠ αk = c, a small enough constant

♠ αk → 0,∑

k αk =∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat

♠ αk = min(c, a/(b+ k)), where a, b, c > 0 (user chosen).

23 / 25

Page 99: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

Incremental gradient methods may be viewed as

Gradient methods with error in gradient computation

I If we can control this error, we can control convergence

I Error makes even smooth case more like nonsmooth case

I So, convergence crucially depends on stepsize αk

Some stepsize choices

♠ αk = c, a small enough constant

♠ αk → 0,∑

k αk =∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat

♠ αk = min(c, a/(b+ k)), where a, b, c > 0 (user chosen).

23 / 25

Page 100: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

Incremental gradient methods may be viewed as

Gradient methods with error in gradient computation

I If we can control this error, we can control convergence

I Error makes even smooth case more like nonsmooth case

I So, convergence crucially depends on stepsize αk

Some stepsize choices

♠ αk = c, a small enough constant

♠ αk → 0,∑

k αk =∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat

♠ αk = min(c, a/(b+ k)), where a, b, c > 0 (user chosen).

23 / 25

Page 101: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient methods

Incremental gradient methods may be viewed as

Gradient methods with error in gradient computation

I If we can control this error, we can control convergence

I Error makes even smooth case more like nonsmooth case

I So, convergence crucially depends on stepsize αk

Some stepsize choices

♠ αk = c, a small enough constant

♠ αk → 0,∑

k αk =∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat

♠ αk = min(c, a/(b+ k)), where a, b, c > 0 (user chosen).

23 / 25

Page 102: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 103: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 104: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 105: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 106: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 107: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 108: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 109: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 110: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

♠ Slow progress near optimum (because αk often too small)

♠ Constant step αk = α, doesn’t always yield convergence

♠ Diminishing step αk = O(1/k) leads to convergence

♠ Usually slow, sublinear rate of convergence

♠ If fi strongly convex, linear rate available (SAG, SVRG)

♠ Idea extends to subgradient, and proximal setups

♠ Some extensions also apply to nonconvex problems

♠ Some extend to parallel and distributed computation

24 / 25

Page 111: Advanced Optimizationsuvrit/teach/lect19.pdf · 2014. 3. 26. · Combettes, Pesquet (2010); Bauschke, Combettes (2012) 7/25. Alternative: two block proximity min x 1 2 kx 2yk 2 +f(x)+h(x)

References

♠ EE227A slides, S. Sra

♠ Introductory Lectures on Convex Optimization, Yu. Nesterov

♠ Proximal splitting methods, Combettes & Pesquet

25 / 25