scienti c computing for x-ray computed tomographypcha/hdtomo/sc/week3day1_optintro.pdf · scienti c...

21
Scientific Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin S. Andersen Section for Scientific Computing DTU Compute DTU January 21, 2019

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Scientific Computing forX-Ray Computed TomographyIntroduction to Optimization (part I)

Martin S. AndersenSection for Scientific Computing

DTU Compute

DTUJanuary 21, 2019

Page 2: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Unconstrained minimization

g : Rn → R differentiable

minimize g(x)

Gradient method: choose x(0) ∈ Rn and iterate

x(k+1) = x(k) − tk∇g(x(k)), k = 0, 1, 2, . . .

I constant steps: tk = t > 0

I diminishing steps: tk = t√k> 0

I exact line search:

tk = argmint≥0

{g(x(k−1) − t∇g(x(k−1))

)}I backtracking line search

1/19

Page 3: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Example

g(x1, x2) =1

2x21 +

1

4x42 −

1

2x22, ∇g(x1, x2) =

[x1

x2(x22 − 1)

]

−20

2−2 −1 0 1 2

0

2

4

x1x2

g(x

1,x

2)

gradient method does not converge to local minimum if x(0)2 = 0

2/19

Page 4: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Lipschitz continuity

g : R→ R is Lipschitz continuous if there exists a constant Lsuch that

|g(y)− g(x)| ≤ L|y − x| for all x, y

L is referred to as a Lipschitz constant

Interpretation: left and right derivatives belong to [−L,L]

3/19

Page 5: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Lipschitz continuity

Multivariate functions

F : Rn → Rm is Lipschitz continuous with constant L if

‖F (y)− F (x)‖ ≤ L‖y − x‖ for all x, y

Lipschitz continuous gradient

∇g : Rn → Rn is Lipschitz continuous with constant L if

‖∇g(y)−∇g(x)‖ ≤ L‖y − x‖ for all x, y

for g : Rn → R continuously differentiable

4/19

Page 6: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Lipschitz continuous gradient (I)

Suppose ∇g is Lipschitz continuous and let φ(τ) = g(x+ τ(y−x))

Newton–Leibniz rule

∫ 1

0φ′(τ) dτ = φ(1)− φ(0)∫ 1

0∇g(x+ τ(y − x))T (y − x) dτ = g(y)− g(x)

5/19

Page 7: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Lipschitz continuous gradient (II)

Let p = y − x

g(y) = g(x) +∇g(x)T p+

∫ 1

0(∇g(x+ τp)−∇g(x))T p dτ

≤ g(x) +∇g(x)T p+ ‖p‖2∫ 1

0‖∇g(x+ τp)−∇g(x)‖2 dτ

Lipschitz property: ‖∇g(x+ τp)−∇g(x)‖2 ≤ L‖x+ τp− x‖2

g(x+ p) ≤ f(x) +∇g(x)T p+ ‖p‖∫ 1

0τL‖p‖ dτ

yields quadratic upper bound

g(y) ≤ g(x) +∇g(x)T (y − x) +L

2‖y − x‖2

6/19

Page 8: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Majorization minimization

A function ψ(y;x) is said to majorize g at x if

i. ψ(y;x) ≥ g(y) for all y

ii. ψ(x;x) = g(x)

ψ minorizes g provided that −ψ majorizes −g

Majorization minimization

x(k+1) = argminx

ψ(x;x(k))

yields descent method

g(x(k)) = ψ(x(k);x(k)) ≥ ψ(x(k+1);x(k)) ≥ g(x(k+1))

7/19

Page 9: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Functions with Lipschitz continuous gradientLipschitz property (quadratic upper bound) yields majorization

g(y) ≤ g(x) +∇g(x)T (y − x) +L

2‖y − x‖2︸ ︷︷ ︸

ψ(y;x)

Minimizing majorization (minimizing rhs with respect to y) yields

x(k+1) = x(k) − 1

L∇g(x(k))

descent property

g(x(k+1)) ≤ g(x(k))− 1

2L‖∇g(x(k))‖22

I possible to show that ‖∇g(x(k))‖2 → 0 as k →∞I x(k) may not converge to local minimum!

8/19

Page 10: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Convex sets

C ⊆ Rn is convex if for all x, y ∈ C and θ ∈ [0, 1]

θx+ (1− θ)y ∈ C

convex set

x

y

nonconvex set

x

y

9/19

Page 11: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Convex sets

C ⊆ Rn is convex if for all x, y ∈ C and θ ∈ [0, 1]

θx+ (1− θ)y ∈ C

convex set

x

y

nonconvex set

9/19

Page 12: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Convex functions

g : C → R is convex if for all x, y ∈ C and θ ∈ [0, 1]

g(θx+ (1− θ)y) ≤ θg(x) + (1− θ)g(y)

x y

g(x)

g(y)

I domain C ⊆ Rn must be a convex set

I g is concave if −g is convex

10/19

Page 13: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Strict and strong convexity

g is strictly convex if for all x, y ∈ dom g, x 6= y, and θ ∈ (0, 1)

g(θx+ (1− θ)y) < θg(x) + (1− θ)g(y)

g is strongly convex if for all x, y ∈ dom g and θ ∈ [0, 1]

g(θx+ (1− θ)y) ≤ θg(x) + (1− θ)g(y)− θ(1− θ)µ2

‖x− y‖22

µ > 0 is called the modulus of strong convexity

strongly convex ⊂ strictly convex ⊂ convex

11/19

Page 14: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

First-order conditions for convexity

Differentiable g is convex if and only if dom g is convex and

g(y) ≥ g(x) +∇g(x)T (y − x)

for all x, y ∈ dom g

x

g(y)

g(x) +∇g(x)T (y − x)g(x)

I first-order Taylor approximation is a global underestimator

I ∇g(x) = 0 implies that x is a global minimizer of g

12/19

Page 15: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

First-order conditions for strict and strong convexity

Suppose g is differentiable and dom g is convex

g is strictly convex if and only if for all x, y ∈ dom g, x 6= y

g(y) > g(x) +∇g(x)T (y − x)

∇g(x) = 0 implies that x is the unique global minimizer of g

g is strongly convex with modulus µ > 0 if for all x, y ∈ dom g

g(y) ≥ g(x) +∇g(x)T (y − x) +µ

2‖y − x‖22

provides global quadratic underestimator

13/19

Page 16: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Second-order conditions for convexity

Twice differentiable g is convex if and only if dom g is convex and

∇2g(x) � 0, ∀x ∈ dom g

Strict convexity (sufficient condition)

∇2g(x) � 0, ∀x ∈ dom g

Strong convexity (necessary and sufficient condition)

∇2g(x) � µI, ∀x ∈ dom g

I ∇2g(x) � µI means that ∇2g(x)− µI is positive semidefinite

I implies that g(x)− (µ/2)‖x‖22 is convex

14/19

Page 17: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Gradient descent — rate of convergence

g : Rn → R diffentiable with Lipschitz gradient

I convex g and constant step size t = 1/L

g(x(k))− g(x?) ≤ 2L‖x(0) − x?‖22k + 4

I strongly convex g and constant step size t = 2/(µ+ L)

g(x(k))− g(x?) ≤ L

2

(Qg − 1

Qg + 1

)2k

‖x(0) − x?‖22

‖x(k) − x?‖2 ≤(Qg − 1

Qg + 1

)k‖x(0) − x?‖2

where Qg = L/µ

15/19

Page 18: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Backtracking line search

Parameters α ∈ (0, 1/2) and β ∈ (0, 1)

Start with t = 1 and repeat t := βt until

g(x+ t∆x) < g(x) + αt∇g(x)T∆x

Sufficient decrease condition (t ≤ t0)

t0

g(x) + t∇g(x)T ∆x

g(x) + tα∇g(x)T ∆x

16/19

Page 19: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Regularized least-squares (I)

minimize g(x) =1

2‖Ax− b‖22 +

γ

2‖x‖22

∇g(x) = AT (Ax− b) + γx, ∇2g(x) = ATA+ γI

Gradient descent

x(k+1) = x(k) − t(AT (Ax(k) − b) + γx(k))

= (I − t(ATA+ γI))x(k) + tAT b

= (I − t∇2g(x(k)))x(k) + tAT b

contraction map if ‖(I − t∇2g(x(k)))‖2 < 1 or, equivalently,

|1− t(‖A‖22 + γ)| < 1

17/19

Page 20: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

Regularized least-squares (II)

Lipschitz constantL = ‖A‖22 + γ

can be estimated iteratively via power iteration

let x 6= 0 be a random vectorfor k = 1, . . . ,M

z ← Ax/‖x‖2x← AT z

endL̂ = ‖x‖2 + γ ≤ ‖A‖22 + γ

I modulus of strong convexity: µ ≥ γ > 0

I linear rate of convergence: Qg = Lµ ≤

‖A‖22γ

18/19

Page 21: Scienti c Computing for X-Ray Computed Tomographypcha/HDtomo/SC/Week3Day1_OptIntro.pdf · Scienti c Computing for X-Ray Computed Tomography Introduction to Optimization (part I) Martin

References

[Pol87] B. T. Polyak. Introduction to Optimization. New York: OptimizationSoftware, Inc., 1987.

[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge:Cambridge University Press, 2004.

[Nes04] Yu. Nesterov. Introductory Lectures on Convex Optimization.Dordrecht, The Netherlands: Kluwer Academic Publishers, 2004.

[NW06] J. Nocedal and S. J. Wright. Numerical Optimization. 2nd. Springer,2006.

[Ber09] D. P. Bertsekas. Convex Optimization Theory. Athena Scientific,2009.

[Ber15] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scientific,2015.

19/19