com s 672: advanced topics in computational models of...
TRANSCRIPT
COM S 672: Advanced Topics in Computational
Models of Learning – Optimization for Learning
Lecture Note 9: Higher-Order Methods – II
Jia (Kevin) Liu
Assistant Professor
Department of Computer Science
Iowa State University, Ames, Iowa, USA
Fall 2017
JKL (CS@ISU) COM S 672: Lecture 9 1 / 26
Outline
In this lecture:
Quasi-Newton methods
Interior-point methods
JKL (CS@ISU) COM S 672: Lecture 9 2 / 26
Quasi-Newton Theory
Key idea: Maintains an approximation to the Hessian that’s filled in usinginformation gained on successive steps and generate H-conjugate directions
Suppose f(x) = c>x+ 1
2x>Hx, where H ⌫ 0
Define pk = xk+1 � xk and qk = rf(xk+1)�rf(xk). Note that
Hpk = H(xk+1 � xk) = (c+Hxk+1)� (c+Hxk) = qk
Construct an estimate Bk for H satisfying: Bkpk = qk, 8k thus far. Thus:
H�1
Bkpj = H�1
qj = pj
This implies: (H�1Bk)pj = pj , 8j = 1, . . . , k � 1, i.e., p1, . . . ,pk�1 are
eigenvectors of (H�1Bk) with unit eigenvectors
Hence, (HBn+1)pk = pk, 8k = 1, . . . , n
JKL (CS@ISU) COM S 672: Lecture 9 3 / 26
[ BSS.ch 8.8 ]
- -
c-Hessian necc . condition )
§ g.Jzt :
- ikt
Quasi-Newton Theory
Suppose that p1, . . . ,pn are linear independent
Denote P = [p1 p2 · · · pn] 2 Rn⇥n. Then we have
(H�1Bn+1)P = P
which implies:
H�1
Bn+1 = I, i.e., Bn+1 = H
Thus, the goal of Quasi-Newton methods is to find a sequence {Bk} ofapproximate Hessian to satisfy that, for all k,
Bkpj = qj , 8j = 1, . . . , k � 1,
which is term quasi-Newton equation or secant equation
Once Bk is determined, find dk that satisfies Bkdk = �rf(xk). It can beshown that the generated d1, . . . ,dn are H-conjugate
JKL (CS@ISU) COM S 672: Lecture 9 4 / 26
¥7tp÷=t"est
.
⇐.- - '
I} Enfifth.
Hk tktl-
exactly recover it after n steps .
qd,
⇒dk=
- Irtftek)
Quasi-Newton Theory
From secant equations, designing Quasi-Newton methods boils down to:
I Given some Bk ⌫ 0 such that Bkpj = qj , 8j = 1, . . . , k � 1
I Want to find a Bk+1 ⌫ 0, such that Bk+1pj = qj , 8j = 1, . . . , k
Key Idea: Try Bk+1 = Bk +Ck for some correction matrix Ck
I This implies: Bkpj +Ckpj = qj , 8j = 1, . . . , k, i.e.,
Ckpj = 0, for j = 1, . . . , k � 1
Ckpk = qk �Bkpk
These two equations give rise to a variety of Quasi-Newton methods:
I Broyden family (Broyden-Fletcher-Goldfarb-Shanno (BFGS) update)
I Davidon-Fletcher-Powell Method (dual construct of Broyden family)
I See [BSS Ch. 8.8] for an excellent treatment on Quasi-Newton theory
JKL (CS@ISU) COM S 672: Lecture 9 5 / 26
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Update
Try the following correction matrix CBFGSk :
CBFGSk =
qkq>k
q>k pk
�Bkpkp
>k Bk
p>k Bkpk
Obtained independently by Broyden, Fletcher, Goldfarb, and Shanno in theyear of 1970, hence the name BFGS
Highly successful due to its e�ciency & robustness, implemented in manynumerical optimizers (e.g., MATLAB, R, GNU C regression libraries...)
JKL (CS@ISU) COM S 672: Lecture 9 6 / 26
Implementing BFGS in Practice
Having found Bk+1, find dk+1 by solving Bk+1dk+1 = �rf(xk+1), i.e.,
dk+1 = �B�1k+1rf(xk+1)
Often more convenient to update inverse series {Dk} , {B�1k } directly:
I Let D1 = B�11 = I.
I In iteration k, given Dk, compute Dk+1 as follows:
Dk+1 = [Bk+1]�1 = [Bk +CBFGS
k ]�1
= [Bk + a1b>1 + a2b
>2 ]
�1, (1)
where a1=qk/(q>k pk), b1=qk, a2=�(Bkpk)/(p
>k Bkpk), and b2=Bkpk
I Eq. (1) shows that Bk+1 can be obtained from Bk with a rank-two update
JKL (CS@ISU) COM S 672: Lecture 9 7 / 26
Implementing BFGS in Practice
Therefore, Dk+1 can be computed by using two sequential applications ofSherman-Morrison-Woodbury (SMW) matrix inverse formula:
[A+ ab>]�1 = A
�1�
A�1
ab>A
�1
1 + b�1A�1a
Note: In general, SMW inverse formula is advantageous to use when A�1 is
known of cheap to compute (e.g., diagonal, sparse, structured, etc.)
As a result, we obtain the following BFGS update for the sequence {Dk}:
Dk+1 = Dk +pkp
>k
p>k qk
✓1 +
q>k Dkqk
p>k qk
◆�
Dkqkp>k + pkq
>k Dk
p>k qk| {z }
,C̄BFGSk
Can prove superlinear local convergence for BFGS (and other Quasi-Newtonmethods): kxk+1 � x
⇤k/kxk � x
⇤k ! 0. Not as fast as Newton, but fast!
JKL (CS@ISU) COM S 672: Lecture 9 8 / 26
s A. tut .
( k )Generalized SMW :(lA=tuE¥5" = # - AI'u=(Et+¥Atu5' IAI '
,A.ernxn.u.tk#e.v=elRM.EeRh
"
L-BFGS
In BFGS (and other quasi-Newton methods), we need n⇥ n storage space tocompute approximate Hessian Bk (or approximate inverse Dk)
Still expensive when n is large. Enter the limited memory BFGS (L-BFGS)!
L-BFGS doesn’t store Bk or Dk. Rather, it only keeps track of pk and qk
from the last few iterations (say 5 to 10), and reconstruct matrices as needed
I Take an initial B0 or D0 and assume m steps have been taken since
I Compute Bkpk via a series of inner and outer products with matrices frompk�j and qk�j from last m iterations: j = 1, . . . ,m� 1
Attractive for problems when n is large (typical in machine learningproblems). Require 2mn storage and O(mn) linear algebra operations, pluscost of function and gradient evaluations, and line search
No superlinear convergence proof, but good behavior has been observed inmany applications (see [Liu & Nocedal, ’89], [Nocedal & Wright, Chap. 7.2])
JKL (CS@ISU) COM S 672: Lecture 9 9 / 26
Interior-Point Methods
Consider the following constrained minimization problem:
Minimize f(x)
subject to gi(x) 0, i = 1, . . . ,m
Ax = b
where:
I f and gi are convex, twice continuously di↵erentiable
I A 2 Rp⇥n with rank(A) = p
I Assume that p⇤. is finite and attainable
I Assume that the problem is strictly feasible (Slater’s condition), i.e., 9x̃ with
x̃ 2 dom{f}, gi(x̃) < 0, i = 1, . . . ,m, Ax̃ = b,
hence strong duality holds and dual optimum is attainable
JKL (CS@ISU) COM S 672: Lecture 9 10 / 26
•
Logarithmic Barrier Function
Reformulate the problem via indicator function:
Minimize f(x) +mX
i=1
�(gi(x))
subject to Ax = b,
where �(u) = 0 if u 0, �(u) = 1 otherwise (indicator function of R�)
Consider the approximation through logarithmic barrier
Minimize f(x)�⇣ 1
µ
⌘ mX
i=1
log(�gi(x))
subject to Ax = b
where µ > 0 is a parameter
JKL (CS@ISU) COM S 672: Lecture 9 11 / 26
¥
The Log Barrier Approximate Problem
An equality constrained problem
For µ > 0, �(1/µ) log(�u) is a smooth approximation of �(·)
Approximation gets better as µ ! 1
JKL (CS@ISU) COM S 672: Lecture 9 12 / 26
Properties of Log Barrier Function
�(x) = �
mX
i=1
log(�gi(x)), dom{�} = {x|gi(x) < 0, i = 1, . . . ,m}
Convex (following composition rules of convexity)
Twice continuously di↵erentiable with derivatives:
r�(x) = �
mX
i=1
1
gi(x)rgi(x)
H�(x) =mX
i=1
1
gi(x)2rgi(x)gi(x)
>�
mX
i=1
1
gi(x)Hgi(x)
JKL (CS@ISU) COM S 672: Lecture 9 13 / 26
Y
Central Path
For µ > 0, define x⇤(µ) as the solution of
Minimize µf(x) + �(x)
subject to Ax = b,
Assume that x⇤(µ) exists and is unique for all µ > 0
Central path is defined as {x⇤(µ)|µ > 0}
Example: Central path for an LP
JKL (CS@ISU) COM S 672: Lecture 9 14 / 26
our
.ee#Ieeyminer ← '
is:
... -
ay- C- YX,
''
l'
'
sit Eirebi .Vi
, , p ,\ ;
' II.i' !
nyperpbneetn . Emmis¥¥tgEyM1i€tangential to the level carnyaadw
" '
it.
.¥==÷¥'
of 9 through 'z*4w) \ "i
,
Dual Points on Central Path
For x = x⇤(µ), if there exists a w such that
µrf(x)�mX
i=1
1
gi(x)rgi(x) +A
>w = 0, Ax = b
Then, x⇤(µ) minimizes the Lagrangian
L(x,u⇤(µ),v⇤(µ)) = f(x) +mX
i=1
u⇤i (µ)gi(x) + v
⇤(µ)>(Ax� b),
where u⇤i (µ) , 1/(�µgi(x⇤(µ))) and v
⇤(µ) , w/µ
This confirms the intuitive idea that f(x⇤(µ)) ! p⇤ as µ ! 1 since:
p⇤� ⇥(u⇤(µ),v⇤(µ)) = L(x⇤(µ),u⇤(µ),v⇤(µ)) = f(x⇤(µ))�m/µ,
which implies f(x⇤(µ))� p⇤
mµ # 0 as µ ! 1
JKL (CS@ISU) COM S 672: Lecture 9 15 / 26
r¥**÷
Interpretation as Perturbed KKT System
The primal-dual solutions x = x⇤(µ), u = u
⇤(µ), and v = v⇤(µ) satisfy:
(ST): rf(x) +Pm
i=1 uirgi(x) +A>v = 0
( 1µ -CS): uigi(x) = �1µ , i = 1, . . . ,m
(PF): gi(x) 0, i = 1, . . . ,m, Ax = b
(DF): u � 0, v unconstrained
That is, the di↵erence between KKT is that ( 1µ -CS) replaces (CS): uigi(x) = 0
JKL (CS@ISU) COM S 672: Lecture 9 16 / 26
Force Field Interpretation
Consider the following “centering” problem (without equality constraints)
Minimize µf(x)�mX
i=1
log(�gi(x))
It admits the following force field interpretation:
µf(x) is potential of force field F0(x) = �µrf(x)
� log(�gi(x)) is potential of force field Fi(x) = (1/gi(x))rgi(x)
The forces balance at x⇤(µ):
F0(x⇤(µ)) +
mX
i=1
Fi(x⇤(µ)) = 0
JKL (CS@ISU) COM S 672: Lecture 9 17 / 26
Force Field Interpretation
Example: Minimize c>x
subject to a>i x bi, i = 1, . . . ,m
Objective force field is a constant: F0(x) = �µc
Constraint force field decays as inverse distance to constraint hyperplane:
Fi(x) = �ai
bi � a>i x
, kFi(x)k2 =1
dist(x,Hi)
where Hi = {x|a>i x = bi}
JKL (CS@ISU) COM S 672: Lecture 9 18 / 26
µ= 1 µ⇒ .
The Barrier Method
1 Initialization: A strictly feasible x (interior point), µ = µ0 > 0, � > 1,tolerance ✏ > 0.
2 Centering step: Compute x⇤(µ) by minimizing µf + �, subject to Ax = b.
Update x = x⇤(µ).
3 Stop if mµ < ✏. Otherwise, let µ = �µ and go to Step 2.
Remarks:
Terminates with f(x)� p⇤ ✏ (following from f(x⇤(µ))� p
⇤
mµ )
Centering usually done using Newton’s method, starting at current x
Choice of � involves a trade-o↵: large � means fewer outer iterations, moreinner (Newton) iterations; typical values: � 2 [10, 20]
As µ gets larger (nearer the optimal solution), it’s getting harder and harderfor Newton’s method to converge (due to ill condition with large µ)
JKL (CS@ISU) COM S 672: Lecture 9 19 / 26
It's not nrecc . to solve x.*( µ ) accurately
Convergence Analysis
Number of outer (centering) iterations:
⇠log(m/(✏µ0))
log �
⇡
plus the initial centering step (to compute x⇤(µ0))
Convergence of the centering problem
Minimize µf(x) + �(x)
follows the convergence analysis of Newton’s method:
I µf + � must have closed sublevel sets for µ > µ0
I Classical analysis requires strong convexity, Lipschitz condition
I Analysis via self-concordance requires self-concordance of µf + �
JKL (CS@ISU) COM S 672: Lecture 9 20 / 26
Feasibility and Phase I Methods
Feasibility problem: Find x such that
gi(x) 0, i = 1, . . . ,m, Ax = b (2)
Phase I: Computes strictly feasible starting point for barrier method
Minimizex,s
s
subject to gi(x) s, i = 1, . . . ,m (3)
Ax = b
I If x, s feasible, with s < 0, then x is strictly feasible for (2)
I If optimal value p̄⇤ of (3) is positive, then (2) is infeasible
I if p̄⇤ = 0 and attained, then problem (2) is feasible (but not strictly)
I if p̄⇤ = 0 and not attained, then problem (2) is infeasible
JKL (CS@ISU) COM S 672: Lecture 9 21 / 26
Primal-Dual Interior-Point Methods
Primal-dual interior-point methods are another class of interior-pointmethods powerful for linear and convex quadratic programming
Consider the following linear constrained quadratic programming problem:
Minimize c>x+
1
2x>Qx
subject to Ax = b, x � 0
where Q is symmetric PSD (LP is a special case with Q = 0)
KKT conditions are that there exist u and v such that:
Qx+ c�A>u� v = 0, Ax = b, (x,v) � 0, xivi = 0, i = 1, . . . , n
Defining:
X , Diag(x1, . . . , xn), S , Diag(s1, . . . , sn),
so we can rewrite the last condition as XSe = 0, where e = [1, 1, . . . , 1]>
JKL (CS@ISU) COM S 672: Lecture 9 22 / 26
Hs )
i i
CCS )
Primal-Dual Interior-Point Methods
Thus, KKT conditions can be rewritten as a square system of constrained,nonlinear equations:
2
4Qx+ c�A
>u� v
Ax� b
XSe
3
5 = 0, (x,v) = 0
Primal-dual interior-point methods generate iterates (xk,uk,vk) with:
I (xk,vk) > 0 (i.e., interior)
I Each step (�xk,�uk,�sk) is a Newton step on a perturbed version of theequations (the perturbation eventually goes to zero)
I Use step-size sk to maintain (xk+1, sk+1) > 0. Set
(xk+1,uk+1, sk+1) = (xk,uk, sk) + sk(�xk,�uk,�sk)
JKL (CS@ISU) COM S 672: Lecture 9 23 / 26
< s
Primal-Dual Interior-Point Methods
The perturbed Newton step is a linear system:
2
4Q �A
>�I
A 0 0
S 0 0
3
5
2
4�xk
�uk
�vk
3
5 =
2
64r(x)k
r(u)k
r(v)k
3
75
where r(x)k = �(Qxk + c�A
>uk � vk)
r(u)k = �(Axk � b)
r(v)k = �XkSke+ �k�ke
Here, r(x)k , r(u)k , r
(v)k are current residuals, �k = (x>
k vk)/n is the currentduality gap, and �k 2 (0, 1] is a centering parameter
A lot of structure in the system that can be exploited for algorithm design.More e�cient than barrier method if high accuracy is needed
See [Wright, ’97] for a description of primal-dual interior-point method
JKL (CS@ISU) COM S 672: Lecture 9 24 / 26
Interior-Point Methods for Learning Problems
Interior-point methods were used early for compressed sensing, regularized leastsquares, SVM:
SVM with hinge loss formulated as QP, solved with primal-dual interior-pointmethod (e.g., [Gertz & Wright, ’03], [Fine & Scheinberg, 01], [Ferris &Munson, ’02]
Compressed sensing & LASSO variable selection formulated as bound-constrained QP and solved by primal-dual; or SOCP solved by barrier (e.g.,[Cades & Romberg, ’05])
However, they were mostly superseded by first-order methods due to increasinglylarge size of machine learning problems
Stochastic gradient descent (low accuracy, simple data access)
Gradient projection with sparsity regularization and prox-gradient incompressed sensing (require only matrix-vector multiplications)
Perhaps just a few clever ideas away to revive interior-point methods?
JKL (CS@ISU) COM S 672: Lecture 9 25 / 26
Next Class
Sparse/Regularized Optimization
JKL (CS@ISU) COM S 672: Lecture 9 26 / 26
Check BFGS :
C=k= -9k¥ l Ekfk )l¥kFr5'
ftp.T.EE ( rank -2 update )
WTS : Ekfj =Q for JH ,- - iky
Ekfk=Fk -
Be kfk
check :
.pe#IyMHrrtYEIfhphpD=qk-Brpr✓
Eke-Epittitotjn atI¥ '
=P
= teeth-threat
'tFEFR FIEKFR
#Fj
=
.EE?EEE#*= - KEEFE =e . ✓
FEERFN-
BFGS Update is derived by solving the
following optimization problem :
"
minimal changes"
.
minHetI±±⇐""Tf"t.ge?grkgrjapmYjnnp#ei. .|st'
lE¥.EE#i9kaeonewituyewestasmmptionlsmpweitg
- should be selected.
Barrierpmonm method / Path following method :
[ Fiacco ,Macormrk ' 697 sequential unconstr
. minimisation techniqueI ( Saint ) .
a
karmarlea '
84 at Bell Labs .
"
Interior pt .method for LP
"
.
khachiyan '72
"
Ellipsoid Method' '
.
Nesterov & Nemiroski : special class of barriers ( self - concordant ) to encode
any= convex sets =) # of iterations bounded by a polynomial in both the
dim. of the problem and the accuracy ,