Frank-Wolfe optimization insights in machine
learning
Simon Lacoste-JulienINRIA / École Normale Supérieure
SIERRA Project Team
SMILE – November 4th 2013
Outline
Frank-Wolfe optimization
Frank-Wolfe for structured prediction links with previous algorithms block-coordinate extension results for sequence prediction
Herding as Frank-Wolfe optimization extension: weighted Herding simulations for quadrature
FW algorithm – repeat:
f convex & cts. differentiable
M convex & compact
alg. for constrained opt.:
(aka conditional gradient)
where:
1) Find good feasible direction by minimizing linearization of :f
2) Take convex step in direction:
®t+1 = (1 ¡ °t) ®t + °t st+1
Frank-Wolfe algorithm [Frank, Wolfe 1956]
Properties: O(1/T) rate sparse iterates get duality gap for
free affine invariant rate holds even if linear
subproblem solved approximately
min®2M
f (®)
Frank-Wolfe: properties
convex steps => convex sparse combo:
get duality gap certificate for free
(special case of Fenchel duality gap) also converge as O(1/T)!
only need to solve linear subproblem *approximately* (additive/multiplicative bound)
affine invariant!
®T = ½0®0 +TX
t=1½tst where
TX
t=0½t = 1
[see Jaggi ICML 2013]
[ICML 2013]Simon
Lacoste-JulienMartinJaggi
PatrickPletscher
MarkSchmidt
Block-Coordinate Frank-Wolfe Optimization for
Structured SVMs
Structured SVM optimization
learn classifier:
structured prediction:
structured SVM primal:
decoding
vs. binary hinge loss:
structured hinge loss:-> loss-augmented
decoding
-> exp. number of variables!
structured SVM dual: primal-dual pair:
Structured SVM optimization (2)
popular approaches: stochastic subgradient method
pros: online! cons: sensitive to step-size; don’t know when to stop
cutting plane method (SVMstruct) pros: automatic step-size; duality gap cons: batch! -> slow for large n
our approach: block-coordinate Frank-Wolfe on dual
-> combines best of both worlds: online! automatic step-size via analytic line search duality gap rates also hold for approximate oracles
rate: after K passes through data:
[Ratliff et al. 07,Shalev-Shwartz et al. 10]
[Tsochantaridis et al. 05, Joachims et al. 09]
FW algorithm – repeat:
f convex & cts. differentiable
M convex & compact
alg. for constrained opt.:
(aka conditional gradient)
where:
1) Find good feasible direction by minimizing linearization of :f
®t+1 = (1 ¡ °t) ®t + °t st+1
2) Take convex step in direction:
®t+1 = (1 ¡ °t) ®t + °t st+1
Frank-Wolfe algorithm [Frank, Wolfe 1956]
Properties: O(1/T) rate sparse iterates get duality gap for
free affine invariant rate holds even if linear
subproblem solved approximately
min®2M
f (®)
Frank-Wolfe for structured SVM
FW algorithm – repeat:
structured SVM dual:
1) Find good feasible direction by minimizing linearization of :f
®t+1 = (1 ¡ °t) ®t + °t st+1
2) Take convex step in direction:
¡ min®2M
f (®)
use primal-dual link:
key insight: loss-augmented decodingon each example i
becomes a batch subgradient step:
f (®)choose by analytic line search on quadratic dual
link between FW and subgradient method: see [Bach 12]
FW for structured SVM: properties
running FW on dual batch subgradient on primal but adaptive step-size from analytic line-search and duality gap stopping criterion
‘fully corrective’ FW on dual cutting plane alg. still O(1/T) rate; but provides
simpler proof for SVMstruct convergence+ approximate oracles guarantees
not faster than simple FW in our experiments
BUT: still batch => slow for large n...
(SVMstruct)
Block-Coordinate Frank-Wolfe (new!)
for constrained optimization over compact product domain:
pick i at random; update only block i with a FW step:
we proved same O(1/T) rate as batch FW
-> each step n times cheaper though
-> constant can be the same (SVM e.g.)
Properties: O(1/T) rate sparse iterates get duality gap
guarantees affine invariant rate holds even if linear
subproblem solved approximately
Block-Coordinate Frank-Wolfe (new!)
for constrained optimization over compact product domain:
pick i at random; update only block i with a FW step:
loss-augmented decoding
structured SVM:
we proved same O(1/T) rate as batch FW
-> each step n times cheaper though
-> constant can be the same (SVM e.g.)
BCFW for structured SVM: properties each update requires 1 oracle call
advantages over stochastic subgradient: step-sizes by line-search -> more robust duality gap certificate -> know when to stop
guarantees hold for approximate oracles implementation:
https://github.com/ppletscher/BCFWstruct almost as simple as stochastic subgradient method caveat: need to store one parameter vector per example
(or store the dual variables) for binary SVM -> reduce to DCA method [Hsieh et
al. 08] interesting link with prox SDCA [Shalev-Shwartz et
al. 12]
(vs. n for SVMstruct)
so get error after K passes through data
(vs. for SVMstruct)
More info about constants...
BCFW rate:
batch FW rate:
comparing constants: for structured SVM – same constants: identity Hessian + cube constraint:
(no speed-up)
->remove with line-search
“curvature”
“product curvature”
Sidenote: weighted averaging standard to average iterates of stochastic
subgradient method
uniform averaging:
vs. t-weighted averaging:
[L.-J. et al. 12], [Shamir & Zhang 13]
weighted avg. improves duality gap for BCFW also makes a big difference in test error!
Experiments
OCR dataset CoNLL dataset
CoNLL dataset
Surprising test error though!
test error: optimization error:
flipped!
Conclusions for 1st part
applying FW on dual of structured SVM unified previous algorithms provided line-search version of batch
subgradient new block-coordinate variant of Frank-Wolfe
algorithm same convergence rate but with cheaper
iteration cost yields a robust & fast algorithm for structured
SVM future work:
caching tricks non-uniform sampling
regularization path explain weighted avg.
test error mystery
On the Equivalence between Herding and
Conditional Gradient Algorithms[ICML 2012]
SimonLacoste-Julien
FrancisBach
GuillaumeObozinski
A motivation: quadrature Approximating integrals:
Random sampling yields error
Herding [Welling 2009] yields error!
[Chen et al. 2010] (like quasi-MC)
This part -> links herding with optimization algorithm (conditional gradient / Frank-Wolfe)
suggests extensions - e.g. weighted version with
BUT extensions worse for learning??? -> yields interesting insights on properties of
herding...
Z
Xf (x)p(x)dx ¼
1T
TX
t=1f (xt)
xt » p(x) O(1=p
T)
O(1=T)
O(e¡ cT )
Outline
Background: Herding [Conditional gradient algorithm]
Equivalence between herding & cond. gradient Extensions New rates & theorems
Simulations Approximation of integrals with cond. gradient
variants Learned distribution vs. max entropy
Review of herding [Welling ICML 2009]
Learning in MRF:
Motivation:
pµ(x) =1Zµ
exp(hµ;©(x)i )
data
parameter
samples
learning:(app.) ML /
max. entropy moment matching
µM L
(app.) inference:sampling
(pseudo)-herdin
g
feature map© : X ! F
hµ;¹ i ¡ maxx2X
hµ;©(x)i
Herding updates
Zero temperature limit of log-likelihood:
Herding updates -subgradient ascent updates:
Properties:1) weakly chaotic -> entropy?2) Moment matching:-> our focus
‘Tipi’ function:
(thanks to Max Welling for picture)
xt+1 2 arg maxx2X
hµt;©(x)i
µt+1 = µt + ¹ ¡ ©(xt+1)
µt
k¹ ¡1T
TX
t=1©(xt)k2 = O(1=T 2)
lim¯ ! 0
hµ;¹ i ¡ ¯ log
0
@X
x2Xexp(
1¯
hµ;©(x)i )
1
A
Approx. integrals in RKHS
Reproducing property: Define mean map : Want to approximate integrals of the form:
Use weighted sum to get approximated mean:
Approximation error is then bounded by:
f 2 F ) f (x) = hf ;©(x)i¹ = Ep(x)©(x)
Ep(x)f (x) = Ep(x)hf ;©(x)i = hf ; ¹ i
¹̂ = Ep̂(x)©(x) =TX
t=1wt©(xt)
jEp(x) f (x) ¡ Ep̂(x) f (x)j · kf kk¹ ¡ ¹̂ k
F Controlling moment discrepancy is
enough to control error of integrals in RKHS :
Conditional gradient algorithm
(aka Frank-Wolfe)
Alg. to optimize: Repeat:Find good feasible direction byminimizing linearization of J:
Take convex step in direction:
-> Converges in O(1/T) in general
ming2M
J (g)M
J (g)
gt¹gt+1
J convex & (twice) cts. differentiableMconvex & compact
J (gt) + hJ 0(gt);g ¡ gti
¹gt+1 2 arg ming2M
hJ 0(gt);gi
gt+1 = (1 ¡ ½t) gt + ½t ¹gt+1 ½t = 1=(t + 1)
J (g) =12
kg¡ ¹ k2
Trick: look at cond. gradient on dummy objective:
Herding & cond. grad. are equiv.
xt+1 2 arg maxx2X
hµt;©(x)i
µt+1 = µt + ¹ ¡ ©(xt+1)
ming2M
f J (g) =12
kg¡ ¹ k2g
¹gt+1 2 arg ming2M
hgt ¡ ¹ ;gi
gt+1 = (1 ¡ ½t) gt + ½t ¹gt+1
gt ¡ ¹ = ¡ µt=t
M = convf ©(x)g
(t + 1)gt+1 = tgt + ©(xt+1)
gT =1T
TX
t=1©(xt) = ¹̂ T½0 = 1
herding updates:
cond. grad. updates:
+ Do change of variable:
©(xt+1)
½t = 1=(t + 1)Same with step-
size:
Subgradient ascent and cond. gradient are Fenchel duals of each other! (see also [Bach 2012])
Extensions of herding
More general step-sizes -> gives weighted sum:
Two extensions:
1) Line search for
2) Min. norm point algorithm(min J(g) on convex hull of previously visited points)
gT =TX
t=1wt©(xt)
½t
Rates of convergence & thms.
No assumption: cond. grad. yields*:
If assume in rel. int. of with radius [Chen et al. 2010] yields for herding
Whereas line search version yields[Guélat & Marcotte 1986, Beck & Teboulle 2004]
Propositions:1)
2)
kgt ¡ ¹ k2 = O(1=t)
¹ M r > 0kgt ¡ ¹ k2 = O(1=t2)
(½t = 1=(t + 1))
kgt ¡ ¹ k2 = O(e¡ ct)
suppose X compact and © continuous
F ¯ nite dim. and p full support means 9r > 0
F in¯ nite dim. means r = 0 (i.e. [Chen et al. 2010] doesn’t hold!)
T
Simulation 1: approx. integrals
Kernel herding onUse RKHS with Bernouilli polynomial kernel
(infinite dim.) (closed
form)
X = [0;1]
log10 k¹̂ T ¡ ¹ k
p(x) /³ P K
k=1 ak cos(2k¼x) + bk sin(2k¼x)´2
Simulation 2: max entropy? learning independent bits:X = f ¡ 1;1gd, d = 10
©(x) = xirrational ¹ rational ¹
log10 k¹̂ T ¡ ¹ k
error on moments
log10 kp̂T ¡ pk
error on distributio
n
Conclusions for 2nd part
Equivalence of herding and cond. gradient:-> Yields better alg. for quadrature based on
moments -> But highlights max entropy / moment matching
tradeoff! Other interesting points:
Setting up fake optimization problems -> harvest properties of known algorithms
Conditional gradient algorithm useful to know... Duality of subgradient & cond. gradient is more
general Recent related work:
link with Bayesian quadrature [Huszar & Duvenaud UAI 2012]
herded Gibbs sampling [Born et al. ICLR 2013]
Thank you!