atutorial onzero-order optimization

A Tutorial on Zero-Order Optimization

Yujie Tang

Gradient Descent

• If f is L-smooth, , then

• If f is L-smooth and convex, , then

• If f is L-smooth and µ-strongly convex,

, then

A Review of Gradient Descent

Optimization without First-Order Information

• Case 1: can be evaluated accuratelyfor K = 2

• Case 2: are independent for any K

zero-order oracle

One approach:

• construct gradient estimators based on 0-order information

• replace the true gradients in first-order methods by their 0-order estimators

Case 1: 2-Point & Multi-Point Estimators

• A naïve approach:

• When f is L-smooth, we have

• Works well for low-dimensional problems

• Not favorable for high-dimensional problems


• 2-point gradient estimator:

where ¸ is spherically symmetric with

• u: smoothing radius

• Under quite general conditions, we have

where is a smooth version of f

If ¸ has a density given by

then

If , then


• 2-point gradient estimator:


• Some facts for L-smooth / convex / µ-strongly convex function f:

• is L-smooth / convex / µ-strongly convex

•


• Gradient descent with 2-point estimator:• Roughly follows the trajectory of true gradient descent, with fluctuations

• Somewhat like a stochastic gradient descent


• Gradient descent with 2-point estimator:

• If f is L-smooth, , then

• If f is L-smooth and convex, , then



• If f is L-smooth and µ-strongly convex, , then

where



N: number of function evaluations


• Add more function evaluations:


• bias: the same, variance: reduced by 1/m

• # of function evaluations: multiplied by m

• More evaluations does not necessarily improve convergence (in terms of N).

• should be avoided.

Case 2: Single-Point Estimator

• Single-point estimator:


• We still have

and

• Variance is much worse:

noise


• Single-point estimator:


• Does averaging multiple single-point estimators work?

• The last two terms coincide with 2-point estimator in Case 1

• Still one remaining term

noise


• GD with single-point estimator:


• Best known lower bound for smooth and strongly convex functions:

• No lower bounds are know for other classes of functions.

• In fact, convergence can be achieved for convex functions

by other types of zero-order methods that do not use gradient estimators.

References

2-point and multi-point estimators• [Nesterov2017] Y. Nesterov and V. Spokoiny. “Random gradient-free minimization of convex functions,”

2017.

deterministic, 2-point

• [Duchi2015] J. C. Duchi et al. “Optimal rates for zero-order convex optimization: The power of two function evaluations,” 2015.

stochastic, 2-point and multi-point, minimax lower bound

• [Shamir2017] O. Shamir. “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” 2017.

online bandit, 2-point, optimal regret for nonsmooth cases

References

Single-point estimators • [Flaxman2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. “Online convex optimization in the

bandit setting: gradient descent without a gradient,” 2005.[Bach2016] F. Bach and V. Perchet. “Highly-smooth zero-th order online optimization,” 2016. single-point gradient estimator

• [Agarwal2013] A. Agarwal et al. “Stochastic convex optimization with bandit feedback,” 2013.[Belloni2015] A. Belloni et al. “Escaping the local minima via simulated annealing: Optimization of approximately convex functions,” 2015.[Bubeck2017] S. Bubeck, Y. T. Lee, and R. Eldan. “Kernel-based methods for bandit convex optimization,” 2017. single-point evaluation, no gradient estimation, convergence

References

Single-point estimators • [Dani2008] V. Dani, S. M. Kakade, and T. P. Hayes. “The price of bandit information for online

optimization,” 2008.[Jamieson2012] K. G. Jamieson, R. Nowak, and B. Recht. “Query complexity of derivative-free optimization,” 2012. [Shamir2013] O. Shamir. “On the complexity of bandit and derivative-free stochastic convex optimization,” 2013.Minimax lower bound

atutorial onzero-order optimization

Documents