atutorial onzero-order optimization
TRANSCRIPT
A Tutorial on Zero-Order Optimization
Yujie Tang
Gradient Descent
• If f is L-smooth, , then
• If f is L-smooth and convex, , then
• If f is L-smooth and µ-strongly convex,
, then
A Review of Gradient Descent
Optimization without First-Order Information
• Case 1: can be evaluated accuratelyfor K = 2
• Case 2: are independent for any K
zero-order oracle
One approach:
• construct gradient estimators based on 0-order information
• replace the true gradients in first-order methods by their 0-order estimators
Case 1: 2-Point & Multi-Point Estimators
• A naïve approach:
• When f is L-smooth, we have
• Works well for low-dimensional problems
• Not favorable for high-dimensional problems
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• u: smoothing radius
• Under quite general conditions, we have
where is a smooth version of f
If ¸ has a density given by
then
If , then
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• Some facts for L-smooth / convex / µ-strongly convex function f:
• is L-smooth / convex / µ-strongly convex
•
Case 1: 2-Point & Multi-Point Estimators
• 2-point gradient estimator:
where ¸ is spherically symmetric with
• Some facts for L-smooth / convex / µ-strongly convex function f:
• is L-smooth / convex / µ-strongly convex
•
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:• Roughly follows the trajectory of true gradient descent, with fluctuations
• Somewhat like a stochastic gradient descent
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
• If f is L-smooth, , then
• If f is L-smooth and convex, , then
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
• If f is L-smooth and µ-strongly convex, , then
where
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
Case 1: 2-Point & Multi-Point Estimators
• Gradient descent with 2-point estimator:
N: number of function evaluations
Case 1: 2-Point & Multi-Point Estimators
• Add more function evaluations:
where ¸ is spherically symmetric with
• bias: the same, variance: reduced by 1/m
• # of function evaluations: multiplied by m
• More evaluations does not necessarily improve convergence (in terms of N).
• should be avoided.
Case 2: Single-Point Estimator
• Single-point estimator:
where ¸ is spherically symmetric with
• We still have
and
• Variance is much worse:
noise
Case 2: Single-Point Estimator
• Single-point estimator:
where ¸ is spherically symmetric with
• Does averaging multiple single-point estimators work?
• The last two terms coincide with 2-point estimator in Case 1
• Still one remaining term
noise
Case 2: Single-Point Estimator
• GD with single-point estimator:
Case 2: Single-Point Estimator
• Best known lower bound for smooth and strongly convex functions:
• No lower bounds are know for other classes of functions.
• In fact, convergence can be achieved for convex functions
by other types of zero-order methods that do not use gradient estimators.
References
2-point and multi-point estimators• [Nesterov2017] Y. Nesterov and V. Spokoiny. “Random gradient-free minimization of convex functions,”
2017.
deterministic, 2-point
• [Duchi2015] J. C. Duchi et al. “Optimal rates for zero-order convex optimization: The power of two function evaluations,” 2015.
stochastic, 2-point and multi-point, minimax lower bound
• [Shamir2017] O. Shamir. “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” 2017.
online bandit, 2-point, optimal regret for nonsmooth cases
References
Single-point estimators • [Flaxman2005] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. “Online convex optimization in the
bandit setting: gradient descent without a gradient,” 2005.[Bach2016] F. Bach and V. Perchet. “Highly-smooth zero-th order online optimization,” 2016. single-point gradient estimator
• [Agarwal2013] A. Agarwal et al. “Stochastic convex optimization with bandit feedback,” 2013.[Belloni2015] A. Belloni et al. “Escaping the local minima via simulated annealing: Optimization of approximately convex functions,” 2015.[Bubeck2017] S. Bubeck, Y. T. Lee, and R. Eldan. “Kernel-based methods for bandit convex optimization,” 2017. single-point evaluation, no gradient estimation, convergence
References
Single-point estimators • [Dani2008] V. Dani, S. M. Kakade, and T. P. Hayes. “The price of bandit information for online
optimization,” 2008.[Jamieson2012] K. G. Jamieson, R. Nowak, and B. Recht. “Query complexity of derivative-free optimization,” 2012. [Shamir2013] O. Shamir. “On the complexity of bandit and derivative-free stochastic convex optimization,” 2013.Minimax lower bound