Parameter tuning based on response surface models
An update on work in progress
EARG, Feb 27th, 2008
Presenter: Frank Hutter
Motivation
• Parameter tuning is important
• Recent approaches (ParamILS, racing, CALIBRA) “only” return the best parameter configuration Extra information would be nice, e.g.
- The most important parameter is X
- The effect of parameters X and Y is largely independent
- For parameter X options 1 and 2 are bad, 3 is best, 4 is decent
ANOVA is one tool for that, but has limitations (e.g. discretization of parameters, linear model)
More motivation
• Support the actual design process by providing feedback about parameters E.g. parameter X should always be i (code gets simpler!!)
• Predictive models of runtime are widely applicable Prediction can be updated based on new information (such as
“the algorithm has been unsuccessfully running for X seconds”) (True) portfolios of algorithms
• Once we can learn a function f: ! runtime, learning a function g:X! runtime should be a simply extension (X=inst. charac., Lin learns h: X! runtime)
The problem setting• For now: static algorithm configuration, i. e. find the best
fixed parameter setting across instances But as mentioned above this approach extends to
PIAC (per instance algorithm configuration)
• Randomized algorithms: variance for a single instance (runtime distributions)
• High inter-instance variance in hardness• We focus on minimizing runtime
But the approach also applies to other objectives (Special treatment of censoring and cost for gathering a data
point is then simply not necessary)
• We focus on optimizing averages across instances Generalization to other objectives may not be straight-forward
Learning a predictive model• Supervised learning problem, regression
Given training data (x1, o1), …, (xn, on), learn function f such that f(xi) ¼ oi
• What is a data point xi ? 1) Predictive model of average cost
- Average of how many instances/runs ?- Not too many data points, but each one very costly- Doesn’t have to be average cost, could be anything
2) Predictive model of single costs, get average cost by aggregation
- Have to deal with ten thousands of data points- If predictions are Gaussian, the aggregates are Gaussian
(means and variances add)
Desired properties of model• 1) Discrete and continuous inputs
Parameters are discrete/continuous Instances features (so far) all continuous
• 2) Censoring When a run times out we only have a lower bound on its true runtime
• 3) Scalability: tens of thousands of points
• 4) Explicit predictive uncertainties
• 5) Accuracy of predictions
• Considered models: Linear regression (basis functions? especially for discrete inputs) regression trees (no uncertainty estimates) Gaussian processes (4&5 ok, 1 done, 2 almost done, hopefully 3)
Coming up
• 1) Implemented: model average runtimes, optimize based on that model Censoring “almost” integrated
• 2) Further TODOs: Active learning criterion under noise Scaling: Bayesian committee machine
Active learning for function optimization
EGO [Jones, Schonlau & Welch, 1998] Assumes deterministic functions
- Here: averages over 100 instances
Start with a Latin hypercube design- Run the algorithm, get (xi,oi) pairs
While not terminate- Fit the model (kernel parameter optimization, all continuous)
- Find best point to sample (optimization in the space of parameter configurations)
- Run the algorithm at that point, add new (x,y) pair
Active learning criterion
• EGO uses maximum expected improvement EI(x) = s p(y|x, 2
x) max(0, f_min-y) dy
- Easy to evaluate (can be solved in closed form)
• Problem in EGO: sometimes not the actual runtime y is modeled, but a transformation, e.g. log(y) Expected improvement then needs to be adapted: EI(x) = s p(y|x, 2
x) max(0, f_min-exp(y)) dy
- Easy to evaluate (can still be solved in closed form)
• Take into account cost of sample: EI(x) = s p(y|x, 2
x) 1/exp(y) max(0, f_min-exp(y)) dy
- Easy to evaluate (can still be solved in closed form)
- Not implemented yet (the others are implemented)
Which kernel to use?• Kernel: distance measure between two data points
Low distance ! high correlation
• Squared exponential, Matern, etc: SE: k(x, x’) = s exp(- li(xi-xi’)2 )
• For discrete parameters: new Hamming distance kernel s epx(- li(xi xi’) )
Positive definite by reduction to String kernels
• “Automatic relevance determination” One length scale parameter li per dimension
Many kernel parameters lead to- Problems with overfitting
- Very long runtimes for kernel parameter optimization
- For CPLEX: 60 extra parameters, about 15h for a single kernel parameter optimization using DIRECT, without any improvement
• Thus: no length scale parameters.Only two parameters: noise n, and overall variability of the signal, s
How to optimize kernel parameters?
• Objective Standard: maximizing marginal likelihood
- Doesn’t work under censoring Alternative: maximizing likelihood of unseen data
using cross-validation- Efficient when not too many folds k are used:
Marginal likelihood requires inversion of N by N matrix Cross validation with k=2 requires inversions of two
N/2 by N/2 matrices. In practice still quite a bit slower (some algebra tricks may help)
• Algorithm Using DIRECT (DIviding RECTangles), global
sampling-based method (does not scale to high dim)
How to optimize exp. improvement?
• Currently only 3 algorithms to be tuned: SAPS (4 continuous params) SPEAR(26 parameters, about half of them discrete)
- For now continuous ones are discretized
CPLEX(60 params, 50 of them discrete) - For now continuous ones are discretized
• Purely continuous/purely discrete optimization DIRECT / multiple restart local search
TODO: integrate censoring• Learning with censored data almost done
(needs solid testing since it’ll be central later)
• Active selection of censoring threshold ? Something simple might suffice, such as picking cutoff equal to
predicted runtime or to the best runtime so far- Integration bounds in expected improvement would change, but nothing
else
• Runtime With censoring about 3 times slower than without (Newton’s
method takes about 3 steps) „Good“ scaling
- 42 points: 19 seconds; 402 points: 143 seconds- Maybe Newton does not need as many steps with more points
Treat as “completed at threshold”, 4s
Don’t use censored data, 4s
Laplace approximation to posterior, 10s
Schmee & Hahn, 21 iterations, 36s
Anecdotal: Lin’s original implementation of Schmee & Hahn, on my machine – beware of normpdf
A counterintuitive example from practice(same hyperparameters in same rows)
Preliminary results and demo
• Experiments with noise-free kernel Great cross-validation results for SPEAR & CPLEX Poor cross-validation results for SAPS
• Explanation Even when averaging 100 instances, the response is
NOT noise-free SAPS is continuous:
- can pick configurations arbitrarily close to each other - if results differ substantially SE kernel must have huge
variance ! very poor results Matern kernel works better for SAPS
TODOs
• Finish censoring
• Consider noise (even possible when averaging over instances), change active learning criterion
• Scaling
• Efficiency of implementation: reusing work for multiple predictions
TODO: Active learning under noise
• [Williams, Santner, and Notz, 2000]
• Very heavy on notation But there is good stuff in there
• 1) Actively choose a parameter setting Best setting so far is not known ! fmin is now a random variable
- take joint samples of performance from predictive distributions for all settings tried so far, take min of those samples, compute expected improvement as if that min was the deterministic fmin
- Average the exp. improvements computed for 100 independent samples
• 2) Actively choose an instance to run for that parameter setting: minimizing posterior variance
TODO: scaling
• Bayesian committee machine More or less a mixture of GPs, each of them on a small subset of
data (cluster data ahead of time) Fairly straight-forward wrapper around GP code (or really any
code that provides Gaussian predictions) Maximizing cross-validated performance is easy
In principle could update by just updating one component at a time
- But in practice once we re-optimize hyperparameters we’re changing every component anyways
- Likewise we can do rank-1 updates for the basic GPs, but a single matrix inversion is really not the expensive part (rather the 1000s of matrix inversions for kernel parameter optimization)
Future work
• We can get main effects and interaction effects, much like in ANOVA The integrals seem to be solvable in closed form
• We can get plots of predicted mean and variance as one parameter is varied, marginalized over all others Similarly as two or three are varied This allows for plots of interactions