motivation: why automatic parameter tuning ? (1)

Performance Prediction andAutomated Tuning of

Randomized and Parametric Algorithms:An Initial Investigation

Frank Hutter1, Youssef Hamadi2, Holger Hoos1, and Kevin Leyton-Brown1

1University of British Columbia, Vancouver, Canada2Microsoft Research Cambridge, UK

July 16, 2006 Hutter, Hamadi, Hoos, Leyton-Brown: Automatic Parameter Tuning

2

Motivation: Why automatic parameter tuning? (1)

• Most approaches for solving hard combinatorial problems (e.g. SAT, CP) are highly parameterized Tree search

- Variable/value heuristic- Propagation- Whether and when to restart- How much learning

Local search- Noise parameter- Tabu length in tabu search - Strength of penalty increase and decrease in DLS- Pertubation, acceptance criterion, etc. in ILS


3

Motivation: Why automatic parameter tuning? (2)

• Algorithm design: new algorithm/application : A lot of time is spent for parameter tuning

• Algorithm analysis: comparability Is algorithm A faster than algorithm B because they

spent more time tuning it ?

• Algorithm use in practice: Want to solve MY problems fast, not necessarily the

ones the developers used for parameter tuning


4

Related work in automated parameter tuning

• Best fixed parameter setting for instance set [Birattari et al. ’02, Hutter ’04, Adenso-Daz & Laguna ’05]

• Algorithm selection/configuration per instance [Lobjois and Lemaître, ’98, Leyton-Brown, Nudelman et al. ’02 & ’04,

Patterson & Kautz ’02]

• Best sequence of operators / changing search strategy during the search [Battiti et al, ’05, Lagoudakis & Littman, ’01 & ‘02]


5

Overview

• Previous work on empirical hardness models [Leyton-Brown, Nudelman et al. ’02 & ’04]

• EH models for randomized algorithms

• EH models automatic tuning for parametric algorithms

• Conclusions


6

Empirical hardness models:Basics (1 algorithm)

• Training: Given a set of t instances s1,...,st For each instance si

- Compute instance features xi = (xi1,...,xim)

- Run algorithm and record its runtime yi

Learn function f: features runtime, such that yi f(xi) for i=1,…,t

• Test: Given a new instance st+1 Compute features xt+1 Predict runtime yt+1 = f(xt+1)


7

Empirical hardness models:Which instance features?

• Features should be polytime computable (for us, in seconds) Basic properties, e.g. #vars, #clauses, ratio (13) Graph-based characterics (10) Estimates of DPLL search space size (5) Local search probes (15)

• Combine features to form more expressive basis functions Basis functions = (1,...,q) can be arbitrary combinations of the

features x1,...,xm

• Basis functions used for SAT in [Nudelman et al. ’04] 91 original features: xi

Pairwise products of features: xi * xj

Only subset of these (drop useless basis functions)


8

Empirical hardness models:How to learn function f: features runtime?

• Runtimes vary by orders of magnitudes, and we need to pick a model that can deal with that Log-transform the output

e.g. runtime is 103 sec yi = 3

• Learn linear function of the basis functionsf(i) = i * wT yi

Learning reduces to fitting the weights w (ridge regression: w = ( + T )-1 Ty)


9

Algorithm selection based on empirical hardness models (e.g. satzilla)

• Given portfolio of n different algorithms A1,...,An

Pick best algorithm for each instance

• Training: Learn n separate functions

fj: features runtime of algorithm j

• Test (for each new instance st+1): Predict runtime yj

t+1 = fj(t+1) for each algorithm Choose algorithm Aj with minimal yj

t+1


10

Overview




• Conclusions


11

Empirical hardness models for randomized algorithms

• Can this same approach predict the run-time of incomplete, randomized local search algorithms? Yes!

• Incomplete Limitation to satisfiable instances (train & test)

• Local search No changes needed

• Randomized Ultimately, want to predict entire run-time distribution (RTDs) For our algorithms, these RTDs are typically exponential and can thus be

characterized by a single sufficient statistic, such as median run-time


12

• Training: Given a set of t instances s1,...,st For each instance si

- Compute features xi = (xi1,...,xim)

- Run algorithm multiple times to get its runtimes yi1, …, yi

k

- Compute median mi of yi1, …, yi

k Learn function f: features median run-time, mi f(xi)

• Test: Given a new instance st+1 Compute features xt+1 Predict median run-time mt+1 = f(xt+1)

Prediction of median run-time (only red stuff changed)


13

Experimental setup: solvers

• Two SAT solvers Novelty+ (WalkSAT variant) SAPS (Scaling and Probabilistic Smoothing) Adaptive version of Novelty+ won SAT04

competition for random instances, SAPS came second

• Runs cut off after 15 minutes


14

Experimental setup: benchmarks

• Three unstructured distributions: CV-fix: c/v ratio 4.26

20,000 instances with 400 variables, (10,011 satisfiable) CV-var: c/v ratio between 3.26 and 5.26

20,000 instances with 400 variables, (10,129 satisfiable) SAT04: 3,000 instances created with the generators for the SAT04

competition (random instances) with same parameters (1,420 satisfiable)

• One structured distribution: QWH: quasi groups with holes, 25% to 75% holes,

7,498 instances, satisfiable by construction

• All data sets were split 50:25:25 for train/valid/test


15

Results for predicting (median) run-time for SAPS on CV-var

Prediction based on single runs

Prediction based on medians of 1000 runs


16

Results for predicting median run-time based on 10 runs

SAPS on QWHNovelty+ on SAT04


17

Predicting complete run-time distribution for exponential RTDs (from extended CP version)

RTD of SAPS on q0.75 instance of QWH

RTD of SAPS on q0.25 instance of QWH


18

Overview




• Conclusions


19

Empirical hardness models for parametric algorithms (only red stuff changed)

• Training: Given a set of instances s1,...,st For each instance si

- Compute features xi - Run algorithm with some settings pi

1,...,pin

i to get runtimes yi1

,...,yin

i

- Basis functions i j (xi, pi

j) of features and parameter settings(quadratic expansion of params, multiplied by instance features)

Learn a function g:basis functions run-time, g(i j) yi

j

• Test: Given a new instance st+1 Compute features xt+1 For each parameter setting p of interest,

construct t+1 p (xt+1, p) and predict run-time g(t+1)


20

Experimental setup

• Parameters Novelty+: noise {0.1,0.2,0.3,0.4,0.5,0.6} SAPS: all 30 combinations of {1.2,1.3,1.4} and

{0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}

• One additional data set Mixed = union of QWH and SAT04


21

Results for predicting SAPS runtime with 30 different parameter settings on QWH

5 instances, one symbol per instance

1 instance in detail, (blue diamonds in left figure)

(from extended CP version)


22

Results for automated parameter setting

Data Set

Algo Speedup over default params

Speedup over best fixed params

SAT04 Nov+ 0.89 0.89

QWH Nov+ 177 0.91

Mixed Nov+ 13 10.72

SAT04 SAPS 2 1.07

QWH SAPS 2 0.93

Mixed SAPS 1.91 0.93Not th

e best

algorithm to

tuneDo you have

a better one?


23

Results for automated parameter setting, Novelty+ on Mixed

Compared to random parameters

Compared to best fixed parameters


24

Overview



• EH models and automated tuning for parametric algorithms

• Conclusions


25

Conclusions

• We can predict the run-time of randomized and incomplete parameterized local search algorithms

• We can automatically find good parameter settings Better than the default settings Sometimes better than the best possible fixed setting

• There’s no free lunch Long initial training time Need domain knowledge to define features for a

domain (only once per domain)


26

Future work

• Even better predictions More involved ML techniques, e.g. Gaussian processes Predictive uncertainty to know when our predictions are not reliable

• Reduce training time Especially for high-dimensional parameter spaces,

the current approach will not scale Use active learning to choose best parameter configurations to train on

• We need compelling domains: please come talk to me! Complete parametric SAT solvers Parametric solvers for other domains where features can be defined

(CP? Even planning?) Optimization algorithms


27

The End

• Thanks to Holger Hoos, Kevin Leyton-Brown,

Youssef Hamadi Reviewers for helpful comments You for your attention


28

Experimental setup: solvers

• Two SAT solvers Novelty+ (WalkSAT variant)

- Default noise setting 0.5 (=50%) for unstructured instances

- Noise setting 0.1 used for structured instances

SAPS (Scaling and Probabilistic Smoothing)- Default setting (alpha, rho) = (1.3, 0.8)

Adaptive version of Novelty+ won SAT04 competition for random instances, SAPS came second

• Runs cut off after 15 minutes


29

Which features are most important?(from extended CP version)

• Results consistent with those for deterministic tree-search algorithms Graph-based and DPLL-based features Local search probes are even more important here

• Only very few features needed for good models Previously observed for all-sat data [Nudelman et al. ’04]

A single quadratic basis function is often almost as good as the best feature subset

Strong correlation between features Many choices yield comparable performance

motivation: why automatic parameter tuning ? (1)

Documents

features runtime

instance features

parameter tuninghutter

combine features

features x1

products of features

original features

kevin leytonbrown1