design of experiments for the tuning of …...design of experiments for the tuning of optimisation...

Design of Experiments for theTuning of Optimisation Algorithms

Enda Ridge

PhD Thesis

The University of York

Department of Computer Science

October 2007

Abstract

This thesis presents a set of rigorous methodologies for tuning the performance of

algorithms that solve optimisation problems.

Many optimisation problems are difficult and time-consuming to solve exactly.

An alternative is to use an approximate algorithm that solves the problem to an

acceptable level of quality and provides such a solution in a reasonable time. Us-

ing optimisation algorithms typically requires choosing the settings of tuning pa-

rameters that adjust algorithm performance subject to this compromise between

solution quality and running time. This is the parameter tuning problem.

This thesis demonstrates that the Design Of Experiments (DOE) approach can

be adapted to successfully address the parameter tuning problem for algorithms

that find approximate solutions to optimisation problems. The thesis introduces

experiment designs and analyses for (1) determining the problem characteristics

affecting algorithm performance (2) screening and ranking the most important tun-

ing parameters and problem characteristics and (3) tuning algorithm parameters to

maximise algorithm performance for a given problem instance. Desirability func-

tions are introduced for tackling the compromise of achieving satisfactory solution

quality in reasonable running time.

Five case studies apply the thesis methodologies to the Ant Colony System and

the Max-Min Ant System algorithms for the Travelling Salesperson Problem. New

results are reported and open questions are answered regarding the importance

of both existing tuning parameters and proposed new tuning parameters. A new

problem characteristic is identified and shown to have a very strong effect on the

quality of the algorithms’ solutions. The tuning methodologies presented here yield

solution quality that is as good as or better than than the general parameter set-

tings from the literature. Furthermore, the associated running times are orders of

magnitude faster than the results obtained with the general parameter settings.

All experiments are performed with publicly available algorithm code, publicly

available problem generators and benchmarked experimental machines.

Contents

Abstract 1

List of Figures 7

List of Tables 11

Acknowledgments 13

Author’s Declaration 15

I Preliminaries 19

1 Introduction and motivation 211.1 Hypothesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Background 292.1 Combinatorial optimisation . . . . . . . . . . . . . . . . . . . . . . 29

2.2 The Travelling Salesperson Problem (TSP) . . . . . . . . . . . . . . 30

2.3 Approaches to solving combinatorial optimisation pro-blems . . . 31

2.4 Ant Colony Optimisation (ACO) . . . . . . . . . . . . . . . . . . . . 34

2.5 Design Of Experiments (DOE) . . . . . . . . . . . . . . . . . . . . . 47

2.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

II Related Work 51

3 Empirical methods concerns 533.1 Is the heuristic even worth researching? . . . . . . . . . . . . . . 54

3.2 Types of experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Life cycle of a heuristic and its problem domain . . . . . . . . . . 56

3.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5 Sound experimental design . . . . . . . . . . . . . . . . . . . . . . 59

3.6 Heuristic instantiation and problem abstraction . . . . . . . . . . 64

3.7 Pilot Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.8 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3

CONTENTS

3.9 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.10 Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.11 Random number generators . . . . . . . . . . . . . . . . . . . . . 71

3.12 Problem instances and libraries . . . . . . . . . . . . . . . . . . . 71

3.13 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.14 Interpretive bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.15 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4 Experimental work 774.1 Problem difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Parameter tuning of other metaheuristics . . . . . . . . . . . . . . 79

4.3 Parameter tuning of ACO . . . . . . . . . . . . . . . . . . . . . . . 82

4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

III Design Of Experiments for Tuning Metaheuristics 93

5 Experimental testbed 955.1 Problem generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2 Algorithm implementation . . . . . . . . . . . . . . . . . . . . . . . 97

5.3 Benchmarking the machines . . . . . . . . . . . . . . . . . . . . . 100

5.4 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Methodology 1056.1 Sequential experimentation . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Stage 1a: Determining important problem characteristics . . . . 106

6.3 Stage 1b: Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4 Stage 2: Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5 Stage 3: Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.6 Stage 4: Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.7 Common case study issues . . . . . . . . . . . . . . . . . . . . . . 125

6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

IV Case Studies 131

7 Case study: Determining whether a problem characteristic affects heuristic perfor-mance 1337.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.2 Research question and hypothesis . . . . . . . . . . . . . . . . . . 134

7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.7 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4

CONTENTS

8 Case study: Screening Ant Colony System 1438.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.4 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 150

8.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Case study: Tuning Ant Colony System 1539.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160


9.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

10 Case study: Screening Max-Min Ant System 16910.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

10.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174


10.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

11 Case study: Tuning Max-Min Ant System 17911.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

11.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

11.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188


11.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

12 Conclusions 19712.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

12.2 Advantages of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

12.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

12.4 Summary of main thesis contributions . . . . . . . . . . . . . . . . 199

12.5 Thesis strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

12.6 Thesis limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.7 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

12.8 Closing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

V Appendices 209

A Design Of Experiments (DOE) 211A.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

A.2 Regions of operability and interest . . . . . . . . . . . . . . . . . . 213

A.3 Experiment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

A.4 Experiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 220

A.5 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

5

CONTENTS

A.6 Error, Significance, Power and Replicates . . . . . . . . . . . . . . 225

B TSPLIB Statistics 229

C Calculation of Average Lambda Branching Factor 233

D Example OFAT Analysis 235D.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

D.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

D.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

D.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

D.5 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . 238

References 243

6

List of Figures

2.1 Growth of TSP problem search space. . . . . . . . . . . . . . . . . 30

2.2 Special cases and generalisations of the TSP. . . . . . . . . . . . . 32

2.3 Experiment setup for the double bridge experiment. . . . . . . . . 35

2.4 An example of a graph data structure. . . . . . . . . . . . . . . . . 36

2.5 The ACO Metaheuristic. . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6 Common tuning parameters and recommended settings for the ACO

algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.7 Tuning parameters and recommended settings for MMAS . . . . 46

2.8 Tuning parameters and recommended settings for MMAS . . . . 46

5.1 Relative frequencies of normalised edge lengths for several TSP in-

stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Results of the DIMACS benchmarking of the experiment testbed. 101

5.3 Data from the DIMACS benchmarking of the experiment testbed. 101

6.1 The sequential experimentation methodology. . . . . . . . . . . . 107

6.2 Schematic for the Two-Stage Nested Design with r replicates. . . 108

6.3 A sample overlay plot. . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.1 Number of outliers deleted during each problem difficulty experi-

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2 Relative Error response for ACS on problems of size 300, mean 100. 138

7.3 Relative Error response for ACS on problems of size 700, mean 100. 138

7.4 Relative Error response for MMAS on problems of size 300, mean

100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.5 Relative Error response for MMAS on problems of size 700, mean

100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.1 Descriptive statistics for the ACS screening experiment. . . . . . 145

8.2 Descriptive statistics for the confirmation of the ACS screening

ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.3 95% Prediction intervals for the ACS screening of Relative Error. 147

8.4 95% Prediction intervals for the ACS screening of ADA. . . . . . 147

8.5 95% Prediction intervals for the ACS screening of Time. . . . . . . 148

8.6 Summary of ANOVAs for Relative Error, ADA and Time. . . . . . . 148

7

LIST OF FIGURES

9.1 Descriptive statistics for the full ACS FCC design. . . . . . . . . . 155

9.2 Descriptive statistics for the screened ACS FCC design. . . . . . . 156

9.3 Descriptive statistics for the confirmation of the ACS tuning. . . . 158

9.4 95% Prediction intervals for the full ACS response surface model of

Relative Error-Time. . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.5 95% Prediction intervals for the screened ACS response surface

model of RelativeError-Time. . . . . . . . . . . . . . . . . . . . . . 159

9.6 RelativeError-Time ranked ANOVA of Relative Error response from

full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

9.7 RelativeError-Time ranked ANOVA of time response from full model. 161

9.8 Full RelativeError-Time model results of desirability optimisation. 162

9.9 Screened RelativeError-Time model results of desirability optimi-

sation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.10 Evaluation of Relative Error response in the RelativeError-Time

model of ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.11 Evaluation of Time response in the RelativeError-Time model of

ACS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.12 Evaluation of ADA response in the ADA-Time model of ACS . . . 165

9.13 Evaluation of Time response in the ADA-Time model of ACS . . . 165

10.1 Descriptive statistics for the MMAS screening experiment. . . . . 171

10.2 Descriptive statistics for the confirmation of the MMAS screening

ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

10.3 95% Prediction intervals for the MMAS screening of Relative Error. 173

10.4 95% Prediction intervals for the MMAS screening of ADA. . . . . . 173

10.5 95% Prediction intervals for the MMAS screening of Time. . . . . 174

10.6 Summary of ANOVAs for Relative Error, ADA and Time for MMAS. 174

11.1 Descriptive statistics for the full MMAS experiment design. . . . . 181

11.2 Descriptive statistics for the screened MMAS experiment design. 182

11.3 Descriptive statistics for the MMAS confirmation experiments. . . 184

11.4 95% prediction intervals of Relative Error by the full RelativeError-

Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . 185

11.5 95% prediction intervals of Relative Error by the screened RelativeError-

Time model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . 185

11.6 Predictions of Time by the full RelativeError-Time model of MMAS. 186

11.7 Predictions of Time by the screened RelativeError-Time model of

MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

11.8 95% prediction intervals of ADA by the full ADA-Time model of

MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

11.9 95% prediction intervals of ADA by the screened ADA-Time model

of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

11.10 95% prediction intervals of Time by the full ADA-Time model of

MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

8

LIST OF FIGURES

11.11 95% prediction intervals of Time by the screened ADA-Time model

of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

11.12 RelativeError-Time ranked ANOVA of Relative Error response from

full model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

11.13 RelativeError-Time ranked ANOVA of time response from full model. 190

11.14 Full RelativeError-Time model results of desirability optimisation. 191

11.15 Screened RelativeError-Time model results of desirability optimi-

sation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191


model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

11.17 Evaluation of the Time response in the relativeError-Time model of

MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

11.18 Evaluation of the Time response in the RelativeError-Time model

of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

11.19 Evaluation of the Time response in the ADA-Time model of MMAS. 194


model of MMAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

A.1 Region of operability and region of interest . . . . . . . . . . . . . 213

A.2 Fractional Factorial designs for two to twelve factors. . . . . . . . 216

A.3 Effects and alias chains . . . . . . . . . . . . . . . . . . . . . . . . 217

A.4 Savings in experiment runs when using a fractional factorial design

instead of a full factorial design. . . . . . . . . . . . . . . . . . . . 218

A.5 Central composite designs for building response surface models. 218

A.6 Individual desirability functions. . . . . . . . . . . . . . . . . . . . 221

A.7 Examples of possible main and interaction effects . . . . . . . . . 224

B.1 Some descriptive statistics for the symmetric Euclidean instances

in TSPLIB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

B.2 Histogram of the bier127 TSPLIB instance. . . . . . . . . . . . . . 230

B.3 Histogram of the Oliver30 TSPLIB instance. . . . . . . . . . . . . . 231

B.4 Histogram of the pr1002 TSPLIB instance. . . . . . . . . . . . . . 231

C.1 Pseudocode for the calculation of the average lambda branching

factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

D.1 Fixed parameter settings for the OFAT analysis. . . . . . . . . . . 236

D.2 Descriptive statistics for the six OFAT analyses. . . . . . . . . . . 237

D.3 Summary of results from the six OFAT analyses. . . . . . . . . . . 238

D.4 Plot of the effect of alpha on relative error for a problem with size

400 and standard deviation 10. . . . . . . . . . . . . . . . . . . . . 239





9

LIST OF FIGURES







10

List of Tables

2.1 A selection of ant heuristic applications. . . . . . . . . . . . . . . . 36

3.1 The state of the art in nature-inspired heuristics from 10 years ago. 57

4.1 Evolved parameter values for ACS. . . . . . . . . . . . . . . . . . . 86

6.1 A full factorial combination of two problem characteristics. . . . . 123

7.1 Parameter settings for the problem difficulty experiments . . . . . 135

8.1 Design factors for the screening study with ACS. . . . . . . . . . . 144

9.1 Design factors for the tuning study with ACS. . . . . . . . . . . . 154

10.1 Design factors for the screening study with MMAS. . . . . . . . . 170

11.1 Design factors for the tuning study with MMAS. . . . . . . . . . . 180

11.2 Amount of outliers removed from MMAS tuning analyses. . . . . 183

A.1 Numbers of each effect estimated by a full factorial design of 10

factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

A.2 Some common response transformations. . . . . . . . . . . . . . 222

11

Acknowledgments

I am very grateful to my family and friends whose support and encouragement

helped me through the PhD process. I thank my supervisor Daniel Kudenko at

the University of York for his supervision. He promptly reviewed my writing and

was always available when I had questions or doubts. I thank my departmental

assessor, Professor John Clark for his advice and Dimitar Kazakov for reviewing

some of my early papers. The thorough examination of the thesis and constructive

criticisms by Thomas Stutzle, at l’Universite Libre de Bruxelles, and John Clark

greatly improved the thesis.

I also wish to thank Daniel and my colleagues Leonardo Freitas and Arturo

Servin for allowing me to use their machines for my experimental work. Pierre

Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna and Sergio Mena also

made their machines available when a deadline required some additional com-

puting power. I am very grateful to my colleague and friend Jovan Cakic who

passed away during the course of my research. Jovan helped me quickly set up

the original C program on which much of this thesis is based.

I am grateful to the anonymous reviewers of my publications whose comments

helped shape my research. The research was also greatly improved by discussions

with Ruben Ruiz, Thomas Bartz-Beielstein, Holger Hoos, David Woodruff, Marco

Chiarandini, and Mike Preuss at various international conferences and with Simon

Poulding at York. Major parts of the thesis were proof read by Leonardo Freitas.

I thank Pauline Greenhough, Filomena Ottaway, Judith Warren, Diane Neville,

Richard Selby, Carol Lock, Nicholas Black, Ian Patrick and all the administrative

and technical support staff at the Department for their help during the PhD.

I am very grateful to Michael Madden, my MSc supervisor, and Colm O’ Riordan

at the National University of Ireland, Galway who encouraged and supported my

application to York. Finally, I gratefully acknowledge the financial support of my

scholarship from the Department of Computer Science at the University of York

and its support of my research travel.

Admhalacha

Taim an-bhuıoch de mo chlann agus de mo chairde as a dtacaıocht agus as a

spreagadh a chuidigh go mor liom i rith an phroiseas PhD. Gabhaim buıochas le

m’fheitheoir Daniel Kudenko in Ollscoil Eabhraic as a aiseolas. Cheartaigh se mo

chuid scripteanna go tapa agus bhı se ann nuair a bhı aon cheist agam no amhras

13

LIST OF TABLES

orm. Gabhaim buıochas le mo mheasunoir rannach i Eabhrac, an tOllamh Sean

O Cleirigh, as a chomhairle agus le Dimitar Kazakov as a leirmheas de chuid de

mo scripteanna. D’fheabhsaıodh an trachtas go mor de thoradh an dianscrudu o

Thomas Stutzle, o l’Universite Libre de Bruxelles, agus o Shean O Cleirigh.

Gabhaim buıochas freisin le Daniel agus le mo chomhaltaı Leonardo Freitas

agus Arturo Servin as ligean dom a rıomhairı a usaid do mo thurgnamh. Chuir

Pierre Andrews, Rania Hodhod, Silvia Quarteroni, Juan Perna agus Sergio Mena

a rıomhairı ar fail dom freisin nuair a bhı me faoi bhru le cuspoir. Taim an-

bhuıoch de chara agus chomhalta liom Jovan Cakic a fuair bas le linn mo thaighde.

Chuidigh Jovan liom an tasc-chlar rıomhaire, ar a bhfuil chuid mhor den trachtas

seo bunaithe, a chruthu go tapa.

Taim buıoch de na leirmheastoirı neamhaitheanta de mo chuid foilseachain mar

chuidigh a gcuid tuairiscı go mor liom cruth a chur ar mo thaighde. D’fheabhsaıodh

an taighde go mor de bharr ple le Ruben Ruiz, Thomas Bartz-Beielstein, Holger

Hoos, David Woodruff, Marco Chiarandini, agus Mike Preuss ag comhdhalacha idir-

naisiunta eagsula agus le Simon Poulding in Eabhrac. Leigh agus cheartaigh

Leonardo Freitas pıosaı mora den trachtas.

Gabhaim buıochas le Pauline Greenhough, Filomena Ottaway, Judith Warren,

Diane Neville, Richard Selby, Carol Lock, Nicholas Black, Ian Patrick agus an

fhoireann riarachain ar fad sa Roinn as a gcabhair le linn an PhD.

Taim an-bhuıoch de Michael Madden, m’fheitheoir MSc, agus de Colm O’ Rior-

dan in Ollscoil na hEireann, Gaillimh as a spreagadh agus as a dtacaiocht do mo

iarratas go Eabhrac. Sa deireadh thiar, gabhaim buıochas leis an Roinn Riomhe-

olaıochta in Ollscoil Eabhraic as a dtacaıocht airgeadais do mo thaisteal taighde.

14

Author’s Declaration

This thesis describes original research carried out by the author Enda Ridge under

the supervision of Dr. Daniel Kudenko at the University of York. This research has

not been previously submitted to the University of York or to any other university

for the award of any degree. Some chapters of the thesis are based on articles that

the author published or submitted for publication in the peer-reviewed scientific

literature during the course of the thesis research. The details of these publications

follow.

The early ideas in this research arose from explorations into parallel and de-

centralised versions of Ant Colony Optimisation (ACO) algorithms.

1. Enda Ridge, Daniel Kudenko, Dimitar Kazakov, Edward Curry. Parallel,Asynchronous and Decentralised Ant Colony System, in Proceedings of

AISB 2006: Adaptation in Artificial and Biological Systems. First Interna-

tional Symposium on Nature-Inspired Systems for Parallel, Asynchronous

and Decentralised Environments, vol. 2, T. Kovacs and J. A. R. Marshall,

Eds. AISB, 2006, pp. 174-177.

2. Enda Ridge, Daniel Kudenko, and Dimitar Kazakov, A Study of Concur-rency in the Ant Colony System Algorithm, in Proceedings of the IEEE

Congress on Evolutionary Computation, 2006, pp. 1662-1669.

3. Enda Ridge, Edward Curry, Daniel Kudenko, Dimitar Kazakov, Nature-Inspir-ed Systems for Parallel, Asynchronous and Decentralised Environments,

in Multi-Agent and Grid Systems, vol. 3, H. Tianfield and R. Unland, Eds.

IOS Press, 2007.

It quickly became obvious that these experiments would have a large amount of

experimental noise arising from the parallel and asynchronous nature of the soft-

ware. This prompted a search for how experiments with Ant Colony Optimisation

(ACO) had been conducted in the literature and how the original sequential single

machine versions of the algorithms were set up. An examination of the litera-

ture revealed there were few guidelines and no rigourous approaches to setting up

ACO algorithms. The original research direction changed. ‘Roadmap’ publications

called for, among other things, recommended experiment designs and analyses for

experiments with metaheuristics such as ACO.

4. Enda Ridge and Edward Curry, A Roadmap of Nature-Inspired SystemsResearch and Development, Multi-Agent and Grid Systems, vol. 3, IOS

15

LIST OF TABLES

Press, 2007.

5. Marco Chiarandini, Luıs Paquete, Mike Preuss, Enda Ridge, Experimentson Metaheuristics: Methodological Overview and Open Issues, Institut

for Matematik og Datalogi, University of Southern Denmark, Technical Report

IMADA-PP-2007-04 (http://bib.mathematics.dk/preprint.php?id=IMADA-PP-

2007-04), March 2007, ISSN 0903-3920.

A preliminary version of the screening and tuning methodologies of the thesis

appeared in the following publication.

6. Enda Ridge and Daniel Kudenko, Sequential Experiment Designs for Scr-eening and Tuning Parameters of Stochastic Heuristics, in Workshop on

Empirical Methods for the Analysis of Algorithms at the Ninth International

Conference on Parallel Problem Solving from Nature, L. Paquete, M. Chiaran-

dini, and D. Basso, Eds., 2006, pp. 27-34.

A refined version of this methodology is described in Chapter 6 and is used in

the case studies of Chapters 8 to 11.

Initial attempts to apply this methodology were not performing as well as ex-

pected and so an investigation was conducted into possible unknown problem

characteristics that might be interfering with the methodology’s models. This led

to the following publications, the second of which contains the updated data of

Chapter 7.

7. Enda Ridge and Daniel Kudenko, An Analysis of Problem Difficulty fora Class of Optimisation Heuristics, in Proceedings of the Seventh Euro-

pean Conference on Evolutionary Computation in Combinatorial Optimisa-

tion (EvoCOP), vol. 4446, Lecture Notes in Computer Science, C. Cotta and

J. Van Hemert, Eds. Springer-Verlag, 2007, pp. 198-209. ISBN 978-3-540-

71614-3.

8. Enda Ridge and Daniel Kudenko, Determining whether a problem charac-teristic affects heuristic performance. A rigorous Design of Experimentsapproach, in Recent Advances in Evolutionary Computation for Combina-

torial Optimization. Springer, Studies in Computational Intelligence, 2008.

ISBN 1860-949X.

The first application of the methodology was published in the following papers,

updated versions of which appear in Chapters 8 and 9. The third of these was

nominated for best paper in its track at the Genetic and Evolutionary Computation

conference 2007.

9. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heur-istic Performance, in Proceedings of the Genetic and Evolutionary Compu-

tation Conference, vol. 1, D. Thierens, H.-G. Beyer, M. Birattari, et al., Eds.

ACM, 2007. ISBN 978-1-59593-697-4.

16

LIST OF TABLES

10. Enda Ridge and Daniel Kudenko, Screening the Parameters Affecting Heur-istic Performance. The Department of Computer Science, The University of

York, Technical Report YCS 415 (www.cs.york.ac.uk/ftpdir/reports/index.php),

April 2007.

11. Enda Ridge and Daniel Kudenko, Analyzing Heuristic Performance withResponse Surface Models: Prediction, Optimization and Robustness in

Proceedings of the Genetic and Evolutionary Computation Conference, D.

Thierens, H.-G. Beyer, M. Birattari, et al., Eds. ACM, 2007, p. 150-157.

ISBN 978-1-59593-697-4.

Finally, the methodology was applied to the MMAS heuristic and published in

the following paper, an updated version of which appears in Chapter 11. The paper

was winner of the best paper award at the Engineering Stochastic Local Search

Algorithms workshop.

12. Enda Ridge and Daniel Kudenko, Tuning the Performance of the MMASHeuristic in Engineering Stochastic Local Search Algorithms. Designing, Im-

plementing and Analyzing Effective Heuristics, vol. 4638, Lecture Notes in

Computer Science, T. Stutzle and M. Birattari, Eds. Berlin / Heidelberg:

Springer, 2007, pp. 46-60. ISBN 978-3-540-74445-0.

17

Part I

Preliminaries

19

1Introduction and motivation

This thesis presents rigorous empirical methodologies for modelling and tuning the

performance of algorithms that solve optimisation problems.

Consider the very common problem of efficiently assigning limited indivisible re-

sources to meet some objective. For example, a manufacturing plant must sched-

ule machines to a particular job in the correct order so that machine utilisation is

maximised and a product is manufactured as quickly as possible. Low cost airlines

must assign cabin crew shifts from a minimum size of workforce and to as many

aircraft as possible. Logistics companies need to deliver products to a set of loca-

tions in an order that minimises delivery cost. Many such similar problems occur

in management, finance, engineering and physics. These problems are known as

Combinatorial Optimisation (CO) problems.

CO problems are notoriously difficult to solve because a large number of poten-

tial solutions must be considered. Constraints on the available resources will limit

the feasible alternatives that need to be considered. However, most CO problems

still contain sufficient alternatives to make the best choices of available options

difficult. CO problems typically require exponential time for solution in the worst

case. In plain terms, as the problem gets larger, the difficulty of finding an exact

solution increases extremely quickly. This has lead to the use of heuristic solu-

tion methods—methods that sacrifice the guarantee of finding an exact solution

in order to find a satisfactory solution in reasonable time. We term this reduc-

tion in solution quality in exchange for an increase in solution time the heuristiccompromise.

Metaheuristics1 are a more recent attempt to combine basic heuristics into a

flexible higher-level framework in order to better solve CO problems. Some of the

most popular metaheuristics for combinatorial optimisation are Ant Colony Opti-

misation (ACO), Evolutionary Computation (EC), Iterated Local Search (ILS), Sim-

1 The terms metaheuristic and heuristic are used interchangeably throughout the thesis.

21

CHAPTER 1. INTRODUCTION AND MOTIVATION

ulated Annealing (SA) and Tabu Search (TS). Many of these metaheuristics have

achieved notable successes in solving difficult and important problems. Industry is

taking note of this. Several companies incorporate metaheuristics into their solu-

tions of complex optimisation problems [12]. These include ILOG (www.ilog.com),

SAP (www.sap.com), NuTech Solutions (www.nutechsolutions.com),

AntOptima (www.antoptima.com) and EuroBios (www.eurobios.com). Metaheuris-

tics are therefore a research area of growing importance.

The flexibility of the metaheuristic framework comes at a cost. Metaheuristics

typically require a relatively large amount of ‘tuning’ in order to adjust them to the

particular problem at hand. This tuning involves setting values of many tuningparameters, much as one would adjust the dials on an old-fashioned television set

to find a given station. This situation is exacerbated if one considers parameteris-

ing internal components of the metaheuristic and then adding or removing these

parameterised components to modify performance. We term these design param-eters. Some metaheuristics can have anything from five to more than twenty-five

of these tuning parameters [33] and the scope for design parameters is effectively

limitless. It quickly becomes very difficult to search through all possible tuning

parameter settings and thus the potential performance of the metaheuristic is not

realised. This is the parameter tuning problem.

The parameter tuning problem is one of the most important research challenges

for any given metaheuristic2. The main elements of this research challenge are as

follows.

1. Screening problem characteristics to determine which problem character-

istics affect metaheuristic performance.

2. Screening tuning parameters to determine which tuning parameters affect

metaheuristic performance.

3. Modelling the relationship between tuning parameters, problem characteris-

tics and performance.

4. Predicting metaheuristic performance for a given problem instance or set of

instances given particular tuning parameter settings.

5. Tuning metaheuristic performance for a given problem instance or set of in-

stances by recommending appropriate tuning parameter settings.

6. Assessing robustness of the tuned metaheuristic performance to variations

in problem instance characteristics. That is, determining whether tuned pa-

rameter settings for a given combination of problem instance characteristics

deteriorate significantly when applied to similar problem instance character-

istics.

The key obstacles to addressing these challenges are as follows:

2 There are, of course, other very important research challenges. Comparing heuristics, for example,is an important challenge that is fraught with its own difficulties.

22

http://www.ilog.com

http://www.sap.com

http://www.nutechsolutions.com

http://www.antoptima.com/

http://www.eurobios.com/


• Problem space. All of the important problem characteristics are generally

not known and may be difficult to determine.

• Parameter space3. The number of tuning parameters and the possible com-

binations of values they can take on is large or even infinite.

• Multiple performance metrics. Performance must be analysed in terms of

both solution quality and solution time because of the heuristic compromise.

• Application scenario. The emphasis in a particular parameter tuning prob-

lem will depend on the specific application scenario. If problem instances are

likely to be similar in characteristics then it is advantageous to have a general

model of the relationship between parameters, instances and performance. If

a small number of problem instances are likely to be tackled and those in-

stances require significant resources for their solution then a relatively fast

tuning approach is to be preferred.

These challenges and obstacles must be addressed for every new metaheuristic

that is proposed, for every modification to an existing metaheuristic that is pro-

posed, and for every new problem type that is addressed. The parameter tuning

problem is ubiquitous. Without addressing these challenges and overcoming these

obstacles, the metaheuristic is of little use in practice as its user cannot set it up

for maximum performance. So how does one address these challenges?

One can distinguish two broad approaches [28]. Analytical approaches attempt

to analytically prove characteristics of the algorithm such as its worst-case and

average-case behaviour. Empirical analyses implement the algorithm in computer

code and evaluate its behaviour on selected problems. Both of these approaches

have been unsatisfactory to date.

The analytical approach is the more ideal of the two in principle because of its

potential generality and pure mathematical foundation. While it is to be expected

that analytical approaches will improve with time and effort, they are far from ideal

at their current level of maturity. The mathematical tools do not yet exist to suc-

cessfully formalise and theorise about the behaviour and performance of existing

cutting-edge metaheuristics. While early attempts at analysis are emerging, they

generally resort to extreme simplifications to the metaheuristic description to ren-

der the analyses tractable. There is also a lack of comparisons of the theoretical

predictions to actual implementations to determine whether the theory predicts

the reality. These simplifications make the majority of conclusions inapplicable for

practical purposes.

An empirical approach would seem an attractive alternative by virtue of its

simplicity—collect enough data and interpret it without bias. The reality is very

different. Which data should be collected? What issues affect the measurement

and collection of the data? How much data is enough data? How should data be

3 We use the term parameter space when considering all the possible combinations of tuning pa-rameters. We use the term design space when considering all the possible combinations of both tuningparameter settings and problem characteristics.

23


interpreted? Can data interpretation be backed by mathematical precision or must

we be limited to subjective interpretation? How do we ensure that an empirical

analysis is both repeatable and reproducible?4

An examination of the professional research journals shows that while empir-

ical analyses of metaheuristics are often large and broad ranging, they are sel-

dom backed by the scientific rigour that one would expect in more mature dis-

ciplines such as the physical, medical and social sciences. Few of the questions

from the previous paragraph regarding empirical methodology are either recog-

nised or clearly addressed by researchers. Proper experimental designs are seldom

used. Interpretations of results are subjective opinions rather than sound statisti-

cal analyses. Parameters are selected without justification or based on the reports

from other studies without verification of their appropriateness for the current sce-

nario [2]. This leaves the metaheuristic ill-defined, experiments irreproducible and

leads to an underestimation of the time needed to deploy the metaheuristic [69].

The list of failings is long and has often been lamented in the literature of the last

two decades [69, 64, 101, 7, 48, 65].

While these criticisms in the literature are justified, others point out that few

publications go further and explicitly illustrate the application of sound established

scientific methodology to the analysis of metaheuristics [28]. Without research

that sets a good example, the impoverished state of the field’s methodology has

thus persisted. Researchers in the natural sciences have available an extensive

lore of laboratory techniques to guide the development of rigorous and conclusive

experiments. This has not been the case in algorithmic research [79]. Attempts

to improve this situation with illustrative case studies and to educate researchers

with tutorials are emerging [122, 27, 9, 90]. A comprehensive methodology for

addressing the aforementioned research challenges is needed. A comprehensive

illustration of the application of such a methodology is needed. Fortunately, a

good candidate methodology already exists.

The field of Design of Experiments (DOE) is defined as:

. . . a systematic, rigorous approach to engineering problem-solving that

applies principles and techniques at the data collection stage so as to

ensure the generation of valid, defensible, and supportable engineering

conclusions. In addition, all of this is carried out under the constraint

of a minimal expenditure of engineering runs, time, and money. [1]

As well as providing this rigorous and efficient approach to data collection,

DOE also provides statistically designed experiments. A statistically designed ex-

periment offers a number of advantages over a design that does not use statisti-

cal techniques [89]. Attention is focussed on measuring sources of variability in

results. The required number of tests is determined reliably and may often be re-

duced. Detection of effects is more precise and the correctness of conclusions is

4 A repeatable experiment is one which the original experimenter can redo and get very similarresults. A reproducible experiment is one which another experimenter can reproduce independentlyand get similar results that lead to the same conclusions.

24


known with the mathematical precision of statistics.

DOE is a well-established field that has existed for over eighty years. It evolved

for the manufacturing industry and is now well supported by commercial soft-

ware. The National Institute of Standards and Technology describes four general

engineering problem areas to which DOE may be applied [1]:

• Screening/Characterizing: the engineer is interested in understanding the

process as a whole in the sense that he/she wishes to rank factors that affect

the process in order of importance.

• Modelling: the engineer is interested in modelling the process with the output

being a good-fitting (high predictive power) mathematical relationship.

• Optimizing: the engineer is interested in optimising the process by adjusting

the factors that effect the process.

• Comparative: the engineer is interested in assessing whether a given choice

is preferable to an alternative.

The first three of these application areas map directly to the parameter tuning

research challenges for metaheuristics identified earlier5. The metaheuristic being

studied is the ‘process’ to which DOE is applied. The rigour of DOE provides

the framework to address the concerns regarding the methodology of empirical

analyses. The statistically designed experiments address any concerns about the

subjective nature of the interpretation of results.

1.1 Hypothesis Statement

We can now identify the central hypothesis of this research:

The problem of tuning a metaheuristic can be successfully addressed with a

Design Of Experiments approach.

If the parameter tuning problem is addressed successfully, then we can expect

• to make verifiably accurate predictions of metaheuristic performance with a

given confidence.

• to make verifiably accurate recommendations on the most important tuning

parameters with a given confidence.

• to make verifiably accurate recommendations on tuning parameter settings

with a given confidence.

5 Comparative DOE studies are appropriate for the comparison of heuristics, typically answeringquestions such as whether one heuristic is better than another. The difficulties of comparative studiesare covered in the literature. Comparative studies are appropriate once all other issues regardingdesign, setup and running have been addressed. This thesis focuses on tuning and so should facilitatefairer comparative studies.

25


• to make all of these recommendations in terms of solution quality and solu-

tion time.

The specific metaheuristic studied in this thesis is Ant Colony Optimisation

(ACO) [47]. The CO problem domain to which ACO will be applied is the Travel-

ling Salesperson Problem [75]. The importance of and need for this research has

already been highlighted in the ACO field [118].

1.2 Thesis structure

The thesis is divided into three parts. Preliminaries are the necessary topics that

must be covered to place the research in context. The second part, Related Work,

presents a synthesis of the methodological issues that arise in empirical analyses of

metaheuristics and critically reviews the literature on parameter tuning in light of

these issues. The third part, Design Of Experiments for Tuning Metaheuristics, is

concerned with methodology. It introduces the experimental testbed and presents

one of the thesis’ main contributions, a Design of Experiments methodology for

metaheuristic parameter tuning. The final part, Case Studies, contains several

examples of the successful application of the methodology. The specific chapters

are now summarised.

Chapter 2 on page 29 gives a background on combinatorial optimisation and

the Travelling Salesperson Problem, the type of optimisation and problem domain

studied in this thesis. Various approaches to solving combinatorial optimisation

problems are covered. The discussion then focuses on Ant Colony Optimisation

(ACO), the family of metaheuristics used to illustrate the methodology advocated

in the thesis. The chapter concludes with an overview of the Design Of Experiments

field.

Chapter 3 brings together and discusses many of the issues that arise when

performing empirical analyses of heuristics. Some of these issues have often been

raised in the research literature but are scattered across a range of related research

fields. This chapter therefore draws on literature from fields such as Operations

Research, Heuristics, Performance Analysis, Design of Experiments and Statistics.

Chapter 4 is a critical review of the literature on parameter tuning in light

of the methodological issues highlighted in the previous chapter. It begins with

approaches to analysing problem difficulty for heuristics. Parameter tuning is

addressed in terms of metaheuristics other than ACO and in terms of the ACO

metaheuristic. For the treatment of ACO parameter tuning, the chapter reviews

analytical, automated and empirical approaches.

Chapter 5 describes the experimental testbed. It covers the problem generator

and metaheuristic code used. It also details the benchmarking of the experiment

machines. All topics are covered in light of the empirical analysis issues discussed

in Chapter 3. This chapter is key to the reproducibility of the results the thesis

presents.

Chapter 6 is a detailed step-by-step description of the Design Of Experiments

methodology that the thesis introduces. The methodology is crafted in terms of the

26


empirical analysis concerns of Chapter 3. This chapter serves as a template for all

the case studies reported in the final part of the thesis.

Chapters 7 to 11 on page 179 are the thesis case studies. They illustrate all

aspects of the thesis’ Design Of Experiment methodology of Chapter 6. Case stud-

ies cover the two best performing members of the ACO metaheuristic family, Ant

Colony System and Max-Min Ant System. Many new results for the ACO field are

presented and open questions from the literature are answered. This underscores

the benefits of adopting the thesis’ rigorous Design Of Experiments methodology.

The thesis concludes with Chapter 12 on page 197. Appendix A is an overview

of Design Of Experiments (DOE) terminology and concepts. It is provided for the

convenience of the reader who is unfamiliar with DOE. It should not be taken as a

replacement for comprehensive textbooks on the subject [89, 84, 85]. Appendix B

contains some statistics related to the TSP. Appendix C is an important complexity

calculation related to the MMAS heuristic.

1.3 Chapter summary

This chapter has introduced and motivated the main thesis of this research.

• Problems of combinatorial optimisation were introduced and the difficulty of

solving them was explained.

• Metaheuristics were introduced as a popular emerging approach for solving

CO problems.

• The parameter tuning problem was identified as a key research challenge

that will always be faced when dealing with newly proposed metaheuristics,

proposed changes to existing metaheuristics and new problem types. The

difficulty of the parameter tuning problem was explained and the importance

of solving the problem in terms of the heuristic compromise of solution time

and solution quality was emphasised. Approaches to addressing the parame-

ter tuning problem were categorised as either analytical or empirical and the

current deficiencies in the state-of-the-art of both approaches were explained.

• Design Of Experiments (DOE) was identified as a well-established field that

may be a very good candidate for empirically solving the parameter tuning

problem in a rigorous fashion.

This lead to the central hypothesis of this thesis:



27

2Background

The previous chapter introduced combinatorial optimisation problems, discussed

their importance across academia and industry and explained why they are typ-

ically difficult to solve. Metaheuristics were introduced as a general framework

for solving such problems and the parameter tuning problem was presented as one

of the key obstacles to the successful deployment of metaheuristics. The chapter

highlighted the lack of experimental rigour in the field’s attempts to analyse and

understand its heuristics, particularly its lack of a rigourous approach to the pa-

rameter tuning problem. This led to the hypothesis that rigourous techniques can

be adapted from the Design Of Experiments (DOE) field to successfully tackle the

parameter tuning problem.

This chapter gives a more detailed background to the areas mentioned in the

previous chapter’s motivation and hypothesis. It begins with a general descrip-

tion of combinatorial optimisation before focussing on the particular combinatorial

optimisation problem addressed in this thesis. The approaches to solving combi-

natorial optimisation problems are reviewed. The chapter then focuses on the

particular family of metaheuristics that is studied in this thesis. The chapter con-

cludes with some background on the Design Of Experiments techniques that will

be adapted to the parameter tuning problem in this thesis.

2.1 Combinatorial optimisation

Optimisation problems in general divide naturally into two classes: those where

solutions are encoded with real-valued variables and those where solutions are

encoded with discrete variables. Combinatorial Optimisation (CO) problems are of

the latter type.

An illustrative example of a CO problem is that of class timetabling. Such

timetabling typically involves assigning a group of teachers and students to class-

rooms. This assignment is subject to the constraints that a teacher cannot teach

29

CHAPTER 2. BACKGROUND

all subjects, students are only taking a limited number of all the available classes,

teachers and students cannot be in two classrooms at once and no more than one

class can be taught in a classroom at a given time. The variables are discrete

because we cannot consider some fraction of a student, room or teacher. The diffi-

culty of the problem lies in the large number of possible solutions that have to be

searched and the constraints on keeping all teachers and students satisfied. Some

other popular examples of CO problems are the Travelling Salesperson Problem

(TSP) [75], the Quadratic Assignment Problem (QAP) [55, p. 218] and the Job Shop

Scheduling Problem (JSP) [55, p. 242].

The ubiquity of CO problems and their importance for logistics, manufacture,

scheduling and other industries has resulted in a large body of research devoted

to their understanding, analysis and solution.

This thesis is concerned with a particular type of combinatorial optimisation

problem called the Travelling Salesperson Problem.

2.2 The Travelling Salesperson Problem (TSP)

Informally, the Travelling Salesperson Problem (TSP) can be described in the fol-

lowing way.

Given a number of cities and the costs of travelling from any city to any

other city, what is the cheapest round-trip route that visits each city

exactly once? [121]

The most direct solution would be to try all the ordered combinations of cities

and see which combination, or tour, is cheapest. Using such a brute force searchrapidly becomes impractical because the number of possible combinations of n

cities to consider is the factorial of n. This rapid growth in problem search space

size is illustrated in Figure 2.1.

1.00E+001.00E+161.00E+321.00E+481.00E+641.00E+801.00E+961.00E+1121.00E+1281.00E+1441.00E+160

0 20 40 60 80 100

Number of Cities (n)

Com

bina

tions

of c

ities

Figure 2.1: Growth of TSP problem search space. The horizontal axis is the number of cities in a TSPproblem. The vertical axis is the number of combinations of cities that have to be considered. Thefigure shows an exponential growth in search space size with problem size.

In fact, the TSP has been shown to be Nondeterministic Polynomial-time hard(NP-hard). Informally, this means that it is contended that the TSP cannot be

30

http://en.wikipedia.org/wiki/Polynomial-time


solved to optimality within polynomially bounded computation time in the worst

case. A detailed examination of the TSP, computational complexity theory and NP-

hardness is beyond the scope of this thesis. The reader is referred to the literature

for a discussion of this important topic [70]. For practical purposes, the difficulty

of the TSP means that a sophisticated approach to its solution is required.

The difficulty of solving the TSP to optimality, despite its conceptually simple

description, has made it a very popular problem for the development and testing

of combinatorial optimisation techniques. The TSP “has served as a testbed for

almost every new algorithmic idea, and was one of the first optimization problems

conjectured to be ‘hard’ in a specific technical sense” [70, p. 37]. This is partic-

ularly so for algorithms in the Ant Colony Optimisation (ACO) field where ‘a good

performance on the TSP is often taken as a proof of their usefulness’ [47, p. 65].

The type of TSP described at the start of this section can be termed the general

asymmetric TSP. It is asymmetric because the cost of travelling between two given

cities can be different depending on the direction of travel. The cost from city

1 to city 2 can be different to the cost from city 2 to city 1. There are several

further categories of TSP problem that we can distinguish [70, p. 58-61]. Their

relationship to one another in terms of generalisations and specialisations of the

general asymmetric TSP are illustrated in Figure 2.2 on the following page.

This thesis focuses exclusively on symmetric TSP instances. The reader is re-

ferred to the literature for details of the other TSP types [70, p. 58-61]. The sym-

metric TSP specialisation was chosen as a problem domain because the heuristics

researched in this thesis were originally developed for this TSP type.

This thesis follows the usual convention of using the term problem to describe

a general problem such as the Travelling Salesperson Problem and an instance to

be a particular case of a problem.

2.3 Approaches to solving combinatorial optimisation pro-

blems

Algorithms to tackle combinatorial optimisation problems can be classified as ei-

ther exact or approximate. Exact methods are guaranteed to find an optimal so-

lution in bounded time. Unfortunately, many problems are NP-hard like the TSP

and so may require exponential time in the worst case. This impracticality of exact

methods has led to the use of approximate (or heuristic) methods—methods that

sacrifice the guarantee of finding an optimal solution in order to find a satisfactory

solution in reasonable time. We term this the heuristic compromise. This com-

promise is even mentioned implicitly in some definitions of CO problems [42, p.

244].

Approximate methods (or heuristics) can be distinguished as being either con-structive methods or local search (or improvement) methods. Constructive meth-

ods start from scratch and add solution components until a complete solution is

found. The nearest neighbour heuristic for the TSP is an example of a constructive

heuristic. It begins at some city and repeatedly chooses the nearest unvisited city

31


K-Salesman TSP Dial-a-ride Stacker Crane

GeneralAsymmetric

TSP

Asymmetric triangle InequalityTSP Symmetric TSP

MixedChinesePostman

DirectedHamiltonian

Cycle

Symmetric TriangleInequality TSP

Hamiltonian Cycle Rectilinear TSP Euclidean TSP

Hamiltonian Cycle forGrid Graphs

Figure 2.2: Special cases and generalisations of the TSP. The TSP type studied in this thesis is high-lighted in bold. Image adapted from [70, p. 59].

32


until a complete tour has been constructed. Local search (or improvement) heuris-

tics start with an initial solution and iteratively try to improve on this solution by

searching within an appropriately defined neighbourhood of the current solution1.

Others [11] provide an overview of local search approaches. This research focuses

on constructive methods rather than local search.

It is clear that there is a myriad of possible combinations of constructive heuris-

tics and local search heuristics. Some are more suited to particular types of com-

binatorial optimisation problems and instances than others. Metaheuristics try

to combine these more basic heuristics (both constructive and local search) into

higher-level frameworks in order to better search a solution space. Some examples

of metaheuristics for combinatorial optimisation [16, p. 270] are Ant Colony Opti-

misation [47], Evolutionary Computation [83], Simulated Annealing [73], Iterated

Local Search [66], and Tabu Search [56].

The general high-level framework of the metaheuristic makes it difficult to de-

fine exactly what a metaheuristic is. Many definitions have been summarised in

the literature [16, p. 270].

A metaheuristic is an iterative master process that guides and modi-

fies the operations of subordinate heuristics to efficiently produce high-

quality solutions. It may manipulate a complete (or incomplete) single

solution or a collection of solutions at each iteration. The subordinate

heuristics may be high (or low) level procedures, or simple local search,

or just a construction method. [120]

Metaheuristics are typically high-level strategies which guide an under-

lying, more problem specific heuristic, to increase their performance.

. . . . Many of the metaheuristic approaches rely on probabilistic deci-

sions made during the search. But, the main difference to pure ran-

dom search is that in metaheuristics algorithms randomness is not used

blindly but in an intelligent, biased form. [117, p. 23]

Interestingly, the first definition permits a stand-alone constructive heuristic

without local search to be considered as a metaheuristic.

It can be useful to consider the different dimensions along which metaheuristics

can be classified ([16, p. 272] and [117, p. 33-35]). The metaheuristic studied in

this research can then be classified in relation to other metaheuristics.

• Nature-inspired versus non nature-inspired. There are nature-inspired

algorithms, like Genetic Algorithms and Ant Algorithms, and non nature-

inspired ones such as Tabu Search and Iterated Local Search. This dimen-

sion is of little use as most modern metaheuristics are hybrids that fit in both

classes.

• Population-based versus trajectory methods. This describes whether an

algorithm works on a population of solutions or a single solution at any time.

1 We have used the terms local search and improvement together here. Henceforth, we only use theterm local search since this is now the more fashionable term for such heuristics.

33


Population-based methods evolve a set of points in the search space. Trajec-

tory methods focus on the trajectory of a single solution in the search space.

• Dynamic versus static objective function. Dynamic metaheuristics modify

the fitness landscape, as defined by the objective function, during search to

escape from local minima.

• One versus various neighbourhood structures. Some metaheuristics al-

low swapping between different fitness landscapes to help diversify search.

Others operate on one neighbourhood only.

• Memory usage versus memory-less methods. Some metaheuristics use

adaptive memory. This involves keeping track of recent decisions made and

solutions found or generating synthetic parameters to describe the search.

Metaheuristics without adaptive memory determine their next action solely

on the current state of their search process.

Having described the need for heuristic (approximate) methods for tackling com-

binatorial optimisation problems and the concept of a metaheuristic, we can now

describe the metaheuristic family examined in this thesis.

2.4 Ant Colony Optimisation (ACO)

Ant Colony Optimisation (ACO) [47] is a metaheuristic based on the way many

species of real ants forage for food. It is helpful to consider this process in a little

detail before describing the actual ACO heuristics.

Real ants manage to find paths between their nest and food sources over large

distances relative to a single ant’s size. They manage to coordinate this foraging for

large swarms of ants despite individual ants having only rudimentary vision and no

centralised swarm leader. It turns out that real ants communicate between them-

selves by leaving chemical markers in their environment. These markers are called

pheromones. By laying down pheromones and sensing existing pheromones, real

ants can locate and converge on trails leading to food sources. These pheromones

also evaporate over time so that as a food source is exhausted and fewer ants visit

it, the trail eventually disintegrates.

The original ant algorithm was inspired by the so-called ‘double bridge exper-

iment’, an experiment in biology that demonstrated this pheromone-laying be-

haviour for real ants. The experiment is summarised here to help in understanding

the subsequent algorithm descriptions. It is described in more detail in the ACO

literature [47, p. 1-5]. A double bridge was set up to connect a nest of ants to a food

source. One bridge was twice as long as the other (Figure 2.3 on the next page).

Ants leave their nest, encounter the fork in their path at point 1 and randomly

choose one of the two bridges. Ants choosing the shorter bridge will arrive at the

food source and start returning to the nest sooner. Because more ants can make

the journey along the shorter bridge in the same time as ants on the longer bridge,

the pheromone markers build up more quickly on the shorter bridge. Subsequent

34


ants, leaving the nest and encountering the fork at point 1 in Figure 2.3, sense a

higher level of pheromone on the shorter bridge and therefore favour choosing the

shorter bridge. This positive feedback of attractive pheromone trails enables the

swarm of ants to successfully find the shorter path to the food source without any

centralised leader and without any global vision of the two bridges. Two points

are particularly noteworthy about this experiment. Firstly, when the vast majority

of the ants had converged on the shorter bridge, a small proportion continued to

choose the longer bridge because of the random decision process. This can be con-

sidered as a type of continuous exploration of the environment. Secondly, when

presented with an even shorter bridge after convergence, the ants were unable to

move to the new shortest bridge because pheromone levels were so high on the

original bridge on which they had already converged. The natural evaporation of

the pheromone chemical was too slow to allow ants to ‘forget’ the first bridge.

Nest Food600

15 cm

Nest Food1 2

(a) (b)

Figure 1.1Experimental setup for the double bridge experiment. (a) Branches have equal length. (b) Branches havedi¤erent length. Modified from Goss et al. (1989).

0

50

100

0-20 20-40 40-60 60-80 80-100

% of traffic on one of the branches

0

50

100

0-20 20-40 40-60 60-80 80-100

(a) (b)

% o

f exp

erim

ents

% o

f exp

erim

ents

% of traffic on the short branch

Figure 1.2Results obtained with Iridomyrmex humilis ants in the double bridge experiment. (a) Results for the case inwhich the two branches have the same length (r ¼ 1); in this case the ants use one branch or the other inapproximately the same number of trials. (b) Results for the case in which one branch is twice as long asthe other (r ¼ 2); here in all the trials the great majority of ants chose the short branch. Modified fromGoss et al. (1989).

1.1 Ants’ Foraging Behavior and Optimization 3

Figure 2.3: Experiment setup for the double bridge experiment. Ants leave the nest and move towardsthe food source. One bridge is longer than the other (adapted from [47, p. 3]).

This ability of real ant swarms to find the shortest route along constrained paths

using pheromone markers and random decision processes was the inspiration for

the ant colony heuristic.

2.4.1 The Ant Colony Heuristic

The idea proposed in the original ant heuristic [44] and developed in many ACO

heuristics since then can be summarised in a general sense as follows. A combi-

natorial optimisation problem consists of a set of components. A solution of the

problem is an ordering of these components and a cost is associated with the or-

dering of each solution. This situation is represented by a data structure called a

graph (Figure 2.4 on the following page). Nodes in the graph (the black dots) are

solution components and a directed edge between two nodes is the cost of ordering

those components one after the other in the problem’s solution.

A number of artificial ants construct solutions by moving on the problem’s fully

connected graph representation. A movement from one node to another represents

a given ordering of those nodes in the constructed solution. Movements are gov-

erned by stochastic decisions. The constraints of the problem are built into the

ants’ decision processes. Ants can be constrained to only construct feasible solu-

tions or can also be allowed to construct infeasible solutions when this is beneficial.

The edges of the graph have an associated pheromone value and heuristic value.

35


IEEE Transactions on Systems, Man, and Cybernetics–Part B, Vol.26, No.1, 1996, pp.1-13 11

The same process can be observed in the graphs of Fig. 6, where the AS was applied to avery simple 10-cities problem (CCA0, from [20]), and which depict the effect of ant search onthe trail distribution. In the figure the length of the edges is proportional to the distancesbetween the towns; the thickness of the edges is proportional to their trail level. Initially (Fig.6a) trail is uniformly distributed on every edge, and search is only directed by visibilities. Lateron in the search process (Fig. 6b) trail has been deposited on the edges composing good tours,and is evaporated completely from edges which belonged to bad tours. The edges of the worsttours actually resulted to be deleted from the problem graph, thus causing a reduction of thesearch space.

1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

7

8

9

10

a) b)

Fig. 6. Evolution of trail distribution for the CCA0 problem.a) Trail distribution at the beginning of search.b) Trail distribution after 100 cycles.

Besides the tour length, we also investigated the stagnation behavior, i.e. the situation inwhich all the ants make the same tour. This indicates that the system has ceased to explore newpossibilities and no better tour will arise. With some parameter settings we observed that, afterseveral cycles, all the ants followed the same tour despite the stochastic nature of the algorithmsbecause of a much higher trail level on the edges comprising that tour than on all the others.This high trail level made the probability that an ant chooses an edge not belonging to the tourvery low. For an example, see the Oliver30 problem, whose evolution of average branching ispresented in Fig. 7. In fact, after 2500 cycles circa, the number of arcs exiting from each nodesticks to the value of 2, which – given the symmetry of the problem – means that ants arealways following the same cycle.

This led us to also investigate the behavior of the ant-cycle algorithm for differentcombination of parameters α and β (in this experiment we set NCMAX=2500). The results aresummarized in Fig. 8, which was obtained running the algorithm ten times for each couple ofparameters, averaging the results and ascribing each averaged result to one of the threefollowing different classes.

• Bad solutions and stagnation. For high values of α the algorithm enters the stagnationbehavior very quickly without finding very good solutions. This situation is represented bythe symbol ∅ in Fig. 8;

• Bad solutions and no stagnation. If enough importance was not given to the trail (i.e., α wasset to a low value) then the algorithm did not find very good solutions. This situation isrepresented by the symbol ∞.

Figure 2.4: An example of a graph data structure. Nodes (the black dots) represent solution compo-nents and edges (the lines joining the nodes) represent the costs of ordering the connected nodes oneafter the other in the solution. The graph is fully connected because every node is connected to everyother node by an edge (adapted from [44]).

The pheromone value is updated by the ants while the heuristic value comes from

knowledge of the problem or the specific problem instance. Both the pheromone

value and the heuristic value of an edge are components of an ant’s stochastic

decision process when it considers the edges to move along. Stutzle and Dorigo

provide a more formal discussion of how the ant heuristic works [47, p. 34-36].

Ant Colony heuristics have been applied to a wide range of problems that can

be represented by such a graph structure (Table 2.1).

Problem References

1 Travelling Salesman Problem [44], [46], [118]

2 Quadratic Assignment Problem [77], [118]

3 Scheduling [39], [51], [80], [17]

4 Vehicle Routing [25]

5 Set Packing [54]

6 Graph Colouring [32]

7 Shortest Supersequence Problem [81]

8 Sequential Ordering [53]

9 Constraint Satisfaction Problems [116]

10 Data Mining [93]

11 Edge disjoint paths problem [88]

12 Bioinformatics [113]

13 Industrial [10], [59]

14 Dynamic [62, 61]

Table 2.1: A selection of ant heuristic applications (from [15]).

Since this thesis focuses on the TSP problem, we now explain how ant colony

heuristics are applied specifically to the TSP.

36


2.4.2 Application to the Travelling Salesperson Problem

The application of the general ant colony heuristic of the previous section to the

TSP of Section 2.2 on page 30 is straightforward. A solution is an ordering of all the

graph nodes because the TSP is a tour of all cities. Ants are therefore restricted to

constructing feasible tours only. Pheromone values are associated with each edge.

Higher pheromone values reflect a greater desirability for visiting one node after

another. The heuristic associated with each edge is simply the inverse of the cost

of adding that edge to the constructed solution. This cost is typically the distance

between the two nodes where distances can be calculated in several ways. These

costs are typically stored in a cost matrix.

2.4.3 The ACO Metaheuristic

Since the introduction of the original ant colony heuristic, Ant System [44], a pat-

tern in implementation has emerged that has allowed Ant System and many of

the subsequent ant colony heuristics to be grouped within a metaheuristic frame-

work. This metaheuristic is called Ant Colony Optimisation (ACO) [43]. The ACO

metaheuristic consists of several stages, illustrated in Figure 2.5.

End While

While (stopping criterion is not yet met)Initialise pheromone trails

For (each ant)

Construct Solutions using aprobabilistic decision rule

End ForFor (ants)

Apply local search

End For

For (graph edges)

Update pheromones

End For

MainLoop

1

2

3

4 Daemon actions

Figure 2.5: The ACO Metaheuristic. The four main stages described in the text are numbered.

1. Initialise Pheromone trails. An initial pheromone value is applied to each

edge in the problem instance.

2. Construct Solutions. Ants visit adjacent states of the problem by moving

between nodes on the problem graph. Once solutions have been constructed,

local search (Section 2.3 on page 31) can be applied to the solutions.

37


3. Update Pheromones. Pheromone trails are modified by evaporating from

and by depositing pheromone onto the problem graph’s edges. Evaporation

decreases the pheromone associated with an edge and deposition increases

the pheromone.

4. Daemon Actions. Centralised actions occur that are not part of the individ-

ual ant actions. These typically involve global information such as determin-

ing the best solution found by the construct solutions phase.

All ACO stages are scheduled by a Schedule Activities construct. The meta-

heuristic does not impose any detail on how this scheduling might occur. Solution

construction, for example, might occur asynchronously [111], in sequence or in

parallel.

This thesis investigates two ant colony heuristics within the ACO metaheuris-

tic framework. These are Max-Min Ant System [118] and Ant Colony System [46].

Stutzle and Dorigo [47, p. 69] state that one may distinguish between those heuris-

tics that descend directly from the original Ant System and those that propose sig-

nificant modifications to the structure of Ant System. Of the heuristics studied in

this thesis, MMAS is of the former type and Ant Colony System is of the latter. The

following sections provide a detailed description of the ACO heuristics studied. The

descriptions follow the 4 stages in the ACO metaheuristic description. Ant System

is described first for completeness.

2.4.4 Ant System (AS)

Ant System (AS) was first introduced to the peer-reviewed literature in 1996 [44]

as a heuristic for the TSP. The details of its four stages within ACO are as follows.

Stage 1: Initialise Pheromone trails

An initial pheromone value τ0 is applied to all edges in the problem. In the ACOTSP

code [47] this initial value was calculated according to the following equation:

τ0 =1

ρ ·NNTour(2.1)

where ρ is a heuristic tuning parameter related to the update pheromones stage

described in Section 2.4.4 on the next page and NNTour is the length of a single

tour generated using the nearest neighbour heuristic. If local search has been

specified for the algorithm then this local search is applied to the solution from the

nearest neighbour heuristic.

Stage 2: Construct Solutions

AS ants apply the following so-called random proportional rule when choosing the

next TSP city to visit. The probability of an ant at a city i choosing a next city j, is

given by

38


pij =[τij ]

α [ηij ]β∑

l∈Fi

[τil]α [ηil]

βif j ∈ Fi (2.2)

where Fi is the set of cities that the ant has not yet visited. τij is the pheromone

level on the graph edge connecting cities i and j and ηij is the heuristic value

for that edge. α and β are heuristic tuning parameters that adjust the relative

influence of pheromone and heuristic values respectively.

Stage 3: Update pheromones

Pheromones are updated with both evaporation and deposition once all ants have

constructed a solution.

EvaporationIn general, evaporation of pheromone occurs on all edges in the problem graph.

In the original source code, evaporation was limited to edges in the candidatelist (see Section 2.4.8 on page 44) if local search was used. For any given edge

connecting nodes i and j, the new pheromone value τij after evaporation is given

by

τij = (1− ρ)τij (2.3)

where ρ is a heuristic tuning parameter controlling the rate of pheromone evapo-

ration.

DepositionAfter evaporation, all ants deposit pheromone along the problem graph edges

belonging to their constructed solution. For any given edge in an ant’s solution

connecting nodes i and j, the new pheromone value τij after deposition is given by

τij = τij + 1/C (2.4)

where C is the cost of the solution built by the ant. Since better solutions have

lower costs, equation ( 2.4) means that better solutions receive a larger deposition

of pheromone.

Stage 4: Daemon Actions.

There are no actions in this stage of AS.

2.4.5 Max-Min Ant System (MMAS)

Max-Min Ant System [118] makes several modifications within the AS structure.

These modifications involve the use of limits on pheromone values (τmax and τmin)and the reinitialisation of edge pheromone limits.

39


Stage1: Initialise Pheromone trails

The maximum pheromone value τmax for all edges is initialised as per Equation

( 2.1 on page 38). The initial value of the pheromone minimum τmin is then set

according to

τmin =τmax

2n(2.5)

where n is the TSP problem size (the number of nodes in the graph). All edges are

initialised to the maximum trail value.


The Construct Solutions phase is the same as for AS (Section 2.4.4 on page 38).

Stage3: Update Pheromones

Before any evaporation occurs, the trail limits are updated according to the follow-

ing equations. The trail maximum is always calculated as follows.

τmax =1

ρ · Cbest so far(2.6)

where Cbest so far is the tour length of the best so far ant, the ant that produced

the best solution during the course of the heuristic so far. The calculation of the

trail minimum in the original source code [47] on which the thesis experiments are

based was confounded with whether local search was used. The first method, used

when local search was in use, calculated the new trail minimum as in Equation

( 2.5). This method is described in the book [47]. However, when local search was

not in use, the source code accompanying the book used another calculation.

τmin =τmax

(1− elog(

p2 ))

(candlist+1

2

)elog(

p2 )

(2.7)

where p is another possible heuristic tuning parameter. This was the calculation

used in this thesis research with p fixed at 0.05, a value that was hard-coded in the

source code. Equation ( 2.7) is similar in form to a version used in the literature

[118].

EvaporationPheromone evaporation is the same as for AS (Section 2.4.4 on the preceding

page and Equation ( 2.3 on the previous page)). After evaporation, the trail limits

are checked. Any pheromone value less than the trail min value is reset to be equal

to trail min. Any pheromone value greater than trail max is reset to be equal to

trail max.

DepositionPheromone deposition is also very similar to deposition for AS (Section 2.4.4 on

the preceding page and Equation ( 2.4 on the previous page)) except that only a sin-gle ant is allowed to deposit pheromone. The choice of whether this ant is the best

40


ant so far (best so far) or the best ant from the current iteration (best of iteration)

is rather complicated.

The best so far ant is used every u gb heuristic iterations. In all other itera-

tions the best of iteration ant is used. This frequency of best so far ant can vary.

For example, in the original source code and one piece of literature [118], the fre-

quency is varied according to a schedule. The schedule approach was used when

local search was also used. Alternatively, when local search was not in use, the

best so far ant was used every u gb iterations and this value was fixed at 25 in the

original code.

Clearly there are many possible schedules that can be applied to the frequency

of pheromone deposition with best so far ant. In this research we take a simpler

approach of having a fixed frequency restart freq with which best so far ant is

used, as in the case of no local search in the original source code. This fixed

frequency is a heuristic tuning parameter.


In MMAS, the daemon actions involve occasionally reinitialising the pheromone

trail levels. Reinitialisation occurs if both of two conditions are met. In the litera-

ture [47, p. 76], determining the reinitialisation was described as one condition orthe other being met. This research uses an and condition to maintain backwards

compatibility with the original source code. The first condition is whether a given

threshold number of iterations since the last solution improvement has been ex-

ceeded. The second condition (which is expensive to calculate relative to the first

condition) is whether the branching factor has dropped below a given threshold.

Branching factor is a measure of the uniformity of pheromone levels on all edges

in the problem’s graph. Its calculation and its expense are discussed in more

detail in Appendix C on page 233. The check is done after a fixed number of iter-

ations because of the expense of calculating branching factor. There are therefore

three tuning parameters controlling pheromone reinitialisation: threshold itera-

tions (reinit iters), threshold branching factor (reinit Branch) and check frequency

(reinit freq). In the original source code, these were hard coded to 250, 1.0001

and 100 respectively. However, at least one case in the literature [118] uses a

reinit iters of 50.

In the research reported in this thesis, we fix reinit freq =1 so that checks on

these conditions are made in every iteration. We made this decision because the

nesting of the other two parameters within the checking frequency made it im-

possible to combine properly all combinations of these tuning parameters in an

experiment design.

When a reinitialisation is due, trails are reinitialised to the trail max value that

is calculated as:

τmax =1

ρ · Cbest so far(2.8)

41


2.4.6 Ant Colony System (ACS)

Ant Colony System [46] differs significantly from AS in its solution construction

and pheromone evaporation procedures.

Stage 1: Initialise Pheromone Trails

The initial pheromone value for all edges is given by

τ0 =1

n ·NNTour(2.9)

where n is the problem size and NNTour is the length of a nearest neighbour tour.

This is different from AS and MMAS (Equation ( 2.1 on page 38)) where the n term

has replaced the pheromone evaporation term ρ. As with AS and MMAS, if local

search is in use, it is applied to the solution generated by the nearest neighbour

heuristic.


Solution construction is notably different from previous algorithms. An ant at a

city i chooses a next city j as follows{maxj∈Fi

{[τij ]

α [ηij ]β}if q ≤ q0

J(2.10)

where q is a random variable uniformly distributed in the range [0, 1]. q0 is a tuning

parameter that determines the threshold q value below which exploitation occurs

and above which exploration occurs in Equation ( 2.10). Fi is the set of feasible

cities (cities not yet visited by the ant). J is a randomly chosen city using the same

Equation ( 2.2 on page 39) as AS and MMAS, repeated below for convenience.

pij =[τij ]

α [ηij ]β∑

l∈Fi

[τil]α [ηil]

βif j ∈ Fi (2.11)

ACS was the first ACO algorithm to use a different decision process for explo-

ration and exploitation. The original source code facilitated applying this decision

process to solution construction in all heuristics provided with ACOTSP [47]. This

thesis takes advantage of this detail to apply the exploration/exploitation threshold

option to all heuristics studied.

When a given ant moves between two nodes, a local pheromone evaporation is

immediately applied to the edge connecting those nodes. After a movement from

node i to node j, the new pheromone level on the connecting edge τij is given by

τij = (1− ρlocal)τij + ρlocalτ0 (2.12)

where ρlocal is a heuristic tuning parameter and τ0 is the initial pheromone value of

Equation ( 2.9). Because of this local pheromone evaporation, the order in which

ants construct solutions in ACS may affect the pheromone levels presented to a

42


subsequent ant and ultimately the solutions produced by the swarm. There are

two distinct methods to construct solutions given a set of ants. Firstly, one can

iterate through the set of ants, allowing each ant to make one move and associated

local pheromone evaporation. This can be considered parallel solution construc-

tion and was the default implementation in the source code. Secondly, one can

move through the set of ants only once, allowing each ant to build a full tour and

apply all associated local pheromone evaporations. This can be considered as se-quential solution construction. It was an open question in the literature [47, p. 78]

whether there was a difference between sequential and parallel solution construc-

tion so this research included solution construction type as a tuning parameter for

ACS. This thesis will answer that open question.

Stage 3: Update pheromones

EvaporationThere is no evaporation in the update pheromones phase of ACS because phe-

romone has already been evaporated in the construct solutions phase.

DepositionPheromone deposition occurs along the trail of a single ant according to the

following:

τij = (1− ρ) τij + ρ1

Cchosen ant(2.13)

where C is the tour length of the chosen ant and ρ is a tuning parameter. It

is claimed in the literature that the use of the best so far ant is preferable for

instances greater than size 100 [47, p. 77]. We wished to investigate this claim

methodically and so created a tuning parameter that determines the chosen ant

used in ACS pheromone deposition. The tuning parameter determines whether the

chosen ant is the best so far ant or the best of iteration ant.


There are no daemon actions for the ACS algorithm.

2.4.7 Other ACO heuristics

There are of course many other variants within the ACO metaheuristic framework.

Best-Worst Ant System (BWAS) [31], although included in the original source code,

was omitted from our studies. This was because some of the behaviours of BWAS

were triggered by a CPU time measure. This made it impossible to guarantee back-

wards compatibility of our code with the original source code of Stutzle and Dorigo.

Ant System (AS), Elitist Ant System and Rank-based Ant System were omitted be-

cause they do not perform as well as MMAS and ACS and so have become less

popular. Applying the methodology, experiment designs and analyses introduced

by this thesis to AS, EAS, RAS and BWAS would be a straightforward matter to

explore as future work.

43


2.4.8 Additional tuning parameters

There are other possible tuning parameters that have been suggested in the liter-

ature or are implicitly used in the original source code or have been introduced in

our implementation of the original source code. We describe these parameters here

and discuss our decisions on their inclusion in the subsequent thesis research.

Exploration and exploitation

It was mentioned in the description of tour construction for ACS (Section 2.4.6

on page 42) that a random decision is made between exploration and exploita-

tion based on an exploration/exploitation threshold and that the original source

code allowed the application of this decision to all ACO algorithms. We decided

to include the use of the exploration/exploitation threshold in all ACO algorithms

investigated. If the threshold is not important, then a q0 threshold of 0 will be

recommended by the tuning methodology and tour construction will default to the

original case of only using the random proportional rule (see Equation ( 2.10 on

page 42)).

Candidate lists

A speed-up trick known as a candidate list was first used in ACS. A candidate list

restricts the number of available choices at each tour construction step to a list

of choices that are rated according to some heuristic. For the TSP, one possible

candidate list for a given city is a list of some number of neighbouring cities, sorted

into increasing distance from the current city. This number of neighbours, the

candidate list length, is a possible tuning parameter. For a static TSP problem,

candidate lists can be constructed for each TSP city at the start of the heuristic

run. This was the case in the original source code. Candidate lists simplify tour

construction as follows. When an artificial ant makes a decision on the next city to

visit, it first checks its current city’s candidate list. If all cities in the list have been

visited, the ant applies the usual tour construction rules to the remaining cities.

If, however, there are unvisited cities in the candidate list, the ant chooses from its

current city’s candidate list according to the usual rule.

For this research, candidate lists were applied to all ACO heuristics. List length

was expressed as a percentage of the problem size.

Computation limit

The previous section described candidate lists and how they are used to limit ant

decisions in the ACS heuristic tour construction. An examination of the origi-

nal source code reveals that candidate lists also influence the update pheromones

stages. Specifically, evaporation of pheromone and subsequent update of phero-

mone levels on edges are limited to the edges in each node’s candidate list. However

the influence of candidate lists was further complicated by its confounding with the

use of local search. Specifically, if local search was applied, then evaporation and

44


update were limited to the candidate list. If local search was not specified, evapora-

tion and update were applied to all edges. The decision was taken in this research

to introduce a new heuristic tuning parameter, called the computation limit, that

specifies whether pheromone updates should be limited to the candidate lists or

applied to all edges leading from every node. This tuning parameter can be applied

independently of whether local search was specified. Applying computation to all

edges is obviously extremely expensive and so in this research, computation limit

is always set to be limited to the node candidate lists. In Design of Experiments

(DOE) terms it is a held-constant factor (Section A.1.2 on page 211).

For MMAS, the candidate list length was also involved in the calculation of

updated trail minimum (Equation ( 2.7 on page 40) in Section 2.4.5 on page 40)

and computation of branching factor for the trail reinitialisation decision. These

calculations are not affected by the new computation limit parameter and can be

specified independently.

2.4.9 Summary of tuning parameters

Figure 2.6 summarises the various tuning parameters that are common to all the

ACO heuristics in this research and, where available, the recommended tuning

parameter settings from the literature [47, p. 71]. Some of these settings were

hard-coded from the ACOTSP author’s experience with the algorithms. In the ab-

sence of such experience it is useful to parameterise these hard-coded values and

experiment with tuning them.

# Parameter Description AS MMAS ACS

1 Exponent of pheromone term in the random proportional rule. 1 1 1

2 Exponent of heuristic term in the random proportional rule 2 to 5 2 to 5 2 to 5

3 Global pheromone evaporation term. 0.5 0.02 0.14 Number of ants. n n 105 Exploration/exploitation threshold. None None 0.96 Length of candidate list.

7 Placement Type of ant placement on the TSP graph. random random random

8 Local search type The type of local search to use.

9 Don’t look bits A parameter related to the local search routine.

10 Neighbourhood size

The number of neighbours examined by the local search routine.

11 Computation limitWhether certain computations are limited to the candidate list or applied to the whole problem.

α

βρm

0qc

Figure 2.6: Common tuning parameters and recommended settings for the ACO algorithms. Thesetuning parameters are common to all ACO algorithms in this research. n is the size of problem in termsof number of nodes.

The parameters and recommended settings from the literature for MMAS and

ACS are given in Figure 2.7 on the next page and Figure 2.8 on the following page

respectively. The nested parameter column in the MMAS table is for parameters

that only make sense as part of their parent parameter.

45


# ParameterNested parameter Recommended

12 Trail min update type

The calculation used when a new trail minimum is set.

None. Varies in the literature.

13 p A term used in one particular type of trail min update calculation.

None. Hard-coded to 0.05 in the original source code.

14 restart_freq The frequency with which the best-so-far ant is used to deposit pheromone.

None. Varies between fixed frequency and more complicated scheduled frequencies

15 reinit_freqThe frequency, in terms of iterations, with which a check is done on the need for trail reinitialisation.

None. Hard-coded in original source to 100.

16 reinit_itersThe threshold iterations without solution improvement after which a trail reinitialisation is considered.

None. Hard-coded in original source to 250.

17 reinit_Branch The threshold branching factor after which a trail reinitialisation is considered.


18 lambdaUsed to determine the cut-off point for inclusion of an edge in the branching factor calculation.


Figure 2.7: Tuning parameters and recommended settings for the MMAS algorithm.

# Parameter Description Recommended

12 A term in the local pheromone evaporation equation. 0.1

13 Const The solution construction method. parallel

14Pheromone deposition ant

The choice of whether to use the best_so-far ant or the best_of_iteration ant in pheromone deposition.

None

localρ

Figure 2.8: Tuning parameters and recommended settings for the ACS algorithm.

46


It is clear from these summaries that the ACO algorithms have many tuning

parameters2. Looking at the common parameters alone, there are eleven tuning

parameters. When specific heuristics are considered, the number of tuning pa-

rameters increases to potentially 18 for MMAS and 14 for ACS. We say that there

is a large parameter space. Moreover, there are no recommendations for many of

the parameter settings. Sophisticated techniques are required so that it is feasible

to experiment with such large numbers of tuning parameters. Such techniques

exist in a field called Design of Experiments.

2.5 Design Of Experiments (DOE)

This section may contain some terminology unfamiliar to the reader. Further de-

tails on the specific DOE techniques and issues encountered in this thesis are

summarised in Appendix A and are detailed in the literature [1, 89, 84, 85].

The National Institute of Standards and Technology defines Design Of Experi-

ments (DOE or sometimes DEX) as:

. . . a systematic, rigorous approach to engineering problem-solving that

applies principles and techniques at the data collection stage so as to

ensure the generation of valid, defensible, and supportable engineering

conclusions. In addition, all of this is carried out under the constraint

of a minimal expenditure of engineering runs, time, and money. [1]

The systematic approach comes from the clear methodologies and experiment

designs used by DOE. The analysis of the designs is supported with statistical

methods, providing the user with defensible conclusions and mathematically pre-

cise statements about confidence in those conclusions. The DOE principles of data

collection ensure that only sufficient data of a high quality is collected, improving

the efficiency and cost of the experiment. The main capabilities of DOE are as

follows [112]:

1. Quantify multiple variables simultaneously: Many factors and many re-

sponses can be investigated in a single experiment.

2. Identify variable interactions: the joint effect of factors on a response can

be identified and quantified.

3. Identify high impact variables: the relative importance of all factors on the

responses can be ranked.

2 It is possible to draw a distinction between tuning parameters and what we shall term designparameters. Tuning parameters are known to affect heuristic performance and so must be specified forevery deployment of the heuristic. Design parameters, by contrast, are alternative heuristic componentsthat have been parameterised so that they can be plugged into the heuristic. The aim is to determinewhether any of the alternative designs have a favourable affect on performance. An example in thecurrent research is the use of sequential or parallel solution construction in ACS (Section 2.4.6 onpage 42). If an alternative value of the design parameter is shown not to affect performance then thatalternative is removed as a parameter and the improved design is thereby fixed.

47


4. Predictive capability within design space: performance at new points in

the design space can be predicted.

5. Extrapolation capability outside design space: occasionally and with some

caution, performance outside the design space can be extrapolated.

These capabilities make DOE an essential approach for any research dealing

with large and expensive experiments. In industry, users of DOE include NASA3

and Google4.

The Operations Research community has been aware of these advantages for

some time, acknowledging that the risk of not adopting DOE is that the absence of

a statistically valid, systematic approach can result in the drawing of insupportable

conclusions [3]. Adenso-Dıaz and Laguna [2] give a brief list of OR papers that have

used statistical experiment design over the past 30 years. Some discussions are

quite general and offer guidelines on the subject [7, 35, 60]. Experimental design

techniques have been used to compare solution methods [3] and to find effective

parameter values [33, 123]. None of these papers’ techniques or methodologies,

however, have become so widespread that they approach being the standard for

experimental work in OR. None have been applied to ACO heuristics.

More often than not, ACO research uses a trial-and-error approach to answer-

ing its research questions. Birattari [12, p. 34-35] identifies two disadvantages

of the trial-and-error approach to parameter tuning and relates these to an in-

dustrial and academic context. From an industrial perspective, the trial-and-error

approach is time-consuming and requires a very specialised practitioner. From

an academic perspective, the approach does not facilitate a methodical scientific

analysis.

When the need for a more methodical approach to parameter tuning is acknowl-

edged, researchers may attempt a One-Factor-At-A-Time (OFAT) analysis. OFAT

involves tuning a single parameter when all others are held fixed, repeating this

process with each parameter one at a time. However it is well recognised outside

the heuristics field that DOE has many advantages over OFAT. Czitrom [37] illus-

trates these advantages by taking three real-world engineering problems that were

tackled with an OFAT analysis and re-analysing them with designed experiments.

In summary, the following advantages are clearly illustrated:

• Efficiency. Designed experiments require fewer resources, in terms of exper-

iments, time and material, for the amount of information obtained.

• Precision. The estimates of the effects of each factor are more precise. Full

factorial and fractional factorial designs use all observations to estimate the

effects of each factor and each interaction. OFAT experiments typically use

only two treatments at a time to estimate factor effects.

3 Obtained by searching the NASA Technical Reports server (http://ntrs.nasa.gov/search.jsp) withthe phrase “Design Of Experiments”.

4 Web page of Peter Norvig, current Director of Research at Google (http://norvig.com/experiment-design.html).

48

http://ntrs.nasa.gov/search.jsp

http://norvig.com/experiment-design.html

http://norvig.com/experiment-design.html


• Interactions. Designed experiments can estimate interactions between fac-

tors but this is not the case with OFAT experiments.

• More information. There is experimental information in a larger region of

the design space. This makes process optimisation more efficient because

the whole factor space can be studied and searched.

Despite all the advantages and capabilities of DOE presented above, we still

encounter several common excuses for not using DOE [112]. We list these here

with our own refutation of those excuses.

• Claim of no interactions: it may indeed be the case that there are no in-

teractions between factors. This claim can only be defended after a rigorous

DOE analysis has shown it to be true.

• OFAT is the standard: we have seen that trial-and-error approaches and

OFAT approaches are the norm. However, the comparison to DOE to OFAT

that we presented from another field [37] shows that if OFAT is the standard

then it is a seriously deficient standard that must be improved.

• Statistics are confusing: it is true that we cannot expect heuristics re-

searchers to become experts in statistics and Design Of Experiments. That

is the job of statisticians. It is also true that becoming an expert is not nec-

essary for leveraging the power and capabilities of DOE. In other fields such

as medicine and engineering, the research questions are often repetitive. Is

this drug effective? Can this manufacturing process be improved? This per-

mits identifying a small set of experiment designs and analyses that serve

to answer those common research questions. We will see in Chapter 3 that

heuristic tuning involves a similarly small set of research questions. This the-

sis will demonstrate the use of the designs and analyses to answer the most

important of those questions. Even the statistical analyses themselves can

be performed in software that shields the user from unnecessary statistical

details and guides the user in interpreting statistical analyses.

• Experiments are too large: it is true that experiments with tuning heuristics

are large. We have already mentioned the prohibitive size of the design space

in Chapter 1’s list of obstacles to the parameter tuning problem. This thesis

will introduce new experiment designs that permit answering the common

research questions with an order of magnitude fewer experiments.

2.6 Chapter summary

This chapter covered the following topics.

• Combinatorial optimisation. Combinatorial optimisation (CO) was intro-

duced and described. The Travelling Salesperson Problem was highlighted as

a particular type of CO problem.

49


• The Travelling Salesperson Problem. The reasons for the popularity of the

TSP were given along with a summary of the various types of TSP. The Sym-

metric TSP is the focus of this thesis.

• Heuristics. The difficulty of finding exact solutions to CO problems necessi-

tates the use of approximate methods or heuristics.

• Metaheuristics. Metaheuristics were introduced as an attempt to gather

various heuristics into common frameworks.

• Ant Colony Optimisation. Ant Colony Optimisation is a particular meta-

heuristic based on the foraging behaviour of real ants. The ACO heuristics

have been applied to many CO problems that can be represented by a graph

data structure. Several types of ACO heuristic were described in detail.

• Design of Experiments. The field of Design Of Experiments was introduced

and its capabilities highlighted. The advantages of DOE over its alternatives,

trial-and-error or One-Factor-At-A-Time were described. Some common ex-

cuses for not adopting DOE were refuted.

The next Chapter will review the issues that arise when using DOE and describe

how DOE should be adapted for experiments with tuning metaheuristics.

50

Part II

Related Work

51

3Empirical methods concerns

Thus far, this thesis has motivated rigorous empirical research on the parameter

tuning problem for metaheuristics. It was hypothesised that the parameter tuning

problem could be successfully addressed by adapting techniques from the field

of Design Of Experiments. A background on combinatorial optimisation and the

Travelling Salesperson Problem was detailed. The Ant Colony Optimisation (ACO)

family of metaheuristics was introduced and described.

Criticisms of the lack of experimental rigour in the experimental analysis of

heuristics have been made on several occasions in the operations research field

[64, 65]. Such criticisms and calls for increased rigour have also appeared in

the evolutionary computation field [48, 122, 103]. While there has been much

useful and creative research in the ACO field, the issue of experimental rigour

has never been to the fore. In the following, we bring together the most relevant

criticisms, suggestions and general issues relating to the design and analysis of

experiments with heuristics that have appeared in the heuristics and operations

research literature over the previous three decades. We relate these to the relatively

new ACO field. This will facilitate a critical review of the literature on ACO and the

parameter tuning of ACO in the next chapter. It will also strongly influence the

development of the thesis methodology in subsequent chapters.

The material in this chapter is presented approximately in the order an ex-

perimenter would encounter the issues when working in the field. Some issues,

such as reproducibility and responses for example, have an unavoidable overlap—

a poor choice of response or poor reporting will reduce reproducibility for example.

A familiarity with statistics and Design Of Experiments is assumed. A necessary

background on these topics is given in Appendix A and in the literature [89, 84].

53

CHAPTER 3. EMPIRICAL METHODS CONCERNS

3.1 Is the heuristic even worth researching?

A question that heuristics research often neglects is whether the heuristic is even

worth researching. It is tempting to expend effort on extensions of nature-inspired

metaphors and refinements of algorithm details. In fact, these endeavours were

identified as important goals in the early stages of the ACO field [30]. While much

useful work is done in this direction, it is important not to lose sight of the purpose

of optimisation heuristics which is to address the heuristic compromise and solve

difficult optimisation problems to a satisfactory quality in reasonable time. John-

son [69] lists some questions that should be asked before beginning a heuristics

research project:

• What are the questions you want your experiments to address?

• Is the algorithm implemented correctly and does it generate all the data you

will need? We add that, all else being equal, the analytical tractability of the

algorithm changes the conclusions that can be made from our experiments.

• What is an adequate set of test instances and runs?

• Given current computer specifications, which problem instances are too small

to yield meaningful distinctions and which are too large for feasible running

times?

• Who will care about the answers given the current state of the literature?

This final question of ‘care’ ties in with the analysis of Barr et al [7, p. 12] who

state that a heuristic method makes a contribution if it is:

• Fast: produces higher quality solutions quicker than other approaches.

• Accurate: identifies higher quality solutions than other approaches.

• Robust: less sensitive to differences in problem characteristics, data quality

and tuning parameters than other approaches.

• Simple: easy to implement.

• Generalisable: can be applied to a broad range of problems.

• Innovative: new and creative in its own right.

An examination of the literature reveals that not only do researchers often fail

to ask questions regarding speed, accuracy and robustness, they often fail even to

collect the necessary data that would permit answering these questions.

We can speak of one heuristic dominating another heuristic in terms of one or

more of these qualities when the dominating heuristic scores better on these quali-

ties than the dominated heuristic. For example, we often find that a given heuristic

may dominate another in terms of speed and accuracy but is in turn dominated in

terms of its generalizability. In general, a highly dominated heuristic is not worth

studying given the aforementioned overarching aim of heuristics research. It is

54


nonetheless worthwhile to study a dominated algorithm in some circumstances

[69]. Firstly, the algorithm may be in widespread use or its dominating rival may

be so complicated that it is unlikely to enter into widespread use. Secondly, the

algorithm may embody a general approach applicable to many problem domains

and studying how best to adapt it to a given domain may be of interest.

ACO certainly does embody a general approach for combinatorial optimisation

problems that can be represented by graphs (Section 2.4 on page 34). The version

of ACO studied in this thesis does not incorporate local search and therefore it

could be argued that the thesis experiments with a dominated algorithm. This ar-

gument is easily countered in several ways. Firstly, much research in ACO is still

conducted without local search. Secondly, and more importantly, the emphasis

of this thesis is on parameter tuning rather than on the design of new ACO ver-

sions that improve the state-of-the-art in TSP solving. ACO is a useful subject of

study because of its large number of tuning parameters. Although the thesis’ DOE

approach to tuning will later be shown to improve ACO performance, no claims

are made about the competitiveness of this performance in relation to state-of-

the-art TSP solution methods. This does not preclude applying the thesis’ DOE

methodologies to such state-of-the-art methods.

Assuming that it is worthwhile to study the heuristic in question, the experi-

menter must then determine what type of study will be conducted.

3.2 Types of experiment

Barr et al [7] distinguish between just two types of computational experiments with

algorithms: (1) comparing the performance of different algorithms for the same

class of problems or (2) characterising an algorithm’s performance in isolation. In

fact, there are several types of experiment identified in the literature.

• Dependency study [79] (or Experimental Average-case study [69]). This

aims to discover a functional relationship between factors and algorithm per-

formance measures. It focuses on average behaviour, generating evidence

about the behaviour of an algorithm for which direct probabilistic analysis is

too difficult. For example, one may investigate whether and how the tuning

parameters α and β (Section 2.4) increase the convergence rate of ACO.

• Robustness study [79]. A robustness study looks at the distributional prop-

erties observed over several random trials. Typical questions that a robust-

ness study addresses are: how much deviation from average is there? What

is the range in performance at a given design point? Are there unusual values

in the measurements?

• Probing study [79] (or Experimental Analysis paper [69]). These studies

‘open up’ an algorithm and measure particular internal features of its oper-

ation, attempting to explain and understand the strengths, weaknesses and

workings of an algorithm. For example, an ACO probing study might in-

vestigate whether different types of trail reinitialisation schedule improve the

55


performance of MMAS (Section 1 2.4.5 on page 39).

• Horse race study [69] (or Competitive Testing [65]). A horse race study

attempts to demonstrate the superiority of one algorithm over another by

running the algorithm on benchmark problem instances. This is typical of

the majority of research in the ACO field. The horse race study has its place

towards the latter stages of a heuristic’s life cycle (Section 3.3). However, its

scientific merits have been strongly criticised [69, 65].

• Application study [69]. An application study uses a particular code in a

particular application and describes the impact of the code in that context.

For example, one might report the application of ACS to a scheduling problem

in a manufacturing plant.

We consider the application study as a specific context for the other types of

study. Dependency studies, robustness studies, probing studies and horse race

studies could all conceivably be conducted with a particular code in a particular

application. The choice of experiment type will depend very much on the life cycles

of both the heuristic and the problem domain in question. This thesis is primarily

a dependency study as it studies the relationship between tuning parameters and

performance. It also has some characteristics of a probing study in that design fac-tors, factors that represent parameterised design decisions, are also experimented

with.

3.3 Life cycle of a heuristic and its problem domain

The heuristic life cycle consists of two main phases, (1) research and (2) develop-

ment [101]. Research aims to produce new heuristics for existing problems or to

apply existing heuristics in creative ways to new problems. Development aims to

refine the most efficient heuristic for a specific problem. Software implementation

details and the application domain become more important in this situation.

Birattari [12] breaks development into two phases. There is what he also terms

a development phase in which the algorithm is coded and tuned. This phase relies

on past problem instances. The second phase is the production phase in which

the algorithm is no longer developed and is deployed to the user. This phase is

characterised by the need to cope with new problem instances.

The research phase of the heuristic lifecycle requires dependency, robustness

and probing studies. The development phase requires horse race and application

studies. Although ACO is still in the research phase of its life cycle, the majority

of work reported on it is more appropriate for the development phase as it focuses

on the typical horse race and application study issues.

The problem domain also has a life cycle [101, p. 264] and this impacts heuris-

tic research. For some problems in the early stages of their life cycle, there exist

few if any solution algorithms. In these cases, being able to consistently construct

a feasible solution is a significant achievement. Later in the problem’s life cycle, a

body of consistent algorithms that produces feasible solutions already exists. At

56


this stage, research must demonstrate either an insight into the algorithm’s be-

haviour (probing study) or must demonstrate that the algorithm performs better

than other existing methods (horse race and application studies). The TSP as used

in this thesis is undoubtedly in the later stages of its life cycle.

Over ten years ago in 1996, Colorni et al [30] identified 4 progressions in nature-

inspired heuristics and used these to compare the state of the art of 6 types of

heuristic. Their stages, in order of progress, are:

1. the presence of practical results,

2. the definition of a theoretical framework,

3. the availability of commercial packages, and

4. the study of computational complexity and related principles.

We disagree with this ordering, although it is often encountered in computer

science research. The meaning of ‘practical results’ is vague. Assuming the au-

thors mean results on real problems rather than small scale abstractions, then

complexity studies and theoretical frameworks can certainly precede such ‘practi-

cal results’. Their stages make no distinction between problem and heuristic life

cycle. Their view on the state-of-the-art with respect to these stages is summarised

in Table 3.1.

Results Theory Packages Complexity

Simulated An-nealing

Well devel-oped

Well devel-oped

Developing Developing

Tabu Search Developing Developing Developing Developing

Neural Nets Developing Developing Developing Emerging

Genetic Algo-rithms

Developing Developing Emerging Emerging

Sampling andClustering

Developing Emerging Emerging

Ant Systems Emerging Emerging

Table 3.1: The state of the art in nature-inspired heuristics from 10 years ago. Adapted from [30].

With hindsight, we see that this assessment was overly optimistic. A robust

theory and complexity analysis for ant colony algorithms has yet to be established

and research has only recently moved in this direction [42]. There are still no

commercial packages in widespread use, although we have mentioned anecdotal

evidence for the use of ant colony approaches within several companies (Chapter

1). Even some of the supposedly established results that we will review in the next

chapter may have to be revised in light of the experiment design issues we review

in this section and the results this thesis reports.

57


3.4 Research questions

Having decided on the type of experimental study that is required, based on the

heuristic and problem life cycles, the experimenter can then proceed to refine the

study into one or more specific research questions. There are two main issues in

research with heuristics: how fast can solutions be obtained and how close do the

solutions come to being optimal [101]? These questions cannot be answered in

isolation. Rather we must consider the trade off between feasibility and solution

quality [7, p. 14], a trade off that this thesis terms the heuristic compromise.

This is of course a simplification and other authors have tried to enumerate the

various research questions that one can investigate within this heuristic compro-

mise of quality and speed. In the following, we have categorised these questions

within the types of experimental study identified previously.

3.4.1 Dependency Study

• What are the effects of type and degree of parametric change on the perfor-

mance of each solution methodology [3, p. 880]?

• What are the effects of problem set and size on the performance of each

method [3, p. 880]?

• What are the interaction effects on the solution techniques when the above

factors are changed singly or in combination [3, p. 880], [30]?

• How does running time scale with instance size and is there any dependence

on instance structure [69, 30]?

3.4.2 Robustness

• How robust is the algorithm [7, p. 14]?

• Does a new class of instances cause significant changes in the behaviour of a

previously studied algorithm [69]?

• For a given machine, how predictable are running times/operation counts for

similar problem instances [69]?

• How is running time affected by machine architecture [69]?

• How far is the best solution from those more easily found [7, p. 14]?

• What are the answers to these questions for other performance measures

[69]?

3.4.3 Probing study

• How do implementation details, heuristics and data structure choices affect

running time [69]?

58


• What are the computational bottlenecks of the algorithm and how do they

depend on instance size and instance structure [69]?

• What algorithm operation counts best explain running time [69, 30]?

3.4.4 Horse race

• Is there a best overall method for solving the problem? [3, p. 880]

• What is the quality of the best solution found? [7, p. 14]

• How long does it take to determine the best solution? [7, p. 14]

• How quickly does the algorithm find good solutions? [7, p. 14]

• How does an algorithm’s running time compare to those of its top competitors

and are those comparisons affected by instance size and structure? [69]

3.5 Sound experimental design

Once one or more research questions have been identified, an experiment can be

planned and executed. A general procedure for experimentation has three steps

[28].

1. Design. An experimental design is conceived. This is the general plan the

experimenter uses to gather data. Crowder et al [34] quote a definition of

good experimental design.

The requirements for a good experiment are that the treatment com-

parisons should as far as possible be free from systematic error,

that they should be made sufficiently precise, that the conclusions

should have a wide range of validity, that the experimental arrange-

ment should be as simple as possible, and finally that the uncer-

tainty in the conclusions should be assessable. [4]

2. Data gathering and exploration. When all data have been gathered, an ex-ploratory analysis is conducted. This involves looking for patterns and trends

in the data using plots and descriptive statistics. Appropriate transformations

of the data may have to be done.

3. Analysis. Formal statistical analyses are performed.

There is some recognition in the literature that formal statistical analyses are

an integral part of an experimental procedure. Attempts have been made to detail

the specifics of the experimental procedure for heuristics.

Developing a sound experimental design involves identifying the vari-

ables expected to be influential in determining code performance (both

those which are controllable and those which are not), deciding the

appropriate measures of performance and evaluating the variability of

59


these measures, collecting an appropriate set of test problems, and fi-

nally, deciding exactly what questions are to be answered by the experi-

ment [35].

The identification of a methodical and ordered experimental design procedure

is to be welcomed. However, there is a problem with the ordering presented in that

a decision on the research question is left to the very end of the design process. We

posit that this should be the very first step in any procedure because the nature

of the questions the experimenter wants to ask will determine all subsequent de-

cisions in the design. A research question involving comparisons (Section 3.4.4 on

the preceding page) requires a different design to a research question concerning

relationships (Section 3.4.1 on page 58).

A more comprehensive seven step outline of the design and analysis of an ex-

periment comes from outside the heuristics field [84, p. 14].

1. Recognition of and statement of the problem. Although it seems obvious,

it is often difficult to develop a statement of the problem. If the process is new

then a common initial objective is factor screening, determining which factors

are unimportant and need not be investigated. A better understood process

may require optimisation. A system that has been modified may require con-firmation to determine whether it performs the same way as it did in the past.

A discovery objective occurs when we wish to explore new variables such as

an improved local search component. Robustness studies are needed when

there are circumstances in which the responses may seriously degrade.

Clearly, these objectives are reflected in the types of study categorised in

Section 3.4 on page 58.

2. Selection of the response variable and the need for replicates. The re-

sponse variable(s) chosen must provide useful information about the process

under study. Measurement error, or errors in the measuring equipment, must

be considered and may require the use of repeated measurements of the re-

sponse. This is typically the case with measurements of CPU time.

3. Choice of factors, levels, and range. Factors are either potential factorsor nuisance factors [84, p. 15]. Because there is often a large number of

potential factors, they are classified as either:

• Design factors: these are the factors that are actually selected for study.

• Held-constant factors: these factors may have an effect on the re-

sponse(s) but because they are not of interest, they are held constant

at a specific level during the entire experiment.

Nuisance factors are not of interest in the study but may have large effects on

the response. They therefore must be accounted for. Nuisance factors can be

classified [84, p. 16] as:

• Controllable: A controllable nuisance factor is one whose levels can be

set by the experimenter. In traditional design of experiments, a batch

60


of raw material is a common controllable nuisance factor. In heuristics

DOE, the random seed for the heuristic’s random number generator is a

very common one.

• Uncontrollable: These factors are uncontrollable but can be measured.

Techniques such as analysis of covariance can then be used to com-

pensate for the nuisance factor’s effect. In traditional DOE, operating

conditions such as ambient temperature may be an uncontrollable but

measurable nuisance factor. In heuristics DOE, CPU usage by back-

ground processes is a good example.

Once the design factors have been selected, the experimenter chooses both

the ranges over which the factors are varied and the specific factor levels at

which experiments will be conducted. The region of interest (Section A.2 on

page 213) is usually determined using practical experience and theoretical

understanding, when available. When there is no knowledge of the heuris-

tic, a pilot study can give quick and useful guidelines on appropriate factor

ranges.

The specific factor levels are often a function of the experimental design.

4. Choice of experimental design: choice of design depends on the experimen-

tal objectives. Some designs are more appropriate for modelling, some for

optimisation, some for screening. Some designs can better fit into a sequen-

tial experimental procedure and so are more efficient in terms of experimental

resources. Decisions are also made on the number of replicates and the use

of blocking.

An emphasis is clearly being placed on the importance of deciding on the

research question(s) early on in the experiment procedure and not at the end.

5. Performing the experiment: In the traditional DOE environment of manu-

facturing it is often difficult to plan and organise an experiment. The process

must be carefully monitored to ensure that everything is done according to

plan.

This is less of an issue in the majority of experiments with heuristics for

combinatorial optimisation. However, experiments involving heuristics and

humans, in some visual recognition task say, would have to pay very careful

attention in this step. It goes without saying that all code should be checked

for correctness and bugs.

6. Analysis of the data: Statistical methods are required so that results and

conclusions are objective rather than judgmental. Statistical methods do not

prove cause but rather provide guidelines to the reliability of a result [84, p.

19]. They should be used in combination with engineering knowledge.

Of particular importance here is the danger of misinterpretation of hypothesis

tests and p values. This is addressed in Appendix V on page 211.

61


7. Conclusions and recommendations: Graphical methods are most useful to

ensure that results are practically significant as well as statistically signifi-

cant. Conclusions should not be drawn without confirmation testing. That is,

new independent experiment runs must be conducted to confirm the conclu-

sions of the main experiment.

Confirmation testing is rare in the ACO literature. Birattari [12] draws at-

tention to the need for independent confirmation as is typical in machine

learning. This thesis places a strong emphasis on independent confirmation

of all its statistical analyses.

Cohen [29] identifies several tips for performance analysis of which the most

relevant to heuristics performance analysis are reproduced here.

1. Use bracketing standards. The tested program’s anticipated performance

should exceed at least one standard and fall short of another.

A typical upper standard in heuristics is an optimal solution. Often times

however the optimal solution is not known. The use of optimal solutions is

discussed in Section 3.10 on page 68. A typical lower standard is a randomly

generated solution. Alternative lower standards for heuristics are simple re-

producible heuristics such as a greedy search.

Adjusted Differential Approximation (Section 3.10.2 on page 69), a quality re-

sponse used in this thesis, incorporates a comparison to an expected random

solution to a problem.

2. Many measures. It is not expensive to collect many performance measures.

If collecting relatively few measures, a pilot study can once again help, deter-

mining which are highly correlated. Highly correlated measures are redun-

dant.

3. Conflicting measures. Collect opponent performance measures. Conflicting

measures are unavoidable with heuristics due to the heuristic compromise of

lower solution time and higher solution quality.

These design of experiment steps and tips provide metaheuristics research with

some much needed procedural rigour. However there remain many pitfalls for the

experimenter.

3.5.1 Common mistakes

Many of the issues that arise in designed experiments for other fields such as

manufacture are thankfully not an issue for the heuristic engineer. Consequently,

metaheuristics researchers have few excuses for poor methodology. Measurement

errors due to gauge calibration, for example do not arise. No human data entry

with the possibility of mistakes is generally required. Nonetheless, some of the

lessons from traditional DOE [67] do translate to designed experiments for heuris-

tics.

62


• Too narrow factor ranges. Running too narrow a range from high to low for

the factors can make it seem that key factors do not affect the process. The

reality is that they do not affect the process in the narrow range examined.

• Too wide factor ranges. Running too wide a range of factors may recommend

results that are not usable in the real process.

• Sample size and effect size. The sample size must be large enough to detect

the effect size that the experimenter has deemed to be significant and yet not

so large as to detect the tiniest of effects of no practical significance.

The risk of all of these mistakes being made can be greatly mitigated by invest-

ing a small amount of resources in a pilot study. Jain [68, p. 14-25] lists further

common mistakes made in performance evaluation. The most relevant of these are

as follows.

1. Biased Goals. The definition of a performance evaluation project’s goals can

implicitly bias its methodology and conclusions. A goal such as showing that

“OUR system is better than THEIRS” can turn a problem into one of finding

metrics such that OUR system turns out better rather than finding the fair

metrics for comparison [68, p. 15].

This is the danger that others also highlight [65]. The problem of bias is

discussed in further detail in Section 3.14 on page 74.

2. Unsystematic approach [68]. Analysts sometimes select parameter values,

performance metrics and problem instances arbitrarily, making it difficult

to draw any conclusions. This is, unfortunately, very common in the meta-

heuristics field. This thesis provides methodical guidelines and steps from

the Design Of Experiments field that replace this unsystematic approach.

3. Incorrect Performance Metrics. Changing the choice of metrics can change

the conclusions of a study. It is important to conduct a study with several

competing metrics so that any effect of choice of metric can be understood

and accounted for. This is not an issue if the experimenter has followed the

recommendations of recording many performance measures (Section 3.5 on

page 59).

4. Ignoring Significant Factors. Not all parameters have the same effect on

performance and so it is important to identify the most important parameters.

This is dealt with in the screening step mentioned in Section 3.5 on page 59.

5. Inappropriate Experimental Design. Proper selection of parameters and

number of measurements can lead to more information from the same num-

ber of experiments. Jain [68] also highlights the problem with the OFAT

approach and his preference for factorial and fractional factorial designs as

introduced in this thesis.

6. No Sensitivity Analysis. Without a sensitivity analysis, one cannot be sure

whether the conclusions would change if the analysis were done in a slightly

63


different setting. Furthermore, a sensitivity analysis can help confirm the

relative importance of factors.

7. Omitting Assumptions and Limitations. This can lead a reader of the re-

search to apply an analysis to another context where the assumptions are no

longer valid.

Even within a well-defined experimental framework, the experimenter must be-

ware of these many common pitfalls.

3.6 Heuristic instantiation and problem abstraction

A research question has been identified and an experiment design has been se-

lected to answer this question. The experimenter must now think about the imple-

mentation of the algorithm that is the subject of the experiment and the problem

domain to which the algorithm will be applied. We can consider both algorithms

and problems at different levels of instantiation. Several authors [28, 65, 79] dis-

cuss how different levels of algorithm instantiation are appropriate for different

types of analyses. A general description may be enough to determine whether an

algorithm has a running time that is exponential in the length of its input. Hooker

[65] likens this to an astronomer who tests a hypothesis about the behaviour of

galaxies by creating a simulation. This simulation can improve our understanding

even though the running time is much faster than the real phenomenon. Further

algorithm instantiation, such as details of data structures, is needed to count crit-

ical operations as a function of the input. A complete instantiation in a particular

language with a particular compiler and running on a particular machine is needed

to generate CPU times for particular inputs. This thesis uses fully instantiated al-

gorithms.

As instantiation increases, so too does the importance of implementation is-

sues. There are three main advantages to using efficient algorithm implementa-

tions [69]. Such implementations better support claims of practicality and com-

petitiveness. There is less possibility for the distortion of results achieved by al-

gorithms that are significantly slower than those used in practice. Finally, faster

implementations allow one to experiment with more and larger problem instances.

Clearly there is a balance between code that has been implemented efficiently for

research purposes and code that has been fine-tuned as for a competitive indus-

trial product.

The problem domain can also be treated at several levels of abstraction [12].

• Lowest level. This is a mathematical model of a well defined practical prob-

lem. This level is most often used in industry and application studies where

it is desired to solve a particular instance rather than make generalisations

across a class of instances.

• Intermediate level. At this level, abstractions such as the Travelling Sales-

person Problem and Quadratic Assignment Problem capture the features and

64


constraints of a class of problems. This thesis is focussed at this level of

problem instantiation.

• Highest level. The highest level of abstraction includes high level ideas such

as deceptive problems [57] but does not represent a specific real world prob-

lem.

Once the appropriate levels of algorithm instantiation and problem abstraction

have been agreed, the experimenter can begin pilot studies.

3.7 Pilot Studies

The discussions of common mistakes in experiment design already mentioned the

usefulness of pilot studies (Section 3.5.1 on page 62). A pilot study is simply a

small scale set of experiment runs that are used for the exploratory analysis of a

process. Pilot studies help refine a full blown experiment design in several ways

[101].

1. Pilot studies can indicate that some factors initially thought important actu-

ally have little effect or have a single best level that can be fixed in all runs.

They help identify design factors (Section 3.5 on page 59). They also can

indicate where two or more factors can be collapsed into a single one.

2. Pilot studies help determine the number and values of levels to use in deter-

mining whether the factor has practical significance.

3. Pilot studies reveal how much variability we can expect in outcomes. This

influences the number of replicates that will be necessary in a sample in

order to obtain reliable results.

4. Pilot studies can help design the algorithm itself, by highlighting appropriate

output data and stopping criteria.

Pilot studies are therefore an important part of the early stages of an exper-

iment design, reducing the risk of some common design mistakes (Section 3.5.1

on page 62). They can never be a replacement for designed experiments with suf-

ficient sample sizes and correct statistical analyses. Conclusions should not be

drawn from pilot studies.

3.8 Reproducibility

The reproducibility of research results is of course fundamentally important to all

sciences. Computer science and research in metaheuristics should be no different.

Reproducing research with computers in general and metaheuristics in particular

presents some unique challenges.

1. Differences between machines [65, 35]: It is difficult to guarantee that al-

gorithms being tested by different researchers are run on machines with the

65


exact same specifications. Specifying the processor speed, memory etc. is not

enough. What other CPU processes may have run throughout the experiment

or for periods during the experiment? Even if a researcher goes to all the

trouble of setting up a clean environment, how reproducible is that environ-

ment going to be for other researchers who do not have access to the same

machines? How reproducible will that environment remain as technology ad-

vances with new hardware and operating system versions for example? We

will see in Section 3.9 on the next page that many of these concerns can be

overcome with benchmarking but there is as yet no discussion of appropri-

ate benchmarks for ACO heuristics and their problem domains. This thesis

devotes a whole chapter (Chapter 5) to benchmarking its code.

2. Differences in coding skill [65]: It is often unclear what coding technique

is best for a given algorithm. Even if a given technique could be agreed on,

it is difficult to guarantee that different programmers have applied the same

technique fairly. This can be mitigated by using and sharing code. This the-

sis uses code that was made available online by Stutzle [47]. However, the

porting of this code from C to Java and the associated refactoring into an

object-oriented implementation undoubtedly introduces further implementa-

tion differences.

3. Degree of tuning of algorithm parameters [65]: Given that it is possible to

adjust parameters so that an algorithm performs well on a set of problems, we

must ask how much adjustment should be done and whether this adjustment

has been done in the same way as in the original research. This thesis intro-

duces a methodical approach to tuning metaheuristics and therefore greatly

improves this aspect of reproducibility of research.

Strictly then, the reproducibility of an algorithm means that ‘if you ran the

same code on the same instances on the same machine/compiler/operating sys-

tem/system load combination you would get the same running time, operation

counts, solution quality (or the same averages, in the case of a randomised algo-

rithm)’ [69].

This is impossible in practice. A broader notion of reproducibility [69] is re-

quired that is acceptable in classical scientific studies. This notion recognises that

while the classical scientist will use the same methods, he will typically use dif-

ferent apparatus, similar but distinct materials and possibly different measuring

techniques. The experiment is deemed reproduced if it produces data consistent

with the original experiment and reaches the same conclusions. Such a notion of

reproducibility must be expected from metaheuristics research.

3.8.1 Reporting results for reproducibility

Even if methods are reproduced exactly for a heuristic experiment, the way results

are calculated and reported can reduce reproducibility. Many of the common ap-

proaches to reporting the performance of an algorithm have drawbacks from the

perspective of reproducibility [69].

66


• Report the solution value: This is not reproducible since we cannot per-

form similar experiments on similar instances and determine if we are getting

similar results. Furthermore, it provides no insight into the quality of the

algorithm.

• Report the percentage excess over best solution currently known: this

is reproducible only if the current best solution is explicitly stated. Unfortu-

nately, current bests are a moving target and so leave us in doubt about the

algorithm’s true quality.

• Report the percentage excess over an estimate of a random problem’sexpected optimal: this is reproducible if the estimate and its method of

computing are explicitly stated. It is meaningful only if the estimate is con-

sistently close to the expected optimal.

• Report the percentage excess over a well-defined lower bound: this is

reproducible when the lower bound can be feasibly computed or reliably ap-

proximated.

• Report the percentage excess over some other heuristic: this is repro-

ducible so long as the other heuristic is completely specified. This involves

more than naming the heuristic or citing a reference. Johnson [69] recom-

mends using a simple algorithm as the standard. This standard is preferably

easily specified and deterministic.

3.9 Benchmarking

A machine can be fully described when results are originally reported. Over time,

it becomes increasingly difficult to estimate the relative speeds between the ear-

lier system and the current one because of changes in technology. This has two

consequences. Firstly, the existing results cannot be reproduced. Secondly, new

results cannot even be easily related to the existing results. The solution to this

is benchmarking. Benchmarking is the process of running standard tests on stan-

dard problems on a set of machines so that the machines can be fairly compared in

terms of performance. Johnson [69] advocates benchmarking code in the following

way. The benchmark source code is distributed with the experiment code. The

benchmark is compiled and run on the same machine and with the same compiler

as used for the experiment implementations. The run times for a specified set of

problem instances of varying sizes is reported. Future researchers can calibrate

their own machines in the same way and attempt to normalise existing results to

their newer results. Benchmarking is common in scientific computing and was in-

troduced to the heuristics community at the DIMACS challenges1. Benchmarking

for ACO TSP algorithms has never been reported to our knowledge. The bench-

marking process for this thesis is reported in Chapter 5.

1 http://public.research.att.com/∼dsj/chtsp/download.html

67

http://public.research.att.com/~dsj/chtsp/download.html


3.10 Responses

The issue of responses was already touched on in our discussion of reproducibility

(Section 3.8 on page 65). The choice of performance measure depends on the

questions that motivate the research [79]. A study of how growth rate is affected

by problem size (‘big O’ studies) would count the dominant operation identified in

a theoretical analysis. A study to recommend strategies for data structures might

measure the number of data structure updates. The literature offers some general

guidelines for choosing good performance measures [79].

• Data should not be summarised too early. Algorithms should report outputs

from every trial rather than the means over a number of trials. This is espe-

cially important when data have unusual distributional properties.

• A good performance measure will exhibit small variation within a design point

compared to the variation between distinct design points.

Barr et al [7, p. 14] observe that research questions can broadly be categorised

as questions of quality, computational effort and robustness. They advise that

measures from each category should be used in a well-rounded study. The litera-

ture also examines more specific responses.

3.10.1 CPU Time

Johnson [69] advocates always reporting CPU times, even if they are not the sub-

ject of a study. He presents some reasons why running times are not reported and

his counter arguments.

• The main subject of the study is one component of the running time, forexample local optimisation. Readers will still want to know how important

this component is relative to the overall running time.

• The main subject of the study is a combinatorial count related to the al-gorithm’s operation. To establish the meaningfulness of this count, readers

will need to study its correlation with running time. For example, an investi-

gation of pruning schemes used by an algorithm could mislead a reader if it

did not report that the better scheme took significantly longer to run.

This issue also arises in research with ACO where there is often a temptation

to extend the ant metaphor without examining the real cost of this added

complexity.

• The main concern of the study is to investigate the quality of solutionsproduced by an approximate algorithm. The main motivation of using an

approximate algorithm is that it trades quality of solution for reduced running

time. Readers will want to know what the trade off is.

McGeoch [79] acknowledges that it is often difficult to find combinatorial mea-

sures that predict running times well when an algorithm is highly instantiated.

68


Coffin and Saltzman [28] argue that CPU time is an appropriate comparison crite-

rion for algorithms when the algorithms being compared have significantly different

architectures and no comparable fundamental operations. Barr et al [7, p. 15-16]

advise recording the following times.

• Time to best-found solution: this is the time required by the heuristic to

find the solution the author reports. This should include all pre-processing.

• Total run time: this is the total algorithm execution time until the execution

of its stopping rule.

• Time per phase: the timing and quality of solution at each phase should be

reported.

One should exercise caution with the time to best solution response. It is only

after the experiment has concluded that we know this was the best solution found.

It can be deceptive to report this value in isolation if the reader is not told how long

the algorithm actually ran for. This is related to the issue of best solution from a

number of runs (Section 3.10.5 on the next page).

3.10.2 Relative Error and Adjusted Differential Approximation

According to Barr et al [7, p. 15], comparison should be made to the known optimal

solution. We have already mentioned some criticisms of the specifics of how this

comparison is made (Section 3.8 on page 65). Birattari [12] discusses measures of

performance in terms of solution quality. He rightly dismisses absolute error as it

is not invariant with a scaling of the cost function. He also dismisses the use of

relative error since it is not invariant under some transformations of the problem,

as first noted by Zemel [124]. An example is given of how an affine transformation2

of the distance between cities in the TSP, leaves a problem that is essentially the

same but has a different relative error of solutions. Birattari uses a variant of

Zemel’s differential approximation measure [125] defined as:

cde(c, i) =c− ci

crndi − ci(3.1)

where cde(c, i) is the differential error of a solution instance i with cost c, ci is the

cost of the optimal solution and crndi is the expected cost value of a random solu-

tion to instance i. An additional feature of this Adjusted Differential Approximation(ADA) is that its value for a random solution is 1, so the measure indicates how

good a method is relative to a trivial method which in this case is a random solu-

tion. It can therefore be considered as incorporating a lower bracketing standard

(Section 3.5 on page 59).

2 An affine transformation is any transformation that preserves collinearity (i.e., all points lying ona line initially still lie on a line after transformation) and ratios of distances (e.g., the midpoint of aline segment remains the midpoint after transformation). Geometric contraction, expansion, dilation,reflection, rotation, and shear are all affine transformations.

69


ADA is not yet a widely used solution quality response. This thesis measures

and analyses both relative error and ADA in keeping with Cohen’s [29] recommen-

dation on multiple performance measures (Section 3.5 on page 59).

3.10.3 Relative Terms

Relative terms are responses expressed such as some type of quotient such as the

number of iterations/average number of iterations. Crowder et al [35] are not in

favour of using relative terms when reporting performance. While relative terms

do make comparison more difficult, Johnson [69] argues that relative performance

indicators are often enlightening. It is important that enough information is pro-

vided so that the original components of the relative term can be recovered for

reproducibility (Section 3.8 on page 65).

3.10.4 Frequency of Optimum

While it is of interest to determine the probability with which an algorithm will

find an optimal solution for a given instance, it has limitations when used as a

metric [69]. Firstly, it limits analysis to instances for which optima are actually

known. Secondly, it ignores how near the algorithm gets when it doesn’t find

the optimum. Thirdly, it cannot distinguish between algorithms on larger prob-

lem instances where the probability of finding the optimal solution is usually 0.

Moreover, this response overemphasises finding an optimum when the heuristic

compromise is about finding a good enough solution in reasonable time.

3.10.5 Best Solution from a number of runs

Birattari and Dorigo [13] criticise the use of the best solution from a number of runs

as advocated by others [48]. They dismiss this measure as ‘not of any real interest’

since it is an over optimistic measure of a stochastic algorithm. The authors also

counter the reasoning that in a real world scenario one would always use the best

of several runs [48]. Firstly, it leads to an experiment measuring the performance

of a random restart version of the algorithm. Secondly, this random restart version

is so trivial (repeated run of the same algorithm with no improvement or input from

the previous run) that it would not be a sound restart strategy anyway with the

given resources. Johnson [69] levels two further criticisms at the reporting of the

best solution found from multiple runs on a problem instance. Because the best

run is a sample from the tail of a distribution it is necessarily less reproducible

than the average. Also, if running time is reported, it is generally for that best run

of the algorithm and not for the entire number of runs that yielded the reported

best solution (Section 3.10.1 on page 68). This obscures the time actually required

to find the reported solution. If the number of runs is not stated, there is no way

to determine the real running time. Even when the number of runs is reported,

multiplying the number of runs by the reported run time would overestimate the

time needed. Actions such as setting up data structures need only be done once

when multiple runs are performed.

70


3.10.6 Use of Averages

Reports of averages should be accompanied at least by a measure of distribution.

Any scaling or normalising of averages should be carefully explained so that raw

averages can be recovered if necessary.

3.11 Random number generators

Several problems can occur with the use of pseudo-random number generators

and differences in numerical precision of machines [79]. These problems can be

identified with replication. Firstly, a faulty implementation of a generator can

introduce a bias in the stream of numbers produced and this can interact with the

algorithm. Treatments should be replicated with more than one random number

generator. Secondly, differences in numerical precision of machines can introduce

biases into an algorithm’s behaviour. Treatments should be replicated with the

same generator and seeds on different machines.

It is difficult to implement a good generator correctly [92]. The source code in

this thesis uses the minimal generator of Park and Miller [92] described in the

literature [99, p. 279]. This is the generator used in the original source code by

Stutzle and Dorigo [47].

3.12 Problem instances and libraries

There are two basic types of test instance: (1) instances from real-world appli-

cations and (2) randomly generated instances. The former are found in libraries

such as TSPLIB [102] or come from private sources. The latter come from instance

generators. A generator is software that, given some parameters, produces a ran-

dom problem instance consistent with those parameters. Real-world data sets are

desirable because the instances automatically represent many of the patterns and

structures inherent in the real world [101]. However, real-world data sets are often

proprietary and may not span all the problem characteristics of interest.

Randomly-generated test instances offer many conveniences.

• Control of problem characteristics [101]. If the problem generator is prop-

erly designed, then the problem characteristics are explicitly under the re-

searcher’s control. This enables the researcher to cover regions of the design

space that may not be well covered by available real-world data or libraries.

This control can be a necessity with the experiment designs in the Design Of

Experiments approach. When problems can be distinguished by some pa-

rameter, these parameters should be treated as independent variables in the

analysis [28, p. 28].

• Replicates [101]. The problem generator can create an unlimited supply of

problem instances. This is particularly valuable in high variance situations

for which statistical methods demand many replicates.

71


• Known optimum [101]. Some problem generators can generate problem in-

stances with a known optimal solution. Knowing the optimum is important

both for bracketing standards (Section 3.5 on page 59) and for the calculation

of some response measures (Section 3.10 on page 68). However, knowing an

optimum may bias an experiment.

• Stress testing [69]. Problem generators can be used to determine the largest

problem size that can be feasibly run on a given machine. This is important

when deciding on ranges of problem sizes to experiment with in the pilot study

phase (Section 3.7 on page 65). Barr et al [7, p. 18] also support this argu-

ment. They state that many factors do not show up on small instances but

do appear on larger instances. Experiments with smaller instances therefore

may not lead to accurate predictions for larger more realistic instances.

A poorly designed generator can lead to misleading unstructured random prob-

lem instances. Johnson [69] refers to Asymmetric TSP papers that report codes

that easily find optimal solutions to generated unstructured problems with sizes

of the order of thousands of cities yet struggle to solve structured instances from

TSPLIB of sizes less than 53 cities.

Online libraries of problem sets, be they real-world or randomly generated,

should be used with caution [101].

• Quality [101]. It is sometimes unclear where a particular instance originated

from and whether the instance actually models a real-world problem. Inclu-

sion in a library generally does not make any guarantees about the quality of

the instance.

• Not Representative [101, 65]. Some instances appearing in publications

may be contrived to illustrate a particular feature of an algorithm or to illus-

trate an algorithm’s pathological behaviour. They are therefore not suitable

as representative instances and may even be misleading.

• Biased [101]. Problem instances are often published precisely because an

algorithm performs well specifically on those instances. The broader issue of

bias is covered in Section 3.14 on page 74.

• Misdirected research focus [101, 65]. The availability of benchmark test

instances can draw researchers into making algorithms perform well on those

instances. As Hooker [65] puts it, ‘the tail wags the dog’ as problems begin to

design algorithms. This changes the context of a study from one of research to

one of development and encourages the premature publication of horse race

studies (Section 3.2 on page 55) before an algorithm is completely understood.

In summary, it would seem that problem generators are a necessity for designed

experiments. It is preferable to have access to a generator rather than relying on

benchmark libraries. Generators that are well-established and tested are prefer-

able to developing one’s own. This thesis uses a generator from a large community

research competition [58].

72


3.13 Stopping criteria

Heuristics can run for impractically long time periods. A stopping criterion is some

condition that causes the heuristic to halt execution. One typically sees several

types of stopping criteria in the heuristics literature. We term these (1) CPU time

stopping criterion, (2) computational count stopping criterion and (3) quality stop-

ping criterion respectively. In the first two types, a heuristic is halted after a given

amount of time or after a given number of computational counts (such as the num-

ber of iterations). In the third type, the heuristic is halted once a given solution

quality (typically the optimum solution) is achieved.

Running experiments with a time stopping criterion has been criticised on the

grounds of reproducibility [69]. A run on a different machine or with a different

implementation will have a distinctly different quality because of differences be-

tween the experimental material (Section 3.8 on page 65). Johnson goes so far

as to state that ‘the definition of an algorithm in this way is not acceptable for a

scientific paper’ [69].

Using a computational count as a stopping criterion is preferred by some au-

thors [69] and is generally the most common type of stopping criterion in the liter-

ature. Furthermore, one can report running time alongside computational count.

This permits other researchers to reproduce the work (using the computational

count) and observe differences in run times caused by their machines and imple-

mentations.

Johnson [69] objects to the use of attaining an optimal value as a stopping

criterion on the grounds that in practice one does not typically run an approximate

algorithm on an instance for which an optimal solution is known. In addition, we

argue that this overemphasises the search for optima when this is not the purpose

of a heuristic.

There is some evidence that the choice of stopping criterion could affect the

appropriate choice of tuning parameter settings for a heuristic. Socha [115] inves-

tigated the influence of the variation of a running time stopping criterion on the

best choice of parameters for the Max-Min Ant System (Section 2.4.5 on page 39)

heuristic applied to the University Course Timetabling Problem. Three levels of a

local search component, ten levels of pheromone evaporation rate, eleven levels of

pheromone lower bound and four levels of fixed run-time were investigated. The

local search levels were not varied with the other two parameters so we do not

know whether these interact. Furthermore, the parameter levels used with the

separate local search investigation were not reported and analyses were performed

on only two instances. The remaining three algorithm parameters were compared

in a full factorial type design on a single instance with 10 replicates. The motiva-

tion for this number of replicates was not mentioned. A fractional factorial design

would have been sufficient to determine an effect due to stopping criterion and

this would have offered huge savings in the number of experiment runs. Despite

this, the work does seem to indicate that different parameter settings are more

appropriate for different run-times of MMAS for one instance of the UCTP. This is

intuitive when one realises that the parameters investigated, pheromone evapora-

73


tion and pheromone lower bound, have an influence on the explore/exploit nature

of the MMAS algorithm. Obviously, exploration is a more sensible strategy when

a greater amount of run-time is available. Pellegrini et al [96] attempt an analysis

of the effect of run-time on solution parameters but this has many flaws that we

discuss in Section 4.3.1 on page 82.

The result of Socha [115] has the following implication for parameter tuning

experiment designs; results are restricted to the specific stopping criterion used.

Either (1) the stopping criterion (and a range of its settings) should be included as a

factor in the experiments or (2) the analyses should be conducted at several levels

of the stopping criterion settings. For example, if a fixed iteration stopping criterion

were used then the number of fixed iterations could be included as a factor or

analyses should be conducted after several different fixed iterations. The former

approach permits the most general conclusions at the cost of greatly increased

experimental resources.

3.14 Interpretive bias

The issue of bias is well recognised in the medical research field [71]. Its dangers

are equally relevant to the heuristics field. Bias is probably unavoidable given the

nature of science.

Good science inevitably embodies a tension between the empiricism of

concrete data and the rationalism of deeply held convictions. Unbiased

interpretation of data is as important as performing rigorous experi-

ments. This evaluative process is never totally objective or completely

independent of scientists’ convictions or theoretical apparatus. [71, p.

1453]

There are several types of bias that can affect the interpretation of results and

we relate these to the heuristics field here.

• Confirmation bias. Researchers evaluate research that supports their prior

beliefs differently from research challenging their convictions. Higher stan-

dards are expected of the research that challenges convictions. This bias is

often unintentional.

• Rescue bias. This bias involves selectively finding faults in an experiment

that contradicts expectations. It is generally a deliberate attempt to evade

evidence.

• Auxiliary hypothesis bias. This is a form of rescue bias in which the original

hypothesis is modified in order to imply that results would have been different

had the experiment been different.

• Mechanism bias. Evidence is more easily accepted when it is supported by

accepted scientific mechanisms.

74


• ‘Time will tell’ bias. Scientific scepticism necessitates a judicious attitude

of requiring more evidence before accepting a result. This bias affects the

amount of such evidence that is deemed necessary.

“A new scientific truth does not triumph by convincing its opponents

and making them see the light, but rather because its opponents

eventually die, and a new generation grows up that is familiar with

it.” Max Planck [97]

• Orientation bias. This reflects a phenomenon of experimental and recording

error being in the direction that supports the hypothesis. This arises in the

pharmaceuticals industry, for example, where trials consistently favour the

new pharmaceutical treatments.

Clearly these biases can affect interpretation of results regardless of the atten-

tion paid to the aforementioned issues.

3.15 Chapter summary

This chapter has covered the following topics.

• Concerns regarding many aspects of experiment design for heuristics have

appeared throughout the heuristics literature over the past three decades.

These concerns have not been addressed in the Ant Colony Optimisation lit-

erature.

• It is important to ask whether the heuristic is even worth researching. The

temptation to invent creative extensions to algorithms or explore new nature-

inspired metaphors can distract us from the real task of producing optimisa-

tion heuristics that produce feasible solutions in acceptable time.

• There are several types of experiment one can conduct. The appropriate type

will depend on the life cycle of the heuristic and the problem domain. Each

type of experiment can answer several types of research question.

• There are clearly defined steps to good design and analysis of experiments.

Nonetheless, there are many potential pitfalls for the analyst and many design

and analysis decisions that must be made and justified.

• Different levels of heuristic instantiation and problem domain abstraction

are appropriate for different types of study and research question. This the-

sis studies a highly instantiated metaheuristic applied to a problem type of

medium abstraction.

• Machines should be benchmarked so that results can be correctly interpreted

by other researchers and can be scaled to different types of experiment mate-

rial (machine architecture, compiler, programming language etc).

75


• A broad notion of reproducibility for empirical research with metaheuristics

states that an experiment is reproducible if others can produce consistent

data that leads to the same conclusions.

• There are many types of performance responses one can measure and report.

• One should exercise caution in the choice of random number generator. It

is difficult to implement a generator well and poor implementations can bias

research results.

• Problem instances can be so-called real-world instances or randomly gener-

ated instances. Both have their advantages and disadvantages. Randomly

generated instances are probably more appropriate when one needs explicit

control of problem instance characteristics. It is difficult to implement a gen-

erator well and so is preferable to use an established and well tested genera-

tor. The are several potential dangers of online libraries of instances, be they

real world instances or randomly generated ones.

• Because heuristics can run for a significant time, continuously improving

their solution, one needs to choose a stopping criterion to halt an experiment.

Stopping criteria are generally based on a computation count, a clock time or

when a predefined solution quality is attained.

• There are several types of interpretive bias that can affect the researcher’s

assessment of results, even from the most rigorously designed experiment.

The next chapter will review experiments on tuning metaheuristics in light of

the concerns summarised in this chapter.

76

4Experimental work

The previous chapter summarised the most important experiment design and anal-

ysis concerns that have been raised in the heuristics literature. The discussion of

these concerns was related to Ant Colony Optimisation (ACO) research at a general

level. This chapter examines research that is relevant to parameter tuning of ACO

in the context of the concerns that have been identified. It begins with a review

of the most significant attempts to analyse problem difficulty for algorithms. The

chapter then continues with approaches to tuning heuristics and metaheuristics

other than ACO. This is necessary because the fields of operations research and

heuristics in general have been better than the ACO field at recognising and ad-

dressing the parameter tuning problem. Lessons can be learned from these fields.

Finally, this chapter addresses parameter tuning approaches for the ACO meta-

heuristic, the focus of this thesis. Of course, parameter tuning should be a major

part of any ACO research effort. It is integral to the effective application of the

heuristic. A comprehensive review of parameter tuning would therefore neces-

sitate reviewing almost all ACO literature. A glance through the ACO literature

should convince the reader that methodical and reproducible parameter tuning of

ACO is rarely addressed, despite its identification as an open research topic [47].

This chapter will therefore limit its scope to papers that have explicitly proposed

and investigated methods for the parameter tuning of ACO.

4.1 Problem difficulty

Some problem instances are more difficult for an algorithm (exact or heuristic) to

solve than other instances. It is critically important to understand which instances

can be expected to be more difficult for a given algorithm. Essentially this involves

investigating which levels of one or more problem characteristics (and combina-

tions of levels) have a significant effect on problem difficulty.

Fischer et al [49] investigated the influence of Euclidean TSP structure on the

77

CHAPTER 4. EXPERIMENTAL WORK

performance of two algorithms, one exact and one heuristic. The exact algorithm

was branch-and-cut [5] and the heuristic was the iterated Lin-Kernighan algorithm

[63]. In particular, the TSP structural characteristic investigated was the distribu-

tion of cities in Euclidean space. The authors varied this distribution by taking a

structured problem instance and applying a perturbation operator to the city dis-

tribution until the instance resembled a randomly distributed problem. There were

two perturbation operators. A reduction operator removed between 1% to 75% of

the cities in the original instance. A shake operator offset cities from their origi-

nal location. Using 16 original instances, 100 perturbed instances were created for

each of 8 levels of the perturbation factor. Performance on perturbed instances was

compared to 100 instances created by uniformly randomly distributing cities in a

square. Predictably, increased perturbation lead to increased solution times that

were closer to the times for a completely random instance of the same size. It was

therefore concluded that random Euclidean TSP instances are relatively hard to

solve compared to structured instances. Unfortunately, it is unavoidable that the

reduction operator confounds changed problem with a reduction in problem size, a

known factor in problem difficulty. Nonetheless, the research of Fischer et al leads

us to suspect that structured instances possess some feature that algorithms can

exploit in their solution whereas completely random instances are lacking that

feature and consequently may be unrealistically difficult. These results tie in with

arguments over the merits of problem instance generators discussed previously

(Section 3.12 on page 71).

Van Hemert [119] evolved TSP instances of a fixed size that were difficult to solve

for two heuristics: Chained Lin-Kernighan and Lin Kernighan with Cluster Com-

pensation. TSP instances of size 100 were created by uniform randomly selecting

100 coordinates from a 400x400 grid. An initial population of such instances was

evolved for each of the algorithms where higher fitness was assigned to instances

that required a greater effort to solve. This effort was a combinatorial count of

the algorithms’ most time-consuming procedure. This is an interesting approach

that side-steps the difficult issues related to CPU time measurement discussed in

Section 3.10.1 on page 68 while still acknowledging the relevance and importance

of CPU time. Van Hemert then analysed the evolved instances using several inter-

esting approaches. His aim was to determine whether the evolutionary procedure

made the instances more difficult to solve and whether that difficulty was specific

to the algorithm. The first approach considered was box plots of the mean, median

and 5 and 95 percentile range. Secondly, the author looked at the frequency with

which each algorithm found an optimal solution in each of the problem sets and

the average discrepancy between the algorithm solution and the known optimum.

The problems with the first of these responses has already been discussed (Sec-

tion 3.10.4 on page 70). The average number of clusters in each set was measured

with a deterministic clustering algorithm. The average distribution of tour seg-

ment lengths was measured for both problem sets as well as the average distance

between pairs of nodes. Finally, to verify whether difficult properties were com-

mon to both algorithms, each algorithm was run on the other algorithm’s evolved

problem set. A set evolved for one algorithm was less difficult for the other al-

78


gorithm. However, the alternative evolved set still required more effort than the

random set indicating that some difficult instance properties were shared by both

evolved problem sets. Van Hemert’s conclusions may have been limited by the lack

of a rigorous experiment design. The approach can be summarised as evolving

instances and then looking for characteristics that might explain any observed dif-

ferences in problem hardness. This offers no control over problem characteristics.

Ideally, one should hypothesise a characteristic that affects hardness and then

test that hypothesis while controlling for all other characteristics. This was exactly

the approach taken in the next piece of research and in this thesis.

Cheeseman et al [26] explored the idea of defining an ‘order parameter’ for NP

instances such that critical values of this parameter describe instances that are

particularly hard to solve. The basic idea is that such a critical value divides the

space of problems into two regions. One region is underconstrained and so has a

high density of solutions. This makes it relatively easy to find a solution. The other

region is overconstrained and so has very few solutions. However, these solutions

typically have very distinct local maxima/minima and so again are relatively easy

to find. The difficult problems occur at the boundary between these two regions

where there are many minima/maxima corresponding to almost complete solu-

tions. In essence, the algorithm is forced to investigate many ‘false leads’. In some

ways, this concept of critical values of an order parameter resembles that of phasetransitions used in statistical mechanics and physics.

Cheeseman et al [26] investigated the presence of these transitions when vari-

ous algorithms were applied to four problems: finding Hamiltonian circuits, graph

colouring, k-satisfyability and the Travelling Salesperson Problem. In the TSP in-

vestigations, three problem sizes of 16, 32 and 48 were investigated. For each

problem size, many instances were generated such that each instance had the

same mean cost but a varying standard deviation of cost. Mean and standard

deviation of edge lengths were controlled by drawing edge lengths from a Log-

Normal distribution. The computational effort for an exact algorithm to solve each

of these instances was measured and plotted against standard deviation of TSP

edge lengths. The plots showed an increase in the magnitude and sharpness of the

phase transition with increasing problem size.

Although conducted only with an exact algorithm on relatively small instance

sizes, this research leads us to expect that edge length standard deviation may

have a significant influence on problem difficulty for other heuristics. This is an

important research question that this thesis will answer for the ACO algorithms.

4.2 Parameter tuning of other metaheuristics

Adenso-Dıaz and Laguna [2] have used a factorial design combined with a local

search procedure to systematically find the best parameter values for a heuristic.

Their method, CALIBRA, was demonstrated on 6 different combinatorial optimi-

sation applications, mostly related to machine scheduling. CALIBRA begins by

finding a set of ‘optimal’ parameter values using the Taguchi methodology [95] in a

79


2k factorial design. The Taguchi methodology is based on a linear assumption that

can lead to large differences between the predicted ‘optimal’ values and the true

optimum. CALIBRA therefore uses this analysis only as a guideline to focus the

search through the parameter space. An iterative local search is then used to im-

prove on the parameter values within a refined region of the parameter space. The

parameter values found by CALIBRA led to better algorithm performance than the

values used by some other authors. In all other situations, the CALIBRA parame-

ter values did not perform significantly better or worse than the parameter values

used by the original authors. A main limitation of CALIBRA is that it can only

tune five algorithm parameters. A more serious limitation is that CALIBRA does

not examine interactions between parameters and so cannot be used in situations

where such interactions might be significant. Later chapters will demonstrate that

interactions are present in ACO tuning parameters and so the more sophisticated

experiment designs of this thesis are required.

Coy et al [33] present a systematic procedure for finding good heuristic param-

eter settings on a range of Vehicle Routing Problems (VRP). This methodology was

applied to two local search heuristics with 6 tuning parameters and a total of 34

VRPs. The new parameters found results that were, on average, 1% to 4% better

than the best known solutions. Broadly, Coy et al’s procedure works by finding

high quality parameter settings for a small number of problems in the problem

set and then combining these settings to achieve a good set of parameters for the

complete problem set. The procedure is as follows:

1. A subset of the problem set is chosen for analysis. This subset should be

representative of the key problem characteristics in the entire set. In their

paper’s case, example key VRP characteristics are demand distribution and

customer distribution.

2. A starting value and range for each parameter is determined. This requires

either a judgement based on previous experience with the heuristic or a pilot

study.

3. A factorial or fractional factorial design is used to determine the parameter

settings. Linear regression gives a linear approximation of the response sur-

face. The path of steepest descent along this surface is calculated, beginning

at the starting point identified from the parameter study.

4. Step 4 is repeated for each problem in the analysis set.

5. The parameter vectors determined in step 4 are averaged to obtain the final

parameter settings for the heuristic over all problem instances.

Coy et al’s approach does not use higher order regression models (such as

quadratic). The author’s chose the simpler linear approach because different re-

sponse surfaces are averaged over all test instances. The authors believed their

approximate approach would not be significantly enhanced by a more complicated

response surface. This comparison was not performed however.

80


In Coy et al’s work, their VRP problems were chosen based on three charac-

teristics, distribution of customers, distribution of demand and problem size. The

decisions were based on a graphical analysis of these characteristics for each in-

stance. In their conclusions, Coy et al acknowledge that the method will perform

poorly in two scenarios:

• if the representative test problems are not chosen correctly or

• if the problem class is so broad that it requires very different parameter set-

tings.

While they recommend creating problem subclasses based on the ‘significant’

problem characteristics, they give no detail of how such significance could be de-

termined. The first of these shortcomings could conceivably be mitigated in an

application scenario by building up a repository of instances to which the tuning

procedure has been applied. The second shortcoming is more troublesome. It is

likely that many problems have instances that require quite different parameter

settings. Coy et al’s method does not build a model of the relationship between

instances, parameter settings and performance. It therefore cannot recommend

parameter settings for varying combinations of problem characteristics. The de-

signs introduced in this thesis can make such recommendations.

Parsons and Johnson [94] used a 24 full factorial replicated design to screen

the best parameter settings for four genetic algorithm parameters applied to a data

search problem (specifically DNA sequencing). Their stopping criterion was a fixed

number of trials. They used a type of sequential experimentation procedure of

running first a half fraction, then the other half fraction and finally the replicates

of the data. Only two parameters were deemed important and so a steepest ascent

approach with these parameters was used to determine the centre point of a central

composite design for building a response surface. This response surface then

allowed the authors to improve the genetic algorithm performance on the tested

data set. Experiments on larger data sets with these parameter settings showed

improvements in both solution quality and computational effort. Unfortunately,

no analysis of the data set characteristics was made so we cannot determine why

parameters tuned on one data set worked so well for larger data sets. The authors

could have halved the number of experiment runs by using a 2IV (4-1) fractional

factorial (Section A.3.2 on page 215) instead of the 24 full factorial.

Analysis of Variance (ANOVA) and response surface models have been used for

parameter tuning on several occasions in the heuristics literature. For example,

Van Breedam [22] attempts to find significant parameters for a genetic algorithm

and a simulated annealing algorithm applied to the vehicle routing problem us-

ing an analysis of variance technique. Seven GA and eight SA parameters are

examined. Park and Kim [91] used a non-linear response surface method to find

parameter settings for a simulated annealing algorithm.

None of these methods have been applied to ACO.

81


4.3 Parameter tuning of ACO

Existing approaches to understanding the relationship between tuning parame-

ter settings, problem characteristics and performance of ACO fall into a three

categories: (1) Analytical Approaches, (2) Automated Approaches and (3) Empir-

ical Approaches. Recent analytical approaches attempt to understand parameters

and recommend parameter settings based on mathematical proof. Automated ap-proaches attempt to use some other algorithm or heuristic to tune the ACO heuris-

tic. The automated approach may use the heuristic itself in a kind of introspective

manner, in which case we term the approach Automated Self-tuning. Alternatively,

Heuristic Tuning uses some other heuristic to search for good parameter settings

for the tuned heuristic. The final category is Empirical approaches. These gather

data about the heuristic and attempt to analyse it to draw conclusions about the

heuristic. Empirical approaches in ACO are either Trial-and-Error or occasionally

the One-Factor-At-a-Time (OFAT) approach. This thesis uses the Design Of Exper-iments approach, bringing the experiment designs and analysis procedures from

DOE to bear on the parameter tuning problem.

4.3.1 Analytical Approaches

Dorigo and Blum [42] provide a survey of analytical results relating to ant colony

optimisation. They acknowledge that the convergence proofs they summarise for

ACO are of little use to a practitioner since the proofs often assume the availability

of either infinite time or infinite space [42, p. 246]. Their discussion is preceded

with a simplification of the transition probabilities of the Construct Solutions phase

(Section 2.4.3 on page 37) so that heuristic information is omitted. Their motiva-

tion for this simplification is to ‘ease the derivations’ [42, p. 256] however this

simplification ignores the reality that heuristic information is well-established as

a highly significant contributor to algorithm performance. The authors also admit

that none of the convergence proofs discussed make reference to the ‘astronomi-

cally large’ time to find the optimal solution [42, p. 260].

Neumann and Witt attempt an analysis of run time and evaporation rate on the

OneMax problem [86] and the LeadingOnes and BinVal problems [41]. An abstract

ACO algorithm is analysed in both cases. However, no comparison with empirical

data is given and so it is impossible to tell whether any of the proofs’ assumptions

have affected their analyses’ application to instantiated ACO algorithms. Despite

this, the authors claim that ‘It is shown that the so-called evaporation factor ρ, the

probably most important parameter in ACO algorithms, has a crucial impact on

the runtime’ [41, p. 34]. These claims are far too general given that their analyses

are at such an early stage and apply only to a single abstract algorithm with no

test of predictions on real instantiated algorithms. These claims will be tested later

in this thesis.

Pellegrini et al [96] attempt an analytical prediction of suitable parameter set-

tings for MMAS for a given run time. The most important parameters are deemed

to be the number of ants, the pheromone evaporation rate and the exponents of the

82


pheromone and heuristic terms. This is a subjective judgement as no screening or

other analysis is done to verify it. Dorigo and Stutzle’s MMAS code [47] was used

and so we should expect results to be consistent with results from this thesis. Ex-

periments were performed without local search and assuming pheromone update

is not time consuming. This assumption is likely incorrect if one examines the code

used. This shows that when no local search is used, pheromone evaporation takes

places on all edges in the problem rather then being limited to the edges in the can-

didate list. This issue was described in Section 2.4.8 on page 44 and a computationlimit parameter was introduced. In general, the reasoning about the parameters is

reported quite vaguely. For example, ‘It is easy to see that the number of iterations

is the leading force, at least until a certain threshold. Nonetheless, the number

of nodes has a remarkable impact as well.’ We have no measure of ‘easy’, leading

force’, ‘a certain threshold’ or ‘remarkable’.

The values recommended by the analysis are then compared to the values rec-

ommended by an automated tuning algorithm, F-Race [14]. Available solution time

was set to six arbitrarily chosen levels varying between 5 and 120 seconds. The

F-Race recommendations were observed to match the predicted trends of the anal-

ysis for the exponent parameters and the number of ants but not the pheromone

decay parameter for just two distinct levels of instance size, 300 and 600. The

failure on prediction of pheromone value may be due to the incorrect assumption

highlighted earlier. Regardless, the results do not confirm the author’s analysis

but rather confirm that the authors’ analysis would appear to agree with some

of the F-Race results. Once again, parameters were treated in isolation and the

unstated assumption is that there is no possibility of many different combinations

of parameter settings achieving the same performance under the constraint of a

fixed solution time. Times were not benchmarked properly so we cannot compare

other tuning procedures to Pellegrini et al’s analysis.

Hooker [64] highlights the failings of theoretical analysis of algorithm perfor-

mance on several fronts.

• Not practical. The results do not usually tell us how an algorithm will per-

form on practical problems.

• Not representative. Complexity results are asymptotic or apply to a worst

case that seldom occurs. Worst case analyses, by definition, do not give an

indication of how a heuristic will perform in more representative scenarios

[101]. Average case results presuppose a probability distribution over ran-

domly generated problems that is typically unreflective of reality.

• Too simplified. Results are usually obtained for the simplest kinds of algo-

rithms and so do not apply to the complex algorithms used in practice.

All the analyses mentioned demonstrate some or all of Hooker’s failings. In gen-

eral, results are still too premature to recommend parameter settings for an ACO

algorithm when presented with a given problem instance and the complete heuris-

tic compromise. Analytical approaches are not ready to address the parameter

tuning problem.

83


4.3.2 Automated Approaches

Self-tuning

Randall [100] has examined how ACS (Section 2.4.6 on page 42) can use its own

mechanisms to tune its parameters at the same time as it is solving TSP and QAP

problems. Four tuning parameters were examined, β, local and global pheromone

update factors ρlocal and ρglobal and the exploration/exploitation threshold q0. The

number of ants m was arbitrarily fixed at 10. Each ant maintained its own pa-

rameter values. A separate pheromone matrix was used to learn new parameter

values. The self-tuning test ACS was compared to a control ACS with fixed param-

eter values taken from another author’s implementation from 7 years previously

on different problem instances [46]. Twelve instances from the TSPLIB (sizes 48 to

442) and QAPLIB (sizes 12 to 64) were used for comparison of the test and con-

trol. Experiments on each instance were repeated 10 times with different random

seeds and were halted after 3000 iterations. The choice of number of replicates

was not justified. Only a single fixed iteration stopping criterion was examined.

Randall claims that this number of iterations ‘should give the ACS solver sufficient

means to adequately explore the search space of these problems.’ [46, p. 378].

No evidence was given to support this claim. Furthermore, because problem size

varies, a fixed iteration will give a less adequate exploration of the search space of

larger instances. Two responses were measured, the percentage relative error of

the solution and the CPU time until the best solution was found. Note that there

is some disagreement over the use of this measure as it does not account for the

3000 iterations that were actually run (Section 3.10.1 on page 68). Responses were

listed in a table with a row for each problem instance. No attempt was made to

qualify the significance of differences between control and test responses with a

statistical test.

Although it is difficult to interpret the results in these circumstances, it seems

that the parameter tuning strategy had little practically significant effect on the

quality of solutions found and therefore is not better than a set of parameter val-

ues recommended by other authors in different circumstances. Part of the diffi-

culty in detecting differences between test and control is that many of the problem

instances were so small as to be solved relatively easily in the set-up phase of ACS,

when a nearest neighbour heuristic is applied to the TSP graph (Section 2.4).

Although the self-tuning approach is intuitively appealing, it has two major defi-

ciencies. Firstly, it offers no understanding of the relationship between parameters

and performance—the algorithm tunes itself and it is hoped that performance im-

proves. An understanding of this relationship necessitates building some model

(analytical or empirical) of the relationship. Of course, the particular application

scenario will determine whether modelling this relationship is advantageous (see

Chapter 1). The second deficiency is related to the first. Without a model, there

is no understanding of the relative importance of tuning parameters. This runs

the risk of wasted resources tuning parameters that actually have no effect on

performance.

84


Heuristic Tuning

Botee and Bonabeau [19] investigated the use of a simple Genetic Algorithm to

tune 12 parameters of a modified version of the Ant Colony System algorithm on

two problems from TSPLIB, Oliver30 and Eil51. The parameters they evolved are

summarised in Table 4.1 on the next page. The ACS equations were modified in

several ways to give the genetic algorithm more flexibility. The trail evaporation

parameter ρ, which was the same for local and global pheromone updates in the

original ACS, was separated into ρlocal and ρglobal for the local and global phe-

romone update equations respectively. A new numerator Q and a new exponent γ

were introduced into the pheromone deposit Equation ( 2.13 on page 43).

τij = (1− ρ) τij + ρQ

(Cchosen ant)γ (4.1)

The chosen ant was always the best so far ant. The ACS implementation was

augmented with 2-opt local search. The number of repetitions was of the form

σ = a · nb where a and b determine how σ scales with problem size n. The original

trail value was set to a small value rather than according to the nearest neighbour

approach advocated in previous literature [46]. The candidate list length was fixed

at 1.

Each colony, characterised by the given parameters, was treated as an indi-

vidual. The population of 40 colonies was randomly generated and run for 100

generations. The GA found a set of parameter values that always found the opti-

mal solution to Oliver30 in fewer ant cycles than ACS. The solution found for Eil51

was comparable to the best known solution.

The modified algorithm, tuned by the GA, found an optimal solution to Oliver30

after 4928 ant cycles (14 ants for 352 iterations averaged over 30 repetitions). The

original algorithm found the optimal solution in 8300 ant cycles (10 ants for 830

iterations). The results for Eil51 are not comparable to the original paper [45]

as this paper used a different but similar problem Eil50. The evolved parameter

values are summarised in Table 4.1 on the next page.

While the reported results are interesting, the general conclusions we can draw

from them are limited. Firstly, the experiments introduced 4 new ACS parameters

at once—the γ exponent and Q numerator in the global pheromone update equa-

tion, and the separation of the pheromone decay parameter into local and global

versions. The use of σ repetitions of a local 2-opt search procedure was also intro-

duced and the candidate list length was removed by fixing it at a value 1. It there-

fore makes it impossible to tell whether the improvements in ant cycles that the

authors report are due to the genetic algorithm’s tuning or the introduced param-

eters or the removed parameter. These factors are confounded (Section A.1.6 on

page 212). The authors then carry all the parameter values from the Oliver30 prob-

lem instance to the Eil51 instance, except for number of ants, which is changed

from 14 to 25. Neither the use of the same parameters or the arbitrary change in

the number of ants is justified. Secondly, while the performance improvement of

40% on Oliver30 seems very large, only two simple problem instances were tested

and the differences in performance between both algorithms were not tested for

85


Symbol Parameter Values

m Number of ants 14

q0 Exploration/exploitationthreshold

0.38

α Influence of pheromonetrails

0.37

β Influence of heuristic 6.68

ρlocal Local pheromone decay 0.30

ρglobal Global pheromone depo-sition

0.31

Q Global pheromone up-date term

78.04

γ Global pheromone up-date term

0.67

τ0 Initial pheromone levels 0.41

a Local search term 5

b Local search term 0.97

Table 4.1: Evolved parameter values for ACS. Results are from Botee and Bonabeau [19, p. 154]applied to Oliver30 and Eil51.

statistical significance. A 40% improvement on such small problems is probably of

little practical significance.

Overall, the approach of tuning ACS with the GA has some other weak points.

The GA is itself a heuristic and so probably needs its own tuning. We have seen that

this is best done with a DOE approach (Section 4.2 on page 79). The authors do not

specify where their GA parameter settings such as population size come from. This

introduces another parameter tuning problem on top of the ACS parameter tuning

problem. Their methodology does not incorporate any screening of parameters.

The GA therefore expends time tuning parameters that may not have any effect on

ACS performance.

Birattari [12] uses algorithms derived from a machine learning technique known

as racing [78] to incrementally tune the parameters of several metaheuristics. Tun-

ing is achieved with a fixed time constraint where the goal is to find the best config-

uration of an algorithm within this time. While the dual problem of finding a given

threshold quality in as short a time as possible is acknowledged, the author does

not pursue the idea of a simultaneous bi-objective optimisation of both time and

quality. Solution times were subsequently investigated by others [38]. Compar-

isons are made between 4 types of racing algorithm and a baseline algorithm that

uses a brute force approach to tuning. These racing algorithms were used to tune

Iterated Local Search for the Quadratic Assignment Problem and Max-Min Ant Sys-

tem for the Travelling Salesperson Problem. Experiments were run on Dorigo and

Stutzle’s code [12, p. 120], the same original source code on which this thesis is

based.

86


4.3.3 Empirical Approaches

One Factor at a Time

Even if the previous self-tuning (Section 4.3.1 on page 83) and heuristic tuning

(Section 4.3.2 on page 85) approaches were successful, they are of limited use

when attempting to understand the all-important relationship between tuning pa-

rameters, problem characteristics and performance. Understanding this relation-

ship requires a model. Building a model involves sampling various points in the

space of parameter settings and problem instances (the design space) and then

measuring performance at those points. When there are many parameters or prob-

lem characteristics, the researcher must confront a vast high-dimensional design

space. One way to tackle this is to use a One-Factor-At-a-Time (OFAT) approach.

OFAT involves fixing the values of all but one of the tuning parameters. The re-

maining parameter is varied until performance is maximised. Another parameter

is chosen to be varied and all other parameter values are fixed. This process con-

tinues one factor at a time until all parameters have been tuned. While OFAT may

occasionally be quick, there are limitations to the conclusions that one may draw

from an OFAT analysis (Section 2.5 on page 47). However this approach occurs in

the literature and so an illustrative case is reviewed here.

Stutzle studied the three modifications to Ant System introduced for Max-Min

Ant System [118, p. 18] with an OFAT approach. Default parameter values were

β = 2, α = 1, m = n, ρ = 0.98 and candidate list lengths were 20. ρ was varied

between 0.7 and 0.99 for two small instances from TSPLIB, KroA100 and d198.

The response measured was quality of solution. Stutzle’s recommendation was a

low value of ρ when a low number of tour constructions are performed and a high

value of ρ when a high number of tour constructions are performed.

The trade-off of initialisation to τmin or τmax was also examined along with the

use of the global or solution best ant for pheromone update. An informal examina-

tion of a table of the differences between the trail initialisations showed a negligible

practical difference in solution quality (0.9% maximum) on small instances of size

51 to 318. However, only a single problem characteristic, problem size, was re-

ported. Interestingly, the difference between trail initialisation methods increased

with problem size but this trend was unfortunately not explored further. A similar

result was obtained for the choice of ant for pheromone update. MMAS was then

compared to Ant System [44], Elitist Ant System [44] and Rank-based Ant System

[24] where the parameter settings for these algorithms are listed but no motivation

of the choice of these parameter settings was given. Comparisons were made on

three small instances of size 51, 100 and 198 with a fixed tours stopping criterion.

The results were listed in absolute terms without a statistical test for significance.

We can express that results table in terms of relative solution quality. Although

MMAS did indeed find the best solutions, we see that the difference from the next

best solution provided by ACS was only 0.6%. It is claimed that since MMAS out-

performs ACS, and ACS was demonstrated to outperform other nature-inspired

algorithms [46], MMAS is therefore competitive with other algorithms. However,

this claim ignores the resources used in tuning these algorithms before their per-

87


formance was measured.

The investigation of the benefits of several variants of MMAS with local search

could have been done differently. Firstly, some of the parameters were inexplica-

bly changed. Specifically, the number of ants was now fixed at m = 25 and the

pheromone decay held constant at ρ = 0.8. Solution CPU time was reported but

without any accompanying benchmarking. The instance sizes were varied from

198 to 1577.

Design Of Experiments

The Design of Experiments (DOE) approach is preferable to OFAT. Surprisingly,

before the publication of results from this thesis [104, 106, 109, 105, 107], DOE

has been almost completely absent from the ACO literature.

Silva and Ramalho [114] give a small summary of the use of DOE techniques

in ACO. However, their categorisation of the techniques is unusual. They include

One-Factor-At-a-Time analysis as a ‘simple’ type of DOE and the general category

of ‘data analysis’ is seen as separate from DOE. Only one reference [52] is listed

as applying DOE but a reading of the reference shows that it does not actually

use DOE. The authors then illustrate the use of a full 2k factorial with 7 factors

and 5 replicates on a single instance of the Sequential Ordering Problem. A single

solution quality response that is independent of CPU time is measured. Normal

plots and residual plots are used to check model quality. The authors then use

what they term the ‘observation method’ to recommend tuning parameter values.

It is not clear what this method is. Non-integers values of α = 0.25 and β = 1.5 are

recommended. This recommendation would actually lead to extremely high CPU

times because of non-integer exponentiation in the ant decisions (Equation ( 2.2 on

page 39)). The authors did not measure the CPU response and so would have been

unaware of this problem. This dramatically supports the argument for recording

CPU time, regardless of the focus of the experiment (Section 3.10.1 on page 68).

Gaertner and Clark [50] attempted to find optimal parameter settings for three

parameters of a single ACO heuristic, Ant Colony System, using a full factorial de-

sign. While there were many flaws in the execution and analysis of their research,

we include it in this section because of its use of a factorial design. Although

the authors identified 6 tuning parameters, α, β, ρ, q0, m and Q, they immedi-

ately argued that all but 3 of these could be omitted from consideration. Firstly,

they claimed that it is sufficient to fix α and only vary β. They claimed that Q is

a constant despite listing it as a parameter. Finally, they claimed that m could

‘reasonably’ be set to the number of cities in the problem. This left only three

parameters that were actually considered, β, ρ and q0. We know from our review

in Section 2.4.9 on page 45 that the number of tuning parameters is actually far

greater for ACS. The authors then partitioned the three parameters β, ρ and q0

into 14, 9 and 11 values respectively. No reasoning was given for this granularity

of partitioning or why the number of partitions varied between parameters. Each

‘treatment’ was run 10 times with a 1000 iteration or optimum found stopping cri-

terion on a single 30 city instance, Oliver30. This resulted in 13,860 experiment

88


runs that took the authors several weeks to execute on a dual 2.2GHz proces-

sor with 2Gb RAM. While the excessive running time may have been due to poor

implementation, the approach was nonetheless incredibly inefficient—a response

surface design (Section A.3.4 on page 218) for 3 factors with a full factorial and

10 replicates would have required approximately 150 runs, 1% of that used by the

authors. It was also expensive although we cannot relate their figures to present

day values because no benchmarking was reported. CPU time was not reported

for the various parameter settings and so the heuristic compromise was ignored.

The authors also make an unfair comparison with other authors’ work, claiming

that they find the optimal solution faster on average without local search. This

claim fails to acknowledge the authors’ significant effort (13,860 runs over several

weeks) to find their conclusion and the use of prior knowledge of an optimum to

stop an experiment once the optimum was found. Section 3.13 on page 73 dis-

cussed the problem with using an optimum as a stopping criterion. The authors

claim that their parameter setting is robust because their empirical search found

the same parameter setting for 3 values of relative error, 0%, 1% and 5%. This is

not a robustness analysis. The authors made no attempt to see how the response

varies when the input parameter is perturbed.

This thesis presents a far more rigorous experimental approach that draws con-

tradictory conclusions to those of Gaertner and Clark, is an order of magnitude

more efficient in terms of experiment runs and deals with all ACS tuning parame-

ters across a space of problem instances rather than a single instance.

4.4 Chapter Summary

This chapter covered the following topics.

• Problem Difficulty. It is critically important to determine the characteris-

tics that affect the difficulty of a problem presented to a heuristic. Without

this, it is impossible to generalise the relationship between instances, tuning

parameters and heuristic performance.

– Some authors have tried manipulating instances and examining the re-

sulting affect on problem difficulty. Others have tried to evolve difficult

instances and then determine why those instances were difficult.

– Some authors have hypothesised that a particular problem characteristic

made an instance difficult and then generated many instances with dif-

ferent levels of that hypothesised characteristic. This approach is prefer-

able since it fits within the more scientific approach of hypothesise and

test. Such an approach has not yet been applied to ACO heuristics for

the TSP.

– Results with exact algorithms for the TSP suggest that the standard de-

viation of edge lengths in the TSP instance has an effect on the difficulty

of the instance.

89


• Parameter Tuning of other heuristics. Other heuristics have been tuned

using basic Design Of Experiments techniques such as factorial and frac-

tional factorial designs.

• Parameter tuning of ACO heuristics.

– Approaches to tuning ACO can be categorised as either (1) Analytical,

(2) Automated or (3) Empirical. Analytical approaches attempt to prove

properties about the parameter-problem-performance relationship using

mathematical proof. Automated approaches use an algorithm to auto-

matically tune the heuristic. This algorithm may be the heuristic itself.

Empirical approaches gather data from actual algorithm runs and at-

tempt to build a model to reason about the data and draw conclusions.

– Automated Tuning with another heuristic, such as a genetic algorithm,

is inefficient for two reasons. Firstly, there is no ability to screen out

parameters that are not effecting performance and so effort is wasted on

tuning potentially ineffective parameters. Secondly, the tuning heuris-

tic does not build up a model of the relationship between parameters,

problem instances and performance of the tuned heuristic. This severely

limits what can be learned from running the tuning procedure.

– Automated Self-tuning involves applying the heuristic’s own optimisa-

tion mechanisms to the heuristic’s tuning parameters. This is intuitively

a sensible approach to parameter tuning. However, it suffers from the

same lack of screening and modelling as the automated tuning approach.

Examples from the literature have been poorly executed experimentally

and so we cannot determine whether this is a viable approach to param-

eter tuning.

– Researchers often use a One-Factor-At-a-Time (OFAT) approach. While

this does give useful insights into the importance and effects of various

parameter settings, the OFAT approach has many recognised deficiencies

for parameter tuning.

– The Design of Experiments approach has been used on two occasions to

recommend ACS parameter settings. There were several flaws with the

execution of the DOE methods however. Furthermore, the authors did

not examine CPU time and so made recommendations of setting exponent

tuning parameters to non-integer values.

This concludes the first part of the thesis. Chapter 2 on page 29 gave a back-

ground on combinatorial optimisation and the Travelling Salesperson Problem.

Metaheuristics were introduced as an approach to finding approximate solutions

to these difficult and important problems and the most important Ant Colony Op-

timisation (ACO) heuristics were described in detail. A review of related work in

Chapter 3 began by collecting and organising the many experiment design and

analysis issues that arise in empirical research with metaheuristics. Several ap-

proaches to determining the problem characteristics that affect performance and

90


to parameter tuning were reviewed in this chapter. However, in light of the con-

cerns raised in Chapter 3, the vast majority of these approaches and their execu-

tion have been deficient in several ways.

The next part of this thesis will address these deficiencies, comprehensively de-

scribing an adapted DOE approach for addressing the parameter tuning problem.

91

Part III

Design Of Experiments for TuningMetaheuristics

93

5Experimental testbed

Before detailing the adapted DOE methodology that this thesis introduces, we must

first address the thesis’ ‘apparatus’. This chapter covers all issues related to the ex-perimental testbed. The experimental testbed in metaheuristic research comprises

three items. These are:

1. the code for the problem generator that creates the test instances that the

algorithms then solve,

2. the code for the metaheuristics on which the experimenter is conducting

research, and

3. the machines that run all experiments in the research.

Of course, either the machines or the problem generators could be the subject of

the research (Section 3.4 on page 58). One often asks whether a problem generator

is appropriate for an algorithm. In an industrial context, one may be concerned

about the machine characteristics that best suit the algorithms and problems. This

chapter deals with these three experimental testbed items in order.

5.1 Problem generator

We have already discussed the difficulty in creating a problem generator and the

arguments for and against the use of problem generators (Section 3.12 on page 71).

Problem generators are a large area of research because of these difficulties and so

are beyond the scope of this research. It is therefore desirable to choose a problem

generator that is acceptable for other researchers. Preferably, the generator will

already have been subjected to extensive use so that any peculiarities with the

generator are more likely to have become known. We have chosen to use a problem

generator provided with the 8th DIMACS Implementation Challenge: The Travelling

95

CHAPTER 5. EXPERIMENTAL TESTBED

Salesman Problem1. The DIMACS challenge was a large competition held within

the TSP research community with the stated goals of:

• creating ‘a reproducible picture of the state of the art in the area of TSP

heuristics (their effectiveness, their robustness, their scalability, etc.), so that

future algorithm designers can quickly tell on their own how their approaches

compare with already existing TSP heuristics’ and

• enabling ‘current researchers to compare their codes with each other, in

hopes of identifying the more effective of the recent algorithmic innovations

that have been proposed. . . ’.

One way the DIMACS challenge facilitated these goals was to provide researchers

with problem generators to generate instances on which their codes could be

tested. Problem generators were provided for several of the possible types of TSP

instance (Section 2.2). In particular, the DIMACS challenge provided a generator

called portmgen to generate symmetric instances with a given number of nodes

where edges lengths were chosen with a uniform random distribution.

Other researchers have used instances with edges drawn from a Log-Normal

distribution (Section 3.1 on page 54) so that the standard deviation of edge lengths

could be treated as a factor and controlled in the experimental sense. This was

shown to have an important effect on problem difficulty for an exact algorithm

[26]. This same factor is investigated later in this thesis (Chapter 7). The Log-

Normal distribution is the probability distribution of any random variable whose

logarithm is normally distributed. There is a good introduction to the Log-Normal

Distribution online2. The distribution has the probability density function:

f (x;µ, σ) =e− ln(x−µ)2/2σ2

xσ√

2π(5.1)

for x > 0 where µ and σ are the mean and standard deviation of the variable’s

logarithm. For our purposes of controlling edge length standard deviation, we note

that relationships can be derived to solve for the Log-Normal parameters µ and σ of

Equation ( 5.1) given a desired expected mean E (x) and expected variance V ar (x)of the resulting distribution.

µ = ln (E (x))− 12

ln

(1 +

V ar (x)E (x)2

)(5.2)

σ2 = ln

(1 +

V ar (x)E (x)2

)(5.3)

For example, if we want a Log-Normal distribution with a certain standard de-

viation and certain mean, these equations will tell us what values of parameters µ

and σ to use when creating our distribution.

1 http://www.research.att.com/∼dsj/chtsp/2 http://en.wikipedia.org/w/index.php?title=Log-normal distribution&oldid=136064053

96


For this research, the DIMACS portmgen generator [58] was ported to Java

and refactored into an Object-Oriented implementation we call Jportmgen. The

generator’s behaviour was preserved during the porting using unit tests. The DI-

MACS portmgen and the thesis’ Jportmgen produced identical instances for a given

pseudo-random generator seed. The code was then modified such that chosen edge

lengths exhibited a Log-Normal distribution with a desired mean and standard de-

viation, as per Cheeseman et al [26]. This new implementation therefore allows

the experimenter to control problem size, edge length mean and edge length stan-

dard deviation while remaining true to the DIMACS generator accepted for the TSP

community’s largest research project. Different distributions can be plugged into

the generator, including the original DIMACS uniform distribution.

Although Cheeseman et al [26] did not state their motivation for using the

Log-Normal distribution, a plot of the relative frequencies of the normalised edge

lengths of Euclidean instances from the online benchmark library, TSPLIB [102],

shows that the majority have a Log-Normal shape (Appendix B). Figure 5.1 shows

relative frequencies of the normalised edge lengths of several instances created by

Jportmgen.

0

0.5

1

0 0.2 0.4 0.6 0.8Normalised Edge Lengths

Nor

mal

ised

Rel

ativ

e Fr

eque

ncy

Mean 100, StDev 70Mean 100, StDev 30Mean 100, StDev 10

Figure 5.1: Relative frequencies of normalised edge lengths for several TSP instances of the samesize and same mean cost. Instances are distinguished by their standard deviation. All instancesdemonstrate the characteristic Log-Normal shape.

Unless otherwise stated, all future references to the problem generator will refer

to the Jportmgen generator. All generated instances in the thesis are created with

this Log-Normal version of the DIMACS portmgen.

5.2 Algorithm implementation

The next important aspect of the testbed is the algorithm implementation. Repro-

ducible algorithm implementations are both extremely important and yet difficult

to achieve (Section 3.8 on page 65). The best way to overcome these issues is to

provide the source code on which all experiments are conducted. Furthermore, in

the interest of advancing the field, research and its results should be both consis-

tent with previous research and extensible in future research by others. Meeting

these basic demands of a scientific field requires the community’s adoption of a

97


standard implementation of its ACO algorithms. The closest thing to a standard

implementation for ACO algorithms is the C code written by Stutzle and Dorigo for

their definitive book on the field [47]. This was made available to the community on

the world wide web3 and is recommended for experiments with ACO [42, p. 275].

For the reasons mentioned above (reproducibility, relevance to previous re-

search and extensibility in future research) we made the decision to use the ACOTSP

code of Stutzle and Dorigo. ACOTSP was a procedural C implementation. We

translated ACOTSP into a Java implementation that we will refer to henceforth as

JACOTSP. The Java implementation now benefits from the usual Object-Oriented

advantages4, in particular its extensibility. The class hierarchy in JACOTSP en-

sures that algorithm subclasses share the same data structures and differ only

in the implementation details of individual methods. The Template design pat-

tern [72] proves particularly useful in this regard. The Delegator pattern allows

different termination conditions, for example, to be ‘plugged in’ to the algorithms

without disrupting the rest of their structure. JACOTSP runs on symmetric TSP

problems using 6 ACO algorithms namely Ant System, Ant Colony System, Rank-

based Ant System, Elitist Ant System, Best-Worst Ant System and Max-Min Ant

System. One may question the impact of Java on computation times when com-

pared to the original ACOTSP C implementation. While the early releases of Java

were indeed slow, subsequent releases addressed this issue in the context of sci-

entific computing [23]. Java is now an acceptable choice for scientific computing

according to performance benchmarks5 and is used for high performance scientific

computing in laboratories such as CERN 6. To focus too closely on relative running

times would be to miss the aim of the thesis which is to demonstrate effective tun-

ing of a heuristic resulting in improvements in solution quality and solution time.

As discussed in Chapter 3, these solution times will always be dependent on the

particular implementation details, regardless of the programming language used.

The random number generator used in ACOTSP and ported to JACOTSP is the

Minimal Random Number Generator of Park and Miller [92]. Its implementation

and a discussion of its merits can be found in the literature [99, p. 278-279]. The

reimplementation of the random number generator, in particular, ensures that JA-

COTSP produces the same behaviour (and ultimately the same solutions) as its

ACOTSP predecessor. This backwards compatibility was ensured with unit tests

that compare the output files of ACOTSP with those of JACOTSP for a variety of

input parameters. Such compatibility does not make sense when new tuning pa-

rameters are identified or when aspects of the algorithm’s internal design are pa-

rameterised and varied. Breaking such compatibility is inevitable if the ACO algo-

rithms are to evolve. Given that the random number generator is well-established

and that backwards compatibility was important, we did not investigate alternative

generators as is sometimes advisable (Section 3.11 on page 71).

3 ACOTSP, available at http://iridia.ulb.ac.be/∼mdorigo/ACO/aco-code/public-software.html4In general, the use of an object model leads to systems with the following attributes of well-

structured complex systems: abstraction, encapsulation, modularity and hierarchy [18].5http://shootout.alioth.debian.org/6http://dsd.lbl.gov/˜hoschek/colt/

98

http://iridia.ulb.ac.be/~mdorigo/ACO/aco-code/public-software.html

http://shootout.alioth.debian.org/

http://dsd.lbl.gov/~hoschek/colt/


The original ACOTSP contained a timer that reported the CPU time for which

ACOTSP was running. A slightly different approach was taken in JACOTSP be-

cause accessing CPU times in Java was problematic in the Java version in which

the JACOTSP project was started. Newer versions have since overcome this. For

this reason, the decision was taken to use a timer supplied with the Colt project7.

Colt is a set of high performance scientific computing libraries used at the CERN

labs. JACOTSP therefore measures elapsed time rather than CPU time. The timer

was paused during the calculation and output of data that is not essential to the

functioning of the JACOTSP ant algorithms. For example, branching factor cal-

culation is not timed for ACS but is timed for MMAS because it is used in trail

reinitialisation. Any concerns over the interruption of the timer by other operating

system processes are easily allayed by randomising experiment running orders.

Unless otherwise stated, the times reported in this thesis’ case studies are elapsed

times rather than CPU times.

5.2.1 Profiling

Exponentiation is a mathematical operation, written an, involving two numbers,

the base a and the exponent n. When n is a whole number (an integer), the expo-

nentiation operation corresponds to repeated multiplication. However, when the

exponent is a real number (say 1.73) a different approach to calculation is re-

quired and this approach is computationally very expensive. In Java, the language

of the JACOTSP implementation, the natural logarithm method is used for real ex-

ponents. The details of this method are beyond the scope of this discussion. A

simple profiling of the JACOTSP code showed that real exponent values caused

the vast majority of computational effort to be expended on exponentiation. Re-

call from the design of ACO (Section 2.4.3 on page 37) that two exponentiations

are involved in every ant movement decision (see equation ( 2.2 on page 39) for

example).

The implication of this and the common knowledge of the expense of exponenti-

ation is that tuning parameters that are exponents should be limited to be integer

values only. However, we have seen at least one case in the literature [114] where

authors looking only at solution quality and not recording CPU time actually rec-

ommend non-integer values of these exponents (Section 4.3.3 on page 88). Any

gain in quality from using a non-integer α and β will most likely be offset by the

huge deterioration in solution time. This is further evidence to support the recom-

mendation of measuring CPU time (Section 3.10.1 on page 68) and for this thesis’

emphasis on the heuristic compromise.

7 http://dsd.lbl.gov/˜hoschek/colt/index.html

99

http://dsd.lbl.gov/~hoschek/colt/index.html


5.3 Benchmarking the machines

The remaining aspect of our experimental testbed is the physical machines on

which all experiments are conducted. Clearly, all machines can differ widely.

There are differences in processor speeds, memory sizes, chip types, operating

systems, operating system versions and, in the case of Java, different versions of

different virtual machines. Even if machines are identical in terms of all of these

aspects, they may still differ in terms of running background processes such as

virus checkers. This is the unfortunate reality of the majority of computational re-

search environments. Furthermore, such differences will almost certainly occur in

the computational resources of other researchers who attempt to reproduce or ex-

tend previous work of others. Ultimately, such differences mean that experiments

that are identical in terms of the two previous testbed issues of algorithm code

and problem instances will still differ when run on supposedly identical machines.

These differences necessitate the benchmarking of the experimental testbed (Sec-

tion 3.9 on page 67).

Reproducibility of results (Section 3.8 on page 65) is a second important mo-

tivation for benchmarking. Other researchers can reproduce the benchmarking

process on their own experimental machines. They can thus better interpret the

CPU times reported in this research by scaling them in relation to their own bench-

marking results. This mitigates the decline in relevance of reported CPU times with

inevitable improvements in technology. It is hoped that the benchmarking advo-

cated in this thesis becomes commonplace in reported ACO and metaheuristics

research.

5.3.1 Benchmarking method

The clear and simple benchmarking procedure of the DIMACS [58] challenge is

applied here and its results described below.

1. A set of TSP instances is generated with one of the DIMACS problem gener-

ators. These instances range in size from one thousand nodes to one million

nodes.

2. The DIMACS greedy search, a deterministic algorithm, is applied to each in-

stance for a given number of repetitions and the total time for all repetitions

is recorded. The number of repetitions varies inversely with the size of the

instance. For example, the instance of size 1 million is solved only once by

the greedy search while the instance of size 1 thousand is solved 1 thousand

times.

5.3.2 Results and discussion

The results of the DIMACS benchmarking of our experimental machines are illus-

trated in Figure 5.2 on the facing page and the corresponding data are presented

in Figure 5.3 on the next page.

100


DIMACS Benchmarking of experiment machines

0.00

10.00

20.00

30.00

40.00

50.00

E1k.0 E3k.0 E10k.0 E31k.0 E100k.0 E316k.0 E1M.0

Instance

Tim

e (s

)

116253111156188136

Figure 5.2: Results of the DIMACS benchmarking of the experiment testbed.

26/09/2007benchmarking.xls

Instance Size Repetitions 116 253 111 156 188 136 96E1k.0 1000 1000 5.45 4.38 5.25 5.00 3.31 4.81 3.37E3k.0 3000 316 5.67 4.61 7.61 5.25 3.78 5.23 3.75E10k.0 10000 100 6.91 7.25 8.81 6.44 4.99 6.64 5.32E31k.0 31000 32 11.77 16.52 13.41 11.20 9.48 11.00 10.84E100k.0 100000 10 22.86 26.53 26.77 21.03 10.87 19.85 12.82E316k.0 316000 3 28.61 32.05 34.50 27.61 12.56 25.86 14.70E1M.0 1000000 1 39.52 44.80 49.23 38.03 16.55 35.44 19.31

Time (s)

Figure 5.3: Data from the DIMACS benchmarking of the experiment testbed.

101


The horizontal axis represents the different instances for which the benchmark-

ing was conducted where instances are arranged in order of increasing size. The

vertical axis is the total time in seconds for the benchmarking run. Each bar rep-

resents a different machine, identified by a machine ID. It is evident that every

machine’s benchmark time increases with instance size. The differences between

machines become more pronounced as instance size increases. In general, ma-

chine 188 is always fastest.

The benchmarking times indicate that despite the similarity of the specification

of most of the machines, there are still differences in CPU times and these differ-

ences seem to amplify in larger instances. The benchmarking has thus identified a

nuisance factor in the experimental testbed and the need to randomise experiment

runs across the experiment testbed. Efforts to use completely identical machines

for all experiments will still encounter this nuisance factor. Any successful perfor-

mance analysis methodology will have to cope with this reality.

5.4 Chapter summary

This chapter covered the following:

• Decision to use a publicly available problem generator. There are concerns

over the difficulties of developing reliable problem generators. This research

uses a publicly available problem generator that was also used in the DIMACS

challenge, a large research competition within the TSP community.

• Modified generator to control problem characteristic. Problem instance

standard deviation of edge lengths may be an important problem character-

istic affecting problem difficulty. To this end, a modified DIMACS problem

generator draws its edge lengths such that they exhibit a Log-Normal dis-

tribution with a desired mean and standard deviation. The choice of this

distribution is in keeping with previous work by Cheeseman et al [26].

• OOP implementation of publicly available algorithm source code. There

is a suite of publicly available C code of the main ACO algorithms to accom-

pany the field’s main book [47]. We have reimplemented this code in Java

and refactored it into an extensible Object-Oriented (OO) implementation.

Our java code continues to reproduce the solutions of the original C code

by Dorigo and Stutzle. Results from this research are therefore applicable to

other research that has used their publicly available C code.

• Highlighting the computationally expensive exponentiation calculation.We saw from our overviews of the ACO metaheuristic (Section 2.4.3 on page 37)

that all involve a very large number of exponentiation calculations in which

the tuning parameters α and β are the exponents. It is well known that ex-

ponentiation with non-integer exponents is computationally very expensive.

This was highlighted in our profiling of the code but is often missed by re-

searchers who ignore the heuristic compromise and do not record CPU times.

Tuning of the α and β parameters will therefore be restricted to integer values.

102


• Benchmarked experimental machines. The DIMACS benchmarking ap-

proach was applied to all machines used during the course of this research.

The emphasis of the thesis research is not to compare algorithms in a ‘horse-

race’ study (Section 3.4.4 on page 59). Benchmarking the machines benefits

future research in that CPU times reported in the thesis can be interpreted

and scaled by other researchers using the same accepted benchmarking ap-

proach, regardless of the inevitable differences in their experimental testbed.

103

6Methodology

Chapter 3 brought together high-level concerns spanning all aspects of Design

Of Experiments (DOE) with particular emphasis on DOE for metaheuristics. The

reader is referred to Appendix A for a background on DOE. This chapter focuses

on a sequential experimentation methodology that deals with these concerns. The

methodology efficiently takes the experimenter from the initial situation of almost

no knowledge of the metaheuristic’s behaviour to the desired situation of a mod-

elled metaheuristic with recommendations on tuning parameter settings for given

problem characteristics. The methodology is based on a well-established proce-

dure from DOE that this thesis modifies for its application to metaheuristics. It

was first introduced to the ACO community only recently by the author [104]. The

design generation and statistical calculations can be performed with most modern

statistical software packages including SPSS, NCSS PASS, Minitab and Microsoft

Excel. Examples of the methodology’s successful application to the parameter tun-ing problem are presented in the chapters in Part IV of the thesis. This chapter

begins with a relatively high-level overview of the whole sequential experimenta-

tion methodology before detailing all its stages and decisions.

6.1 Sequential experimentation

Experimentation for process modelling and process improvement is inherently it-

erative. This is no different when the studied process is a metaheuristic. Box

[20] gives some sample questions that often arise after an experiment has been

conducted.

“That factor doesn’t seem to be doing anything. Wouldn’t it have been

better if you had included this other variable?”

“You don’t seem to have varied that factor over a wide enough range.”

105

CHAPTER 6. METHODOLOGY

“’The experiments with high factor A and high factor B seem to give the

best results; it’s a pity you didn’t experiment with these factors at even

higher levels.”

In sequential experimentation, there are six directions in which a subsequent

experiment commonly moves [20]. These depend on the results from the first

experiment.

1. Move to a new location in the design space because the initial results suggest

a trend that is worth pursuing.

2. Stay at the current location in the design space and add further treatments

to the design to resolve ambiguities that may exist in the design. Such ambi-

guities are typically due to effects that cannot be separated from one another

due to the nature of the experiment design. Such ‘entanglement’ of effects is

termed aliasing (Section A.3 on page 214).

3. Rescale the design if it appears that certain variables have not been scaled

over wide enough ranges.

4. Remove or add factors to the experiment.

5. Repeat some runs to better estimate the replication error.

6. Augment the design to assess the curvature of the response. This is particu-

larly important when large two-factor interactions occur between factors.

These questions and the decisions for subsequent experiments are part of a

larger sequential experimentation methodology. This ‘bigger picture’ is illustrated

in Figure 6.1 on the next page.

The main advantage of the sequential experimentation methodology is its effi-

ciency of resources. It is rare that the experimenter begins with a full knowledge

of the metaheuristic (hence the point of experimentation). A revision of experiment

design decisions is therefore inevitable as more is learned about the metaheuris-

tic. The sequential methodology avoids the risky and oft unsuccessful approach of

running a large all-encompassing expensive experiment up front. Instead, many of

the designs and their existing data are incorporated into subsequent experiments

so that no experimental resources go to waste. Factors are carefully examined

before a decision on their inclusion in an experiment. Calculations of statistical

power reveal when a sufficient number of replicates have been gathered.

We now describe each of the stages in this sequential experimentation method-

ology of this thesis with reference to modelling and tuning the performance of a

metaheuristic.

6.2 Stage 1a: Determining important problem character-

istics

The first stage in the sequential experimentation approach is to determine all prob-

lem characteristics that affect at least one of the responses of interest. Without

106


Heuristic tuningparameters

Known & suspectedproblem characteristics

screening

runs to confirmmodel1.

Scr

eeni

ng augment designwith foldover

estimate main effects andinteractions

curvature?

augment design with axialpoints or new design space

Response Surface Methods

runs to confirmmodel

2. M

odel

ling

Numerical Optimisation

3. T

unin

g

None

Stage

runs to confirmresults

4. E

valu

atio

n

Overlay plots for optimalparameter settings

Figure 6.1: The sequential experimentation methodology. The methodology covers four main stages,screening, modelling, tuning and evaluation of results.

107


sufficient problem characteristics, the response surface models from later in the

procedure will not make good predictions of performance on new instances.

6.2.1 Experiment Design

The main difficulty encountered in attempting to experiment with problem instance

characteristics is that of the uniqueness of instances. That is, while several in-

stances may have the same characteristic that is hypothesised to affect the re-

sponse, these instances are nonetheless unique. For example, there is a poten-

tially infinite number of possible instances that all have the same characteristic

of problem size. The uniqueness of instances will therefore cause different values

of the response despite the instances having identical levels of the hypothesised

characteristic. The experimenter’s difficulty is one of separating the effect (if any)

due to the hypothesised characteristic from the unavoidable variability between

unique instances.

A given heuristic encounters instances with different levels of some character-

istic. The experimenter wishes to determine whether there is a significant overall

difference in heuristic performance response for different levels of this character-

istic. The experimenter also wishes to determine whether there is a significant

variability in the response when unique instances have the same level of the prob-

lem characteristic.

There is a well-established experiment design to overcome this difficulty. It is

termed a two-stage nested (or hierarchical) design. Figure 6.2 illustrates the two-

stage nested design schematically.

1

1 2 3

2

4 5 6

Problemcharacteristic

Instance

Observations

y111y112

y11r

y121y122

y12r

y131y132

y13r

y241y242

y24r

y251y252

y25r

y261y262

y26r

Figure 6.2: Schematic for the Two-Stage Nested Design with r replicates. (adapted from [84]). There areonly two levels of the parent problem characteristic factor. Note the instance numbering to emphasisethe uniqueness of instances within a given level of the problem characteristic.

Note that this thesis applies the two-stage nested design to the heuristics re-

searcher’s aim of determining whether a problem characteristic merits inclusion in

the sequential experimentation methodology. This design cannot capture possible

interactions between more than one problem characteristic. Capturing such inter-

actions would require a more complicated crossed nested design or the factorial

designs encountered later in the sequential experimentation procedure. The goal

here is to quickly determine whether the hypothesised characteristic should be

included in subsequent experiments in the sequential methodology. This design,

first introduced into ACO research by the author [105, 110] is now receiving wider

108


attention and theoretical backing in the community [6].

6.2.2 Method

This is an overview of the method for determining whether a problem character-

istic should be included in subsequent stages of the methodology. A case study

illustrates this method with real data in Chapter 7.

1. Responses variables. Identify the response(s) to be measured. For experi-

ments with metaheuristics, these must reflect some measure of solution qual-

ity and some measure of solution time.

2. Design factors and factor ranges. Choose the problem characteristic hy-

pothesised to affect the response of interest and the range over which that

factor will be varied. The null hypothesis is that changes in this charac-

teristic cause no significant change in the metaheuristic performance. The

alternative hypothesis is that changes in the characteristic do indeed cause

changes in the performance.

3. Held-constant factors. Fix all tuning parameters and all other problem char-

acteristics at some values that remain fixed for the duration of the experiment.

These values may be values commonly encountered in the literature, values

in the middle of the range of interest or values determined to be of interest by

pilot studies. As noted above, this does not permit an examination of interac-

tions between the hypothesised characteristic and both other characteristics

and the metaheuristic’s tuning parameters. These interactions are examined

later in the sequential methodology.

4. Experiment design. Generate a 2-stage nested design where the hypothe-

sised characteristic is the parent factor (Factor A) and the unique instances

(Factor B) are nested within a given level of this parent. This design is il-

lustrated schematically in Figure 6.2 on the preceding page. Preferably, there

should be at least three levels of the parent factor so that it can be determined

whether the hypothesised effect of the problem characteristic is linearly re-

lated to the performance metric (Section 3.5 on page 59). We can have as

many instances nested within each parent level as we like. The number of in-

stances must be the same within each level of the parent. A treatment is then

a run of the metaheuristic on an instance within a given level of the problem

characteristic.

5. Replicates and randomise. Replicate each treatment and randomise the run

order. Collect the data in the randomised run order.

6. Analysis. The data can now be analysed with the General Linear Model. The

statistical technicalities behind this analysis have recently been explained in

the context of metaheuristics research [6]. These translate into the following

settings in most statistical software:

109


• Factor A is entered into the model as the parent. Factor B is nested

within Factor A.

• Factor B is set as a random factor since the unique instances are ran-

domly generated.

• Factor A is a fixed factor because its levels were chosen by the experi-

menter.

7. Diagnostics. The usual diagnostic plots (Section A.4.2 on page 221) are ex-

amined to verify that the assumptions for the application of ANOVA have not

been violated.

8. Response Transformation. A transformation of the response may be needed

(Section A.4.3 on page 222). If so, the experimenter returns to step 5 and

reanalyses the transformed response.

9. Outliers. If outliers are identified in the data, these are deleted and the

experimenter returns to step 5.

10. Power. The gathered data are analysed to determine the statistical power. If

insufficient power has been reached for the study’s level of significance then

further replicates are added and the experimenter returns to the Replicates

and randomise step.

11. Interpretation. When satisfied with the model diagnostics, the ANOVA table

from the analysis can be interpreted.

12. Visualisation. The box plot is also useful for visualising the practical signifi-

cance of any statistically significant effects.

This approach can be repeated for any problem characteristic that is hypoth-

esised to affect performance. Once satisfied with a set of characteristics, the ex-

perimenter proceeds to do a larger scale screening experiment with all the tuning

parameters and all the relevant problem characteristics. Of course, if the set of

problem characteristics turns out to be insufficient, the experimenter returns to

this stage of the overall sequential experimentation methodology.

6.3 Stage 1b: Screening

The next stage in the sequential experimentation methodology is screening. Screen-

ing aims to determine which factors have a statistically significant effect and prac-

tically significant effect on each response as well as the relative size of these ef-

fects. Chapters 8 on page 143 and 10 on page 169 present detailed case studies

of screening. The experiment design and methodology presented here were first

introduced to the ACO community by the author [104, 107, 108]. A similar DOE

screening methodology was applied to Particle Swarm Optimisation a year later

[74].

110


6.3.1 Motivation

A detailed motivation for screening heuristic tuning parameters and problem char-

acteristics has been covered in Chapter 1. It is important to screen heuristic tuning

parameters and problem characteristics for several reasons. We learn which pa-

rameters have no effect on performance and this saves experimental resources in

subsequent performance modelling experiments. It also improves the efficiency of

other heuristic tuning methods (Section 4.3.2 on page 85) by reducing the search

space these methods must examine. This was already discussed on several occa-

sions as one of the major advantages of the DOE approach to parameter tuning

over alternatives like automated tuning (Section 4.3.2 on page 84). Screening ex-

periments also provide a ranking of the importance of the tuning parameters. In a

case of limited resources, this arms the experimenter with knowledge of the most

important factors to examine and those parameters that one may afford to treat

as held-constant factors. Finally, screening is a useful design tool. Alternative

new heuristic features can be ‘parameterised’ into the heuristic and a screening

analysis will reveal whether these new features have any significant effect on per-

formance.

6.3.2 Research questions

The research questions in any heuristic screening study can be phrased as follows.

Screening. Which of the given set of heuristic tuning parameters and

problem characteristics have an effect on heuristic performance in terms

of solution quality and solution time?

Ranking. Of the tuning parameters and problem characteristics that

affect heuristic performance, what is the relative importance of each in

terms of solution quality and solution time?

Adequacy of a Linear Model. Is a linear model of the responses ade-

quate to predict performance or is a higher order model required?

These research questions lead to a potentially large number of hypotheses. It

would not contribute to this case study to exhaustively list all here. Some illus-

trative examples follow. These examples can be specified for any tuning parameter

or problem characteristic and must be analysed for all performance metrics in the

study. A screening hypothesis would look like the following.

• Null Hypothesis H0: the tuning parameter A has a significant affect on per-

formance measure X.

• Alternative Hypothesis H1: the tuning parameter A has no significant affect

on performance measure X.

A ranking hypothesis would look like the following.

• Null Hypothesis H0: the tuning parameters A and B have an equally important

effect on performance measure X.

111


• Alternative Hypothesis H1: the tuning parameter A has a stronger effect than

tuning parameter B on performance measure X.


Full factorial designs are expensive in terms of experimental resources and pro-

vide more information than is needed in a screening experiment (Section A.3.2

on page 215). For screening purposes it is sufficient to use a fractional factorial(FF). The minimum appropriate fractional factorial design resolution for screening

is a resolution IV since no main effects are aliased but a resolution V design is

preferable when possible since this provides information on unaliased second or-

der effects. Fractional factorials, aliasing and resolution are discussed in more

detail in Appendix A.2.

6.3.4 Method

The methodology for factor screening is described below.




2. Design factors and factor ranges. Choose the algorithm tuning parameters

and problem characteristics that will be screened as well as the ranges over

which these factors and problem characteristics will vary. Sometimes factors

have a restricted range due to their nature. Alternatively, factors may have

an open range. In either case, a pilot study (Section 3.7 on page 65) may be

required to determine sensible factor ranges.

3. Held constant factors and values. Because of limited resources, it may not

be possible to experiment with all potential design factors. A pilot study may

have revealed that some factors have a negligible effect on performance. These

factors must be held at a constant value for the duration of the experiments.

4. Experimental Design. For the given number of factors, choose an appro-

priate fractional factorial design. If the number of factors and resource lim-

itations prevent the use of a resolution V design then examine the available

resolution IV designs for the given number of factors. Where there are sev-

eral available designs of resolution IV, check whether the design requiring a

smaller number of treatments has a satisfactory aliasing structure. Note that

any resolution IV design will have aliased two-factor interactions. However,

knowledge of the system and its most likely interactions may make some of

these aliases negligible. Furthermore, judicious assignment of factors will re-

sult in the most important factors having the least aliasing in the generated

design.

The choice of design will fix the number and value of the factor levels. The

chart provided in Section A.3.2 on page 215 can help in making a decision

between resolution, number of runs and aliasing.

112


5. Run Order. This is now the minimum design with a single replicate. Generate

a random run order for the design.

6. Gather Data. Collect data according to the treatments and their random run

order.

7. Significance level, effect size and replicates. Choose an appropriate sig-

nificance (alpha) level for the study. Choose an appropriate effect size that

the screening must detect. For the study’s chosen significance level (5% in

this thesis), examine the design’s power to detect the chosen effect size. If the

power is not greater than 80%, add replicates to the design and return to the

Run Order step to gather the extra data. Sufficient data has been gathered

when power reaches 80%.

At this stage, sufficient data has been gathered with which to build a model of

each response. The steps in this model building stage of the methodology are the

subject of the next section.

6.3.5 Build a Model

The following model building steps must be repeated for each of the responses

separately. Before this however, it is necessary to check that none of the responses

are highly correlated. Only one response in a set of highly correlated responses

needs to be analysed. A scatter plot of each response against each other response

visually demonstrates correlation. Recall that two solution quality responses are

measured in this research (Section 3.10.2 on page 69). These responses are of

course highly correlated. However, both responses are always analysed separately.

It is an open question for the ACO field as to which solution quality response is the

more appropriate and we wished to examine whether the conclusions using one

would differ from the conclusions using the other.

1. Find important effects. Various techniques such as Half-normal plots can

be used to identify the most important effects that should be included in a

model of the data. This thesis uses backwards regression (Section A.4.1 on

page 220) with an alpha out value of 0.1.

2. ANOVA test. The result of the backwards regression is an ANOVA on the

model containing these most important effects.

3. Diagnosis. The usual diagnostic tools (Section A.4.2 on page 221) are used

to verify that the ANOVA model is correct and that the ANOVA assumptions

have not been violated.

4. Response Transformation. The diagnostics may reveal that a transforma-

tion of the response is required. In this case, perform the transformation and

return to the Find Important Effects step.

5. Outliers. If the transformed response is still failing the diagnostics, it may be

that there are outliers in the data. These should be identified and removed

before returning to the Find Important Effects step.

113


6. Model significance. Check that the overall model is significant.

7. Model Fit. Check that the predicted R-Squared value is in reasonable agree-

ment with the Adjusted R-Squared value and that both of these are close to

1. Check that the model has a signal to noise ratio greater than about 4.

At this stage in the procedure, it may be necessary to augment the design de-

pending on whether a resolution IV or resolution V design had been used and

depending on whether aliased effects are statistically significant. This is one of

the common iterative experimentation situations identified earlier (Section 6.1 on

page 105).

6.3.6 Augment model

In a resolution IV design, some second order effects will be aliased with other ef-

fects. If the experimenter deems these effects to be important then the design must

be augmented with additional treatments so that the aliasing of these effects can

be removed. There is a methodical approach to fractional factorial augmentation

that is termed foldover. The details of foldover are beyond the scope of this thesis

but are covered in the literature [84]. In essence, foldover is a way to add specific

treatments to a design such that a target effect will no longer be aliased in the new

augmented design. If foldover is performed, the new design can be analysed as per

Sections 6.3.4 on page 112 and 6.3.5 on the previous page.

At this point, we have reduced models for each response that pass the diagnos-

tics and in which no important effects are aliased. However, diagnostics involve

some subjective judgements. While the ANOVA procedure can be robust to slight

violations of these diagnostics, it is still good practice to independently confirm the

models’ accuracy on some real data.

6.3.7 Confirmation

Before drawing any conclusions from a model, it is important to confirm that the

model is sufficiently accurate. As in traditional DOE, confirmation is achieved

by running experiments at new randomly chosen points in the design space and

comparing the actual data to the model’s predictions. Confirmation is not a new

rigorous experiment and analysis in itself but rather a quick informal check. In the

case of a heuristic, these randomly chosen points in the design space equate to new

problem instances and new randomly chosen combinations of tuning parameters.

The methodology that this thesis proposes is as follows:

1. Treatments. A number of treatments are chosen where a treatment consists

of a new problem instance and a new set of tuning parameter values with

which the instance will be solved.

2. Generate Instances. The required problem instances are generated.

3. Select Tuning Parameters. It is important to remember that the screening

design uses only high and low values of the factors. It therefore can only pro-

duce a linear model of the response. This will not be an accurate predictor of

114


the response within the centre of the design space if the response actually ex-

hibits curvature. However, the model should still be accurate near the edges

of the design space (at the high and low values of the factors). The randomly

chosen tuning parameters should therefore be restricted to be within a cer-

tain percentage of the edges of the factors’ ranges. In this research, a limit of

within 10% of the factor high and low values is used.

4. Random run order. A random run order is generated for the treatments

and a given number of replicates. In this research, 3 replicates are used. 3

replicates is enough to give an estimate of how variable the response is for a

given treatment. We are conducting this confirmation to ensure our subjec-

tive decisions in the model building were correct. We are not conducting a

new statistically designed experiment that would introduce further subjective

diagnostic decisions as per the previous section.

5. Prediction Intervals. The collected data is compared to the model’s 95%

high and low prediction intervals (Section A.3.5 on page 219). We identify two

criteria upon which our satisfaction with the model (and thus confidence in

its predictions) can be judged [106].

• Conservative: we should prefer models that provide consistently higher

predictions of relative error and higher solution time than those actually

observed. We typically wish to minimise these responses and so a con-

servative model will predict these responses to be higher than their true

value.

• Matching Trend: we should prefer models that match the trends in

heuristic performance. The model’s predictions of the parameter combi-

nations that give the best and worst performance should match the com-

binations that yield the actual metaheuristic’s observed best and worst

performance.

6. Confirmation. If the model is not a satisfactory predictor of the actual al-

gorithm then the experimenter must return to the model building phase and

attempt to improve the model.

At this stage, we have built models of each response from the gathered data and

the models have been confirmed to be good predictors of the algorithm responses

around the edges of the design space. We can now analyse the models and rank

and screen the factors.

6.3.8 Analysis

The following steps for analysing a screening model must be repeated for each

response independently.

1. Rank most important factors. The terms in the model should be ranked

according to their ANOVA Sum of Square values (Section A.4 on page 220).

115


These ranks can then be studied alongside the corresponding p values for the

model terms. The most important model terms will have large sum of square

values and a p value that shows they are statistically significant.

2. Screen factors. Factors that are not statistically significant and have a rel-

atively low ranking can be removed immediately from the subsequent experi-

ments as they do not have an important influence on the response. Further-

more, factors that are statistically significant but practically insignificant can

also be considered for removal from subsequent experiments. The extent to

which we screen out factors will depend on the experimental resources avail-

able for subsequent experiments, our knowledge of the metaheuristic’s tuning

parameters and our confidence in the screening experiment’s recommenda-

tions. Note also that if a factor is important for even one response then it must

remain in the subsequent experiments. Subsequent experiments combine all

responses into a single model of performance.

3. Model graphs. Graphs of the response for each factor should be examined.

This serves two purposes. It confirms our decision to screen factors in the

previous step. Furthermore, it shows us whether statistically significant and

highly ranked factors actually have a practically significant effect on the re-

sponse.

At this stage, the relative importance of each term in the model has been as-

sessed. A linear relationship between the factors and the response has been as-

sumed. The screening design has therefore yielded a planar relationship between

the factors and the response. The relationship between the factors and response

is often of a higher order than planar. A higher order relationship will exhibit some

curvature. It is therefore important to determine whether such curvature exists so

that we can establish the need to use a more sophisticated response surface and

associated experiment design in subsequent experiments.

6.3.9 Check for curvature

Adding centre points to a design allows us to determine whether the response

surface is not planar but actually contains some type of curvature. A centre point

is simply a treatment that is a combination of all factors’ values at the centre of

the factors’ ranges1. The average response value from the actual data at the centre

points is compared to the estimated value of the centre point that comes from

averaging all the factorial points. If there is curvature of the response surface in the

region of the design, the actual centre point value will be either higher or lower than

predicted by the factorial design points. If no curvature exists then the screening

experiment’s planar models should be sufficient to predict responses. This should

first be confirmed as per Section 6.3.7 on page 114 but with the difference that

1 Centre points must therefore be replicated for each level of each categoric factor since the ‘middle’of a categoric factor does not make any sense.

116


treatments are drawn from throughout the design space rather than just from the

edges.

If the analysis with centre points reveals the possibility of curvature in the ac-

tual data then further experiments are required to predict the responses through-

out the whole design space. These experiments are the subject of the next stage in

the methodology.

6.4 Stage 2: Modelling

The response surface methodology is similar to the screening methodology of Sec-

tion 6.3.4 on page 112 in many regards. It involves a fractional factorial type de-

sign, analysed with ANOVA. Its data are collected following the good practices for

experiments with heuristics developed in Chapter 3 and illustrated in the screen-

ing methodology of Section 6.3 on page 110. The most significant difference in the

response surface methodology is in the experiment design it requires. The screen-

ing design used a simple linear model to determine whether there was a significant

difference between high and low levels of the factors. For this purpose, only the

edges of the design space were of interest and only these were examined when con-

firming the model (see Section 6.3.7 on page 114). A Response Surface Model, by

contrast, requires a more sophisticated design since it attempts to build a model of

the factor-problem-performance relationship across the whole design space. The

experiment designs and methodology presented here were first introduced to the

ACO field by the author [104, 106].

6.4.1 Motivation

A detailed motivation for performance modelling has already been presented in

Chapter 1. Modelling heuristic performance is a sensible way to explore the vast

design space of tuning parameter settings and their relationship to problem in-

stances and heuristic performance. A good model can be used to quickly rec-

ommend tuning parameter settings that maximise performance given an instance

with particular characteristics. Performance models can also provide visualisa-

tions of the robustness of parameter settings in terms of changing problem char-

acteristics.

6.4.2 Research questions

This parameter tuning study addresses the following research questions.

• Screening. Which tuning parameters and which problem characteristics

have no significant effect on the performance of the metaheuristic in terms

of solution quality and solution time? If a screening study has already been

conducted correctly and the tuning study is performed on the screened pa-

rameter set, one would expect all remaining parameters to have a significant

effect on performance. The screening study is a more efficient method of an-

swering screening questions but the tuning study can nonetheless identify

117


further unimportant parameters that may have been missed in the screening

study.

• Ranking. What is the relative importance of the most important tuning pa-

rameters and problem characteristics?

• Sufficient order model. What is the minimum order model that satisfactorily

models performance? We know from the screening study whether or not a

linear model is sufficient. Tuning studies offer more advanced fit analysesthat recommend the minimum order model that is required for the data.

• Relationship between tuning, problems and performance. What is the

relationship between tuning parameters, problem characteristics and the re-

sponses of solution quality and solution time. A tuning study yields a math-

ematical equation describing this relationship for each response.

• Tuned parameter settings. What is a good set of tuning parameter settings

given an instance with certain characteristics? Are these settings better than

what can be achieved with randomly chosen settings? Are these settings

better than alternative settings from the literature?

The first two research questions are identical to questions in the screening

study of Chapter 8. The tuning study does not obviate the need for a screening

study. A screening study is a simpler and more efficient method for answering

these particular research questions.


Building a response surface requires a particular type of experiment design. There

are several alternatives available and these are discussed in Section A.3.4 on

page 218. In this thesis, the Face-Centred Composite (FCC) design is used for

all models. The FCC is most appropriate for situations where design factors are

restricted to be within a certain range. This is the case with many metaheuristic

tuning parameters. For example, in ACO, the pheromone related parameter ρ must

be within the range 0 < ρ < 1.

6.4.4 Method

The method for response surface modelling is detailed below. As already noted, it

is very similar to the screening methodology of Section 6.3 on page 110.




2. Design factors and factor ranges. Choose the algorithm tuning parameters

and problem characteristics whose relationship to performance metrics will

be modelled by the response surfaces. Note that problem characteristics must

be included in the model so that we can correctly model their relationship to

118


the tuning parameters and the heuristic performance. If the response surface

design has been augmented from a previous screening design by adding star

points then the factors and factor ranges are already determined.

3. Held constant factors and values. Because of limited resources, it may not

be possible to experiment with all potential design factors. A pilot study may

have revealed that some factors have a negligible effect on performance. These

factors must be held at a constant value for the duration of the experiments.

4. Screened out factors. Factors that have been screened out in the previous

screening study can be set to any values with impunity.

5. Experiment design. For the given number of factors, choose an appropriate

fractional factorial design for the factorial part of the Face-Centred Compos-

ite design. Where there are several available designs of resolution IV, check

whether the design requiring a smaller number of treatments has a more

satisfactory aliasing structure. The choice of the FCC design determines the

location of the design’s star points and consequently the levels of the factors.

6. Run Order. This is now the minimum design with a single replicate. Generate

a random run order for the design.

7. Gather Data. Collect data according to the treatments and their run order.

8. Significance level, effect size and replicates. Choose an appropriate sig-

nificance (alpha) level for the study. Choose an appropriate effect size that

the screening must detect. For the study’s chosen significance level (5% in

this thesis), examine the design’s power to detect the chosen effect size. If

the power is not greater than 80%, introduce replicates into the design. Add

replicates to the design and return to the Run Order step and gather the

extra data. Sufficient data has been gathered when power reaches 80% by

convention.

At this stage, we have sufficient data to build a response surface model of each

response.

6.4.5 Build a Model

Building the models for a response surface differs in several ways from building

a model for screening. When screening, a model of each response was analysed

separately. This meant that outliers removed from one response’s model would not

necessarily be removed from another response’s model. With the response surface

models however, we will ultimately be combining all the models’ recommendations

into a simultaneous tuning of all the responses. It is therefore more appropriate

that outlier runs deleted from the analysis of one response will remain deleted from

the analysis of other responses.

When screening, it was sufficient to use two levels of each factor and so we were

limited to a linear/planar model of the responses. The response surface model is

119


more complicated in that it can model higher-order surfaces. The first step is

therefore to determine the most appropriate model for the data.

1. Model Fitting. The highest order model that can be generated from the FCC

design is quadratic. All lower order models (linear and 2-factor interaction)

are generated and then assessed on two aspects: their significance and their

R-squared values.

(a) Begin with the lowest order model, the linear model. If the model is not

significant, it is removed from consideration and the next highest order

model is examined for significance.

(b) Examine the adjusted R-squared and predicted R-squared values. These

should be within 0.2 of one another and as close to 1 as possible. If

the R-squared values are not satisfactory, the model is removed from

consideration and the next highest order model is examined.

If models of a higher order than quadratic are required then an alternative

design to the FCC will have to be used.

2. Find important effects. A stepwise linear regression is performed on the

chosen model to estimate its coefficients. Note that this is different from the

screening stage of Section 6.3 on page 110. Here, terms are being removed

from the model to give the most parsimonious model possible. Screening, by

contrast, determines which factors (and all associated model terms) do not

even take part in the experiment design. Of course, if screening has been

accurate then few main effect terms should be removed from the response

surface model by stepwise linear regression.

3. Diagnosis. The usual diagnostics of the linear regression model are per-

formed. If the model passes these tests then its proposed coefficients can be

accepted.

4. Response Transformation. The diagnostics may reveal that a transforma-

tion of the response is required. In this case, perform the transformation and

return to step 2.

5. Outliers. If the transformed response is still failing the diagnostics, it may be

that there are outliers in the data. These should be identified and removed

before returning to step 2.

6. Model significance. Check that the overall model is significant.

7. Model Fit. Check that the predicted R-Squared value is in reasonable agree-

ment with the Adjusted R-Squared value and that both of these are close to

1. Check that the model has a signal to noise ratio greater than about 4.

As with the screening procedure, it is good practice to independently confirm

the models’ accuracy on some real data as recommended by the DOE approach.

120


6.4.6 Confirmation

The methodology for confirming our response surface models differs in one im-

portant way from the previous method for confirming our screening models (Sec-

tion 6.3.7 on page 114). The randomly chosen instances and parameter settings

are now drawn from across the entire design space rather than being limited to the

edges of the design space. The methodology is the same as that of Section 6.3.7 on

page 114 in all other aspects and so is not repeated here.

6.4.7 Analysis

The following analysis is performed for each model separately.

1. Rank most important factors. The terms in the model equations should be

ranked according to their ANOVA F values. These ranks can then be stud-

ied alongside the corresponding p values for the model terms. The rankings

should be in approximate agreement with the previous screening study.

2. Model graphs. Graphs of the responses for each factor should be examined.

This shows us whether statistically significant and highly ranked factors ac-

tually have a practically significant effect on the response. It also shows the

likely location of optimal response values. Surface plots of pairs of factors are

particularly insightful.

At this stage, the experimenter has a model of each of the responses over the

entire design space and these models have been confirmed to be accurate predic-

tors of the actual metaheuristic. It is now possible to use this model for tuning the

actual metaheuristic.

6.5 Stage 3: Tuning

Screening has provided us with a reduced set of the most important problem char-

acteristics and the most important algorithm tuning parameters. The Response

Surface Model has given us mathematical functions relating these characteristics

and tuning parameters to each of the responses of interest. The accuracy of these

models’ predictions have been methodically confirmed and we are satisfied that

the models meets our quality criteria (see Section 6.4.6).

Since the Response Surface Models are mathematical functions of the factors,

it is possible to numerically optimise the responses by varying the factors. This

allows us to produce the most efficient process. There are several possible optimi-

sation goals. We may wish to achieve a response with a given value (target value,

maximum or minimum). Alternatively, we may wish that the response always falls

within a given range (relative error less than 10%). More usually, we may wish to

optimise several responses because of the heuristic compromise. We have seen in

the literature review of Chapter 4 on page 77 that such tuning rarely deals with

121


both solution quality and solution time simultaneously and so neglects the heuris-

tic compromise. We introduce here a technique from Design Of Experiments that

allows multiple response models to be simultaneously tuned.

6.5.1 Desirability Functions

The multiple responses are expressed in terms of desirability functions [40]2. De-

sirability functions are described in more detail in Section A.3.6 on page 220. The

overall desirability for all responses is the geometric mean of the individual desir-

abilities.

The well-established Nelder-Mead downhill simplex [98, p. 326] is then applied

to the response surface model’s equations such that the desirability is maximized.

We specify the optimization in this research with the dual goals of minimizing both

solution error and solution time, while allowing all algorithm-related factors to vary

within their design ranges3. Equal priority is given to the dual goals. This does not

preclude running an optimisation that favours solution quality or favours solution

time, as determined by the tuning application scenario. Problem characteristics

are also factors in the model since we want to establish the relationship between

these problem characteristics, the algorithm parameters and the performance re-

sponses. It does not make sense to include these problem characteristic factors

in the optimization. The optimisation process would naturally select the easiest

problems as part of its solution. We therefore needed to choose fixed combina-

tions of the problem characteristics and perform the numerical optimizations for

each of these combinations. A sensible choice of such combinations is a three-

level factorial of the characteristics. A more detailed description of these methods

follows.

6.5.2 Method

These are the steps used to simultaneously optimise the desirability of the solution

quality and solution time responses.

1. Combinations of problem characteristics. A three-level factorial combina-

tion of the problem characteristics is created. In the case of two characteris-

tics, this creates 9 combinations of problem characteristics. These combina-

tions are illustrated in Table 6.1 on the next page.

2. Numerical optimisation. For each of these combinations, a numerical opti-

misation of desirability is performed using the Nelder-Mead Simplex with the

2 From the NIST/SEMATECH e-Handbook of Statistical Methods, available athttp://www.itl.nist.gov/div898/handbook/pri/section5/pri5322.htm.

3 It is important to note that optimisation of desirability does not necessarily lead to parameter rec-ommendations that yield optimal metaheuristic performance. As stated in the opening of Section 6.5.1,desirability functions are a geometric mean of the desirability of each individual response. Furthermore,a response surface model is an interpolation of the responses from various points in the design space.There is therefore no guarantee that the recommended parameters result in optimal performance, theyonly result in tuned performance that is better than performance in most of the design space. Onemust be careful to distinguish between optimised desirability and tuned parameters.

122


Standard Order Characteristic A Characteristic B

1 Level 1 of A Level 1 of B









Table 6.1: A full factorial combination of two problem characteristics with three levels of eachcharacteristic. This results in nine treatments.

following settings:

• Cycles per optimisation is 30.

• Simplex fraction is 0.1.

• Design points are used as the starting points.

• The maximum number of solutions is 25.

The optimisation goal is to minimise the solution error response and the so-

lution run time response. These goals can be given different priorities if, for

example, quality is deemed more important than time. The problem charac-

teristics are fixed at the values corresponding to the 3 level factorial combi-

nation.

3. Choose best solution. When the optimisation has completed, the solution

with the highest desirability is taken and the others are discarded. Note that

there may be several solutions of very similar desirability but with differing

factor settings. This is due to the nature of the multiobjective optimisation

and the possibility of many regions of interest (Section A.2 on page 213).

4. Round off integer-valued parameters. If a non-integer value of an exponent

tuning parameter was recommended, this is rounded to the nearest integer

value for the reasons explained in Section 5.2.1 on page 99.

This optimisation procedure has given us recommended parameter settings for

9 locations covering the problem space. Of course, a user requiring more refined

parameter recommendations will have to run this optimisation procedure for the

problem characteristics of the scenario to hand. Optimisation of desirability is not

expensive but requires access to appropriate tools. Alternatively, an interpolation

of the recommendations across the design space is possible. The experimenter can

now evaluate the tuning recommendations.

123


6.6 Stage 4: Evaluation

There are two main aspects to the evaluation of recommended tuning parameter

settings. Firstly, we wish to assess how well the presented method’s recommended

parameter settings perform in comparison to alternative recommended settings.

Secondly, we wish to determine how robust the recommended settings are when

used with other problem characteristics.

6.6.1 Comparison

We wish to determine by how much the results obtained with the tuned parame-

ter settings are better than the results obtained with randomly chosen parameter

values and the results obtained with alternative parameter settings. We wish to

compare with randomly chosen parameters to demonstrate that the effort involved

in the methodology does indeed offer an improvement in performance over not

using any method at all. We compare with alternative parameter settings to see

whether the methodology is competitive with the literature or supports values rec-

ommended in the literature. Such alternative settings may include those recom-

mended by others or those determined using some other tuning methodology. In

this research, all results using DOE tuned parameters are compared to the results

obtained with the parameter settings recommended in the literature [47].

The methodology is described below. It is similar in terms of set-up to the previ-

ous confirmation experiments for the screening models (Section 6.3.7 on page 114)

and response surface models (Section 6.4.6 on page 121). It must be repeated for

every combination of problem characteristics used to cover the design space (see

Section 6.5.2 on page 122).

1. Generate problem instances. A number of problem instances are created at

randomly chosen locations in the problem space.

2. Randomly choose combinations of tuning parameters. For each instance,

a number of sets of parameter settings are chosen randomly within the design

space.

3. Use models to choose parameter settings. For each instance, a parameter

setting is chosen using the desirability optimisation of the response surface

model. In this research, two models were built: a relative error vs time model

and an ADA vs time model. There are therefore two sets of parameter recom-

mendations, one for each model.

4. Solve instances. The instances are solved using the various parameter set-

tings.

5. Plot. A scatter plot is used to illustrate the differences, if any, between the

solutions obtained with the various parameter settings.

It is important once again to draw the reader’s attention to the heuristic compro-

mise. We may find that recommended parameter settings offer a similar solution

124


quality to the DOE method’s recommendations. However, the DOE methodology is

constrained to also find settings that offer good solution times. Only when both

responses are examined do we have a realistic comparison of the performance of

the parameter settings.

6.6.2 Robustness

At this point, we have tuned parameter settings for each member of a set of combi-

nations of problem characteristics. This set was chosen so that it covers the whole

space of possible problems in some sensible fashion. We also have the model

equations of the responses that allow us to calculate tuned parameter settings for

new combinations of problem instance characteristics. It may not always be con-

venient or necessary to perform these calculations. A technique from Design Of

Experiments called overlay plots is adapted in this thesis so that statements can be

made about the robustness of tuned parameter settings across a range of different

problem instance characteristics [106].

Overlay plots are a visual representation of regions of the design space in which

the responses fall within certain bounds. They can be considered as a more relaxed

optimisation than that of tuning where the experimenter was looking for maxima

and minima. For example, the experimenter might query the response models for

ranges of some tuning parameters within which the solution time response is less

than a certain value. Overlay plots are very useful in the context of robustness of

tuned parameters. Recall that the researcher has several sets of tuned parameter

values for various combinations of problem characteristics. For any one of these

combinations of problem characteristics, the associated tuned parameter settings

should result in maxima or minima of the responses. Now one may also wonder

whether adjacent combinations of problem characteristics in the problem space

could also be quite well solved with the same tuned parameter settings. Overlay

plots allow one to visualise the answer to this question. Figure 6.3 on the next

page illustrates an overlay plot.

The horizontal and vertical axes represent values of two problem instance char-

acteristics. The tuned parameter settings are listed to the left. The white area

represents all instance combinations that are solvable with these parameter set-

tings within given relaxed bounds on the performance responses. Clearly, this area

will be larger for more robust tuned parameter settings.

Overlay plots are a powerful tool that is only available when one has a model of

metaheuristic performance across the whole design space. They are backed by the

same rigour and statistical confidence as all DOE methods.

6.7 Common case study issues

Thus far, this chapter has detailed the methodologies that the thesis has adapted

from DOE and will apply to the parameter tuning problem. Chapter 3 highlighted

how even within a good methodology there are many experimental design decisions

that must be taken. This section summarises those common issues and decisions

125


Figure 6.3: A sample overlay plot. The horizontal and vertical axes represent two problem charac-teristics. The tuning parameter settings are listed on the left. Two constraints on solution have beenspecified: the time must be less than 5 seconds and the relative error must be less than 1%. The whitearea is the region of the problem space that can be solved within these constraints on time and solutionquality.

taken across all subsequent case studies. They are reported here to avoid rep-

etition in the case studies but a stand-alone case study should report all of the

following for completeness and to aid reproducibility.

6.7.1 Instances

All TSP instances were of the symmetric type. In the Euclidean TSP, cities are

points with integer coordinates in the two-dimensional plane. For two cities at

coordinates (x1, y1) and (x2, y2), the distance between the cities is computed ac-

cording to the Euclidean distance√

(x1 − x2)2 + (y1 − y2)2. However this definition

of distance must be modified slightly. In the given form, this equation produces

irrational numbers that can require infinite precision to be described correctly.

This causes problems when comparing tour lengths produced by solution tech-

niques. In all problems encountered in this thesis, distance is calculated using

(int)[√

(x1 − x2)2 + (y1 − y2)2 + 0.5]. This is the so-called EUC2D distance type as

specified in the online TSP benchmark library TSPLIB [102] and as used in the

original ACOTSP code (Section 5.2 on page 97).

The TSP problem instances ranged in size from 300 cities to 500 cities with cost

matrix standard deviation ranging from 10 to 70. All instances had a mean of 100.

The same instances were used for each replicate of a design point.

Instances were generated with a version of the publicly available portmgen

problem generator from the DIMACS challenge [58] as described in Section 5.1

on page 95.

126


6.7.2 Stopping criterion

All experiments except those in Chapter 7 were halted after a stagnation stopping

criterion. Stagnation was defined as a fixed number of iterations in which no

improvement in solution value had been obtained. Responses were measured at

several levels of stagnation during an experiment run: 50, 100, 150, 200 and 250

iterations. This facilitated examining the data at alternative stagnation levels to

ensure that conclusions were the same regardless of stagnation level.

6.7.3 Response variables

Three response variables were measured. The time in seconds to the end of an ex-

periment reflects the solution time. The adjusted differential approximation (ADA)

and relative error from a known optimum reflect the solution quality. These were

described in Section 3.10.2 on page 69.

Concorde [5] was used to calculate the optima of the instances. Expected ran-

dom solution values of the instances, as used in the ADA calculation, were gener-

ated by randomly permuting the order of cities in a TSP instance 200 times and

taking the average tour length from these permutations.

6.7.4 Replicates

The design points were replicated in a work up procedure (Section A.6.1 on page 227)

until sufficient power of 80% was reached for detecting a given effect size with an

alpha level of 5% for all responses. The 80% power and 5% significance level were

chosen by convention. The size of effect that could feasibly be detected depended

on the particular response and the particular experiment design.

6.7.5 Benchmarks

All experimental machines were benchmarked as per the DIMACS benchmarking

procedure described in Chapter 5. The results of this benchmarking are presented

in Section 5.3 on page 100. These benchmarks should be used when scaling CPU

times in future research.

6.7.6 Factors, levels and ranges

Held-Constant Factors

There are several held constant factors common to all case studies. Local search, a

technique typically used in combination with ACO heuristics, was omitted. There

are two reasons for this omission. Firstly, there is a large number of local search

alternatives from which to choose and choosing one would have restricted the

thesis’s conclusions to a particular local search implementation. Secondly, the

overwhelming contribution to ACO solution quality comes from local search. This

would defeat the thesis aim of evaluating and demonstrating the tuning of a heuris-

tic with a large number of parameters. All instances had a cost matrix mean of 100.

127


The computation limit parameter (Section 2.4.8 on page 44) was fixed at being

limited to the candidate list length as this resulted in significantly lower solution

times.

Nuisance Factors

A limitation on the available computational resources necessitated running exper-

iments across a variety of machines with slightly different specifications. There

was no control over background processes running on these machines. Runs

were executed in a randomised order across these machines to counteract any

uncontrollable nuisance factors due to the background processes and differences

in machine specification.

6.7.7 Outliers

This research used the approach of deleting outliers from an analysis until the

analysis passed the usual ANOVA diagnostics (Section A.4.2 on page 221).

6.8 Chapter summary

This chapter has detailed the sequential experimentation methodology that is used

in the rest of the thesis. The following topics were covered.

• Iterative experimentation. Experimentation for modelling any process is

inevitably iterative. Metaheuristics are no different. The experimenter often

knows little about the heuristic at the start of experimentation. Once data

is gathered and understanding of the heuristic deepens, the original experi-

ment design decisions may have to be revised. Any procedure for algorithm

modelling must efficiently incorporate the iterative nature of experimentation.

• Sequential experimentation procedure. There is a well-established sequen-

tial experimentation procedure in traditional DOE. This thesis modifies that

procedure so that it can be applied to metaheuristics.

• Choosing problem characteristics. Experimentation with metaheuristics is

different from traditional DOE in many regards. In particular, both problem

characteristics and algorithm tuning parameters must be incorporated into

the model so that parameter settings for new instances can be selected. In

the spirit of sequential experimentation, we would like to determine quickly

which problem characteristics should be included in subsequent experiment

designs rather than building a large all-encompassing and expensive design.

The two-stage nested design was introduced in Section 6.2 on page 106 as a

methodical way to generalise performance effects due to a problem charac-

teristic despite the individual uniqueness of each problem instance.

• Performing confirmation runs. Confirmation runs are important when val-

idating conclusions in traditional DOE. Many DOE analyses involve some

128


subjective decisions. It is important then to confirm conclusions from the

DOE procedures. This increases our confidence in our analysis—if the meth-

ods give good predictions of actual performance then they are sufficient for

our engineering purposes. There are two major types of confirmation runs

encountered here.

1. Model confirmation runs are used to confirm that the ANOVA equation

is a good prediction of the actual response.

2. Tuning confirmation runs are used to confirm that the recommended

tuning parameter settings are indeed as good as or better than alterna-

tives.

• Factor Screening. Screening experiments allow us to rank the importance

of each tuning parameter and problem characteristic, determining those that

matter most to performance and those that have no statistically significant

or practically significant effect. Screening experiment designs do not have

to be of a high enough resolution to make statements about higher order

interactions. They are a quick way to reduce the size of subsequent response

surface designs.

• Response Surface Modelling. Response surface modelling determines the

relationship between tuning parameters, problem characteristics and perfor-

mance responses. A surface is built for each response separately.

• Desirability functions. Desirability functions are a DOE technique for com-

bining multiple responses into a single response. We have introduced desir-

ability functions to the ACO community [104, 106, 109]. This permits easy

tuning of factors while observing the heuristic compromise of high solution

quality in reasonable solution time.

• Tuning. Once all responses are expressed in a single desirability function,

the multi-objective optimisation of all the responses in terms of the tuning

parameters can be performed using well-established numerical optimisation

methods. Optimisation can only be performed for fixed combinations of prob-

lem characteristics. Including problem characteristics in the optimisation

would not make sense as the optimisation would select the easiest combi-

nation of problem characteristics. We therefore perform the optimisations at

combinations of problem characteristics that span the design space. The rec-

ommendations from tuning for these combinations of problem characteristics

can be interpolated across the design space. Alternatively, tuning can be per-

formed for every specific combination of problem characteristics that the user

is presented with.

• Overlay plots. Since tuned parameter settings relate to specific combina-

tions of problem characteristics, it is useful to determine how robust those

parameter settings are to changes in the original problem characteristic val-

ues. Overlay plots provide a useful visual tool for doing this. Given a bound

129


on the responses, we can plot all the combinations of problem characteris-

tics that the algorithm will solve within these bounds using the given tuned

parameter settings.

The chapters in the next part of this thesis will apply this sequential experimen-

tation methodology and its procedures to several ACO algorithms to test the thesis

hypothesis.

130

Part IV

Case Studies

131

7Case study: Determining whether

a problem characteristic affectsheuristic performance

This Chapter reports a case study on how to determine whether a problem charac-

teristic affects the performance of a metaheuristic. The methodology for this case

study was proposed and described in Chapter 6.

The Chapter reports a new result for ACO. The standard deviation of TSP edge

lengths has a significant effect on the difficulty of TSP instances for two ACO algo-

rithms, ACS and MMAS1. The results reported in this Chapter have been published

[105, 110]. The results support conclusions from similar experiments with exact

algorithms [26] and provide a detailed illustration of the application of techniques

of which the research community is becoming increasingly aware [6].

7.1 Motivation

An integral component of the construct solutions phase (Section 2.4.3 on page 37)

of ACO algorithms depends on the relative lengths of edges in the TSP. These edge

lengths are often stored in a TSP cost matrix. The probability with which an ant

chooses the next node in its solution depends, among other things, on the rela-

tive length of edges connecting to the nodes being considered (Equation ( 2.2 on

page 39)). Intuitively, it would seem that a high variance in the distribution of edge

lengths would result in a different problem to a low variance in the distribution of

edge lengths. This has already been investigated for exact algorithms for the TSP

1Recall from Section 6.7.6 on page 127 that edge length mean was held constant during all experi-ments. The results in terms of standard deviation of edge lengths can therefore be interpreted in termsof the scale-free ratio of standard deviation to mean edge length.

133

CHAPTER 7. CASE STUDY: DETERMINING WHETHER A PROBLEM CHARACTERISTICAFFECTS HEURISTIC PERFORMANCE

(Section 4.1 on page 77, [26]). It was shown that the standard deviation of edge

lengths in a TSP instance has a significant effect on the problem difficulty for an

exact algorithm. This leads us to suspect that standard deviation of edges lengths

may also have a significant effect on problem difficulty for the ACO heuristics.

This research is worthwhile for several reasons. Current research on ACO algo-

rithms for the TSP does not report the problem characteristic of standard deviation

of edge lengths. Assuming that such a problem characteristic affects performance,

this means that for instances of the same or similar sizes, differences in perfor-

mance are confounded (Section A.1.6 on page 212) with possible differences in

standard deviation of edge lengths. Consequently, too much variation in perfor-

mance is attributed to problem size and none to problem edge length standard

deviation. Furthermore, in attempts to model ACO performance, all important

problem characteristics must be incorporated into the model so that the relation-

ship between problems, tuning parameters and performance can be understood.

With this understanding, performance on a new instance can be satisfactorily pre-

dicted given the salient characteristics of the instance.

7.2 Research question and hypothesis

The research question of this case study can be phrased as follows:

Does the variability of edge lengths in the Travelling Salesperson Prob-

lem affect the difficulty of the problem for the ACO metaheuristic?

This can be refined to the following research hypotheses, phrased in terms of

either MMAS or ACS:

• Null Hypothesis H0: the standard deviation of edge lengths in TSP instances’

cost matrices has no effect on the average quality of solutions produced by

the algorithm.

• Alternative Hypothesis H1: the standard deviation of edge lengths in TSP

instances’ cost matrices affects the average quality of solutions produced by

the algorithm.

7.3 Method

7.3.1 Response Variable

The response variable was solution quality, measured as per Section 6.7.2 on

page 127. The response was measured after 1000, 2000, 3000, 4000 and 5000

iterations.

7.3.2 Instances

Instances were generated as per Section 6.7.1 on page 126. Standard deviation of

TSP cost matrix was varied across the 5 levels: 10, 30, 50, 70 and 100. Three prob-

134


lem sizes; 300, 500 and 700 were used in the experiments. The same instances

were used for the two algorithms and the same instance was used for replicates of

a design point.

7.3.3 Factors, Levels and Ranges

Design Factors

There were two design factors. The first was the standard deviation of edge lengths

in an instance. This was a fixed factor, since its levels were set by the experimenter.

Five levels: 10, 30, 50, 70 and 100 were used. The second factor was the individual

instances with a given level of standard deviation of edge lengths. This was a

random factor since instance uniqueness was caused by the problem generator

and so was not under the experimenter’s direct control. Ten instances were created

within each level of edge length standard deviation.

Held-constant Factors

There were many common held-constant factors as per Section 6.7.6 on page 127.

This study also contained further held-constant factors. Problem size was fixed

for a given experiment. Sizes of 300, 500 and 700 were investigated. Two ACO

algorithms were investigated: MMAS (Section 2.4.6 on page 42) and ACS (Sec-

tion 2.4.5 on page 39). These algorithms were chosen because they are claimed to

be the best performing of the ACO heuristics and because they are representative

of the two main types of ACO heuristic: Ant System descendents and non-Ant Sys-

tem descendents respectively. The held constant tuning parameter settings for the

heuristics are listed in the following table.

Parameter Symbol ACS MMAS

Ants m 10 25

Pheromone emphasis α 1 1

Heuristic emphasis β 2 2

Candidate List length 15 20

Exploration threshold q0 0.9 N/A

Pheromone decay ρglobal 0.1 0.8

Pheromone decay ρlocal 0.1 N/A

Solution construction Sequential Sequential

Table 7.1: Parameter settings for the problem difficulty experiments with the ACS and MMAS algo-rithms. Values are taken from the original publications [118, 47]. See Section 2.4 on page 34 for adescription of these tuning parameters and the MMAS and ACS heuristics.

It is important to stress that this research’s use of parameter values from the

literature by no means implies support for such a ‘folk’ approach to parameter

selection in general. Selecting parameter values as done here strengthens any

conclusions in two ways. It shows that results were not contrived by searching for

135


a unique set of tuning parameter values that would demonstrate the hypothesised

effect. Furthermore, it makes the research conclusions applicable to all other

research that has used these tuning parameter settings without the justification

of a methodical tuning procedure, as proposed in this thesis. Recall from the

motivation (Section 7.1 on page 133) that demonstrating an effect of edge length

standard deviation on performance with even one set of tuning parameter values

is sufficient to merit the factor’s consideration in parameter tuning studies. The

results from this research will therefore directly affect the results on parameter

tuning in later chapters.

7.3.4 Experiment design, power and replicates

The experiment design is a two-stage nested design (Section 6.2.1 on page 108).

The standard deviation of edge lengths is the parent factor. The individual in-

stances factor is nested within this.

The heuristics are probabilistic (Section 2.4) and so repeated runs with identi-

cal inputs (instances, parameter settings etc.) will produce different results. All

treatments are thus replicated in a work up procedure (Section A.6.1 on page 227)

until sufficient power of 80% was reached to detect an effect for the study’s signifi-

cance level of 1%. Power was calculated with Lenth’s power calculator [76]. For all

experiments, 10 replicates were sufficient to meet these requirements.

7.3.5 Performing the Experiment

Randomised run order

Available computational resources necessitated running experiments across a va-

riety of similar machines. Runs were executed in a randomised order across

these machines to counteract any uncontrollable nuisance factors. While such

randomising is strictly not necessary when measuring a machine-independent re-

sponse, it is good practice nonetheless.

Stopping Criterion.

Experiments were halted after a fixed iteration stopping criterion (Section 3.13 on

page 73). The number of fixed iterations was 5000. A potential problem with this

approach is that the choice of combinatorial count can bias the results. Should

we stop after 1000 iterations or 1001? Taking response measurements after 1000,

2000, 3000, 4000 and 5000 iterations mitigated this concern. The data were

separately analysed at the 1000 and 5000 measurement points.

7.4 Analysis

7.4.1 ANOVA

The two-stage nested designs were analysed with the General Linear Model. Stan-

dard deviation was treated as a fixed factor since we explicitly chose its levels and

136


instance was treated as a random factor. The technical reasons for this decision

in the context of experiments with heuristics have recently been well explained in

the metaheuristics literature [6].

To make the data amenable to statistical analysis, a transformation (as per

Section A.4.3 on page 222) of the responses was required for each analysis. The

transformations were either a log10, inverse square root transformation or a square

root transformation.

Outliers were deleted and the model building repeated until the models passed

the usual ANOVA diagnostics for the ANOVA assumptions of model fit, normality,

constant variance, time-dependent effects, and leverage (Section A.4.2 on page 221).

Figure 7.1 lists the number of data points deleted in each analysis.

Algorithm Problem size ADA Relative

ErrorACS 300 3 3

500 0 2700 5 4

MMAS 300 3 7500 1 2700 4 2

Figure 7.1: Number of outliers deleted during the analysis of each problem difficulty experiment. Eachexperiment had a total of 500 data points.

The number of outliers deleted in each case is very small in comparison to the

total number of 500 data points. Further details on these analyses and diagnostics

are available in many textbooks [84] and in Appendix A.

7.5 Results

In all cases, the effect of standard deviation of edge length on solution quality

was deemed statistically significant at the p < 0.01 level. The effect of individual

instance was also deemed statistically significant at the p < 0.01 level, however, an

examination of the data shows that this effect was not of practical significance.

The following figures illustrate box plots of the data for the problem sizes of

300 and 700 for ACS and MMAS. The same trends were observed for problems

of size 500 and so these plots are omitted. In each box-plot, the horizontal axis

shows the five levels of standard deviation of the instances’ edge lengths at the five

measurement points, 1000, 2000, 3000, 4000 and 5000 iterations. The vertical

axis shows the solution quality response in its original scale. There is a separate

plot for each algorithm and each problem size. Vertical axes have not been set to

the same scale in the various plots. This is to discourage performance comparisons

between plots because parameters had not been tuned to the different problem

sizes. Outliers have been included in these plots.

An examination of the plots shows that only standard deviation had a practically

significant effect on solution quality.

At each measurement point, there was a slight improvement in the response.

137


A:problemStDevRelErr 5000RelErr 4000RelErr 3000RelErr 2000RelErr 1000

1007050301010070503010100705030101007050301010070503010

14

12

10

8

6

4

2

0

Relative Error, ACS, size 300, mean 100

Figure 7.2: Relative Error response for ACS on problems of size 300, mean 100.

A:problemStDevRel Err 5000Rel Err 4000Rel Err 3000Rel Err 2000Rel Err 1000

1007050301010070503010100705030101007050301010070503010

18

16

14

12

10

8

6

4

2

0

Relative Error, ACS, size 700, mean 100

Figure 7.3: Relative Error response for ACS on problems of size 700, mean 100.

138



1007050301010070503010100705030101007050301010070503010

20

15

10

5

Relative Error, MMAS, size 300, mean 100

Figure 7.4: Relative Error response for MMAS on problems of size 300, mean 100.


1007050301010070503010100705030101007050301010070503010

25

20

15

10

5

Relative Error MMAS, size 700, mean 100

Figure 7.5: Relative Error response for MMAS on problems of size 700, mean 100.

139


This is to be expected since the metaheuristic has a larger number of iterations in

which to solve the problems. In all cases, problem instances with a lower standard

deviation had a significantly lower relative error value than instances with a higher

standard deviation.

In all cases, there was a higher variability in the relative error response between

instances with a higher standard deviation. The same conclusions were drawn

from an analysis of the ADA quality response.

7.6 Conclusions

For ACS and MMAS, applied to TSP instances generated with log-normally dis-

tributed edge lengths such that all instances have a fixed cost matrix mean of 100

and a cost matrix standard deviation varying from 10 to 70:

1. a change in cost matrix standard deviation leads to a statistically and prac-

tically significant change in the difficulty of the problem instances for these

algorithms.

2. there is no practically significant difference in difficulty between instances

that have the same size, cost matrix mean and cost matrix standard deviation.

3. there is no practically significant difference between the difficulty measured

after 1000 algorithm iterations and 5000 algorithm iterations.

Difficulty here means relative error from an optimum and the adjusted differ-

ential approximation. We therefore reject the null hypothesis of Section 7.2 on

page 134.

7.6.1 Implications

These results are important for the ACO community for the following reasons:

• They demonstrate in a rigorous, designed experiment fashion, that quality of

solution of an ACO TSP algorithm is affected by the standard deviation of the

cost matrix.

• They demonstrate that cost matrix standard deviation must be considered as

a factor when building predictive models of ACO TSP algorithm performance.

• They clearly show that performance analysis papers using ACO TSP algo-

rithms must report instance cost matrix standard deviation as well as in-

stance size since two instances with the same size can differ significantly in

difficulty.

• They motivate an improvement in benchmark libraries so that they provide a

wider crossing of both instance size and instance cost matrix standard devia-

tion. Plots of instances in the TSPLIB show that generated instances generally

have the same shaped distribution of edge costs (Appendix B).

140


7.6.2 Assumptions and restrictions

For completeness and for clarity, we state that this case study did not examine the

following issues.

• It did not examine clustered problem instances or grid problem instances.

These are other common forms of TSP in which nodes appear in clusters and

in a very structured grid pattern respectively. The conclusions should not be

applied to other TSP types without a repetition of this case study.

• Algorithm performance was not being examined since no claim was made

about the suitability of the parameter values for the instances encountered.

Rather, the aim was to demonstrate an effect for standard deviation and so

argue that it should be included as a factor in experiments that do examine

algorithm performance. These experiments are the subject of subsequent

case studies in the thesis.

• We cannot make a direct comparison between algorithms since algorithms

were not tuned methodically. That is, we are not entitled to say that ACS did

better than MMAS on, say, instance X with a standard deviation of Y.

• We cannot make a direct comparison of the response values for different sized

instances. Clearly, 3000 iterations explores a bigger fraction of the search

space for 300-city problems than for 500 city problems. Such a comparison

could be made if it was clear how to scale iterations with problem size. Such

scaling is an open question.

7.7 Chapter summary

This Chapter presented a case study on determining whether a problem charac-

teristic has an effect on the difficulty of problems for a given heuristic. Specifically,

it investigated whether the standard deviation of edge lengths in TSP instances

affects the quality of solutions produced by the MMAS and ACS heuristics.

The result, that symmetric TSP edge length standard deviation affects problem

difficulty is a new result for the ACO community. This result is particularly im-

portant for approaches to modelling and analysing ACO performance. This will be

illustrated in subsequent chapters in the thesis.

141

8Case study: Screening Ant

Colony System

This Chapter reports a case study on screening the factors affecting the perfor-

mance of a heuristic. The methodology for this case study was described in Chap-

ter 6. The particular heuristic studied is Ant Colony System for the Travelling

Salesperson Problem (Section 2.4.6 on page 42).

This chapter reports many new results for ACS. Established tuning parameters

previously thought to affect performance are actually shown to have no effect at

all. New tuning parameters that were thought to affect performance are investi-

gated. A new TSP problem characteristic is shown to have a very strong effect on

performance, confirming the results of Chapter 7. All analyses are conducted for

two performance measures, quality of solution and solution time. This provides an

accurate measure of the heuristic compromise that is rarely seen in the literature.

Finally, it is shown that models of ACS performance must be of a higher order than

linear. The results reported in this Chapter have been published in the literature

[108, 107].

8.1 Method

8.1.1 Response Variables

Three responses were measured as per Section 6.7.2 on page 127. These responses

were percentage relative error from a known optimum (henceforth referred to as

Relative Error), adjusted differential approximation (henceforth referred to as ADA)

and solution time (henceforth referred to as Time).

143

CHAPTER 8. CASE STUDY: SCREENING ANT COLONY SYSTEM


Design Factors

There were 12 design factors, 10 representing the ACS tuning parameters and 2

representing the TSP problem characteristics being investigated. The design fac-

tors and their high and low levels are summarised in the following table. A descrip-

tion of the ACS tuning parameter factors was given in Section 2.4.9 on page 45.

H:solutionConstruction and J:antPlacement could be considered as parameterised

design features as mentioned in Section 6.3.1 on page 111. The antsFraction and

nnFraction are expressed as a percentage of problem size.

The factor ranges were chosen to encompass common parameter values en-

countered in the literature. The experiences of other researchers using ACS has

been that 10 ants is a good parameter setting. If this is the case, our experi-

ments will confirm this recommendation methodically. Recall that this is a se-

quential experimentation methodology and does not preclude incorporating the

experimenter’s prior knowledge into the chosen factor ranges.

Factor Name Type Low High

A alpha Numeric 1 13

B beta Numeric 1 13

C antsFraction Numeric 1.00 110.00

D nnFraction Numeric 2.00 20.00

E q0 Numeric 0.01 0.99

F rho Numeric 0.01 0.99

G rhoLocal Numeric 0.01 0.99

H solutionConstruction Categoric Parallel sequential

J antPlacement Categoric Random same

K pheromoneUpdate Categoric BestSoFar bestOfIteration

L problemSize Numeric 300 500

M problemStDev Numeric 10.00 70.00

Table 8.1: Design factors for the screening study with ACS. There are two problem characteristicfactors (L-problemSize and M-problemStDev). The remaining 10 factors are ACS tuning parameters.


The held constant factors are as per Section 6.7.6 on page 127.

8.1.3 Instances

All TSP instances were of the symmetric type and were created as per Section 6.7.1

on page 126. The TSP problem instances ranged in size from 300 cities to 500

cities with cost matrix standard deviation ranging from 10 to 70. All instances had

a mean of 100. The same instances were used for each replicate of a design point.

144



The experiment design was a Resolution IV (12-5) fractional factorial with 24 centre

points.

The number of replicates was 8, determined using the work-up procedure (Sec-

tion A.6.1 on page 227) until a power of 80% was achieved for a significance level

of 5% when detecting an effect size of 0.18 standard deviations. This yielded a total

of 1048 runs. Figure 8.1 gives the descriptive statistics for the collected data and

the actual effect size for each response.

Iterations Time Relative Error ADA

50 Mean 51.79 7.77 4.15StDev 103.90 10.36 4.06Max 1173.42 56.63 20.21Min 0.13 0.45 0.64Actual Effect Size of 0.18 standard deviations 18.70 1.87 0.73





Figure 8.1: Descriptive statistics for the ACS screening experiment. Statistics are given at five stag-nation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standarddeviations is also listed for each response variable.

8.1.5 Performing the experiment

Responses were measured at five stagnation levels. An examination of the de-

scriptive statistics in Figure 8.1 verifies that the stagnation level did not have a

large effect on the response values and therefore the conclusions after a 250 iter-

ation stagnation should be the same as after lower iteration stagnations. The two

solution quality responses show a small but practically insignificant decrease in

solution error as the stagnation iterations is increased.

145


The Time response increases with increasing stagnation iterations because the

experiments take longer to run. For all cases, the level of stagnation iterations

has little practically significant effect on the three responses. ACS did not make

any large improvements when allowed to run for longer. It is therefore sufficient to

perform analyses at the 250 iterations stagnation level.

8.2 Analysis

8.2.1 ANOVA

Effects for each response model were selected using stepwise regression (Sec-

tion A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha

out threshold of 0.10. Some terms removed by stepwise regression were added

back into the final model to preserve hierarchy.

To make the data amenable to statistical analysis, a transformation of the re-

sponses was required for each analysis. The transformation was a log10 for all three

responses.




35 data points (3% of total data) were removed when analysing Relative Error. 36

data points (3% of total data) were removed when analysing ADA. 32 data points

(3% of total data) were removed when analysing Time.

8.2.2 Confirmation

Confirmation experiments were run according to the methodology detailed in Sec-

tion 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models. Re-

call that the general idea is to run the algorithm on new randomly chosen combina-

tions of parameter settings and problem characteristics and compare performance

to the ANOVA models’ predictions. The randomly chosen treatments produced

actual algorithm responses with the descriptives listed in the following figure.

Max Min Mean StDevRelative Error 250 85.59 0.91 11.78 17.60ADA 250 39.37 0.93 7.57 9.69Time 250 16141 1 477 1869

Figure 8.2: Descriptive statistics for the confirmation of the ACS screening ANOVA. The responsedata is from runs of the actual algorithm on the randomly generated confirmation treatments with astagnation stopping criterion of 250 iterations.

The large ranges of each response reinforce the motivation for correct parameter

tuning as there is clearly a high cost in incorrectly tuned parameters.

The next three figures illustrate the 95% prediction intervals (Section A.3.5 on

page 219) and actual data for the three response models, Relative Error, ADA and

Time respectively.

146


0.05.0

10.015.020.025.030.035.040.045.050.0

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

Relative Error25095% PI low

95% PI high

Figure 8.3: 95% Prediction intervals for the ACS screening of Relative Error.

05

10152025303540

0 10 20 30 40 50Treatment

AD

A 2

50

ADA 250

95% PI low

95% PI high

Figure 8.4: 95% Prediction intervals for the ACS screening of ADA.

The two solution quality responses are well predicted by their models. The

models match the trends of the actual data, successfully picking up the extremely

low and extremely high response values which vary over a range of 85% for relative

error and 39 for ADA. Both quality models tend to overestimate high values.

The time response is also well predicted by its model. Time is subject to the

nuisance factor of different experiment machines and is a more variable response

due to the nature of the ACS algorithm. The extremely high times and extremely

low times are predicted well for all the 50 treatments. This was achieved over a

range of over 16000s (see Figure 8.1 on page 145).

The three ANOVA models are therefore satisfactory predictors of the three ACS

performance responses for factor values within 10% of the factor range limits listed

in Section 8.1 on page 144.

8.3 Results

Figure 8.6 on the next page gives a summary of the Sum of Squares ranking and

the ANOVA F and p values for the three responses. Only the main effects are listed.

Those main effects that rank in the top 12 are highlighted in bold. Rankings

are based on the full two factor interaction model of 78 terms, before stepwise

regression was applied.

147


1

10

100

1000

10000

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 250

95% PI low

95% PI high

Figure 8.5: 95% Prediction intervals for the ACS screening of Time.

ScreeningResults_ACS.xls

Main Effect Rank F value p value Rank F value p value Rank F value p value

A-alpha 62 1.15 0.2840 61 1.38 0.2405 16 12.01 0.0006B-beta 3 2103.73 < 0.0001 3 2128.02 < 0.0001 13 16.70 < 0.0001C-antsFraction 11 305.13 < 0.0001 10 310.74 < 0.0001 1 37566.17 < 0.0001D-nnFraction 4 1739.78 < 0.0001 4 1744.51 < 0.0001 3 1680.96 < 0.0001E-q0 2 6920.35 < 0.0001 2 6955.10 < 0.0001 7 41.76 < 0.0001F-rho 45 4.63 0.0317 44 5.08 0.0244 56 0.94 0.3313G-rhoLocal 13 227.57 < 0.0001 12 226.37 < 0.0001 41 2.88 0.0901H-solutionConstruction 14 151.18 < 0.0001 13 154.72 < 0.0001 4 80.62 < 0.0001J-antPlacement 75 0.01 0.9386 75 0.00 0.9831 54 1.28 0.2576K-pheromoneUpdate 47 4.41 0.0360 46 4.85 0.0278 5 49.39 < 0.0001L-problemSize 12 289.62 < 0.0001 18 52.52 < 0.0001 2 3334.25 < 0.0001M-problemStDev 1 43076.29 < 0.0001 1 9426.17 < 0.0001 43 2.57 0.1096

Relative Error ADA Time

1 of 1

Figure 8.6: Summary of ANOVAs for Relative Error, ADA and Time. Only the main effects are shown.Rankings are based on the full two factor interaction model of 78 terms, before backward selection wasapplied.

148


This table contains several results and answers to some open questions in the

ACO literature.

8.3.1 Screened factors

J-AntPlacement, is statistically insignificant at the 0.05 level for both quality re-

sponses and the time response. J-AntPlacement can therefore be directly screened

out.

Factor F-rho is statistically insignificant at the 0.05 level for the time response

and statistically significant for both quality responses. Despite its significance, the

rankings of F-Rho are very low (45, 44 and 56 out of 78 effects) across all three

responses. F-rho is therefore screened out.

A-Alpha is statistically insignificant at the 0.05 level for both quality responses

but statistically significant at the 0.05 level for the time response. This is reflected

in the low rankings for the quality responses and the high ranking (almost within

the top 12) for the time response. A-alpha should be set to 1 since this requires

marginally less time to compute the ACS decision equation than a higher value of

A-alpha.

K-pheromoneUpdate is statistically significant for all three responses but has

a very low ranking for both quality responses. An examination of the plot of time

for the K-pheromoneUpdate factor shows that the effect on time is not practically

significant. K-pheromoneUpdate can therefore be screened out.

In summary, 4 factors are screened out, A-Alpha, F-Rho, J-AntPlacement and

K-PheromoneUpdate.

8.3.2 Relative Importance of Factors

Of the two problem characteristics, the factor with the larger effect on solution

quality is the problem standard deviation M-problemStDev. The problem size L-

problemSize has a stronger effect on solution time than M-problemStDev. ACS

takes longer to reach stagnation on larger problem instances than smaller in-

stances.

Of the remaining unscreened tuning parameters, the heuristic exponent B-beta,

the amount of ants C-antsFraction, the length of candidate lists D-nnFraction and

the exploration/exploitation threshold E-q0 have the strongest effects on solution

quality. The same is true for solution time.

These results are important because they highlight the most important tuning

parameters and problem characteristics in terms of both heuristic performance

dimensions. These factors are the minimal set of design factors one should exper-

iment with when modelling and tuning ACS.

8.3.3 Adequacy of a Linear Model.

A test for curvature as per Section 6.3.9 on page 116 shows there is a significant

amount of curvature in all three responses. This means that the linear model for

screening is not adequate to explore the whole design space. A higher order model

149


of all three responses is required. This is an important result because it confirms

that a One-Factor-At-a-Time (Section 4.3.3 on page 87) approach is insufficient for

investigating the performance of ACS.

8.3.4 Comparison of Solution Quality Measures

It was already mentioned that two solution quality responses were measured to

investigate if either response lead to different conclusions. An examination of the

ANOVA summaries reveals that both Relative Error and ADA have almost the same

rankings and same statistical significance or insignificance for all factors except L-

problemSize. While L-problemSize has a significant effect on both responses, the

ADA response has a lower ranking of 18 compared to the Relative Error response

ranking of 12. This is due to the nature of how the two responses are calculated

(Section 3.10.2 on page 69).

This result shows that for screening of ACS, the choice of solution quality re-

sponse will not affect the conclusions. However, ADA may be preferable as it ex-

hibits a lower range and variability than Relative Error (Figure 8.1 on page 145).

The advantage of a lower variability was discussed in the context of statistical

power (Section A.6 on page 225).

8.4 Conclusions and discussion

The following conclusions are drawn from the ACS screening study. These conclu-

sions apply for a significance level of 5% and a power of 80% to detect the effect

sizes listed in Figure 8.1 on page 145. These effect sizes are a change in solution

time of 140s, a change in Relative Error of 1.8% and a change in ADA of 0.71.

Issues of power and effect size are discussed in Section A.6 on page 225.

• Tuning Ant placement not important. The type of ant placement has no

significant effect on ACS performance in terms of solution quality or solution

time. This was an open question in the literature. It is remarkable because

intuitively one would expect a random scatter of ants across the problem

graph to explore a wider variety of possible solutions. This result shows that

this is not the case.

• Tuning alpha not important. Alpha has no significant effect on ACS per-

formance in terms of solution quality or solution time. This confirms the

common recommendation in the literature of setting alpha equal to 1. Alpha

has also been analysed with an OFAT approach in Appendix D on page 235.

• Tuning Rho not important. Rho has no significant effect on ACS perfor-

mance in terms of solution quality or solution time. This is a new result for

ACS. It is a surprising result since Rho is a term in the ACS update phero-

mone equations and other analytical results in a much simplified scenario

suggested it was important [?].

150


• Tuning Pheromone Update Ant not important. The ant used for pheromone

updates is practically insignificant for all three responses. An examination

of the plot of time for the K-pheromoneUpdate factor shows that the effect

on time is not practically significant. K-pheromoneUpdate can therefore be

screened out.

• Most important tuning parameters. The most important ACS tuning pa-

rameters are the heuristic exponent B-beta, the amount of ants as a fraction

of problem size C-antsFraction, the length of candidate lists as a fraction of

problem size D-nnFraction and the exploration/exploitation threshold E-q0.

These are the factors one should focus on as design factors when experimen-

tal resources are limited.

• Problem standard deviation is important. This confirms the main result of

Chapter 7 in identifying a new TSP problem characteristic that has a signif-

icant effect on the difficulty of a problem for ACS. ACO research should be

reporting this characteristic in the literature.

• Higher order model needed. A higher order model, greater than linear, is

required to model ACS solution quality and ACS solution time. This is an

important result because it demonstrates for the first time that simple OFAT

approaches seen in the literature are insufficient for accurately modelling and

tuning ACS performance.

• Comparison of solution quality responses. The is no difference in the con-

clusions of the screening study using either the ADA or Relative Error solu-

tion quality responses. ADA has a slightly smaller variability and so results

in more powerful experiments than Relative Error.

8.5 Chapter summary

This chapter has presented a case study on screening the tuning parameters and

problem characteristics that affect the performance of ACS. This illustrated the

application of the methodology in Section 6.3 on page 110 with a fully instantiated

ACO heuristic, Ant Colony System. Many new results were presented and exist-

ing recommendations in the literature were confirmed in a rigorous fashion. In

the next chapter, these results will be used to efficiently build an accurate model

of ACS performance. The full model and the reduced model using the screening

results will be compared. This will confirm that the screening decisions recom-

mended in this study were correct.

151

9Case study: Tuning Ant Colony

System

This Chapter reports a case study on tuning the factors affecting the performance

of a heuristic. The methodology for this case study was described in Chapter

6. The particular heuristic studied is Ant Colony System (ACS) for the Travelling


This chapter reports many new results for ACS. All analyses are conducted for



It is shown that models of ACS performance must be at least quadratic. ACS is

tuned using a full parameter set and a screened parameter set resulting from the

case study of the previous chapter. This verifies that screening decisions from the

previous chapter are correct.

The results reported in this Chapter have been published in the literature [106].

9.1 Method






153

CHAPTER 9. CASE STUDY: TUNING ANT COLONY SYSTEM


Design factors

In the full parameter set RSM, there were 12 design factors as per the screening

study of Chapter 8. The factors and their high and low levels are repeated in the

following table for convenience.

Factor Name Type Low Level High Level


B beta Numeric 1 13





G rhoLocal Numeric 0.01 0.99

H solutionConstruction Categoric parallel sequential

J antPlacement Categoric random same

K pheromoneUpdate Categoric bestSoFar bestOfIteration

L problemSize Numeric 300 500

M problemStDev Numeric 10.00 70.00

Table 9.1: Design factors for the tuning study with ACS. The factor ranges are also given.

Tuning parameters for the Screened model, based on the results of the previous

Chapter, did not include A-alpha, F-rho, J-antPlacement and K-pheromoneUpdate.

These two screened factors took on randomly chosen values within their range for

each experiment run.



9.1.3 Instances






The experiment design was a Minimum Run Resolution V Face-Centred Composite

(Section A.3.4 on page 218) with six centre points.

The number of replicates was increased in a work-up procedure (Section A.6.1

on page 227) until a power of 80% was achieved for an significance level of 5% when

154


detecting a given effect size. The next two figures give the descriptive statistics

for the collected data and the actual effect size for each response in the full and

screened experiments with the FCC design.


50 Mean 65.33 11.01 5.60StDev 194.96 19.49 7.45Max 3131.77 125.84 41.75Min 0.17 0.55 0.66Actual Effect Size of 0.2 stdevs 38.99 3.90 1.49





Figure 9.1: Descriptive statistics for the full ACS FCC design. The actual detectable effect size of 0.2standard deviations is shown for each response and for each stagnation point. There is little practicaldifference in effect size of the solution quality responses for an increase in the stagnation point.

The full design could achieve sufficient power with 5 replicates while detecting

an effect of size 0.2 standard deviations.

The screened design could achieve sufficient power with 10 replicates while

detecting an effect of size 0.29 standard deviations. Unfortunately, experimental

resources did not permit using a larger number of replicates. This is further mo-

tivation for the use of DOE and fractional factorials. Without the vast savings of

fractional factorials (Section A.3.3 on page 218) this experiment would have been

completely infeasible.



scriptive statistics verifies that the stagnation level did not have a large effect on

the response values and therefore the conclusions after a 250 iteration stagnation

should be the same as after lower iteration stagnations. The two solution quality

155








Figure 9.2: Descriptive statistics for the screened ACS FCC design. The actual detectable effect sizeof 0.3 standard deviations is shown for each response and for each stagnation point. There is littlepractical difference in effect size of the solution quality responses for an increase in the stagnationpoint.

156


responses show a small but practically insignificant decrease in solution error as

the stagnation iterations is increased.

The Time response increases with increasing stagnation iterations because the

experiments take longer to run. For all cases, the level of stagnation iterations

has little practically significant effect on the three responses. ACS did not make

any large improvements when allowed to run for longer. It is therefore sufficient to

perform analyses at the 250 iterations stagnation level.

9.2 Analysis

9.2.1 Fitting

A fit analysis was conducted for each response in the full experiments and the

screened experiments. For both the full and screened cases, at least a quadraticmodel was required to model the responses. For the Minimum Run Resolution V

Face-Centred Composite, cubic models are aliased and so were not considered.

9.2.2 ANOVA


tion A.4.1 on page 220) applied to a full quadratic model with an alpha out thresh-

old of 0.10. Some terms removed by stepwise regression were added back into the

final model to preserve hierarchy.


sponses was required for each analysis. The transformation was a log10 for all three

responses.




138 data points (∼5% of total data) were removed when analysing the full model of

ADA-Time. 122 data points (∼5% of total data) were removed when analysing the

full model of RelativeError-Time. 47 data points (∼5% of total data) were removed

when analysing the screened model of ADA-Time. 15 data points (∼2% of total

data) were removed when analysing the screened model of RelativeError-Time.

9.2.3 Confirmation


tion 6.4.6 on page 121 in order to confirm the accuracy of the ANOVA models.

The randomly chosen treatments produced actual algorithm responses with the

descriptives listed.



The next two figures illustrate the 95% prediction intervals (Section A.3.5 on

page 219) and actual confirmation data for full and screened response surface

models of Relative Error and Time.

157


Iterations TimeRelative

Error ADA100 Mean 70.14 7.01 3.83

StDev 84.14 3.48 3.27Max 528.28 17.65 16.15Min 1.55 3.12 1.00

150 Mean 109.80 6.88 3.77StDev 130.37 3.41 3.24Max 774.39 17.65 16.15Min 2.17 2.77 1.00



Figure 9.3: Descriptive statistics for the confirmation of the ACS tuning. The response data is fromruns of the actual algorithm on the randomly generated confirmation treatments.

ACS_RelErr-Time.xls

Full Response Surface ModelScreened FCD RSMFull CCC RSMScreened CCC RSM

Full Response Surface Model

0.05.0

10.015.020.025.030.035.0

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

RelativeError 250

95% PI low

95% PI high


1

10

100

1000

10000

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 250

95% PI low

95% PI high

Screened FCD RSM

0.05.0

10.015.020.025.030.035.0

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

RelativeError 25095% PI low

95% PI high

Screened FCD RSM

1

10

100

1000

10000

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 250

95% PI low

95% PI high

Figure 9.4: 95% Prediction intervals for the full ACS response surface model of Relative Error-Time.The horizontal axis is the randomly generated treatment. The vertical axis is the Relative Error or Timeresponse.

158


ACS_RelErr-Time.xls

Full Response Surface ModelScreened FCD RSMFull CCC RSMScreened CCC RSM


0.05.0

10.015.020.025.030.035.0

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

RelativeError 250

95% PI low

95% PI high


1

10

100

1000

10000

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 250

95% PI low

95% PI high

Screened FCD RSM

0.05.0

10.015.020.025.030.035.0

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

RelativeError 25095% PI low

95% PI high

Screened FCD RSM

1

10

100

1000

10000

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 250

95% PI low

95% PI high

Figure 9.5: 95% Prediction intervals for the screened ACS response surface model of RelativeError-Time. Screening was conducted in the previous Chapter. The horizontal axis is the randomly generatedtreatment. The vertical axis is the Relative Error or Time response.

159


For both screened and full model, the predictions are very similar for both the

Relative Error and Time responses. This shows that the screening decisions from

the previous case study were correct. Looking at the predictions in general we see

that time was better predicted than relative error. The time models match all the

trends in the actual data. The relative error models however exhibit some false

peaks and miss some actual peaks.

Similar results were observed for the full and screened models of ADA-Time.

The ADA model failed to predict one more peak than the relative error model.

The RelativeError-Time and ADA-Time models are therefore deemed good pre-

dictors of the ACS responses.

9.3 Results

9.3.1 Screening and relative importance of factors

The following two figures give the ranked ANOVAs of the Relative Error and Time

models from the RelativeError-Time analysis. The terms have been rearranged in

order of decreasing sum of squares so that the largest contributor to the models

comes first.

Rank TermSum of squares F value p value Rank Term

Sum of squares F value p value

1 J-problemStDev 182.23 121362.13 < 0.0001 36 EM 0.21 137.62 < 0.00012 E-q0 64.87 43205.00 < 0.0001 37 F-rho 0.20 134.68 < 0.00013 D-nnFraction 22.01 14656.74 < 0.0001 38 BH 0.20 132.31 < 0.00014 DE 17.71 11794.89 < 0.0001 39 DF 0.20 131.48 < 0.00015 BD 8.86 5901.77 < 0.0001 40 GH 0.19 124.68 < 0.00016 EJ 4.61 3068.80 < 0.0001 41 F^2 0.18 122.13 < 0.00017 B-beta 4.46 2967.32 < 0.0001 42 J^2 0.15 100.38 < 0.00018 AD 4.36 2904.26 < 0.0001 43 AH 0.14 94.24 < 0.00019 BE 3.84 2554.84 < 0.0001 44 A-alpha 0.12 80.01 < 0.0001

10 H-problemSize 3.66 2439.84 < 0.0001 45 FJ 0.12 77.53 < 0.000111 AJ 3.38 2250.48 < 0.0001 46 EK 0.10 63.99 < 0.000112 AB 3.26 2174.28 < 0.0001 47 HJ 0.09 58.21 < 0.000113 CE 2.76 1836.57 < 0.0001 48 DK 0.06 42.16 < 0.000114 G-rhoLocal 2.57 1714.47 < 0.0001 49 JK 0.06 42.03 < 0.000115 D^2 2.45 1631.98 < 0.0001 50 DH 0.06 41.72 < 0.000116 DJ 2.31 1540.95 < 0.0001 51 AE 0.06 37.78 < 0.000117 AF 2.26 1504.79 < 0.0001 52 FM 0.05 33.60 < 0.000118 E^2 2.24 1492.39 < 0.0001 53 HK 0.05 32.45 < 0.000119 BJ 2.21 1468.81 < 0.0001 54 CG 0.04 28.05 < 0.000120 B^2 2.13 1417.47 < 0.0001 55 BM 0.04 25.03 < 0.000121 C-antsFraction 1.93 1286.26 < 0.0001 56 DM 0.04 25.02 < 0.000122 EG 1.88 1253.72 < 0.0001 57 GK 0.04 24.44 < 0.000123 CD 1.69 1125.00 < 0.0001 58 JM 0.03 22.86 < 0.000124 CJ 1.50 1001.96 < 0.0001 59 BC 0.03 22.76 < 0.000125 EF 1.30 862.88 < 0.0001 60 GJ 0.03 19.37 < 0.000126 EH 1.23 820.42 < 0.0001 61 CM 0.03 17.58 < 0.000127 DG 0.98 649.57 < 0.0001 62 BG 0.03 17.42 < 0.000128 K-solutionConstruction 0.84 562.08 < 0.0001 63 CH 0.02 16.61 < 0.000129 AG 0.42 279.16 < 0.0001 64 FG 0.02 9.99 0.001630 CF 0.38 254.96 < 0.0001 65 KM 0.01 7.98 0.004831 H^2 0.32 215.21 < 0.0001 66 GL 0.01 7.51 0.006232 M-pheromoneUpdate 0.28 188.89 < 0.0001 67 BK 0.01 7.49 0.006233 G^2 0.26 172.66 < 0.0001 68 GM 0.01 6.05 0.014034 A^2 0.24 162.14 < 0.0001 69 EL 0.01 5.58 0.018335 FH 0.21 141.53 < 0.0001 70 L-antPlacement 0.01 4.39 0.0363

71 FL 0.00 2.92 0.087872 BF 0.00 2.75 0.0972

Figure 9.6: RelativeError-Time ranked ANOVA of Relative Error response from full model. The tablelists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

Looking first at the Relative Error rankings, we see that the least important

160




1 C-antsFraction 1214.31 35326.80 < 0.0001 23 FJ 0.43 12.41 0.00042 C^2 248.18 7219.98 < 0.0001 24 AE 0.42 12.36 0.00043 H-problemSize 122.02 3549.80 < 0.0001 25 G-rhoLocal 0.35 10.30 0.00134 D-nnFraction 31.19 907.44 < 0.0001 26 AB 0.33 9.68 0.00195 E-q0 4.88 141.91 < 0.0001 27 B-beta 0.31 9.09 0.00266 DH 4.57 132.87 < 0.0001 28 AG 0.30 8.64 0.00337 EM 3.37 98.14 < 0.0001 29 HK 0.27 7.79 0.00538 HJ 2.75 80.00 < 0.0001 30 BE 0.25 7.34 0.00689 K-solutionConstruction 2.59 75.23 < 0.0001 31 BC 0.24 7.08 0.0078

10 M-pheromoneUpdate 2.44 71.03 < 0.0001 32 FM 0.24 6.88 0.008811 EF 1.95 56.76 < 0.0001 33 DK 0.22 6.29 0.012212 BD 1.44 41.91 < 0.0001 34 FH 0.19 5.39 0.020313 DE 1.43 41.68 < 0.0001 35 AH 0.18 5.31 0.021314 GM 1.31 38.21 < 0.0001 36 AJ 0.18 5.20 0.022715 DM 1.23 35.79 < 0.0001 37 GH 0.17 4.91 0.026816 CD 1.23 35.73 < 0.0001 38 BJ 0.17 4.86 0.027617 EG 1.14 33.10 < 0.0001 39 AC 0.15 4.29 0.038518 CG 1.04 30.19 < 0.0001 40 EJ 0.14 4.19 0.040819 BM 0.63 18.19 < 0.0001 41 J-problemStD 0.13 3.69 0.054720 CH 0.59 17.20 < 0.0001 42 BG 0.11 3.26 0.071221 CF 0.50 14.58 0.0001 43 AF 0.11 3.06 0.080422 JM 0.45 13.21 0.0003 44 HM 0.09 2.76 0.0966

45 A-alpha 0.01 0.28 0.596346 F-rho 0.01 0.22 0.6386

Figure 9.7: RelativeError-Time ranked ANOVA of time response from full model. The table lists theremaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

main effects are L-antPlacement, A-alpha, F-rho and M-pheromoneUpdate. These

are exactly the terms that were deemed unimportant in the screening study of the

previous chapter.

By far the most important terms are the main effects of the candidate list length

tuning parameter and the exploration/exploitation tuning parameter as well as

their interaction. This is a very important result because it shows that candidate

list length, a parameter that we have often seen set at a fixed value or not used is

actually one of the most important parameters to set correctly.

Looking at the Time rankings, we see that L-antPlacement was completely re-

moved from the model. The least important main effects were then F-rho and

A-alpha. These results also confirm the screening decisions. However, the M-

pheromoneUpdate term has now risen in importance in its effect on time.

By far the most important tuning parameters are the amount of ants and the

lengths of their candidate lists. This is quite intuitive as the amount of processing

is directly related to these parameters. The result regarding the cost of the amount

of ants is particularly important because the amount of ants does not have a rel-

atively strong effect on solution quality. The extra time cost of using more ants

will not result in gains in solution quality. This is an important result because

it methodically confirms the often recommended parameter setting of letting the

number of ants equal to a small number (usually 10).

An examination of the ranked ANOVAs from the ADA-Time model gives the same

ranking of tuning parameter contributions to time and ADA. As with the previous

ACS screening study, the choice of solution quality response does not change the

conclusions of the relative importance of the tuning parameters.

161


9.3.2 Tuning

A desirability optimisation is performed as per Section 6.5.2 on page 122. Recall

that equal preference is given to the minimisation of relative error and solution

time. The results from the full and screened RelativeError-Time models are pre-

sented in the following two figures.ACS_RSM_All_R5_relErr-Time_05.dx7_desirability.xls

Size

StD

ev

alph

a

beta

ants

Frac

tion

nnFr

actio

n

q0 rho

rhoL

ocal

solu

tionC

onst

ruct

ion

antP

lace

men

t

pher

omon

eUpd

ate

Tim

e05

Rel

ativ

e Er

ror

Des

irabi

lity

300 10 8 2 1.00 1.00 0.99 0.69 0.96 parallel random bestSoFar 1.15 0.46 0.96300 40 13 5 1.00 1.00 0.98 0.95 0.28 sequential random bestSoFar 1.46 1.24 0.86300 70 1 11 1.00 20.00 0.98 0.05 0.70 parallel random bestSoFar 1.77 2.18 0.80400 10 8 4 1.00 1.00 0.99 0.11 0.81 parallel random bestSoFar 2.42 0.46 0.92400 40 13 6 2.19 1.16 0.97 0.99 0.03 parallel random bestOfItera 2.83 1.33 0.82400 70 1 11 1.61 20.00 0.98 0.01 0.07 parallel random bestOfItera 4.92 2.59 0.73500 10 7 3 1.13 1.00 0.99 0.86 0.01 parallel same bestOfItera 4.88 0.38 0.88500 40 13 7 1.00 1.00 0.99 0.99 0.48 parallel random bestSoFar 4.25 1.35 0.80500 70 1 10 1.04 19.78 0.99 0.05 0.01 parallel same bestOfItera 9.24 2.54 0.70

1 of 1

Figure 9.8: Full RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value. ACS_Screened02_R10_relErr-Time_02.dx7_desirability.xls

Size

StDev

beta

antsFraction

nnFraction

q0 rhoLocal

solutionConstruction

Time05

RelError05

Desirability

300 10 1 1.00 1.00 0.99 0.99 parallel 0.93 0.51 0.98300 40 1 1.00 1.00 0.99 0.99 sequential 1.13 2.32 0.82300 70 12 1.00 20.00 0.99 0.01 parallel 2.70 4.45 0.71400 10 1 1.00 1.00 0.99 0.99 parallel 1.91 0.59 0.93400 40 1 1.00 1.00 0.99 0.99 sequential 2.21 2.71 0.77400 70 13 1.00 20.00 0.99 0.01 parallel 5.19 3.68 0.69500 10 1 1.00 1.00 0.99 0.99 parallel 3.92 0.69 0.87500 40 5 1.03 1.13 0.99 0.04 sequential 3.71 3.16 0.73500 70 13 1.00 20.00 0.99 0.01 parallel 9.98 3.01 0.68

1 of 1

Figure 9.9: Screened RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value.

The rankings of the ANOVA terms has already highlighted the factors that have

little effect on the responses. These screened factors can take on any values in the

desirability optimisations. This is confirmed by examining the desirability recom-

mendations from the full and screened models. The most important factors, com-

prising the screening model, have recommended settings that strongly agree with

162


the recommended settings from the full model. For example, beta is always low,

except when the problem standard deviation is high. The exploration/exploitation

threshold q0 is always at a maximum of 0.99, implying that exploitation is always

preferred to exploration. AntsFraction is always low. The remaining unimportant

factors take on a variety of values in the full model desirability optimisation.

The predicted values of time from both models agree with one another to within

a second. The predictions of relative error are higher from the screened model. The

quality of the desirability optimisation recommendations can now be evaluated.

9.3.3 Evaluation of tuned settings

The tuned parameter recommendations from the desirability optimisation are eval-

uated as per the methodology of Section 6.6 on page 124. Some illustrative plots

are given in the following two figures. On each plot, the horizontal axis lists the

randomly generated treatments and the vertical axis lists the response value. Each

plot contains the data for the response recorded using the settings from the de-

sirability optimisation of the full and screened experiments. The responses pro-

duced by using parameter settings recommended in the literature (Section 2.4.9

on page 45) and some randomly chosen parameter settings are also listed.

Relative Error vs Time model after 250 iteration stagnation

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10

Treatment

Rel

ativ

e E

rror 2

50

FullScreenedBookRandom

Figure 9.10: Evaluation of Relative Error response in the RelativeError-Time model of ACS on problemsof size 400 and standard deviation 70. The horizontal axis is the randomly generated treatment. Thereare plots of the results from four parameter settings, the settings from a desirability optimisation of thefull relativeError-Time model, the settings from a desirability optimisation of the screened relativeError-Time model, the settings recommended in the literature and randomly generated settings.

For both Relative Error and Time, the parameter settings from the full and

screened models perform about the same as the parameter settings from the litera-

ture. Interestingly, on a small number of occasions, randomly chosen settings per-

form better than all other settings. Similar results were found with all eight other

combinations of problem characteristics for the full and screened RelativeError-

Time models. This result confirms the recommendation of generally good ACS set-

tings in the literature summarised in Section 2.4.9 on page 45. In particular, both

the literature and the desirability optimisation agree on the recommended settings

163



1

10

100

1000

10000

0 1 2 3 4 5 6 7 8 9 10

Treatment

Tim

e 25

0


Figure 9.11: Evaluation of Time response in the RelativeError-Time model of ACS on problems of size400 and standard deviation 70. The horizontal axis is the randomly generated treatment. There areplots of the results from four parameter settings, the settings from a desirability optimisation of the fullrelativeError-Time model, the settings from a desirability optimisation of the screened relativeError-Time model, the settings recommended in the literature and randomly generated settings.

for the most important factors according to the screening study. Both recommend

low values of Beta and AntsFraction and high values of the exploration/exploitation

threshold q0.

Results from the ADA-Time desirability optimisation were different as the rec-

ommended parameter settings from the literature were chosen with a relative error

response in mind rather than an ADA response. The following two figures illustrate

representative results for ADA and Time. Again, on a few occasions, the randomly

chosen settings perform better than all alternatives. There is little difference in

solution times between the desirability settings and the literature settings. How-

ever, there is a very large difference when one considers ADA. This shows that one

should not use the literature recommended parameter settings if one is measuring

an ADA solution quality response.


The following conclusions are drawn from the ACS tuning study. The first of these

relate to screening and ranking and serve to confirm the conclusions from the

screening study of the previous chapter (Section 8.4 on page 150). These screening

and tuning conclusions apply for a significance level of 5% and a power of 80% to

detect the effect sizes listed in Figure 9.1 on page 155. These effect sizes are a

change in solution time of 224s, a change in Relative Error of 3.76% and a change

in ADA of 1.43. Issues of power and effect size are discuss in Section A.6 on

page 225.

• Tuning Ant placement not important. The type of ant placement has no

significant effect on ACS performance in terms of solution quality or solution

time.

164


ADA vs Time model after 250 iteration stagnation

0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10

Treatment

AD

A 2

50FullScreenedBookRandom

Figure 9.12: Evaluation of ADA response in the ADA-Time model of ACS on problems are of size 500and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots ofthe results from four parameter settings, the settings from a desirability optimisation of the full ADA-Time model, the settings from a desirability optimisation of the screened ADA-Time model, the settingsrecommended in the literature and randomly generated settings.


1

10

100

1000

10000

0 1 2 3 4 5 6 7 8 9 10

Treatment

Tim

e 25

0


Figure 9.13: Evaluation of Time response in the ADA-Time model of ACS on problems are of size 500and standard deviation 10. The horizontal axis is the randomly generated treatment. There are plots ofthe results from four parameter settings, the settings from a desirability optimisation of the full ADA-Time model, the settings from a desirability optimisation of the screened ADA-Time model, the settingsrecommended in the literature and randomly generated settings.

165


• Tuning Alpha not important. Alpha has no significant effect on ACS per-

formance in terms of solution quality or solution time. This confirms the

common recommendation in the literature of setting alpha equal to 1. Alpha

has also been analysed with an OFAT approach in Appendix D on page 235.

• Tuning Rho not important. Rho has no significant effect on ACS perfor-

mance in terms of solution quality or solution time. This is a new result for

ACS.

• Tuning Pheromone Update Ant not important. The ant used for pheromone

updates is ranked highly for solution time. However, omitting this factor from

the screened model did not affect the performance of the screened model.

• Most important tuning parameters. The most important ACS tuning pa-

rameters are the heuristic exponent B-beta, the amount of ants C-antsFraction,

the length of candidate lists D-nnFraction, the exploration/exploitation thresh-

old E-q0 and rhoLocal.

The tuning study also provides further results that the screening study of the

previous case study could not.

• Minimum order model. A model that is of at least quadratic order is required

to model ACS solution quality and ACS solution time. This is a new result for

ACS and shows that an OFAT approach is not an appropriate way to tune the

performance of ACS.

• Relationship between tuning, problems and performance. Both the mod-

els of RelativeError-Time and ADA-Time were good predictors of ACS perfor-

mance across the entire design space. The prediction intervals for full and

screened models were very similar, confirming that the decisions from the

screening study in the previous chapter were correct.

• Tuned parameter settings. There was much similarity between the rec-

ommended tuned parameter settings from the full and screened models and

both settings resulted in similar ACS performance. Recommended settings

from desirability optimisation resulted in similar solution quality and solu-

tion time as settings in the literature. There are immense performance gains

to be achieved as evidenced by the relatively poor performance of many ran-

domly chosen parameter settings. The reader may have intuitively expected

randomly chosen values to perform poorly but we emphasise that their evalu-

ation is nonetheless an important control for the test of the DOE methodology.

• Comparison of solution quality responses. The is no difference in screening

and ranking conclusions from using the ADA or Relative Error solution quality

responses for ACS.

166


9.5 Chapter summary

This chapter presented a case study applying the methodology of Chapter 6 to the

tuning of the Ant Colony System (ACS) heuristic. Many new results were presented

and existing recommendations in the literature were confirmed in a rigorous fash-

ion. The conclusions of the screening study in the previous chapter were also

confirmed.

167

10Case study: Screening Max-Min

Ant System

This chapter presents a case study on the screening of the Max-Min Ant System

(MMAS) heuristic. Several new results for MMAS are presented. These results

have not yet been published in the literature. The chapter follows the sequential

experimentation procedure of Chapter 6, beginning with a screening study. The

tuning study will follow in the next Chapter.

Established tuning parameters previously thought to affect performance are

actually shown to have no effect at all. New tuning parameters that were thought

to affect performance are investigated. A new TSP problem characteristic is shown

to have a very strong effect on performance, confirming the results of Chapter 7.

All analyses are conducted for two performance measures, quality of solution and

solution time. This provides an accurate measure of the heuristic compromise

that is rarely seen in the literature. Finally, it is shown that models of MMAS

performance must be of a higher order than linear.

10.1 Method

10.1.1 Response variables





169

CHAPTER 10. CASE STUDY: SCREENING MAX-MIN ANT SYSTEM

10.1.2 Factors, Levels and Ranges

Design Factors

There were 12 design factors, 10 representing the MMAS tuning parameters and 2

representing the TSP problem characteristics being investigated. The design fac-

tors and their high and low levels are summarised in Table 10.1. A description of

the MMAS tuning parameters was given in Section 2.4.5 on page 39. The parame-

ter M:antPlacement could be considered as a parameterised design feature as men-

tioned in Section 6.3.1 on page 111. As with the ACS case studies, we acknowledge

that an experimenter’s prior experience with MMAS may suggest narrower ranges

for these factors. When this experience was not available in this thesis, we chose

ranges around the values hard-coded in the original source code. For example,

restart freq was hard-coded to 25 and so a range of 2 to 40 was experimented with

here. It is a simple matter to rerun this case study with different ranges if desired.



B beta Numeric 1 13





G reinitBranchFac Numeric 0.50 2.00

H reinitIters Numeric 2 80

J problemStDev Numeric 10 70

K problemSize Numeric 300 500

L restartFreq Numeric 2 40

M antPlacement Categoric random same

Table 10.1: Design factors for the screening study with MMAS. There are two problem characteristicfactors (J-problemStDev and K-problemSize). The remaining 10 factors are MMAS tuning parameters.



10.1.3 Instances





170



The experiment design was a Resolution IV (12-5) fractional factorial (Section A.3.2

on page 215) with 6 centre points. The number of replicates was 8, determined

using the work-up procedure of Section 6.7.4 on page 127 for a power of about

80%, a significance level of 5% and an effect size of 0.18 standard deviations.

This yielded a total of 1030 runs. The following figure summarises the descriptive

statistics of the three response variables across all treatments at the 5 stagnation

measuring points and the actual effect size that is equivalent to 0.18 standard

deviations.


50 Mean 102.29 8.15 6.08StDev 378.70 13.34 10.36Max 4549.08 118.75 43.46Min 0.18 0.41 0.34Actual Effect Size of 0.18 sDev 68.17 2.40 1.86





Figure 10.1: Descriptive statistics for the MMAS screening experiment. Statistics are given at fivestagnation points ranging from 50 to 250 iterations. The actual effect size equivalent to 0.18 standarddeviations is also listed for each response variable.





should be the same as after lower iteration stagnations. The two solution qual-

ity responses show a small but practically insignificant decrease in solution error

171


as the stagnation iterations is increased. The Time response increases with in-

creasing stagnation iterations because the experiments take longer to run. For all

cases, the level of stagnation iterations has little practically significant effect on

the three responses. MMAS did not make any large improvements when allowed

to run for longer. It is therefore sufficient to perform analyses at the 250 iterations

stagnation level.

10.2 Analysis

10.2.1 ANOVA


tion A.4.1 on page 220) applied to a full 2 factor interaction model with an alpha

out threshold of 0.10. Some terms removed by backward selection were added

back into the final model to preserve hierarchy.

To make the data amenable to statistical analysis, a transformation of the

responses was required for each analysis. The transformation was a log10 (Sec-

tion A.4.3 on page 222) for all three responses.




24 data points (2% of total data) were removed when analysing Relative Error. 24

data points (2% of total data) were removed when analysing ADA. 10 data points

(1% of total data) were removed when analysing Time.

10.2.2 Confirmation


tion 6.3.7 on page 114 in order to confirm the accuracy of the ANOVA models.

The randomly chosen treatments produced actual algorithm responses with the

descriptives listed in the following figure.

Max Min Mean StDevRelative Error 250 107.25 0.27 7.21 15.35ADA 250 37.62 0.79 5.25 8.00Time 250 4954 1 239 599

Figure 10.2: Descriptive statistics for the confirmation of the MMAS screening ANOVA. The responsedata is from runs of the actual algorithm on the randomly generated confirmation treatments.



The following three figures illustrate the 95% prediction intervals and actual

data for the three response models, Relative Error, ADA and Time respectively.

The two solution quality responses are well predicted by their models. The mod-

els match the trends of the actual data, successfully picking up the extremely low

172


0.00

5.00

10.00

15.00

20.00

25.00

30.00

0 10 20 30 40 50Treatment

Rel

ativ

e E

rror 2

50

Relative Error 25095% PI low95% PI high

Figure 10.3: 95% Prediction intervals for the MMAS screening of Relative Error. The horizontal axisshows the randomly generated treatment number.

05

10152025303540

0 10 20 30 40 50Treatment

AD

A 2

50

ADA 25095% PI low95% PI high

Figure 10.4: 95% Prediction intervals for the MMAS screening of ADA. The horizontal axis shows therandomly generated treatment number.

173


and extremely high response values which vary over a range of 107% for relative

error and 37 for ADA.

1.0

10.0

100.0

1000.0

10000.0

0 10 20 30 40 50Treatment

Tim

e 25

0

Time 25095% PI low95% PI high

Figure 10.5: 95% Prediction intervals for the MMAS screening of Time. The horizontal axis shows therandomly generated treatment number.

The time response is also well predicted by its model. This was achieved over

a range of 5000s. The three ANOVA models are therefore satisfactory predictors of

the three MMAS performance responses for factor values within 10% of the factor

range limits listed in Section 10.1.2 on page 170.

10.3 Results

The next figure gives a summary of the Sum of Squares ranking and the ANOVA

F and p values for the three responses. Only the main effects are listed. Those

main effects that rank in the top 12 are highlighted in bold. Rankings are based

on the full two factor interaction model of 78 terms, before backward selection was

applied. ScreeningResults_MMAS-TODO.xls

Rank F value p value Rank F value p value Rank F value p valueA-alpha 23 59.80 < 0.0001 24 59.80 < 0.0001 6 342.47 < 0.0001B-beta 3 1270.75 < 0.0001 3 1270.75 < 0.0001 40 5.79 0.0163C-antsFraction 8 760.90 < 0.0001 8 760.90 < 0.0001 1 28012.34 < 0.0001D-nnFraction 5 1006.33 < 0.0001 5 1006.33 < 0.0001 3 2941.57 < 0.0001E-q0 2 2378.04 < 0.0001 2 2378.04 < 0.0001 5 352.99 < 0.0001F-rho 13 547.03 < 0.0001 13 547.03 < 0.0001 8 172.10 < 0.0001G-reinitBranchFac 22 65.12 < 0.0001 23 65.12 < 0.0001 10 114.19 < 0.0001H-reinitIters 16 111.65 < 0.0001 17 111.65 < 0.0001 63 0.97 0.3255J-problemStDev 1 21133.33 < 0.0001 1 6690.14 < 0.0001 72 0.17 0.6833K-problemSize 44 7.56 0.0061 16 120.56 < 0.0001 2 3489.94 < 0.0001L-restartFreq 63 0.44 0.5078 63 0.44 0.5078 35 7.88 0.0051M-antPlacement 53 2.97 0.0853 54 2.97 0.0853 50 2.61 0.1066

Relative Error ADA Time

1 of 1

Figure 10.6: Summary of ANOVAs for Relative Error, ADA and Time for MMAS. Only the main effectsare shown. Effects with a top 12 ranking are in bold. Rankings are based on the full two factorinteraction model of 78 terms, before stepwise regression was applied.

The screening study of MMAS and the associated ANOVAs yield several impor-

174


tant results and answers to open questions from the ACO literature.

10.3.1 Screened factors

Factor L-RestartFreq is statistically insignificant at the 0.05 level for the two quality

responses but significant for the time response. The factor has low rankings across

all three responses.

M-AntPlacement, is statistically insignificant at the 0.05 level for all three re-

sponses. It has a low ranking for all three responses, leading us to expect this

factor to have little effect on performance.

In summary, 2 factors are screened out, M-AntPlacement and L-RestartFreq.

10.3.2 Relative Importance of Factors

Of the two problem characteristics, the factor with the larger effect on solution

quality is the problem standard deviation J-problemStDev. The problem size K-

problemSize has a stronger effect on solution time than J-problemStDev. This is

because MMAS takes longer to reach stagnation on larger problem instances than

smaller instances.

Of the remaining unscreened tuning parameters, the heuristic exponent B-beta,

the amount of ants C-antsFraction, the length of candidate lists D-nnFraction, the

exploration/exploitation threshold E-q0 and F-Rho have the strongest effects on

solution quality. The same is true for solution time except for B-beta which has a

low ranking for solution time.

G-ReinitBranchFac is statistically significant for all three responses but is only

ranked in the top third for the quality responses. It has a high ranking for Time.

This highlights that G-ReinitBranchFac should be considered as a tuning param-

eter rather than being hard-coded as is typically the case.

Although statistically significant for all responses, A-Alpha only has a high

ranking for solution Time. Alpha could possibly be considered for screening.

These results are important because they highlight the most important tuning

parameters and problem characteristics in terms of both performance dimensions

of the heuristic compromise. These factors are the minimal set of design factors

one should experiment with when modelling and tuning MMAS performance.

10.3.3 Adequacy of a Linear Model.

A test for curvature as per Section 6.3.9 on page 116 shows there is a statistically

significant amount of curvature in all three responses. This means that the linear

model from the screening study is not adequate to explore the whole design space.

A higher order model of all three responses is required. This is an important result

because it confirms that a One-Factor-At-a-Time (OFAT) approach is insufficient

for investigating the performance of MMAS.

175


10.3.4 Comparison of Solution Quality Measures

It was already mentioned that two solution quality responses were measured to

investigate if the choice of one response over the other lead to different conclu-

sions. An examination of the ANOVA summaries reveals that both Relative Error

and ADA have the same rankings (±1 places) and same statistical significance or

insignificance for all factors except K-problemSize. K-problemSize has a statisti-

cally significant effect on Relative Error only. This is due to the nature of how the

two quality responses are calculated (Section 3.10.2 on page 69).

This result shows that for screening of MMAS, the choice of solution quality re-

sponse will not affect the conclusions. However, ADA may be a preferable response

as it exhibits a lower range and variability than Relative Error. The advantage of a

lower variability was discussed in the context of statistical power (Section A.6 on

page 225).


The following conclusions are drawn from the MMAS screening study. The first

results concern tuning and design parameters that have no impact on MMAS per-

formance. These screening and tuning conclusions apply for a significance level of

5% and a power of 80% to detect the effect sizes listed in Figure 10.1 on page 171.

These effect sizes are a change in solution time of 187s, a change in Relative Error

of 2.2% and a change in ADA of 1.5. Issues of power and effect size are discuss in

Section A.6 on page 225.

1. Tuning Restart frequency tuning parameter not important. The number

of iterations used in the restart frequency has no significant effect on MMAS

performance in terms of solution quality or solution time. This is a highly

unexpected result as the restart frequency is a fundamental feature of MMAS

(Section 2.4.5 on page 39)

2. Tuning Ant placement design parameter not important. The type of ant

placement has no significant effect on MMAS performance in terms of either

solution quality or solution time. This was an open question in the litera-

ture. The result is remarkable because intuitively one would expect a random

scatter of ants across the problem graph to explore a wider variety of possi-

ble solutions. This result shows that this is not the case. MMAS design can

be fixed with either a random scatter method or single node method of ant

placement.

Other results were as follows.

3. Alpha only important for solution time. The choice of Alpha only signifi-

cantly effects solution time. Although statistically significant for the quality

responses, it has a low ranking.

176


4. Problem standard deviation is important. This confirms the main result of

Chapter 7 in identifying a new TSP problem characteristic that has a signifi-

cant effect on the difficulty of a problem for MMAS. ACO research should be

reporting this characteristic in the literature.

5. Most important parameters. The rankings show that the most important

tuning parameters affecting solution quality or solution time or both are beta,

antsFraction, length of candidate list, exploration/exploitation threshold and

the pheromone update term rho.

6. Beta not important for solution time. The choice of Beta only affects so-

lution quality and not solution time. It has always been known that Beta

strongly affects solution quality.

7. New tuning parameter. Reinitialisation Branching Factor tuning parameter

has a strong effect on time and a moderate but statistically significant effect

on quality..

8. Higher order model of MMAS behaviour needed. A higher order model,

greater than linear, is required to model MMAS solution quality and MMAS

solution time. This is an important result because it demonstrates for the

first time that simple OFAT approaches seen in the literature are insufficient

for accurately modelling and tuning MMAS performance.

9. Comparison of solution quality responses. The is no difference in conclu-

sions from the ADA and Relative Error solution quality responses for MMAS.

The ADA response is therefore preferable for screening because it exhibits a

lower variability than Relative Error and therefore results in more powerful

experiments.

The result regarding alpha confirms the literature’s general recommendation

that Alpha be set to 1 [47, p. 71]. It also contradicts Pellegrini et al’s [96] analysis

of the most important MMAS parameters, where alpha was one of these parame-

ters and the length of candidate list and exploration/exploitation threshold were

omitted. This result is all the more remarkable given that Pellegrini et al’s research

used the ACOTSP code and this study used the backwards compatible JACOTSP

code. It highlights the danger of using only intuitive reasoning to rank the impor-

tance of tuning parameters rather than a rigorous DOE approach with the fully

instantiated algorithm.

The ranking of the tuning parameters contradicts the claim of Doerr and Neu-

mann [41, p. 38] about rho being the most important parameter affecting ACO

solution run time. This thesis’ screening study shows that, excluding problem

characteristics, rho is in fact the 7th most important factor to affect solution time

to stagnation for MMAS.

The Reinitialisation Branching Factor is usually held constant (Section 2.4.5

on page 39) in reported research but this study has shown that it is an important

tuning parameter to be considered.

177



This chapter presented a case study applying the methodology of Chapter 6 to the

screening of the Max-Min Ant System (MMAS) heuristic. Many new results were

presented. Existing recommendations in the literature were confirmed and other

claims were refuted in a rigorous fashion. The next chapter will use the results of

this screening to tune the performance of MMAS.

178

11Case study: Tuning Max-Min Ant

System

This Chapter reports a case study on tuning the factors affecting the performance

of a heuristic. The methodology for this case study was described in Chapter

6. The particular heuristic studied is Max-Min System (MMAS) for the Travelling


This chapter reports many new results for MMAS. All analyses are conducted for



It is shown that models of MMAS performance must be at least quadratic. MMAS

is tuned using a full parameter set and a screened parameter set resulting from

the case study of the previous chapter. This verifies that screening decisions from

the previous chapter are correct.

The results reported in this Chapter have been published in the literature [109].

11.1 Method






179

CHAPTER 11. CASE STUDY: TUNING MAX-MIN ANT SYSTEM


Design Factors

In the full RSM, there were 12 design factors as per the screening study of the

previous chapter. The factors and their high and low levels are repeated in the

following table for convenience.



B beta Numeric 1 13





G reinitBranchFac Numeric 0.50 2.00

H reinitIters Numeric 2 80

J problemStDev Numeric 10 70

K problemSize Numeric 300 500

L restartFreq Numeric 2 40

M antPlacement Categoric random same

Table 11.1: Design factors for the tuning study with MMAS. The factor ranges are also given.

Tuning parameters for the Screened model, based on the results of Section 10.3

on page 174, did not include L-RestartFreq and M-antPlacement. These two fac-

tors took on randomly chosen values within their range for each experiment run.


The held constant factors are as per Section 6.7.6 on page 127. There were ad-

ditional held-constant factors for MMAS. The p term in the trail minimum update

(Section 2.4.5 on page 39) was fixed at 0.05, as hard-coded in the original ACOTSP.

The lambda value used in the Trail Reinitialisation of the daemon actions calcula-

tion (Section 2.4.5 on page 39) is fixed at 0.05.

11.1.3 Instances





180



The experiment design for both models was a Minimum Run Resolution V Face-

Centred Composite (Section A.3.4 on page 218) with six centre points.

The work-up procedure was slightly different to that specified in Section 6.7.4

on page 127 due to a limitation on experimental resources. In this study, a target

power of 80% and significance level of 5% were fixed by convention. A number

of replicates of 8 was fixed according to the number of feasible experiment runs

that could be conducted with the given resources. The collected data was then

examined to determine the smallest effect size that could be detected given these

constraints. Fortunately, the variability of the data was low enough to permit a

reasonable effect size to be detected with this power, significance level and num-

ber of replicates. The descriptive statistics for the full experiment and screened

experiment data are given in the following two figures.







Figure 11.1: Descriptive statistics for the full MMAS experiment design. The actual detectable effectsize of 0.25 standard deviations is shown for each response and for each stagnation point. There islittle practical difference in effect size of the solution quality responses for an increase in the stagnationpoint.

The full design could achieve sufficient power while detecting an effect of size

0.25 standard deviations.

The screened design could only achieve sufficient power while detecting an ef-

fect of size 0.41 standard deviations.

181








Figure 11.2: Descriptive statistics for the screened MMAS experiment design. The actual detectableeffect size of 0.41 standard deviations is shown for each response and at each stagnation point. Thereis little practical difference in the detectable effect size of the solution quality responses for differentstagnation iterations levels.

182






should be the same as after lower iteration stagnations. The two solution quality

responses show a small but practically insignificant decrease as the stagnation

iterations is increased. The Time response increases with increasing stagnation

iterations because the experiments take longer to run. For all cases, the level of

stagnation iterations has little practically significant effect on the three responses.

MMAS did not make any large improvements when allowed to run for longer. It is

therefore sufficient to perform analyses of the full and screened models at the 250

iterations stagnation level.

11.2 Analysis

11.2.1 Fitting

A fit analysis was conducted for each response in the full experiment and the

screened experiment. In both cases, at least a quadratic model was required to

model the responses. For the Minimum Run Resolution V Face-Centred Compos-

ite, cubic models are aliased and so were not considered.

11.2.2 ANOVA

Effects for each response model were selected using backward selection applied to

a full quadratic model with an alpha threshold of 0.10. Some terms removed by

backward selection were added back into the final model to preserve hierarchy. To

make the data amenable to statistical analysis, a transformation of the responses

was required for each analysis (Section A.4.3 on page 222). The transformations

were a log10 in all but one case., In the case of ADA in the ADA-Time model, a

square root transformation with d=0.5 was used.




The next table summarises the outliers deleted from each analysis.

Experiment Model Number % of runs

Full ADA-Time 56 4

Screened ADA-Time 50 8

Full RelativeError-Time 41 3

Screened RelativeError-Time 53 9

Table 11.2: Amount of outliers removed from MMAS tuning analyses. The percentage of the total runsdeleted is greater in the screened experiments than the full experiments.

183


For both models, the screened experiments required more outliers to be deleted

than the full experiments. The amount of outliers (8% and 9%) for the two screened

experiments may be cause for concern. This concern is addressed with some con-

firmation experiments.

11.2.3 Confirmation


tion 6.3.7 on page 114. The randomly chosen treatments produced actual algo-

rithm responses with the descriptives listed in the following figure.

Iterations TimeRelative

Error ADA100 Mean 65.30 5.89 3.62

StDev 81.26 4.35 3.92Max 380.82 25.68 25.31Min 3.27 0.69 0.43




Figure 11.3: Descriptive statistics for the MMAS confirmation experiments. This is the actual dataproduced from the randomly generated confirmation treatments. All three responses vary over wideranges, highlighting the cost of incorrectly chosen tuning parameter settings.

The same confirmation runs were used for the screened and full experiments.

This allows a direct comparison between the predictive capabilities of the models

from the screened and full experiments.

The following two figures compare the predictive capabilities of the full and

screened RelativeError-Time models for the Relative Error response over fifty ran-

domly generated treatments. Both models predict the relative error response well

for almost all of the 50 treatments extending over a range of 25%. There are about

three poorly performing predictions, two where the models overpredicted and one

where the models underpredicted the actual MMAS performance. Both the full and

screened models have the same shape, confirming the accuracy of the screening

study in the previous chapter.

184



0.00

5.00

10.00

15.00

20.00

25.00

30.00

0 5 10 15 20 25 30 35 40 45 50

Treatment

Rel

ativ

e E

rror 2

50Relative Error 25095% PI low95% PI high

Figure 11.4: 95% prediction intervals of Relative Error by the full RelativeError-Time model of MMAS.The horizontal axis is the randomly generated treatment number.

Screened Response Surface Model

0.00

5.00

10.00

15.00

20.00

25.00

30.00

0 5 10 15 20 25 30 35 40 45 50

Treatment

Rel

ativ

e E

rror 2

50

Relative Error 250

95% PI low

95% PI high

Figure 11.5: 95% prediction intervals of Relative Error by the screened RelativeError-Time model ofMMAS. The horizontal axis is the randomly generated treatment number.

185


The following two figures compare the predictive capabilities of the full and

screened RelativeError-Time models for the Time response. Both the full and

screened models of RelativeError-Time are excellent predictors of the Time re-

sponse. All extreme points are well predicted. Both models have the same shape,

confirming the accuracy of the screening study from the previous chapter.


1

10

100

1000

0 5 10 15 20 25 30 35 40 45 50Treatment

Tim

e 25

0


Figure 11.6: Predictions of Time by the full RelativeError-Time model of MMAS. The horizontal axis isthe randomly generated treatment number.


1

10

100

1000

0 5 10 15 20 25 30 35 40 45 50Treatment

Tim

e 25

0


Figure 11.7: Predictions of Time by the screened RelativeError-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.

The next two figures compare the predictive capabilities of the full and screened

ADA-Time models for the ADA response. Both models are good predictors of ADA

across the majority of confirmation treatments. The full model fails to predict

the exact values of two peaks but does nonetheless identify them as peaks. The

screened model has a slightly different shape to the full model. This may be due

either to decisions in the modelling process or and incorrect screening decision.

The next two figures compare the predictive capabilities of the full and screened

ADA-Time models for the Time response. Both models are excellent predictors of

the Time response over a range of thousands of seconds and there is little differ-

ence in the models’ predictions.

186



0.00

5.00

10.00

15.00

20.00

25.00

0 5 10 15 20 25 30 35 40 45 50Treatment

AD

A 2

50


Figure 11.8: 95% prediction intervals of ADA by the full ADA-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.


0.00

5.00

10.00

15.00

20.00

25.00

0 5 10 15 20 25 30 35 40 45 50

Treatment

AD

A 2

50


Figure 11.9: 95% prediction intervals of ADA by the screened ADA-Time model of MMAS. The horizon-tal axis is the randomly generated treatment number.


1

10

100

1000

10000

0 5 10 15 20 25 30 35 40 45 50

Treatment

Tim

e 25

0


Figure 11.10: 95% prediction intervals of Time by the full ADA-Time model of MMAS. The horizontalaxis is the randomly generated treatment number.

187



1

10

100

1000

10000

0 5 10 15 20 25 30 35 40 45 50Treatment

Tim

e 25

0Time 25095% PI low95% PI high

Figure 11.11: 95% prediction intervals of Time by the screened ADA-Time model of MMAS. The hori-zontal axis is the randomly generated treatment number.

In general, we conclude that both models are good predictors of the responses

and that the screening decisions from the screening study were correct. The con-

cerns raised in the previous section regarding effect size and outlier deletion have

been mitigated. The models can therefore be used to make recommendations on

good tuning parameter settings for a given instance.

11.3 Results

11.3.1 Screening and relative importance of factors

The following two figures give the ranked ANOVAs of the Relative Error and Time

models from the RelativeError-Time analysis. The terms have been rearranged in

order of decreasing sum of squares so that the largest contributor to the models

comes first.

Looking first at the Relative Error rankings, we see that the least important

main effects are A-alpha, M-antPlacement, and H-reinitIters. The first two of these

are the terms that were deemed unimportant in the screening study of the previous

chapter. However, in the screening study it was L-restartFreq that was deemed

unimportant rather than H- reinitIters. This may adversely affect the desirability

optimisation of the next section.

By far the most important terms are the main effects of the exploration/exploit-

ation tuning parameter, the candidate list length tuning parameter and exponent

B-Beta as well as their interactions. This is a very important result because it

shows that candidate list length, a parameter that we have often seen set at a fixed

value or not used is actually one of the most important parameters to set correctly.

G-reinitBranchFac, which is not normally considered a parameter at all, is also

very important for solution quality.

Looking at the Time rankings, we see that antPlacement was completely re-

moved from the model. The least important main effects were then L-restartFreq

and H-reinitIters. These results also confirm the screening decisions.

188




1 J-problemStDev 96.79 27278.14 < 0.0001 35 BK 0.59 165.63 < 0.00012 B-beta 12.95 3648.58 < 0.0001 36 HL 0.59 164.91 < 0.00013 E-q0 12.69 3576.14 < 0.0001 37 BL 0.55 154.09 < 0.00014 BE 12.21 3440.86 < 0.0001 38 CF 0.54 151.67 < 0.00015 D-nnFraction 5.84 1646.78 < 0.0001 39 FL 0.50 140.41 < 0.00016 DE 4.47 1260.23 < 0.0001 40 JK 0.44 124.25 < 0.00017 BD 3.89 1096.49 < 0.0001 41 BJ 0.44 122.94 < 0.00018 G-reinitBranchFac 3.89 1095.75 < 0.0001 42 A^2 0.43 120.87 < 0.00019 GJ 2.79 785.30 < 0.0001 43 EH 0.43 120.76 < 0.0001

10 FH 2.68 756.69 < 0.0001 44 CK 0.42 118.41 < 0.000111 BC 2.61 735.55 < 0.0001 45 FG 0.42 117.05 < 0.000112 CD 2.57 725.08 < 0.0001 46 CH 0.41 115.01 < 0.000113 AB 2.50 704.56 < 0.0001 47 J^2 0.38 106.55 < 0.000114 AK 2.41 680.27 < 0.0001 48 DG 0.37 104.57 < 0.000115 AJ 2.39 673.34 < 0.0001 49 CE 0.36 101.04 < 0.000116 HJ 2.34 658.60 < 0.0001 50 H-reinitIters 0.34 96.05 < 0.000117 HK 2.28 642.08 < 0.0001 51 AL 0.29 81.91 < 0.000118 EL 2.00 565.03 < 0.0001 52 AE 0.27 76.12 < 0.000119 GK 1.90 535.92 < 0.0001 53 FK 0.25 70.12 < 0.000120 EJ 1.84 519.39 < 0.0001 54 AG 0.22 62.70 < 0.000121 C-antsFraction 1.82 513.19 < 0.0001 55 BF 0.20 57.43 < 0.000122 GL 1.75 492.62 < 0.0001 56 EK 0.17 49.28 < 0.000123 DF 1.68 474.43 < 0.0001 57 F-rho 0.17 47.94 < 0.000124 DK 1.51 426.61 < 0.0001 58 BH 0.14 40.61 < 0.000125 EF 1.40 395.95 < 0.0001 59 AD 0.14 38.56 < 0.000126 CL 1.20 338.79 < 0.0001 60 F^2 0.13 36.25 < 0.000127 K-problemSize 1.14 322.31 < 0.0001 61 DJ 0.12 34.29 < 0.000128 KL 1.01 284.32 < 0.0001 62 E^2 0.11 30.90 < 0.000129 L-restartFreq 0.85 238.83 < 0.0001 63 M-antPlacement 0.08 21.47 < 0.000130 D^2 0.69 194.48 < 0.0001 64 AC 0.08 21.37 < 0.000131 DH 0.66 185.14 < 0.0001 65 BG 0.06 17.64 < 0.000132 FJ 0.65 182.44 < 0.0001 66 AF 0.06 15.97 < 0.000133 AH 0.62 174.99 < 0.0001 67 EG 0.05 15.35 < 0.000134 CJ 0.62 174.03 < 0.0001 68 B^2 0.05 14.48 0.0001

69 C^2 0.05 13.28 0.000370 DL 0.04 11.10 0.000971 A-alpha 0.03 9.67 0.001972 EM 0.03 7.05 0.008073 JL 0.02 6.44 0.0113

Figure 11.12: RelativeError-Time ranked ANOVA of Relative Error response from full model. The tablelists the remaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

189




1 C-antsFraction 447.72 28026.84 < 0.0001 31 AK 0.66 41.11 < 0.00012 K-problemSize 47.69 2985.59 < 0.0001 32 DK 0.63 39.22 < 0.00013 D-nnFraction 42.94 2688.11 < 0.0001 33 CF 0.55 34.54 < 0.00014 E-q0 8.06 504.59 < 0.0001 34 B-beta 0.45 28.07 < 0.00015 CG 5.58 349.35 < 0.0001 35 BJ 0.39 24.28 < 0.00016 DG 5.08 318.31 < 0.0001 36 DE 0.39 24.14 < 0.00017 A-alpha 4.95 309.83 < 0.0001 37 DJ 0.35 22.18 < 0.00018 EG 4.61 288.38 < 0.0001 38 BK 0.35 21.84 < 0.00019 AG 3.57 223.53 < 0.0001 39 EL 0.33 20.92 < 0.0001

10 C^2 3.46 216.87 < 0.0001 40 BH 0.32 19.86 < 0.000111 BC 3.41 213.56 < 0.0001 41 JL 0.29 17.87 < 0.000112 AE 3.02 189.01 < 0.0001 42 BE 0.26 16.46 < 0.000113 AC 2.64 165.50 < 0.0001 43 HK 0.24 14.71 0.000114 CJ 2.39 149.68 < 0.0001 44 HJ 0.23 14.39 0.000215 G-reinitBranchFac 2.31 144.33 < 0.0001 45 EJ 0.20 12.65 0.000416 F-rho 1.66 104.20 < 0.0001 46 D^2 0.20 12.48 0.000417 CE 1.52 94.85 < 0.0001 47 CK 0.18 11.16 0.000918 FJ 1.50 94.14 < 0.0001 48 EK 0.17 10.89 0.001019 DF 1.50 93.69 < 0.0001 49 GJ 0.14 8.65 0.003320 AD 1.48 92.87 < 0.0001 50 GL 0.13 8.45 0.003721 BG 1.30 81.17 < 0.0001 51 G^2 0.10 6.34 0.011922 J-problemStDev 1.25 78.56 < 0.0001 52 FL 0.10 6.17 0.013123 HL 1.20 75.24 < 0.0001 53 FH 0.09 5.39 0.020424 BF 1.19 74.42 < 0.0001 54 DL 0.06 3.93 0.047625 AB 0.98 61.19 < 0.0001 55 BL 0.05 3.07 0.079926 CD 0.95 59.67 < 0.0001 56 CH 0.04 2.75 0.097727 KL 0.84 52.89 < 0.0001 57 EF 0.04 2.68 0.101728 AH 0.73 45.43 < 0.0001 58 FG 0.04 2.52 0.113029 JK 0.72 44.89 < 0.0001 59 H-reinitIters 0.01 0.49 0.483930 GK 0.66 41.51 < 0.0001 60 L-restartFreq 0.01 0.36 0.5483

Figure 11.13: RelativeError-Time ranked ANOVA of time response from full model. The table lists theremaining terms in the model after stepwise regression in order of decreasing Sum of Squares.

By far the most important tuning parameters are the amount of ants and the

lengths of their candidate lists. This is quite intuitive as the amount of processing

is directly related to these parameters. The result regarding the cost of the amount

of ants is particularly important because the amount of ants does not have a rel-

atively strong effect on solution quality. The extra time cost of using more ants

will not result in gains in solution quality. This is an important result because it

contradicts the often recommended parameter setting of letting the number of ants

equal to the problem size.

An examination of the ranked ANOVAs from the ADA-Time model gives the

same ranking of tuning parameter contributions to time and ADA. As with the

previous screening study, the choice of solution quality response does not change

the conclusions of the relative importance of the tuning parameters.

11.3.2 Tuning

A desirability optimisation is performed as per Section 6.5.2 on page 122. The

results from the full and screened RelativeError-Time models are presented in the

following two figures.

The rankings of the ANOVA terms has already highlighted the factors that have

little effect on the responses. These screened factors can take on any values in

the desirability optimisations. This is confirmed by examining the desirability rec-

ommendations from the full and screened models. The most important factors,

comprising the screening model, have recommended settings that usually agree

190


Desirability results for the Full MMAS algorithm

Size StDev alph

a

beta

ants

Frac

tion

nnFr

actio

n

q0 rho

rein

itBra

nchF

ac

rein

itIte

rs

rest

artF

req

antP

lace

men

t

Tim

e

Rel

ativ

e Er

ror

Des

irabi

lity

300 10 1 1 1.00 1.00 0.99 0.01 0.50 80 2 random 0.72 0.15 1.00300 40 1 1 1.00 1.00 0.99 0.01 0.50 80 2 random 0.94 0.56 0.95300 70 13 13 1.13 1.00 0.04 0.68 0.57 80 39 same 1.01 0.54 0.95400 10 1 1 1.00 1.50 0.96 0.22 0.51 80 39 same 1.53 0.19 0.96400 40 13 6 1.00 1.00 0.95 0.70 0.50 80 20 random 2.18 0.84 0.87400 70 13 12 1.00 1.00 0.01 0.64 0.62 35 30 random 2.55 1.16 0.84500 10 13 2 1.00 1.01 0.99 0.49 0.51 31 23 same 4.25 0.37 0.91500 40 13 13 1.01 1.32 0.01 0.59 0.50 5 30 random 4.45 0.99 0.82500 70 13 13 1.00 1.40 0.01 0.48 0.51 38 34 same 4.27 1.14 0.81

MMAS_RSM_All_R8_RelErr-Time_05.dx7_desirability.xls

Figure 11.14: Full RelativeError-Time model results of desirability optimisation. The table lists therecommended parameter values for combinations of problem size and problem standard deviation. Theexpected time and relative error are listed with the desirability value.

Desirability results for the Full MMAS algorithm

Size

St D

ev

alph

a

beta

ants

Frac

tion

nnFr

actio

n

q0 rho

rein

itBra

nchF

ac

rein

itIte

rs

Tim

e

Rel

ativ

e Er

ror

Des

irabi

lity

300 10 3 11 1.00 1.10 0.98 0.25 0.50 80 1 0.40 0.97300 40 1 7 8.83 1.00 0.98 0.56 0.50 77 3 0.62 0.89300 70 13 6 1.00 2.34 0.99 0.52 0.50 77 1 2.29 0.73400 10 13 5 1.00 1.00 0.89 0.53 0.50 3 3 0.41 0.94400 40 1 3 2.79 1.00 0.94 0.69 0.60 77 3 1.11 0.80400 70 13 9 1.00 1.00 0.99 0.53 1.12 76 5 2.82 0.64500 10 13 8 1.00 1.02 0.85 0.24 0.50 50 5 0.41 0.90500 40 13 11 1.20 1.04 0.99 0.57 0.85 80 8 1.18 0.74500 70 13 10 1.11 1.00 0.14 0.46 0.61 75 4 3.04 0.63

MMAS_RSM_Screened_R8_RelErr-Time_06.dx7_desirability.xls

Figure 11.15: Screened RelativeError-Time model results of desirability optimisation. The table liststhe recommended parameter values for combinations of problem size and problem standard deviation.The expected time and relative error are listed with the desirability value.

191


with the recommended settings from the full model. For example, AntsFraction

and nnFraction are always low in both models. The remaining unimportant factors

take on a variety of values in the full model desirability optimisation, for example

AntPlacement.

The predicted values of both relative error and time from both models agree

closely with one another. The quality of the desirability optimisation recommenda-

tions can now be evaluated.

11.3.3 Evaluation of tuned settings

The tuned parameter recommendations from the desirability optimisation are eval-

uated as per the methodology of Section 6.6 on page 124. Some illustrative plots

are given in the following figures. On each plot, the horizontal axis lists the ran-

domly generated treatments and the vertical axis lists the response value. Each

plot contains the data for the response recorded using the settings from the desir-

ability optimisation of the full and screened experiments. The responses produced

by using parameter settings recommended in the literature and some randomly

chosen parameter settings are also listed.


0

5

10

0 1 2 3 4 5 6 7 8 9 10Treatment

Rel

ativ

e E

rror 2

50

Full DesirabilityScreened DesirabilityBookRandom

Figure 11.16: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Prob-lems are of size 500 and standard deviation 10.

For Relative Error, the parameter settings from the full and screened models

perform slightly better than the parameter settings from the literature. Interest-

ingly, on a small number of occasions, randomly chosen settings perform better

than all other settings. It is not until we examine the other side of the heuristic

compromise that we see the advantage of the DOE approach to parameter tun-

ing. Here, the results from the DOE desirability optimisation are two orders of

magnitude better than the results from the settings recommended in the literature

(Section 2.4.9 on page 45).

For both models, there is little difference between the performance in terms of

time or quality, supporting the conclusions made in the screening experiment of

the previous chapter.

As with ACS, results were quite different when evaluating the ADA-Time model.

192



1

10

100

1000

10000

0 1 2 3 4 5 6 7 8 9 10

Treatment

Tim

e 25

0


Figure 11.17: Evaluation of the Time response in the relativeError-Time model of MMAS. Problemsare of size 500 and standard deviation 10.

In the next two figures we see that the parameter settings from the full and

screened models outperform the settings from the literature by an order of magni-

tude in solution quality and three orders of magnitude in solution time. Again, the

comparison of the performance of the DOE settings with the literature settings is

not appropriate for ADA as the literature settings were recommended in the context

of relative error. However this highlights the importance of not apply the literature

recommended settings without an understanding of the solution quality response.

There is no practically significant difference between parameter setting recommen-

dations from the full and screened models. This indicates that the decisions from

the screening study were correct.


0

5

10

15

20

25

30

35

40

0 1 2 3 4 5 6 7 8 9 10Treatment

AD

A 2

50


Figure 11.18: Evaluation of the Time response in the RelativeError-Time model of MMAS. Problemsare of size 300 and standard deviation 10.

193



1

10

100

1000

0 1 2 3 4 5 6 7 8 9 10Treatment

Tim

e 25

0Full DesirabilityScreened DesirabilityBookRandom

Figure 11.19: Evaluation of the Time response in the ADA-Time model of MMAS. Problems are of size300 and standard deviation 10.

Not all parameter recommendations performed so well. The next figure shows

the evaluation of the Relative Error performance from the RelativeError-Time model

for problems of size 400 and standard deviation 70. The parameter settings from

the literature perform as well as or better than those obtained with the DOE ap-

proach. Randomly chosen parameter settings perform better than all others on

many occasions.


0

5

10

15

0 1 2 3 4 5 6 7 8 9 10

Treatment

Rel

ativ

e E

rror 2

50


Figure 11.20: Evaluation of Relative Error response in the RelativeError-Time model of MMAS. Prob-lems are of size 400 and standard deviation 70.

Similar results of poor relative error performance despite excellent time perfor-

mance were also observed for other combinations of problem size and standard

deviation. This was not an issue with ACS and may be due to the more compli-

cated nature of MMAS or an interaction between the stagnation stopping criterion

and the less aggressive behaviour of MMAS relative to ACS.

194



The following conclusions are drawn from the MMAS tuning study. These screen-

ing and tuning conclusions apply for a significance level of 5% and a power of

80% to detect the effect sizes listed in Figure 11.1 on page 181. These effect sizes

are a change in solution time of 474s, a change in Relative Error of 3.36% and a

change in ADA of 1.56. Issues of power and effect size are discuss in Section A.6

on page 225.

• Tuning Ant Placement not important. This factor had a very low rank-

ing in both full models. The screening study correctly identified the lack of

importance of Ant Placement.

• Tuning Restart frequency not important. The number of iterations used

in the restart frequency has no significant effect on MMAS performance in

terms of solution quality or solution time. This is a highly unexpected result

as the restart frequency is a fundamental feature of MMAS (Section 2.4.5 on

page 39)

• Alpha only important for solution time. The choice of Alpha only effects

solution time. Although statistically significant for the quality responses, it

has a low ranking. This confirms the literature’s general recommendation

that Alpha be set to 1.

• Beta not important for solution time. The choice of Beta only effects solu-

tion quality and not solution time. This is a new result in the ACO literature.

• Reinitialisation Branching Factor is important. This has a strong effect

on Time and a moderate but statistically significant effect on quality. This

highlights the importance of tuning this parameter which is usually held con-

stant.

• Sufficient order model.. A model of at least quadratic order is required to

model MMAS solution quality and MMAS solution time. This confirms the re-

sult of the screening study that simple OFAT approaches seen in the literature

are insufficient for accurately modelling and tuning MMAS performance.

• Relationship between tuning, problems and performance. Both the mod-

els of RelativeError-Time and ADA-Time were good predictors of MMAS per-

formance across the entire design space. The prediction intervals for full and

screened models were very similar, confirming that the decisions from the

screening study were correct. However a ranking of the full model terms from

the tuning study suggests it may not have been appropriate to screen out

restart frequency.

• Tuned parameter settings. There was little similarity between the recom-

mended tuned parameter settings from the full and screened models but both

settings resulted in similar MMAS performance. This indicates that there may

be many combinations of parameter settings that give similar performance for

195


MMAS. Finding one of these settings is important however as recommended

settings resulted in similar or better solution quality than random settings

and two orders of magnitude better time than random settings. There are

immense performance gains to be achieved.

For some combinations of problem size and problem standard deviation, the

RelativeError-Time recommendations resulted in worse solution quality than

some randomly chosen parameter settings. Solution time was nonetheless

improved. These poor recommendations may be due to the complex nature

of MMAS that may be difficult to model with an interpolative DOE approach.

This complexity is particularly evident in the daemon actions phase of MMAS

(Seciton 2.4.5 on page 39) where reinitialisations occur. This ‘restart’ nature

of MMAS may be difficult to model with the DOE approach.


This chapter presented a case study applying the methodology of Chapter 6 to

the tuning of the Max-Min Ant System (MMAS) heuristic. Many new results were

presented and existing recommendations in the literature were confirmed in a rig-

orous fashion. The conclusions of the screening study in the previous chapter were

also confirmed.

196

12Conclusions

This thesis presents a rigorous Design Of Experiments (DOE) approach for the

tuning of heuristic algorithms for Combinatorial Optimisation (CO). The thesis

therefore draws on well-established fields such as Operations Research, Design

Of Experiments, Empirical Analysis and Empirical Methodology and contributes a

much needed rigour [64, 103, 48, 65] to fields such as Heuristics, Metaheuristics,

and Ant Colony Optimisation.

This chapter summarises the thesis. It begins with a brief overview of the main

problem that the thesis addresses. It then examines the advantages of the ap-

proach taken in the thesis, leading to the main hypothesis of the work. The contri-

butions from this thesis are listed along with the thesis strengths and limitations.

The chapter closes with a discussion of possibilities for future work.

12.1 Overview

CO is a ubiquitous and extremely important type of optimisation problem that oc-

curs in scheduling, planning, timetabling and routing. However, these are typically

very difficult problems to solve exactly and so are generally tackled with approxi-

mate (heuristic) approaches and more general frameworks of heuristic approaches

called metaheuristics. Heuristics sacrifice finding an exact solution to a problem

and instead find an approximate solution in reasonable time. They therefore in-

herently involve a tradeoff between solution quality and solution time called the

heuristic compromise. Popular and successful metaheuristics include Evolutionary

Computation, Tabu Search, Iterated Local Search and Ant Colony Optimisation.

The flexibility of metaheuristics to deal with a range of problems comes at a cost—

their application to a particular problem typically requires setting the values of a

large number of tuning parameters. These tuning parameters are inputs to the

metaheuristic that govern its behaviour. Exploring this large parameter space and

relating it to problem characteristics and performance in terms of the heuristic

197

CHAPTER 12. CONCLUSIONS

compromise is called the parameter tuning problem. Methodically solving the pa-

rameter tuning problem for metaheuristics is probably the single most important

obstacle to producing metaheuristics that are well understood scientifically and

easily applied in practical scenarios.

12.2 Advantages of DOE

This thesis takes an empirical approach to parameter tuning, adapting method-

ologies, experiment designs and analyses from the field of Design Of Experiments

[84, 85]. It therefore fits between analytical and automated tuning approaches. It

efficiently provides the raw data and trends to which the analytical camp should fit

their models. It recommends good quality parameter tuning settings against which

the automated tuning camp can compare their tuners’ recommendations.

DOE was chosen for several reasons. It is well-established theoretically and

well supported in terms of software tools. This means it should be relatively easy

to convince metaheuristics researchers of the thesis’ methodology and to encour-

age them to adopt it. While a familiarity with statistical analyses and their in-

terpretation is of course required, many of the more involved aspects can be left

to the statistical analysis software. This is certainly the case in other fields such

as psychology and medicine where researchers have a tradition of following good

practice in experiment design, data collection and data analysis without neces-

sarily being expert statisticians. The traditional areas to which DOE is applied in

engineering map almost directly to the common research questions that one asks

in metaheuristics research so DOE’s power and maturity can be transferred di-

rectly to metaheuristics research. DOE offers efficiency in terms of the amount

of data that needs to be gathered. This is critical when attempting to understand

immense design spaces. All DOE conclusions are based on statistical analyses and

so are supported with mathematical precision. This allays any concerns regarding

subjective interpretation of results.

The main advantage of DOE over automated approaches is that it produces

a model of the data. If the model is verified to be a good quality model then it

can be used to explore, understand and ask questions of the actual algorithm

performance. Automated approaches [12, 8] provide relatively fast solutions but

do not provide this power to explore the parameter-performance relationship. They

generally provide only a tuned solution. Understanding where this solution came

from requires understanding the complex tuning algorithm code and its dynamics.

Understanding the results from a DOE analysis requires only understanding a

clear and well-established methodology. While some application scenarios will not

require this understanding, scientific reproducibility certainly demands it.

These advantages of DOE over alternatives for tuning metaheuristics lead to the

thesis hypothesis.

198


12.3 Hypothesis

The main hypothesis of this thesis is:



This hypothesis was tested in a multitude of ways. DOE nested designs were

used to investigate the importance of problem characteristics on heuristic per-

formance. DOE screening designs were used to rank the importance of tuning

parameters and problem characteristics. DOE response surface models were used

to model the relationship between tuning parameters, problem characteristics and

performance. Desirability functions were used to tune performance while simulta-

neously addressing the two aspects of the heuristic compromise

The key finding was that the Design of Experiments approach is indeed an

excellent method for modelling and tuning metaheuristics like Ant Colony Op-

timisation. This was demonstrated with independent confirmation experiments

and with comparisons to tuning parameter settings taken from the literature.

This was recognised by the community through peer-reviewed publications at

the field’s main conferences with best paper nominations and best paper awards

[105, 106, 110, 108, 104, 109]. This result is now covered in more detail.

12.4 Summary of main thesis contributions

The following is a summary of the main contributions from this thesis.

1. Synthesis. The poor quality of methodology in the metaheuristics field is

probably due to a lack of awareness of the issues involved. Chapter 3 sur-

veyed the literature to gather together these issues as they have been dis-

cussed over the past 30 years in fields such as heuristics and operations

research. Issues covered included:

(a) the types of research question and study,

(b) the stages in sound experiment design and some common mistakes,

(c) the issues of heuristic instantiation and problem abstraction,

(d) the importance of pilot studies,

(e) reproducibility of results,

(f) benchmarking of machines,

(g) the advantages and disadvantages of various performance responses,

(h) random number generators,

(i) problem instances,

(j) stopping criteria, and

199


(k) interpretative bias

This thesis strived to apply best practice regarding these issues and should

serve as a much needed illustration of good experiment design and analysis.

2. Heuristic code JACOTSP. All experiments were run with our Java version

(JACOTSP) of the original C source code (ACOTSP) accompanying the litera-

ture [47]. JACOTSP was informally verified to produce the same behaviour as

ACOTSP by comparing their outputs on a variety of problem instances and a

variety of tuning parameter settings. JACOTSP also offers the usual advan-

tages of an Object-Oriented (OO) design, namely extensibility and reuse. It is

intended to make JACOTSP available to the community, creating a focal point

for more reproducible research with ACO.

3. Problem generator code Jportmgen. All problem instances were generated

with our Java port (Jportmgen) of a generator used in a large open competition

in the optimisation community (portmgen) [58]. Jportmgen was informally

verified to produce the same instances as the original portmgen. Again, the

OO design permitted instances to be generated with edge lengths that follow

a plugged-in distribution. This was important for experiments with problem

characteristics.

4. Methodology for investigating if a problem characteristic affects per-formance. A detailed methodology was presented for determining whether

a given problem characteristic affects heuristic performance. This involved

the introduction of the nested design and its analysis to the ACO field. The

methodology was illustrated with a published case study of ACS and MMAS

[105, 110]. This method and experiment design are now attracting attention

in the stochastic local search field [6].

5. Methodology for screening tuning parameters and problem characteris-tics. A detailed methodology and efficient experiment designs were presented

for ranking the most important tuning parameters and problem character-

istics that affect performance and for screening out those that do not effect

performance. This methodology was published along with illustrative case

studies of its application [108, 104]. A further methodology was presented

for independently confirming the accuracy of the screening model and its rec-

ommendations. The thesis is the first use of fractional factorial designs (Sec-

tion A.3.2 on page 215) for screening ACO tuning parameters and problem

characteristics.

6. Methodology for modelling the relationship between tuning parameters,problem characteristics and performance. A detailed methodology and ef-

ficient experiment designs were presented for modelling the relationship be-

tween tuning parameters, problem characteristics and performance. This

methodology was published along with illustrative case studies of its appli-

cation [106, 109]. The thesis is the first use of response surface models and

fractional factorial designs for modelling ACO.

200


7. Desirability functions and emphasis on the heuristic compromise. Thr-

oughout the thesis, the heuristic compromise has been emphasised. The mul-

tiobjective problem of reducing running time while improving solution quality

was tackled by the introduction of desirability functions. Using desirabil-

ity functions and tuning heuristic desirability does not exclude the analysis

of heuristic solution time and heuristic solution quality separately. It is a

convenient approach to deal with the heuristic compromise. Confirmation

experiments demonstrated that tuning parameter settings found with the de-

sirability approach offered orders of magnitude savings in solution time over

parameters taken from the literature.

8. A new important problem characteristic for ACO. The analysis of problem

difficulty from the case study in Chapter 7 showed that the standard deviation

of edge lengths in a TSP instance has a significant effect on problem difficulty

for ACS and MMAS. This means that research should report the standard de-

viation of instances. This result was confirmed in subsequent screening and

modelling case studies in which it was shown that problem instance standard

deviation had a very large effect on solution quality. This may extend to other

metaheuristics.

9. New results from screening ACS and MMAS. The screening experiments

answered many open questions regarding the importance of various tuning

parameters and problem characteristics. From screening ACS, it was shown

that:

• Tuning Ant placement not important. The type of ant placement has

no significant effect on ACS performance in terms of solution quality or

solution time. This was an open question in the literature. It is remark-

able because intuitively one would expect a random scatter of ants across

the problem graph to explore a wider variety of possible solutions. This

result shows that this is not the case.

• Tuning Alpha not important. Alpha has no significant effect on ACS

performance in terms of solution quality or solution time. This confirms

the common recommendation in the literature of setting alpha equal to

1. An OFAT analysis of alpha for ACS is reported in Appendix D.

• Tuning Rho not important. Rho has no significant effect on ACS per-

formance in terms of solution quality or solution time. This is a new

result for ACS. It is a surprising result since Rho is a term in the ACS

update pheromone equations and analytical approaches in very simpli-

fied scenarios have concluded that rho is important [41].

• Tuning Pheromone Update Ant not important. The ant used for phe-

romone updates is practically insignificant for all three responses. An

examination of the plot of time for the K-pheromoneUpdate factor shows

that the effect on time is not practically significant. K-pheromoneUpdate

can therefore be screened out.

201


• Most important tuning parameters. The most important ACS tuning

parameters are the heuristic exponent B-beta, the amount of ants C-

antsFraction, the length of candidate lists D-nnFraction and the explo-

ration/exploitation threshold E-q0.

• Problem standard deviation is important. This confirms the main re-

sult of Chapter 7 in identifying a new TSP problem characteristic that has

a significant effect on the difficulty of a problem for ACS. ACO research

should be reporting this characteristic in the literature.

• Higher order model needed. A higher order model, greater than linear,

is required to model ACS solution quality and ACS solution time. This is

an important result because it demonstrates for the first time that simple

OFAT approaches seen in the literature are insufficient for accurately

tuning ACS performance.

• Comparison of solution quality responses. The is no difference in con-

clusions from the ADA and Relative Error solution quality responses.

ADA has a slightly smaller variability and so results in more powerful

experiments than Relative Error.

From screening MMAS, it was shown that:

(a) Tuning Restart Frequency not important. The tuning parameter Restart

Frequency is statistically insignificant for solution quality and solution

time in the factor ranges experimented with. It may, however, become

important when very high solution quality is required.

(b) Tuning AntPlacement not important. As with ACS, the design param-

eter AntPlacement does not have a significant effect on solution quality

or solution time. Either random scatter or single random node placement

can be used when placing ants on the TSP graph.

(c) Tuning Alpha only important for solution time. The choice of the

Alpha tuning parameter value only effects solution time. Although sta-

tistically significant for the quality responses, it has a low ranking. This

confirms the literature’s general recommendation that Alpha be set to 1

[47, p. 71].

(d) Problem difficulty results confirmed. The result of the study of prob-

lem characteristics affecting performance was confirmed with problem

edge length standard deviation having a very strong effect on solution

quality for MMAS.

(e) Important tuning parameters. Of the remaining unscreened tuning pa-

rameters, the heuristic exponent Beta, the amount of ants antsFraction,

the length of candidate lists nnFraction, the exploration/exploitation thres-

hold q0 and and pheromone decay term Rho have the strongest effects

on solution quality. The same is true for solution time except for beta

which has a low ranking for solution time.

202


(f) New parameter. Reinitialisation Branching Factor (ReinitBranchFac) is

statistically significant for all three responses but is only ranked in the

top third for the quality responses. It has a high ranking for Time. This

highlights that ReinitBranchFac should be considered as a tuning pa-

rameter rather than being hard-coded as is typically the case.

(g) Higher order model of MMAS behaviour needed. A higher order model,

greater than linear, is required to model MMAS solution quality and

MMAS solution time. This is an important result because it demonstrates

for the first time that simple OFAT approaches seen in the literature are

insufficient for accurately tuning MMAS performance.

(h) Comparison of solution quality responses. The is no difference in con-

clusions from the ADA and Relative Error solution quality responses for

MMAS. The ADA response is therefore preferable for screening because

it exhibits a lower variability than Relative Error and therefore results in

more powerful experiments.

The modelling case studies in Chapters 9 and 11 both confirmed results from

the other screening case studies and yielded new results in their own right.

• Confirmation of screening study results. The models of ACS and

MMAS performance were built using the full set of tuning parameters

and the reduced set resulting from the previous screening experiments.

Both the full and reduced models were good predictors of performance in

terms of both solution quality and solution time. This confirmed the ac-

curacy of the previous screening studies’ recommendations. In general,

the ranking of the importance of the tuning parameters was in broad

agreement with the ranking from the screening study. Some small differ-

ences are to be expected because the screening analyses are conducted

on each response separately while the modelling analyses are conducted

on each response simultaneously.

These results confirm that screening studies for ACO can be trusted for

screening out parameters that do not affect performance and so reducing

the parameter space to explore in more expensive modelling studies.

• Possibility of multiple regions of interest. The recommended param-

eter settings from the desirability optimisation of full and screened mod-

els were not the same for MMAS. However, when the recommendations

were independently evaluated, both gave similarly competitive solution

qualities and similarly huge savings in solution time on new problem

instances. This highlights the possibility of multiple regions of interest

in the parameter space of MMAS. This is important because it confirms

the futility of attempting to recommend ‘optimal’ parameter settings. It

also illustrates the more complicated parameter-performance relation-

ship that emerges when one considers the two aspects of the heuristic

compromise simultaneously. There are probably more than one regions

203


in the parameter space where a similar compromise in solution quality

and solution time can be found.

• Quadratic models needed. Fit analyses for the ACS and MMAS response

surface models showed that a surface of order at least quadratic is re-

quired to model these metaheuristics. This rules out the use of OFAT

approaches (Section 2.5 on page 47) to tuning these heuristics. The

quadratic models were independently confirmed to be good predictors of

performance across the parameter space. It is an open question whether

higher order models and the associated increase in experiment expense

would yield even better predictions of performance.

Note that all the aforementioned results were obtained at a significance level of

5% and the largest effect size that could be detected with a power of 80%. Please

refer to the individual case studies for the details of these effect sizes. Note that

these effect sizes were limited by the number of replicates that could be run with

the available experimental resources. In an ideal situation, the experimenter would

determine the effect size from the experimental objectives and then increase the

replicates until sufficient power was achieved.

12.5 Thesis strengths

12.5.1 Rigour and efficiency

The strengths of this thesis come from the strengths of DOE (Section 2.5 on

page 47). The thesis’ methodologies are adapted from well-established and tested

methodologies used in other fields such as manufacture. They are therefore proven

on decades of scientific and industrial experience. The experiment designs allow

for a very efficient use of experiment resources while still obtaining all of the most

important information from the data. Until this thesis, there has been little aware-

ness of the potential of these designs in the ACO literature. Their efficiency is crit-

ically important when experiments are expensive due to large parameter spaces

and difficult problem instances. In particular, the designs provide a vast saving in

experiment runs (Section A.3.3 on page 218). Because DOE and Response Sur-

face Models build a model of performance across the whole design space, many

research questions can be explored. Numerical optimisation of this surface can

recommend tuning parameter settings for different weightings of the responses of

interest. One may obtain settings appropriate for long run times and high quality

or short run times and lower levels of solution quality. All of these questions are

answered on the same model without need to rerun experiments.

12.5.2 Generalizability

The methodology and results from this thesis are of interest to both those who

design heuristics and engineers who wish to deploy a heuristic. Designers can use

the thesis methodology to rank the contribution of new additions to a heuristic

204


(design parameters) as well as to understand and model the contribution of tuning

parameters to changes in performance. DOE provides a rigorous approach for

testing hypotheses about a new heuristic, categorically determining whether new

techniques/components make a significant impact on performance. For engineers,

DOE provides a verifiably accurate model of behaviour, allowing the heuristic to

be quickly retuned to new problem instances without running a large set of new

experiments.

Although all case studies illustrate the application of the thesis’ methodolo-

gies to ACO, there is no reason why the methodologies cannot be applied to other

heuristics.

12.5.3 Reproducibility and empirical best practice

An effort has been made throughout the thesis to address the methodological is-

sues raised and discussed in Chapter 3. The thesis uses algorithm and problem

generator code that is backwards compatible with codes commonly used in the

field. This strengthens the reproducibility of its results and makes its conclusions

applicable to all previous work that has used these codes. Experiment machines

were properly benchmarked according to an established procedure [58]. This im-

proves the thesis’ reproducibility and applicability for all subsequent research work

that may refer to this thesis.

12.6 Thesis limitations

DOE is not a panacea for the myriad difficulties that arise in the empirical analysis

of heuristics. It goes a long way towards overcoming many of those difficulties.

However, the conclusions and contributions of this thesis are necessarily limited

in a few ways.

• Computational expense. Despite the efficiency of the DOE designs that this

thesis introduced, running sufficient experiments to gather sufficient data

is still computationally expensive. Of course, the experiments would have

been orders of magnitude more expensive had a less sophisticated approach

been used. This expense is increased when there are many categorical tun-

ing parameters due to the nature of how the designs are built. However,

any expense is mitigated by the amount of useful and structured informa-

tion obtained. DOE yields a full model of the data that can be explored in

many ways and used to make new predictions about the heuristic perfor-

mance across the entire design space. The DOE designs and methods in this

thesis are the state-of-the-art approach for building such models. If a user

is concerned about quickly tuning parameters in a one-off scenario, an au-

tomated approach may be a preferable alternative. Recall however that use

of an automated method implies that the user is content to trust its black-

box approach and requires no understanding of the parameter, problem and

performance relationship.

205


• Categorical factors. The previous point mentioned how categorical tuning

parameters increase the size and consequently the expense of the experi-

ments. It must be pointed out that the 2-level fractional factorial designs

used in screening can only take on two values for each factor. This is not

a severe limitation when one considers the main motivation of a screening

design—to determine whether a factor should be included in the more expen-

sive Response Surface Model design.

• Nested parameters. A parameter type that we term a nested parameter arose

in the analysis of MMAS parameters in Section 2.4.9 on page 45. These are

parameters that only make sense within their parent parameter. Factorial ex-

periment designs cannot analyse these types of parameters directly. Values of

the nested parameter and its parent could be lumped into a single categorical

parameter. Our summary of tuning parameters in Section 2.4.9 on page 45

suggests that these types of parameter might not be so common anyway.

12.7 Future work

The research presented in this thesis should be developed and extended along the

following lines.

• Further ACO algorithms. The original ACOTSP and our JACOTSP contain

further ACO algorithms, Rank-based Ant System [24], Elitist Ant System and

Best-Worst Ant System [31]. These could easily be investigated with the thesis

methodologies to see whether similar results are obtained regarding the im-

portance of various tuning parameters. It would also be of interest to extend

these algorithms with local search.

• Comparison to OFAT. It would strengthen the argument for the use of DOE

in favour of OFAT if a comprehensive comparison of the two methods were

conducted as has been done in other fields [37].

• Comparison to other tuning methods. If would be interesting to compare

DOE to the results from other tuning approaches such as automated tuning.

• Further heuristics. The methodology could also be applied to other heuris-

tics. Screening studies, for example, have already been independently applied

to Particle Swarm Optimisation [74].

• Useful tools. A strength of DOE is its support in software. There is no

excuse for the metaheuristics practitioner to claim that statistics are too time-

consuming or complicated to use (Section 2.5 on page 47). Modern statistical

analysis software shields the user from much of this complexity. A greater

awareness of this software and tutorials on how to use it with metaheuristics

is urgently needed.

206


12.8 Closing

It is hoped that this thesis has convinced the reader of the merits of the DOE

approach when applied to the problem of tuning metaheuristics. The parameter

tuning problem is ubiquitous in the field and must be tackled in every new piece

of metaheuristics research. The methodologies of this thesis, or some appropri-

ate adaptation of them, should be used when setting up ACO heuristics. There

is no longer any excuse for inheriting values from other publications or for fuzzy

reasoning with words and intuition about the parameters that need to be tuned.

Adopting this thesis’ methodologies will add to the expense of metaheuristics ex-

periments. However, this is the unavoidable reality of dealing with metaheuristics

with a large number of tuning parameters. The researcher who embraces this the-

sis’ methodologies will have at their disposal an established, efficient, rigourous,

reproducible approach for making strong conclusions about the relationship be-

tween metaheuristic tuning parameters, problem characteristics and performance.

207

Part V

Appendices

209

ADesign Of Experiments (DOE)

This appendix provides a basic background on the main Design Of Experiments

(DOE) and statistics concepts used in this thesis and introduced in Section 2.5 on

page 47. The material in this appendix is adapted and compiled from the literature

[84, 1, 89, 85] for the reader’s convenience and is not intended to replace a detailed

study of those texts.

The appendix begins with an introduction to the terminology for the DOE field.

It then provides a short explanation of the various DOE topics referred to in the

thesis.

A.1 Terminology

The following terms are encountered frequently in the design and analysis of ex-

periments.

A.1.1 Response variable

The response variable is the measured variable of interest. In the analysis of meta-

heuristics, one typically measures the solution quality and solution time required

by a heuristic as these are reflections of the heuristic compromise. The DOE ap-

proach can be used in heuristic design as well as heuristic performance analysis

and so the choice of response variable is limited only by the experimenter’s imag-

ination. In some cases, it may be appropriate to measure the frequency of some

internal heuristic operation for example.

A.1.2 Factors and Levels

A factor is an independent variable manipulated in an experiment because it is

thought to affect one or more of the response variables. The various values at

which the factor is set are known as its levels. In heuristic performance analysis,

211

APPENDIX A. DESIGN OF EXPERIMENTS (DOE)

the factors include both the heuristic tuning parameters and the most important

problem characteristics. These are sometimes distinguished by referring to them

as tuning factors and problem factors respectively. Factors can also be new heuris-

tic components that are hypothesised to improve performance. Sometimes factors

are distinguished as being either design factors (or primary factors) or held-constantfactors (or secondary factors). Design factors are those factors that are being stud-

ied because we are interested in their effects on the responses. Held-constant

factors are those factors that are known to affect the responses but are not of in-

terest in the present study. They should be held at a constant value throughout

all experiments.

A.1.3 Treatments

A treatment is a specific combination of factor levels. The particular treatments will

depend on the particular experiment design and on the ranges over which factors

are varied.

A.1.4 Replication

Replicates are repeated runs of a treatment. Replicates are needed when a studied

process produces different response measurements for identical runs of a treat-

ment. This is always the case with stochastic heuristics. The number of replicates

required in an experiment is linked to the statistical concept of power discussed

later.

A.1.5 Effects

An effect is a change in the response variable due to a change in one or more

factors. We can define main effects as follows:

The main effect of a factor is a measure of the change in the response

variable to changes in the level of the factor averaged over all levels of all

the other factors. [89]

Higher order effects (or interactions) are the effect that occurs when the com-

bined change in two factors produces an effect greater (or less) than that of the

sum of effects expected from either factor alone. An interaction occurs when the

effect one factor has depends on the level of another factor. A second order effect

is due to two factors, a third order to three and so on.

A.1.6 Confounding

Two are more effects are said to be confounded if it is impossible to separate the ef-

fects when the subsequent statistical analysis is performed. This is best described

with an example.

A computer scientist has developed a new algorithm and wishes to compare

it with an established algorithm. He has two machines available to him. The

212


established algorithm will be run on one machine and the experimental algorithm

to the other. The characteristic to be measured as an index of performance will

be the run time of the algorithm to solve a particular problem. However, when the

two run times are compared, it is impossible to say how much of the difference is

due to the algorithms and how much is due to inherent differences (age, operating

system version, memory) between the two machines.

The effects of algorithm and machine are thus confounded. Confounding is due

to poor experimental planning and execution, particularly to poor control of factors.

It is important to stress the difference between confounding and aliasing. Aliasing

is an inability to distinguish several effects due to the nature of the experiment

design rather than poor execution. It is a deliberate and known price that we pay

for using more efficient designs such as fractional factorials, as discussed later.

A.2 Regions of operability and interest

There are two regions within an experimental design space [85]. The region ofoperability is the region in which the equipment, process etc. works and it is the-

oretically possible to conduct an experiment and measure responses. In ACO, the

region of operability is sometimes bounded, as with the exploration/exploitation

parameter ρ which must be within a range 0 < ρ < 1. With other tuning parameters

such as α, the region of interest is, in theory, unbounded . Within this region of

operability, there may be one or more regions of interest. A region of interest is a

region to which an experimental design is confined. The region of interest is typ-

ically chosen because we believe it contains the optimal process settings. These

regions are illustrated schematically in the following figure.

O

R

R’

Figure A.1: Region of operability and region of interest (adapted from [85]).

The region of operability is often not known until the process has been well

studied and may change depending on circumstances. Myers and Montgomery

[85] offer the following comments on the difficulty of choosing an experimental

region of interest.

. . . in many situations the region of interest (or perhaps even the re-

gion of operability) is not clear cut. Mistakes are often made and adjust-

213


ments adopted in future experiments. Confusion regarding type of designshould never be an excuse for not using designed experiments. Using [a

design for the wrong type of region] for example, will still provide impor-

tant information that will, among other things, lead to more educated

selection of regions for future experiments. [85, p. 317]

A.3 Experiment Designs

There many experiment designs to choose from. The design an experimenter uses

will depend on many factors including the particular research question, whether

experiments are in the early stages of research and the experimental resources

available. This section focuses on the advanced designs that appear in this the-

sis. It begins with a simpler more common design as this provides the necessary

background for understanding the subsequent designs.

A.3.1 Full and 2k Factorial Designs

A full factorial design consists of a crossing of all levels of all factors. The number

of levels of each factor can be two or more and need not be the same for each

factor. These levels may be quantitive (scalar), such as values of pheromone de-

cay constant; or they may be qualitative, such as types of algorithm. This is an

extremely powerful but expensive design. A more useful type of factorial for DOE

uses k factors, each at only 2 levels. The so-called 2k factorial design provides

the smallest number of runs with which k factors can be studied in a full factorial

design. Factorials have some particular advantages and disadvantages [89]. These

are worth noting given the importance that factorials play in experimental design.

The advantages are that:

• greater efficiency is achieved in the use of available experimental resources

in comparison to what could be learned from the same number of experiment

runs in a less structured context such as an OFAT analysis [37],

• information is obtained about the interactions, if any, of factors because the

factor levels are all crossed with one another, and

• results are more comprehensive over a wider range of conditions due to the

combining of factor levels in one experiment.

Of course, these advantages come at a price. As the number of factors grows,

the number of treatments in a 2k design rapidly overwhelms the experiment re-

sources. Consider the case of 10 continuous factors. A naıve full factorial design

for these ten factors will require a prohibitive 210 = 1024 treatments. The full facto-

rial experiment is the ideal design for many of the research questions in this thesis

but the size of metaheuristic design spaces limits its applicability. A more efficient

design is required.

214


A.3.2 Fractional Factorial Design

The previous section mentioned the exponential increase in expense of factorial

designs with an increase in the design factors. There are benefits to this expense.

A 210 full factorial will provide data to evaluate all the effects listed in the next table.

For screening however, the experimenter is interested only in the main effects (the

design factors) and perhaps the two-factor effects. This makes the full factorial

inefficient for screening purposes.

Effect Number es-timated

Main 10

Two-factor 45

Three factor 120

Four-factor 210

Five-factor 252

Six-factor 210

Seven-factor 120

Eight-factor 45

Nine-factor 10

Ten-factor 1

Table A.1: Numbers of each effect estimated by a full factorial design of 10 factors.

If it is assumed that higher-order interactions are insignificant, information on

the main effects and lower-order interactions can be obtained by running a fraction

of the complete factorial design. This assumption is based on the sparsity of effectsprinciple. This states that a system or process is likely to be most influenced by

some main effects and low-order interactions and less influenced by higher order

interactions.

A judiciously chosen fraction of the treatments in a full factorial will yield in-

sights into only the lower order effects. This is termed a fractional factorial. The

price we pay for the fractional factorial’s reduction in number of experimental

treatments is that some effects are indistinguishable from one another. They are

aliased. Additional treatments, if necessary, can disentangle these aliased effects

should an alias group be statistically significant. The advantage of the fractional

factorial is that it facilitates sequential experimentation. The additional treatments

and associated experiment runs need only be performed if aliased effects are sta-

tistically significant. Depending on the number of factors, and consequently the

design size, a range of fractional factorials can be produced from a full factorial.

The amount of higher order effects that are aliased is described by the design’s

resolution. For Resolution III designs, all effects are aliased. Resolution IV designs

have unaliased main effects but second-order effects are aliased. Resolution V

designs estimate main and second-order effects without aliases.

The details of how to choose a fractional factorial’s treatments are beyond the

215


scope of this thesis. It is an established algorithmic procedure that is well covered

in the literature [84] and is provided in all modern statistical analysis software.

The fractional factorials used in this research are summarised in the next figure

which shows the relationship between number of factors, design resolution and

associated number of experiment treatments.

−4 1

IV2

2232

4252

6272

8292 12 3

VI2 −

3 1

III2 −

5 1

V2 −

6 1

VI2 − 7 2

IV2 − 8 3

IV2 − 9 4

IV2 − 10 5

IV2 − 11 6

IV2 − 12 7

IV2 −

7 1

VII2 − 8 2

V2 − 9 3

IV2 − 10 4

IV2 − 11 5

IV2 − 12 6

IV2 −

12 5

IV2 −

8 4

IV2 −7 3

IV2 −6 2

IV2 −

5 2

III2 − 6 3

III2 −

9 5

III2 − 10 6

III2 − 11 7

III2 − 12 8

III2 −

8 1

VIII2 − 9 2

VI2 − 10 3

V2 − 11 4

V2 −

9 1

IX2 − 10 2

VI2 − 11 3

VI2 − 12 4

VI2 −

10 1

X2 − 11 2

VII2 −

8

16

32

64

256

128

512

4 65 7 8 9 10 11 1232

4

7 4

III2 −

Number of factors

Num

ber o

f tre

atm

ents

Figure A.2: Fractional Factorial designs for two to twelve factors. The required number of treatmentsis listed on the left. Resolution III designs (do not estimate any terms) are coloured darkest followed byResolution IV designs (estimate main effects only) followed by Resolution V and higher (estimate maineffects and second order interactions).

The minimum appropriate fractional factorial design resolution for screening is

therefore resolution IV since screening aims to remove factors (main effects) that

do not effect the responses. A resolution V design is preferable when resources

allow because it also tells us what second order effects are present without the

need for additional treatments and experiment runs.

It is informative to consider the two available resolution IV designs for 9 factors

in the next figure as examples of the importance of examining alias structure.

The 2(9-4) design requires 32 treatments while the 2(9-3) is more expensive with

64 treatments. The cheaper 2(9-4) design has 8 of its 9 main effects aliased with 3

third order interactions. The 2(9-3) design has only 4 of its 9 main effects aliased

with a single third-order interaction. The second order interactions are almost all

aliased in the more expensive 2(9-3) design but the aliasing is more favourable

than the cheaper 2(9-4) design. Resources permitting, the more expensive 2(9-3)

design is therefore more desirable for screening main effects.

Screening designs based on 2k factorials and fractional factorials can only pro-

duce linear models of a response because each factor appears at only two levels.

For a more complicated relationship between factors and response, a more sophis-

216


2(9-3) Resolution IV 2 (9-4) Resolution IVTerm Term Alias

1 [A] = A + DHJ [A] = A2 [B] = B [B] = B + CHJ + DGJ + EFJ3 [C] = C [C] = C + BHJ + DGH + EFH4 [D] = D + AHJ [D] = D + BGJ + CGH + EFG5 [E] = E [E] = E + BFJ + CFH + DFG6 [F] = F [F] = F + BEJ + CEH + DEG7 [G] = G [G] = G + BDJ + CDH + DEF8 [H] = H + ADJ [H] = H + BCJ + CDG + CEF9 [J] = J + ADH [J] = J + BCH + BDG + BEF10 [AB] = AB + CDG [AB] = AB + CDF + CEG + DEH + FGH11 [AC] = AC + BDG + EFH [AC] = AC + BDF + BEG + DEJ + FGJ12 [AD] = AD + HJ + BCG [AD] = AD + BCF + BEH + CEJ + FHJ13 [AE] = AE + CFH [AE] = AE + BCG + BDH + CDJ + GHJ14 [AF] = AF + CEH [AF] = AF + BCD + BGH + CGJ + DHJ15 [AG] = AG + BCD [AG] = AG + BCE + BFH + CFJ + EHJ16 [AH] = AH + DJ + CEF [AH] = AH + BDE + BFG + DFJ + EGJ17 [AJ] = AJ + DH [AJ] = AJ + CDE + CFG + DFH + EGH18 [BC] = BC + ADG + GHJ [BC] = BC + HJ + ADF + AEG19 [BD] = BD + ACG [BD] = BD + GJ + ACF + AEH20 [BE] = BE [BE] = BE + FJ + ACG + ADH21 [BF] = BF [BF] = BF + EJ + ACD + AGH22 [BG] = BG + ACD + CHJ [BG] = BG + DJ + ACE + AFH23 [BH] = BH + CGJ [BH] = BH + CJ + ADE + AFG24 [BJ] = BJ + CGH [BJ] = BJ + CH + DG + EF25 [CD] = CD + ABG + EFJ [CD] = CD + GH + ABF + AEJ26 [CE] = CE + AFH + DFJ [CE] = CE + FH + ABG + ADJ27 [CF] = CF + AEH + DEJ [CF] = CF + EH + ABD + AGJ28 [CG] = CG + ABD + BHJ [CG] = CG + DH + ABE + AFJ29 [CH] = CH + AEF + BGJ [DE] = DE + FG + ABH + ACJ30 [CJ] = CJ + BGH + DEF [DF] = DF + EG + ABC + AHJ31 [DE] = DE + CFJ [ABJ] = ABJ + ACH + ADG + AEF32 [DF] = DF + CEJ33 [DG] = DG + ABC34 [EF] = EF + ACH + CDJ35 [EG] = EG36 [EH] = EH + ACF37 [EJ] = EJ + CDF38 [FG] = FG39 [FH] = FH + ACE40 [FJ] = FJ + CDE41 [GH] = GH + BCJ42 [GJ] = GJ + BCH43 [ABE] = ABE + FGJ44 [ABF] = ABF + EGJ45 [ABH] = ABH + BDJ46 [ABJ] = ABJ + BDH + EFG47 [ACJ] = ACJ + CDH48 [ADE] = ADE + EHJ49 [ADF] = ADF + FHJ50 [AEG] = AEG + BFJ51 [AEJ] = AEJ + BFG + DEH52 [AFG] = AFG + BEJ53 [AFJ] = AFJ + BEG + DFH54 [AGH] = AGH + DGJ55 [AGJ] = AGJ + BEF + DGH56 [BCE] = BCE57 [BCF] = BCF58 [BDE] = BDE + FGH59 [BDF] = BDF + EGH60 [BEH] = BEH + DFG61 [BFH] = BFH + DEG62 [CEG] = CEG63 [CFG] = CFG

Alias

Figure A.3: Effects and alias chains for a 2 (9-3) resolution IV design and a 2(9-4) resolution IV design.

217


ticated design is required. These are called Response Surface designs.

A.3.3 Efficiency of Fractional Factorial Designs

The following figure makes explicit the huge savings in experiment runs when

using a fractional factorial design instead of a full factorial design.

Design Treatments% saving of treatments Design* Treatments

% saving of treatments

Full 512 Full 5312 (9-5) III 16 97 Half 275 502(9-4) IV 32 94 Quarter 147 752(9-3) IV 64 88 Min Run 65 912(9-2) VI 128 75 * FCC with 1 centre point

Figure A.4: Savings in experiment runs when using a fractional factorial design instead of a fullfactorial design. The savings for screening designs are on the left and the savings for response surfacedesigns are on the right. In both cases, fractional factorial designs offer enormous savings in numberof treatments over the full factorial alternative.

A.3.4 Response Surface Designs

There are several types of experiment design for building response surface models.

This research uses Central Composite Designs (CCD). A CCD contains an imbedded

factorial (or fractional factorial design). This is augmented with both centre points

and a group of so-called ‘star points’ that allow estimation of curvature. Let the

distance from the centre of the design space to a factorial point be ±1 unit for each

factor. Then, the distance from the centre of the design space to a star point is ±αwhere |α| > 1. The value of α depends on certain properties desired for the design

and on the number of factors involved. The number of centre point runs the design

is to contain also depends on certain properties required for the design. There are

three types of central composite design, illustrated in Figure A.5.

Factorial point

Centerpoint

Axial point

CCC FCC ICC

Figure A.5: Central composite designs for building response surface models. From left to right thesedesigns are the Circumscribed Central Composite (CCC), the Face-Centred Composite (FCC) and theInscribed Central Composite (ICC). The design space is represented by the shaded area. The factorialpoints are black circles and the star points are grey squares.

218


The designs differ in the location of their axial points. The choice of design

depends on the nature of the factors being experimented with.

• Circumscribed Central Composite (CCC). In this design, the star points

establish new extremes for the low and high settings for all factors. These

designs require 5 levels for each factor. Augmenting an existing factorial or

resolution V fractional factorial design with star points can produce this de-

sign.

• Inscribed Central Composite (ICC). For those situations in which the limits

specified for factor settings are truly limits, the ICC design uses the factor

settings as the star points and creates a factorial or fractional factorial design

within those limits. This design also requires 5 levels of each factor.

• Face-Centred Composite (FCC). In this design, the star points are at the

centre of each face of the factorial space, so alpha = ± 1. This design requires

just 3 levels of each factor.

An existing factorial or resolution V design from the screening stage can be aug-

mented with appropriate star points to produce the CCC and FCC designs. This is

not the case with the ICC and so it is less useful in the sequential experimentation

scenario (see Section 6.1).

Many of the tuning parameters encountered in ACO algorithms have a restric-

tive range of values they can take on. For example, the exploration/exploitation

threshold q0 (Section 2.4.6 on page 42) must be greater than or equal to 0 and less

than or equal to 1. The problem instance characteristics also have a restrictive

range imposed by the user. We cannot hope to model all possible instances and so

must restrict our instance characteristic ranges to those that will be encountered

in the application of the algorithm. The FCC is designed for scenarios where such

restrictions on factor ranges are enforced. Clearly, it is the most appropriate design

in the current ACO parameter tuning scenario. The FCC is used in all response

surface modelling in this thesis.

A.3.5 Prediction Intervals

A regression model from the response surface design is used to predict new values

of the response given values of the tuning parameter and problem characteristic

input variables. The model’s p% prediction interval is the range in which you can

expect any individual value from the actual heuristic to fall into p% of the time. The

prediction interval will be larger (a wider spread) than a confidence interval since

there is more scatter in individual values than in averages. Montgomery [84, p.

394-396] describes the mathematical formulation of prediction intervals and their

applications. In particular, prediction intervals should be used in confirmation

experiments to verify that models of the heuristic behaviour are correct. This thesis

is the first use in the heuristics literature of prediction intervals and independent

confirmation runs to verify conclusions.

219


A.3.6 Desirability functions

The concept of a desirability function can be briefly described as follows [84, 1].

The desirability function approach is a widely used industrial method for optimis-

ing multiple responses. The basic idea is that a process with many quality char-

acteristics is completely unacceptable if any of those characteristics are outside

some desired limits. For each response, Yi, a desirability function di(Yi) assigns a

number between 0 and 1 to the possible values of the response Yi. di(Yi) = 0 is a

completely undesirable value and di(Yi) = 1 is an ideal response value. These indi-

vidual k desirabilities are combined into an overall desirability D using a geometric

mean:

D = (d1(Y1)× d2(Y2)× . . .× dk(Yk))1/k (A.1)

A particular class of desirability function was proposed by Derringer and Suich

[40]. Let i and Ui be the lower limit and upper limits respectively of response i. Let

Ti be the target value. If the target value is a maximum then

di =

0 yi < Li(yi−Li

Ti−Li

)rLi ≤ yi ≤ Ti

1 yi > Ti

(A.2)

If the target is a minimum value then

di =

1 yi < Ti(Ui−yi

Ui−Ti

)rLi ≤ yi ≤ Ti

0 yi > Ui

(A.3)

The value r adjusts the shape of the desirability function. A value of r = 1 is

linear. A value of r > 1 increases the emphasis of being close to the target value.

A value of 0 < r < 1 decreases this emphasis. These cases are illustrated in the

following figure.

A.4 Experiment analysis

Experiment analysis is the steps one takes after designing an experiment and gath-

ering data. The analysis steps in this thesis are listed in Chapter 6 on methodology.

Some of these steps are covered in more detail here.

A.4.1 Stepwise regression.

Various techniques can be used to identify the most important terms that should

be included in a regression model. This thesis uses an automated approach called

stepwise regression. Usually, this takes the form of a sequence of F-tests, but

other techniques are possible. The 2 main stepwise regression approaches are:

1. Forward selection. This involves starting with no variables in the model,

220


Figure A.6: Individual desirability functions. On the left is a maximise function and on the right is aminimise function. Figure adapted from [84, p. 426].

trying out the variables one by one and including them if they are ’statistically

significant’.

2. Backward selection. This involves starting with all candidate variables and

testing them one by one for statistical significance, deleting any that are not

significant according to an alpha out value.

This thesis uses backward selection for the choice of terms in all its analyses

with an alpha out value of 0.1. There are several criticisms of stepwise regression

methods worth noting.

1. A sequence of F-tests is often used to control the inclusion or exclusion of

variables, but these are carried out on the same data and so there will be

problems of multiple comparisons for which many correction criteria have

been developed.

2. It is difficult to interpret the p-values associated with these tests, since each

is conditional on the previous tests for inclusion and exclusion.

Nonetheless, the accuracy of all models in this thesis is independently analysed

with confirmation runs. This should allay any engineer’s concerns over the use of

stepwise regression for metaheuristic screening and modelling.

A.4.2 ANOVA diagnostics

Once an Analysis of Variance (ANOVA) has been calculated, some diagnostics must

be examined to ensure that the assumptions on which ANOVA depends have not

been violated.

• Normality. A Normal Plot of Studentised Residuals should be approximately

a straight line. Deviations from this may indicate that a transformation of the

response is appropriate.

221


• Constant Variance. A plot of Studentised Residuals against predicted re-

sponse values should be a random scatter. Patterns such as a ‘megaphone’

may indicate the need for a transformation of the response.

• Time-dependent effects. A plot of Studentised Residuals against run order

should be a random scatter. Any trend indicates the influence of some time-

dependent nuisance factor that was not countered with randomisation.

• Model Fit. A plot of predicted values against actual response values will

identify particular treatment combinations that are not well predicted by the

model. Points should align along the 450 axis.

• Leverage and Influence. Leverage measures the influence of an individual

design point on the overall model. A plot of leverage for each treatment indi-

cates any problem data points.

• A plot of Cook’s distance against treatment measures how much the regres-

sion changes if a given case is removed from the model.

It is an open question how much a violation of these diagnostics invalidates

the conclusions from an ANOVA. Coffin and Saltzman [28] believe that the ANOVA

F-test is extremely robust to unequal variances provided that there is approxi-

mately the same number of observations in each treatment group. Diagnostics

were always examined and passed in the analyses of this thesis. Furthermore,

any concerns regarding these diagnostics are allayed with the use of independent

confirmation runs.

A.4.3 Response Transformation

If a model is correct and the assumptions are satisfied, the residuals should be

unrelated to any other variable, including the predicted response. This can be

verified by plotting the residuals against the fitted values. The plot should be un-

structured. However, sometimes nonconstant variance is observed. This is where

the variance of the observations increases as the magnitude of the observations in-

creases. The usual approach to dealing with this problem is to transform the data

and run the ANOVA on the transformed data. The next table gives some popular

transformations.

Name Equation

Logarithmic Y ′ = log10(Y + d)

Square root Y ′ =√Y + d

InverseSquare Root

Y ′ = 1/√

Y + d

Table A.2: Some common response transformations.

The appropriate transformation is chosen based on the shape of the data or, in

the case of this thesis, using an automated technique called a Box-Cox plot [21].

222


A.4.4 Outliers

An outlier is a data value that is much larger or smaller than the majority of the

experiment data. Outliers are important because they affect the ANOVA assump-

tions and can render conclusions from a statistical analysis invalid. Outliers are

easily identified with the ANOVA diagnostics. In this thesis, we take the approach

of deleting outliers. Some would disagree with this approach because outliers in

responses such as solution quality are not due to any random noise but are instead

actual repeatable data values. This is true but we must still somehow deal with the

outliers and make the data amenable to statistical analysis. If the proportion of

outliers deleted is reported then the reader can be assured that the outliers repre-

sented a very small proportion of the total data. If confirmation runs are reported

then the reader is reassured that the model was accurate despite the deletion of

outliers.

A.4.5 Dealing with Aliasing

Once a model has been successfully built and analysed, allowing for necessary

transformations and outliers, there may still be obstacles to interpreting the ANOVA

results. Some designs such as the fractional factorial reduce the number of ex-

periment runs required at the cost of some of the model’s effects being aliased.

Aliased effects are those effects that cannot be distinguished from one another. We

say that these effects form an alias chain. For example, if the main effect A and a

second order effect AB are aliased then we cannot tell whether it is A or AB that

contributes to the model. When a significant effect is aliased, several approaches

are available to the experimenter to determine the correct model term to which the

effect should be attributed [84, p. 289].

1. Engineering judgement. A first attempt is to use engineering judgement to

justify ignoring some terms in the alias. It may be known from experience

that one of the aliased effects is not important and can be discarded.

2. Ockham’s razor. Consider a 24−1IV experiment. It has four main effects that

we shall call A, B, C and D. Suppose that the significant main effects are A,

C and D and the significant aliased interactions are AC and AD, aliased as

follows [AC] → AC + AD and [AD] → AD + BC. The fact that AC and AD are

the interactions composed only of significant main effects, it is more likely

that these interactions are the significant interactions in the alias chains.

Montgomery [84] cites this as an application of Ockham’s razor—the simplest

explanation of the effect is most likely the correct one.

Failing the application of either of these two approaches, one must augment the

design to de-alias the significant effects.

3. Augment design. A foldover procedure is a methodical and efficient way to

introduce more treatments into a fractional design so that a particular effect

can be de-aliased. The foldover procedure produces double the number of

223


new treatments for which data must be gathered (once for each replicate).

The augmented design should foldover on those most significant model terms

that the experimenter wishes to de-alias

Once dealiasing has been completed, the experimenter can interpret significant

main effects and interactions.

A.4.6 Interpreting interactions

There are several possibilities when plotting two-factor interactions from a two-way

analysis of variance [29]. These possibilities are illustrated in the following figure.

There are two factors denoted by A and B and these factors were tested at two

levels A1, A2 and three levels B1, B2 and B3 respectively.

Interachons among Variables Analys~s of Vanance

120 , AreaBurned

slow medium fast

Wind S p e d

Figure 7.2 A plot of mean acreage lost at three wind speeds.

A1 A2 A1 A2

Figure 7.3 Examples of main effects and interaction effects in two-way analyses of variance. Figure A.7: Examples of possible main and interaction effects [29]. The possibilities are numbered 1to 6.

The interpretation of these possibilities is as follows.

• Example 1: There is a main effect for A represented by the increasing slope.

There is a main effect for B, represented by the vertical distance between

lines. There is no interaction AB since the lines are always parallel.

• Example 2: There is a main effect for A but now there is no main effect for

B since the lines are no longer separated by a vertical distance. There is no

interaction AB either.

• Example 3: There are no effects whatsoever.

224


• Example 4: There is a main effect for A represented by the increasing slopes.

There is no main effect B because the vertical distances between levels of B

are reversed at the two levels of A. There is an interaction between A and B.

• Example 5: There is no main effect for A as the slopes at different levels of

B cancel one another out. There is a main effect of B. There is an interaction

AB.

• Example 6: There is a main effect of A. There is a main effect of B. There is

also an interaction AB.

The presence of interactions clearly complicates an analysis because it means

that a main effect cannot be interpreted in isolation. The inability to detect interac-

tions is one of the most important shortcomings of the OFAT approach (Section 2.5

on page 47) and one of the main strengths of the DOE approach.

A.5 Hypothesis testing

Hypothesis testing (sometimes called significance testing) is an objective method

of making comparisons with a knowledge of the risks associated with reaching

the wrong conclusions. A statistical hypothesis is a conjecture about the problem

situation. One may conjecture, for example, that the mean heuristic performances

at two levels 1 and 2 of a factor are equal. This is written as:

H0 : µ1 = µ2

H1 : µ1 6= µ2

The first statement is the null hypothesis and the second statement is the alter-native hypothesis. A random sample of data is taken. A test statistic is computed.

The null hypothesis is rejected if the test statistic falls within a certain rejectionregion for the test. It is extremely important to emphasise that hypothesis testing

does not permit us to conclude that we accept the null hypothesis. The correct

conclusion is always either a rejection of the null hypothesis or a failure to reject

the null hypothesis.

The p-value is the probability of obtaining a test statistic that is at least as far as

the observed value from the value specified in the null hypothesis, where the null

hypothesis value is calculated under the assumption that the null hypothesis is

true. Smaller p-values indicate that the data are inconsistent with the assumption

that the null hypothesis is true

A.6 Error, Significance, Power and Replicates

Two types of error can be committed when testing hypotheses [84, p. 35]. If

the null hypothesis is rejected when it is actually true, then a Type I Error has

occurred. If the null hypothesis is not rejected when it is false then a Type II Errorhas occurred. These error probabilities are given special symbols

225


• α = P (Type I error) = P (reject H0|H0 true)

• β = P (Type II error) = P (fail to reject H0|H0 false)

In the context of Type II errors, it is more convenient to use the power of a test

where

Power = 1− β = P (reject H0| H0 false)

It is therefore desirable to have a test with a low α and a high power. The

probability of a Type I Error is often called the significance level of a test. The

particular significance level depends on the requirements of the experimenter and,

in a research context, on the conventional acceptable level. Unfortunately, with so

little adaptation of statistical methods to the analysis of heuristics, there are few

guidelines on what value to choose. Norvig cites a value as low as 0.0000001% in

research work at Google [87]. All experiments in this thesis use a level of either 1%

or 5%. The power of a test is usually set to 80% by convention. The reason for this

choice is due to diminishing returns. It requires an exponentially increasing num-

ber of replicates to increase power beyond about 80% and there is little advantage

to the additional power this confers.

Miles [82] describes the relationship between significance level, effect size, sam-

ple size and power using an analogy with searching.

• Significance Level: This is the probability of thinking we have found some-

thing when it is not really there. It is a measure of how willing we are to risk

a Type I error.

• Effect Size: The size of the effect in the population. The bigger it is, the

easier it will be to find. This is a measure of the practical significance of a

result, preventing us claiming a statistically significant result that has little

consequence [101].

• Sample size: A larger sample size leads to a greater ability to find what we

were looking for. The harder we look, the more likely we are to find it.

The critical point regarding this relationship is that what we are looking for

is always going to be there—it might just be there in such small quantities that

we are not bothered about finding it. Conversely, if we look hard enough, we are

guaranteed to find what we are looking for. Power analysis allows us to make sure

that we have looked reasonably hard enough to find it. A typical experiment design

approach is to agree the significance level and choose an effect size based on prac-

tical experience and experiment goals. Given these constraints, the sample size

is increased until sufficient power is reached. If a response has a high variability

then a larger sample size will be required.

Different statistical tests and different experiment designs involve different power

calculations. These calculations can become quite involved and the details of their

calculation are beyond the scope of this thesis. Power calculations are supplied

226


with most good quality statistical analysis software. Some are even provided on-

line [76]. Power considerations have had limited exposure in the heuristics field

[28] but play a strong role in this thesis.

A.6.1 Power work up procedure

Sufficient power is achieved with a so-called work-up procedure [36]. This is an

iterative procedure whereby data is calculated for a design with a number of repli-

cates, power is calculated and replicates are added if sufficient power was not

achieved. This process repeats until sufficient power is reached. The work up

procedure is an efficient way to ensure the experiment has enough power without

wasting resources on unnecessary replicates.

227

BTSPLIB Statistics

This appendix reports statistics and plots of some TSPLIB [102] instances. The in-

stances are symmetric Euclidean instances as described in Section 2.2 on page 30.

These statistics and plots are referenced in the conclusions of Chapter 7.

The table gives some descriptive statistics of the symmetric Euclidean instances.

All instances have approximately the same ratio of standard deviation to mean.

Figure B.2 on the next page to Figure B.4 on page 231 are histograms illustrating

the normalised frequency of normalised edge lengths in several of the symmetric

Euclidean TSP instances. All histograms have a shape that can be represented by

a Log-Normal Distribution.

229

APPENDIX B. TSPLIB STATISTICS

Instance Standard Deviation Mean CoefficientOliver30 21.08 43.93 0.48kroA100 916.04 1710.70 0.54kroB100 912.90 1687.54 0.54kroC100 910.74 1700.55 0.54kroD100 867.21 1631.10 0.53kroE100 933.55 1732.15 0.54eil101 16.35 33.92 0.48lin105 670.85 1177.35 0.57pr107 3105.26 5404.24 0.57pr124 2848.45 5623.35 0.51bier127 3082.05 4952.47 0.62ch130 169.98 356.22 0.48pr136 2945.43 6073.99 0.48pr144 2813.44 5639.51 0.50ch150 169.35 359.31 0.47kroA150 919.03 1717.35 0.54kroB150 922.40 1711.61 0.54pr152 3668.43 6914.83 0.53kroA200 917.36 1701.17 0.54kroB200 892.36 1664.18 0.54ts225 3321.76 7080.03 0.47tsp225 95.21 183.58 0.52pr226 3708.92 7503.01 0.49gil262 48.25 101.92 0.47pr264 2557.95 4248.45 0.60pr1002 3160.87 6435.61 0.49vm1084 4149.34 7907.79 0.52rl1304 3670.93 7190.12 0.51rl1323 3724.98 7403.58 0.50nrw1379 543.67 1032.34 0.53vm1748 4235.17 8548.22 0.50rl1889 4012.50 7834.80 0.51pr2392 3125.49 6374.92 0.49

Figure B.1: Some descriptive statistics for the symmetric Euclidean instances in TSPLIB. Instancesare presented in order of increasing size. The columns are the standard deviations of edge lengths, themean edge lengths and the ratio of standard deviation to mean.

bier127.tsp

00.10.20.30.40.50.60.70.80.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

Normalised Edge Length

Nor

mal

ised

Fre

quen

cy

Figure B.2: Histogram of normalised frequency of normalised edge lengths of the bier127 TSPLIBinstance.

230

APPENDIX B. TSPLIB STATISTICS

Oliver30.tsp

00.10.20.30.40.50.60.70.80.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

Normalised Edge Length

Nor

mal

ised

Fre

quen

cy

Figure B.3: Histogram of normalised frequency of normalised edge lengths of the Oliver30 TSPLIBinstance.

pr1002.tsp

00.10.20.30.40.50.60.70.80.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90Normalised Edge Length

Nor

mal

ised

Fre

quen

cy

Figure B.4: Histogram of normalised frequency of normalised edge lengths of the pr1002 TSPLIBinstance.

231

CCalculation of Average Lambda

Branching Factor

Average lambda branching factor was first introduced as a descriptive statistic for

ACS performance [46] but is now an integral part of trail reinitialisation in the

MMAS daemon actions (Section 2.4.5 on page 39). It is discussed in detail in the

literature [47, p. 87]. The following figure represents its calculation in pseudocode

adapted from JACOTSP.

Broadly, the calculation can be described as follows. For every city in the TSP,

go through the city’s candidate list calculating a branching factor cut-off. For each

edge in the city’s candidate list, count the edges that are above this cut-off value.

Average the counts for every city in the TSP.

If we let the TSP size be n and we let the candidate list length for all cities be cl

then the complexity of the calculation is given by:

O(n · 2cl + n) ≈ O(2n) if cl� n (C.1)

Clearly the calculation is expensive and this is exacerbated if the candidate list

length approaches the problem size. However, this expense does not show when

CPU time is not recorded.

The expense of the calculation is mitigated in three ways:

• The branching factor is often not computed at each iteration.

• The complexity is the same as that of construction of one solution by an ant.

• The candidate list length cl is typically very small.

233

APPENDIX C. CALCULATION OF AVERAGE LAMBDA BRANCHING FACTOR

1

/**

* Method to calculate the average lambda branching factor.

*/

public double computeAverageBranchingFactor()

{

/*An array to store number of branches from each TSP city*/

double[] num_branches = new double[tspSize];

/*O(tspSize)*/

for (each aCity in TSP)

{

/*O(cl)*/

final double cutoff = calculateCutOffForLambdaBranchingFactorForACity(aCity);

/*O(cl)*/

num_branches[aCity] = countEdgesAboveCutOffFromCity(aCity, cutoff);

}

/*O(tspSize)*/

final double averageNumberOfBranches = calculateAverageOf(num_branches[tspSize]);

double result = averageNumberOfBranches / (tspSize * 2);

return result;

}

Figure C.1: Pseudocode for the calculation of the average lambda branching factor.

234

DExample OFAT Analysis

This appendix reports a One-Factor-at-A-Time (OFAT) approach to tuning a single

parameter from the ACS heuristic (Section 2.4 on page 34). The particular param-

eter is alpha, which plays a role in an artificial ant’s solution building decisions.

D.1 Motivation

The ACS screening study of Chapter 8 and the ACS tuning study of Chapter 9 both

reported that alpha did not have a statistically significant effect on either solution

quality or solution time. This is an interesting and important result because it is

intuitively unexpected and contradicts the accepted view of the importance of alpha

[96]. The methods and experiment designs introduced in this thesis are new to the

ACO field and the metaheuristics field in general. It is of interest, therefore, to

attempt the same analysis, in so far as is possible, using a more familiar empirical

technique called the One-Factor-at-A-Time (OFAT) approach. An OFAT approach

involves taking one of the algorithm tuning parameters and allowing it to vary

while all other tuning parameters are held fixed at some other values. The free

parameter is tuned until performance is maximised. The procedure then moves

on to another of the tuning parameters, allowing it to vary while all others are

held fixed. This study applies an OFAT analysis to the alpha tuning parameter.

The aim of the study is to determine whether an OFAT approach will lead to a

different conclusion from the DOE approach. It is not an endorsement of the OFAT

approach, the demerits of which were highlighted in Section 2.5 on page 47. In

keeping with the thesis’ strong emphasis on experimental rigor, the OFAT analysis

is conducted with a designed experiment and supporting statistical analyses.

235

APPENDIX D. EXAMPLE OFAT ANALYSIS

D.2 Method

D.2.1 Response Variables

Two responses were measured, relative error from a known optimum and elapsed

solution time, as per Section 6.7 on page 125.

D.2.2 Factors, Levels and ranges

Design Factors

Being an OFAT analysis, there was 1 design factor. This factor was the alpha

tuning parameter for the ACS algorithm, described in Section 2.4 on page 34.

Alpha was set at the following five levels: 1, 3, 5, 7, 12.


The held constant factors are as per Section 6.7.6 on page 127. There were ad-

ditional held-constant factors required of the OFAT approach. All other tuning

parameters were fixed at 6 different settings. These settings came from the desir-

ability optimisation results from the full ACS response surface model, given in the

following figure. ACS_RSM_All_R5_relErr-Time_05.dx7_desirability.xls

Size

StD

ev

beta

ants

Frac

tion

nnFr

actio

n

q0 rho

rhoL

ocal

solu

tionC

onst

ruct

ion

antP

lace

men

t

pher

omon

eUpd

ate

Tim

e

Rel

ativ

e Er

ror

Des

irabi

lity

400 10 4 1.00 1.00 0.99 0.11 0.81 parallel random bestSoFar 2.42 0.46 0.92400 40 6 2.19 1.16 0.97 0.99 0.03 parallel random bestOfItera 2.83 1.33 0.82400 70 11 1.61 20.00 0.98 0.01 0.07 parallel random bestOfItera 4.92 2.59 0.73500 10 3 1.13 1.00 0.99 0.86 0.01 parallel same bestOfItera 4.88 0.38 0.88500 40 7 1.00 1.00 0.99 0.99 0.48 parallel random bestSoFar 4.25 1.35 0.80500 70 10 1.04 19.78 0.99 0.05 0.01 parallel same bestOfItera 9.24 2.54 0.70

1 of 1

Figure D.1: Fixed parameter settings for the OFAT analysis. These are reproduced from the results ofthe desirability optimisation of the ACS full response surface model. The response predictions from thetuning have also been included.

Note that in practice, one may not have access to these tuned parameter set-

tings. A researcher conducting an OFAT analysis without any prior knowledge

would have no guidelines on the values at which the other parameters should be

fixed.

D.2.3 Instances



236




For each OFAT analysis, a single problem instance was used. These were the same

instances used in the ACS tuning case study.

D.2.4 Experiment design, power and replicates

The experiment design for each of the OFAT analyses is a single factor 5-level

factorial. All 5 treatments were replicated 10 times. A work up procedure was not

needed in this case.

The next figure gives the descriptive statistics for the collected data and the

actual detectable effect size for the quality and time responses with a significance

level of 5% and a power of 80%.

Problem size

Problem StDev Range Min Max Mean

Std. Dev.

400 10 0.30 0.74 1.04 0.89 0.07400 40 1.72 3.29 5.01 4.14 0.43400 70 6.30 4.17 10.47 6.73 1.73500 10 0.35 0.62 0.97 0.80 0.09500 40 1.55 3.25 4.79 3.91 0.36500 70 4.52 6.03 10.55 8.11 1.25

Problem size

Problem StDev Range Min Max Mean

Std. Dev.

400 10 3.67 1.52 5.19 2.95 0.96400 40 6.05 2.50 8.55 4.13 1.30400 70 13.15 4.38 17.53 8.20 3.32500 10 4.03 1.97 6.00 3.15 0.89500 40 7.97 2.60 10.56 4.97 1.58500 70 31.83 4.63 36.45 14.08 7.67

Figure D.2: Descriptive statistics for the six OFAT analyses. Relative Error is reported above and Timebelow. There are six combinations of problem size and problem standard deviation.

D.2.5 Performing the experiment

Responses were measured at a 250 iteration stagnation stopping criterion.

Available computational resources necessitated running experiments across a

variety of similar machines. Runs were executed in a randomised order across

these machines to counteract any uncontrollable nuisance factors. The experi-

mental machines are benchmarked as per Section 5.3 on page 100.

D.3 Analysis

D.3.1 ANOVA


sponses was required for some of the analyses. These transformations were a

log10, inverse or inverse square root.

237


No outliers were detected. The models passed the usual ANOVA diagnostics for

the ANOVA assumptions of model fit, normality, constant variance, time-dependent

effects, and leverage (Section A.4.2 on page 221).

D.4 Results

The next figure summarises each of the OFAT analyses for relative error and time.

It reports the detectable effect size for a significance threshold of 5% and a power

of 80% and the statistical significance result.

Problem size

Problem StDev

St Devs at 80% power

Effect size

ANOVAsig?

Effect size

ANOVAsig?

400 10 2 0.14 No 1.91 No400 40 0.35 0.15 Yes 0.45 Yes400 70 0.35 0.61 Yes 1.16 Yes500 10 0.4 0.04 Yes 0.35 Yes500 40 2 0.71 No 3.16 Yes500 70 2 2.50 Yes 15.34 No

Relative Error Time

Figure D.3: Summary of results from the six OFAT analyses. The figure gives the detectable effect sizein terms of both the number of standard deviations and the actual response units for the relative errorresponse and the time response.

Some of the analysis showed a statistically significant effect for alpha on the

responses of solution quality and solution time. Note that the detectable effect

sizes are small relative to those of the screening and tuning case studies. This

is due to the lower variability in the responses when varying only a single tuning

parameter.

The following figures show the plots of the relative error response for the OFAT

analyses. Each plot shows the five levels of alpha on the horizontal axis and in-

cludes a 95% Fisher’s Least Significant Difference interval [84, p. 96].

Alpha had a statistically significant effect on solution quality in four out of

the six experiments. An examination of the range over which average relative error

varied in these significant case shows that the largest difference was approximately

3.9% for problems of size 400 and standard deviation 70 and approximately 0.1%

for problems of size 500 and standard deviation 10.

All analyses except that of size 400 and standard deviation of 70 recommended

an alpha value of 1 to minimise relative error.

D.5 Conclusions and discussion

We draw the following conclusion from these results. For ACS, with all tuning

parameters except alpha set to the values in Figure D.1 on page 236:

• alpha has a statistically significant effect on solution quality for instances

with a size and standard deviation combination of 400-40, 400-70, 500-10

and 500-70.

238


Figure D.4: Plot of the effect of alpha on relative error for a problem with size 400 and standarddeviation 10. Alpha was not statistically significant in this case.

Figure D.5: Plot of the effect of alpha on relative error for a problem with size 400 and standarddeviation 40. Alpha was statistically significant in this case.

239




240


Figure D.8: Plot of the effect of alpha on relative error for a problem with size 500 and standarddeviation 40. Alpha was not statistically significant in this case.


241


• Apart from one anomalous result, an alpha value of 1 is recommended to

minimise relative error.

The first of these conclusions appears to contradict the results of Chapters 8

and 9. These concluded that alpha had a relatively unimportant effect on solution

quality. However, there are several important differences between the previous

experiments and the current OFAT analysis.

Firstly, the fractional factorial and response surface designs experimented with

many more factors. This resulted in a larger variability in the response, as listed in

the descriptive statistics of Figure 9.1 on page 155, for example. The OFAT anal-

ysis, varying only alpha and conducted on a single instance, had a much smaller

variance in its response measurements, as listed in Figure D.2 on page 237. The

consequence is that the OFAT analysis could detect much smaller effects for a

given significance level and power than the fractional factorial screening and the

response surface. This does not mean that OFAT is a better approach than DOE.

The OFAT conclusions are more accurate in their context. This context is the par-

ticular fixed values of the other parameter settings and a single problem instance.

As discussed in Section 2.5, the OFAT analysis tells us nothing about interac-

tions and is inefficient in comparison to DOE approaches in terms of the infor-

mation gained from a given number of experiments. Most importantly, for some

response surface shapes, an incorrect OFAT starting point can lead to incorrect

tuning recommendations. Unfortunately, since the response surface shape cannot

be deduced with OFAT, the experimenter does not know if these incorrect tuning

recommendations are being made. The only safe option in this case is to use a

DOE approach.

242

References

[1] NIST/SEMATECH Engineering Statistics Handbook, 2006.

[2] ADENSO-DIAZ, B., AND LAGUNA, M. Fine-Tuning of Algorithms Using Frac-tional Experimental Designs and Local Search. Operations Research 54, 1(2006), 99–114.

[3] AMINI, M. M., AND RACER, M. A Rigorous Computational Comparison ofAlternative Solution Methods for the Generalized Assignment Problem. Man-agement Science 40, 7 (1994), 868–890.

[4] ANDERSON, V. L., AND MCLEAN, R. A. Design of experiments: a realisticapproach. M. Dekker Inc., New York, 1974.

[5] APPLEGATE, D., BIXBY, R., CHVATAL, V., AND COOK, W. Implementingthe Dantzig-Fulkerson-Johnson algorithm for large traveling salesman prob-lems. Mathematical Programming Series B 97, 1-2 (2003), 91–153.

[6] BANG-JENSEN, J., CHIARANDINI, M., GOEGEBEUR, Y., AND JØRGENSEN, B.Mixed Models for the Analysis of Local Search Components. In EngineeringStochastic Local Search Algorithms. Designing, Implementing and AnalyzingEffective Heuristics, T. Stutzle, M. Birattari, and H. Hoos, Eds., vol. 4638.Springer, Berlin / Heidelberg, 2007, pp. 91–105.

[7] BARR, R. S., GOLDEN, B. L., KELLY, J. P., RESENDE, M. G. C., AND STEW-ART, W. R. Designing and Reporting on Computational Experiments withHeuristic Methods. Journal of Heuristics 1 (1995), 9–32.

[8] BARTZ-BEIELSTEIN, T. Experimental Research in Evolutionary Computation.The New Experimentalism. Natural Computing Series. Springer, 2006.

[9] BARTZ-BEIELSTEIN, T., AND PREUSS, M. Experimental Research in Evolu-tionary Computation. Tutorial at the genetic and evolutionary computationconference, June 2005.

[10] BAUTISTA, J., AND PEREIRA, J. Ant Algorithms for Assembly Line Balanc-ing. In Proceedings of the Third International Workshop on Ant Algorithms,M. Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notes inComputer Science. Springer, 2002, p. 65.

[11] BENTLEY, J. L. Fast algorithms for the geometric traveling salesman prob-lem. ORSA Journal on Computing 4 (1992), 387–411.

[12] BIRATTARI, M. The Problem of Tuning Metaheuristics. Phd, Universite Librede Bruxelles, 2006.

[13] BIRATTARI, M., AND DORIGO, M. How to assess and report the performanceof a stochastic algorithm on a benchmark problem: Mean or best result on anumber of runs? Optimization Letters (2006).

[14] BIRATTARI, M., STUTZLE, T., PAQUETE, L., AND VARRENTRAPP, K. A RacingAlgorithm for Configuring Metaheuristics. In GECCO 2002: Proceedings ofthe Genetic and Evolutionary Computation Conference, New York, USA, W. B.Langdon, E. Cant-Paz, K. E. Mathias, R. Roy, D. Davis, R. Poli, K. Balakr-ishnan, V. Honavar, G. Rudolph, J. Wegener, L. Bull, M. A. Potter, A. C.

243

REFERENCES

Schultz, J. F. Miller, E. K. Burke, and N. Jonoska, Eds. Morgan Kaufmann,2002, pp. 11–18.

[15] BLUM, C. Ant colony optimization: Introduction and recent trends. Physicsof Life Reviews 2, 4 (2005), 353–373.

[16] BLUM, C., AND ROLI, A. Metaheuristics in combinatorial optimization:Overview and conceptual comparison. ACM Computing Surveys 35, 3 (2003),268–308.

[17] BLUM, C., AND SAMPELS, M. An ant colony optimization algorithm for shopscheduling problems. Journal of Mathematical Modelling and Algorithms 3, 3(2004), 285–308.

[18] BOOCH, G. Object-oriented Analysis and Design with Applications, second ed.The Benjamin/Cummings Publishing Company, Inc., 1994.

[19] BOTEE, H. M., AND BONABEAU, E. Evolving Ant Colony Optimization. Ad-vances in Complex Systems 1 (1998), 149–159.

[20] BOX, G. E. P. Sequential experimentation and sequential assembly of de-signs. Quality Engineering 5, 2 (1992), 321–330.

[21] BOX, G. E. P., AND COX, D. R. An Analysis of Transformations. Journal ofthe Royal Statistical Society Series B (Methodological) 26, 2 (1964), 211–252.

[22] BREEDAM, A. V. Improvement Heuristics for the Vehicle Routing ProblemBased on Simulated Annealing. European Journal of Operations Research86, 3 (1995), 480–490.

[23] BULL, J. M., SMITH, L. A., BALL, C., POTTAGE, L., AND FREEMAN, R. Bench-marking Java against C and Fortran for scientific applications. Concurrencyand Computation: Practice and Experience 15, 3-5 (2003), 417–430.

[24] BULLNHEIMER, B., HARTL, R. F., AND STRAUSS, C. A New Rank Based Ver-sion of the Ant System: A Computational Study. Central European Journalfor Operations Research and Economics 7, 1 (1999), 25–38.

[25] BULLNHEIMER, B., HARTL, R. F., AND STRAUSS, C. An Improved Ant SystemAlgorithm for the Vehicle Routing Problem. Annals of Operations Research89 (1999), 319–328.

[26] CHEESEMAN, P., KANEFSKY, B., AND TAYLOR, W. M. Where the Really HardProblems Are. In Proceedings of the Twelfth International Conference on Ar-tificial Intelligence, vol. 1. Morgan Kaufmann Publishers, Inc., USA, 1991,pp. 331–337.

[27] CHIARANDINI, M., PAQUETE, L., PREUSS, M., AND RIDGE, E. Experimentson Metaheuristics: Methodological Overview and Open Issues. Tech. Rep.IMADA-PP-2007-04, Institut for Matematik og Datalogi, University of South-ern Denmark, 20 March.

[28] COFFIN, M., AND SALTZMAN, M. J. Statistical Analysis of ComputationalTests of Algorithms and Heuristics. INFORMS Journal on Computing 12, 1(2000), 24–44.

[29] COHEN, P. R. Empirical Methods for Artificial Intelligence. The MIT Press,Cambridge, Massachusetts, 1995.

[30] COLORNI, A., DORIGO, M., MAFFIOLI, F., MANIEZZO, V., RIGHINI, G., ANDTRUBIAN, M. Heuristics from Nature for Hard Combinatorial Problems. In-ternational Transactions in Operational Research 3, 1 (1996), 1–21.

[31] CORDN, O., FERNANDEZ, I., HERRERA, F., AND MORENO, L. A New ACOModel Integrating Evolutionary Computation Concepts: The Best-Worst AntSystem. In Proceedings of ANTS’2000. From Ant Colonies to Artificial Ants:Second Interantional Workshop on Ant Algorithms, Brussels, Belgium, Septem-ber 7-9, 2000. 2000, pp. 22–29.

244

REFERENCES

[32] COSTA, D., AND HERTZ, A. Ants Can Colour Graphs. The Journal of theOperational Research Society 48, 3 (1997), 295–305.

[33] COY, S., GOLDEN, B., RUNGER, G., AND WASIL, E. Using Experimental De-sign to Find Effective Parameter Settings for Heuristics. Journal of Heuristics7, 1 (2001), 77–97.

[34] CROWDER, H. P., DEMBO, R. S., AND MULVEY, J. M. Reporting Computa-tional Experiments in Mathematical Programming. Mathematical Program-ming 15 (1978), 316–329.

[35] CROWDER, H. P., DEMBO, R. S., AND MULVEY, J. M. On Reporting Com-putational Experiments with Mathematical Software. ACM Transactions onMathematical Software 5, 2 (1979), 193–203.

[36] CZARN, A., MACNISH, C., VIJAYAN, K., TURLACH, B., AND GUPTA, R. Sta-tistical Exploratory Analysis of Genetic Algorithms. IEEE Transactions onEvolutionary Computation 8, 4 (2004), 405–421.

[37] CZITROM, V. One-Factor-at-a-Time versus Designed Experiments. The Amer-ican Statistician 53, 2 (1999), 126–131.

[38] DEN BESTEN, M. L. Simple Metaheuristics for Scheduling: An empirical inves-tigation into the application of iterated local search to deterministic schedulingproblems with tardiness penalties. Phd, Germany.

[39] DEN BESTEN, M. L., STUTZLE, T., AND DORIGO, M. Ant colony optimizationfor the total weighted tardiness problem. In Proceedings of PPSN-VI, sixthinternational conference on parallel problem solving from nature, M. Schoe-nauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H.-P. Schwe-fel, Eds., vol. 1917 of Lecture Notes in Comput Science. Springer, Berlin, 2000,pp. 611–620.

[40] DERRINGER, G., AND SUICH, R. Simultaneous Optimization of Several Re-sponse Variables. Journal of Quality Technology 12, 4 (1980), 214–219.

[41] DOERR, B., NEUMANN, F., SUDHOLT, D., AND WITT, C. On the RuntimeAnalysis of the 1-ANT ACO Algorithm. In Proceedings of the Genetic andEvolutionary Computation Conference, vol. 1. ACM, 2007, pp. 33–40.

[42] DORIGO, M., AND BLUM, C. Ant colony optimization theory: A survey. Theo-retical Computer Science 344, 2-3 (2005), 243–278.

[43] DORIGO, M., AND CARO, G. D. The Ant Colony Optimization Meta-Heuristic.In New Ideas in Optimization, D. Corne, M. Dorigo, F. Glover, D. Dasgupta,P. Moscato, R. Poli, and K. V. Price, Eds., Mcgraw-Hill’S Advanced Topics InComputer Science. McGraw-Hill, 1999, pp. 11–32.

[44] DORIGO, M., AND COLORNI, A. The Ant System: Optimization by a colonyof cooperating agents. IEEE Transactions on Systems, Man, and CyberneticsPart B 26, 1 (1996), 1–13.

[45] DORIGO, M., AND GAMBARDELLA, L. M. Ant Colonies for the Travelling Sales-man Problem. BioSystems 43, 2 (1997), 73–81.

[46] DORIGO, M., AND GAMBARDELLA, L. M. Ant Colony System: A CooperativeLearning Approach to the Traveling Salesman Problem. IEEE Transactionson Evolutionary Computation 1, 1 (1997), 53–66.

[47] DORIGO, M., AND STUTZLE, T. Ant Colony Optimization. The MIT Press,Massachusetts, USA, 2004.

[48] EIBEN, A., AND JELASITY, M. A critical note on experimental researchmethodology in EC. In Proceedings of the 2002 IEEE Congress on Evolu-tionary Computation. IEEE, 2002, pp. 582–587.

245

REFERENCES

[49] FISCHER, T., STUTZLE, T., HOOS, H., AND MERZ, P. An Analysis Of TheHardness Of TSP Instances For Two High Performance Algorithms. In Pro-ceedings of the Sixth Metaheuristics International Conference. 2005, pp. 361–367.

[50] GAERTNER, D., AND CLARK, K. L. On Optimal Parameters for Ant ColonyOptimization Algorithms. In Proceedings of the 2005 International Conferenceon Artificial Intelligence, vol. 1. CSREA Press, 2005, pp. 83–89.

[51] GAGNE, C., PRICE, W. L., AND GRAVEL, M. Comparing an ACO algo-rithm with other heuristics for the single machine scheduling problem withsequence-dependent setup times. Journal of the Operational Research Soci-ety 53 (2002), 895–906.

[52] GAMBARDELLA, L. M., AND DORIGO, M. HAS-SOP: hybrid Ant System for theSequential Ordering Problem. Tech. Rep. IDSIA-11-97, IDSIA, 19 April.

[53] GAMBARDELLA, L. M., AND DORIGO, M. HAS-SOP: An Ant Colony SystemHybridized with a New Local Search for the Sequential Ordering Problem.INFORMS Journal on Computing 12, 3 (2000), 237–255.

[54] GANDIBLEUX, X., DELORME, X., AND T’KINDT, V. An Ant Colony Optimi-sation Algorithm for the Set Packing Problem. In Proceedings of the FourthInternational Workshop on Ant Colony Optimization and Swarm Intelligence,M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mondada, andT. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science. 2004, pp. 49–60.

[55] GAREY, M. R., AND JOHNSON, D. S. Computers and Intractability : A Guideto the Theory of NP-Completeness. Books in the Mathematical Sciences. W.H. Freeman, 1979.

[56] GLOVER, F. Tabu Search - Part I. ORSA Journal on Computing 1, 3 (1989),190–206.

[57] GOLDBERG, D. E. Genetic algorithms in search, optimization, and machinelearning. Addison-Wesley Publishing Company, Inc., 1989.

[58] GOLDWASSER, M., JOHNSON, D. S., AND MCGEOCH, C. C., Eds. Proceedingsof the Fifth and Sixth DIMACS Implementation Challenges. American Mathe-matical Society, 2002.

[59] GOTTLIEB, J., PUCHTA, M., AND SOLNON, C. A Study of Greedy, LocalSearch, and Ant Colony Optimization Approaches for Car Sequencing Prob-lems. In Proceedings of EvoWorkshops 2003: Applications of EvolutionaryComputing, S. Cagnoni, J. R. Cardalda, D. Corne, J. Gottlieb, A. Guillot,E. Hart, C. Johnson, E. Marchiori, J.-A. Meyer, M. Middendorf, and G. Raidl,Eds., vol. 2611 of Lecture Notes in Computer Science. Springer, Berlin, 2003,pp. 246–257.

[60] GREENBERG, H. Computational Testing: Why, how and how much? ORSAJournal on Computing 2, 1 (1990), 94–97.

[61] GUNTSCH, M., AND MIDDENDORF, M. Pheromone Modification Strategies forAnt Algorithms Applied to Dynamic TSP. In Proceedings of EvoWorkshops2001: Applications of Evolutionary Computing, E. J. W. Boers, J. Gottlieb,P. L. Lanzi, R. E. Smith, S. Cagnoni, E. Hart, G. R. Raidl, and H. Tijink,Eds., vol. 2037 of Lecture Notes in Computer Science. Springer, Berlin, 2001,p. 213.

[62] GUNTSCH, M., AND MIDDENDORF, M. Applying Population Based ACO to Dy-namic Optimization Problems. In Ant Algorithms : Third International Work-shop,, . Dorigo, G. D. Caro, and M. Sampels, Eds., vol. 2463 of Lecture Notesin Computer Science. Springer, 2002, p. 111.

246

REFERENCES

[63] HELSGAUN, K. An effective implementation of the Lin-Kernighan travelingsalesman heuristic. European Journal of Operational Research 126, 1 (2000),106–130.

[64] HOOKER, J. N. Needed: An Empirical Science of Algorithms. OperationsResearch 42, 2 (1994), 201–212.

[65] HOOKER, J. N. Testing heuristics: We have it all wrong. Journal of Heuristics1 (1996), 33–42.

[66] HOOS, H., AND STUTZLE, T. Stochastic Local Search, Foundations and Appli-cations. Morgan Kaufmann, 2004.

[67] HYBARGER, J. The Ten Most Common Designed Experiment Mistakes. StatTeaser (December 2006).

[68] JAIN, R. The art of computer systems performance analysis: techniques forexperimental design, measurement, simulation and modeling. John Wiley andSons Inc., 1991.

[69] JOHNSON, D. S. A Theoretician’s Guide to the Experimental Analysis of Al-gorithms. In Proceedings of the Fifth and Sixth DIMACS Implementation Chal-lenges, M. Goldwasser, D. S. Johnson, and C. C. McGeoch, Eds. AmericanMathematical Society, 2002, pp. 215–250.

[70] JOHNSON, D. S., AND PAPADIMITRIOU, C. H. Computational Complexity. InThe Traveling Salesman Problem, E. L. Lawler, J. K. Lenstra, A. H. G. R.Kan, and D. B. Shmoys, Eds., Wiley Series in Discrete Mathematics andOptimization. John Wiley and Sons, 1995, pp. 37–85.

[71] KAPTCHUK, T. J. Effect of interpretive bias on research evidence. BritishMedical Journal 326 (2003), 1453–1455.

[72] KERIEVSKY, J. Refactoring to Patterns. The Addison-Wesley Signature Series.Addison-Wesley, 2005.

[73] KIRKPATRICK, S., GELATT, C. D., AND VECCHI, M. P. Optimization by Simu-lated Annealing. Science 220, 4598 (1983), 671–680.

[74] KRAMER, O., GLOGER, B., AND GOEBELS, A. An experimental analysis ofevolution strategies and particle swarm optimisers using design of experi-ments. In Proceedings of the Genetic and Evolutionary Computation Confer-ence. ACM, 2007, pp. 674–681.

[75] LAWLER, E. L., LENSTRA, J. K., KAN, A. H. G. R., AND SHMOYS, D. B., Eds.The Traveling Salesman Problem - A Guided Tour of Combinatorial Optimiza-tion. Wiley Series in Discrete Mathematics and Optimization. John Wiley andSons, New York, USA.

[76] LENTH, R. V. Java Applets for Power and Sample Size. 2006.

[77] MANIEZZO, V., AND COLORNI, A. The Ant System Applied to the QuadraticAssignment Problem. IEEE Transactions on Knowledge and Data Engineering11, 5 (1999), 769–778.

[78] MARON, O., AND MOORE, A. Hoeffding races: Accelerating model selectionsearch for classification and function approximation. Advances in NeuralInformation Processing Systems 6 (1994), 59–66.

[79] MCGEOCH, C. C. Toward an experimental method for algorithm simulation.INFORMS Journal on Computing 8, 1 (1996), 1–15.

[80] MERKLE, D., MIDDENDORF, M., AND SCHMECK, H. Ant colony optimizationfor resource-constrained project scheduling. IEEE Transactions on Evolution-ary Computation 6, 4 (2002), 333–346.

247

REFERENCES

[81] MICHELS, R., AND MIDDENDORF, M. An Ant System for the Shortest CommonSupersequence Problem. In New Ideas in Optimization, D. Corne, M. Dorigo,and F. Glover, Eds. McGraw-Hill, 1999, pp. 51–61.

[82] MILES, J. Getting the Sample Size Right: A Brief Introduction to PowerAnalysis, 2007.

[83] MITCHELL, M., AND TAYLOR, C. E. Evolutionary Computation: An Overview.Annual Review of Ecology and Systematics 20 (1999), 593–616.

[84] MONTGOMERY, D. C. Design and Analysis of Experiments, 6 ed. John Wileyand Sons Inc, 2005.

[85] MYERS, R. H., AND MONTGOMERY, D. C. Response Surface Methodology.Process and Product Optimization Using Designed Experiments. Wiley Seriesin Probability and Statistics. John Wiley and Sons Inc., 1995.

[86] NEUMANN, F., AND WITT, C. Runtime Analysis of a Simple Ant Colony Op-timization Algorithm. In Theory of Evolutionary Algorithms, D. V. Arnold,T. Jansen, M. D. Vose, and J. E. Rowe, Eds., Dagstuhl Seminar Proceed-ings. Internationales Begegnungs- und Forschungszentrum fuer Informatik(IBFI), Schloss Dagstuhl, Germany, Dagstuhl, Germany, 2006.

[87] NORVIG, P. Mistakes in Experimental Design and Interpretation, 2007.

[88] NOWE, A., VERBEECK, K., AND VRANCX, P. Multi-type Ant Colony: The EdgeDisjoint Paths Problem. In Proceedings of the Fourth International Workshopon Ant Colony, Optimization and Swarm Intelligence, M. Dorigo, M. Birattari,C. Blum, L. M. Gambardella, F. Mondada, and T. Stutzle, Eds., vol. 3172 ofLecture Notes in Computer Science. Springer, 2004, pp. 202–213.

[89] OSTLE, B. Statistics in Research, 2nd ed. Iowa State University Press, 1963.

[90] PAQUETE, L., CHIARANDINI, M., AND BASSO, D. Proceedings of the Workshopon Empirical Methods for the Analysis of Algorithms. In International Con-ference on Parallel Problem Solving From Nature (Reykjavik, Iceland, 2006).

[91] PARK, M.-W., AND KIM, Y.-D. A systematic procedure for setting parametersin simulated annealing algorithms. Computers and Operations Research 25,3 (1998), 207–217.

[92] PARK, S. K., AND MILLER, K. W. Random number generators: good ones arehard to find. Communications of the ACM 31, 10 (1988), 1192–1201.

[93] PARPINELLI, R., LOPES, H., AND FREITAS, A. Data mining with an ant colonyoptimization algorithm. IEEE Transactions on Evolutionary Computation 6(2002), 321–332.

[94] PARSONS, R., AND JOHNSON, M. A Case Study in Experimental Design Ap-plied to Genetic Algorithms with Applications to DNA Sequence Assembly.American Journal of Mathematical and Management Sciences 17, 3 (1997),369–396.

[95] PEACE, G. S. Taguchi Methods: A Hands-On Approach. Addison-Wesley,1993.

[96] PELLEGRINI, P., FAVARETTO, D., AND MORETTI, E. On Max-Min Ant System’sparameters. In Fifth International Workshop on Ant Colony Optimization andSwarm Intelligence, vol. 4150 of Lecture Notes in Computer Science. SpringerBerlin, 2006, pp. 203–214.

[97] PLANCK, M. Scientific autobiography and other papers. Williams and Norgate,London, 1950.

[98] PRESS, W. H., FLANNERY, B. P., TEUKOLSKY, S. A., AND VETTERLING, W. T.Numerical Recipes in Pascal: the art of scientific computing. Cambridge Uni-versity Press, 1989.

248

REFERENCES

[99] PRESS, W. H., TEUKOLSKY, S. A., VETTERLING, W. T., AND FLANNERY, B. P.Numerical Recipes in C: the art of scientific computing. Cambridge UniversityPress, Cambridge, 1992.

[100] RANDALL, M. Near Parameter Free Ant Colony Optimisation. In Proceedingsof the Fourth International Workshop on Ant Colony, Optimization and SwarmIntelligence, M. Dorigo, M. Birattari, C. Blum, L. M. Gambardella, F. Mon-dada, and T. Stutzle, Eds., vol. 3172 of Lecture Notes in Computer Science.Springer, Berlin, 2004, pp. 374–381.

[101] RARDIN, R. L., AND UZSOY, R. Experimental Evaluation of Heuristic Opti-mization Algorithms: A Tutorial. Journal of Heuristics 7 (2001), 261–304.

[102] REINELT, G. TSPLIB - A traveling salesman problem library. ORSA Journalof Computing 3 (1991), 376–384.

[103] RIDGE, E., AND CURRY, E. A Roadmap of Nature-Inspired Systems Researchand Development. Multi-Agent and Grid Systems 3, 1 (2007).

[104] RIDGE, E., AND KUDENKO, D. Sequential Experiment Designs for Screeningand Tuning Parameters of Stochastic Heuristics. In Workshop on EmpiricalMethods for the Analysis of Algorithms at the Ninth International Conferenceon Parallel Problem Solving from Nature, L. Paquete, M. Chiarandini, andD. Basso, Eds. 2006, pp. 27–34.

[105] RIDGE, E., AND KUDENKO, D. An Analysis of Problem Difficulty for a Classof Optimisation Heuristics. In Proceedings of the Seventh European Confer-ence on Evolutionary Computation in Combinatorial Optimisation, C. Cottaand J. V. Hemert, Eds., vol. 4446 of Lecture Notes in Computer Science.Springer-Verlag, 2007, pp. 198–209.

[106] RIDGE, E., AND KUDENKO, D. Analyzing Heuristic Performance with Re-sponse Surface Models: Prediction, Optimization and Robustness. In Pro-ceedings of the Genetic and Evolutionary Computation Conference. ACM,2007, pp. 150–157.

[107] RIDGE, E., AND KUDENKO, D. Screening the Parameters Af-fecting Heuristic Performance. Technical Report YCS 415(www.cs.york.ac.uk/ftpdir/reports/index.php), The Department of Com-puter Science, The University of York, April 2007.

[108] RIDGE, E., AND KUDENKO, D. Screening the Parameters Affecting HeuristicPerformance. In Proceedings of the Genetic and Evolutionary ComputationConference, D. Thierens, H.-G. Beyer, M. Birattari, J. Bongard, J. Branke,J. A. Clark, D. Cliff, C. B. Congdon, K. Deb, B. Doerr, T. Kovacs, S. Kumar,J. F. Miller, J. Moore, F. Neumann, M. Pelikan, R. Poli, K. Sastry, K. O.Stanley, T. Stutzle, R. A. Watson, and I. Wegener, Eds., vol. 1. ACM, 2007.

[109] RIDGE, E., AND KUDENKO, D. Tuning the Performance of the MMAS Heuris-tic. In Engineering Stochastic Local Search Algorithms. Designing, Implement-ing and Analyzing Effective Heuristics, T. Stutzle and M. Birattari, Eds.,vol. 4638 of Lecture Notes in Computer Science. Springer, Berlin / Heidel-berg, 2007, pp. 46–60.

[110] RIDGE, E., AND KUDENKO, D. Determining whether a problem characteristicaffects heuristic performance. A rigorous Design of Experiments approach.In Recent Advances in Evolutionary Computation for Combinatorial Optimiza-tion, Studies in Computational Intelligence. Springer, 2008.

[111] RIDGE, E., KUDENKO, D., AND KAZAKOV, D. A Study of Concurrency inthe Ant Colony System Algorithm. In Proceedings of the IEEE Congress onEvolutionary Computation. 2006, pp. 1662–1669.

[112] SCOTT, L. DOE Strategies: An Overview of the Methodology and Concepts,2006.

249

REFERENCES

[113] SHMYGELSKA, A., AND HOOS, H. An ant colony optimisation algorithm for the2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics6, 1 (2005), 30.

[114] SILVA, R. M. A., AND RAMALHO, G. L. Going the Extra Mile in Ant ColonyOptimization. In Proceedings of the Fourth Metaheuristics International Con-ference. 2001, pp. 361–366.

[115] SOCHA, K. The Influence Of Run-time Limits On Choosing Ant System Pa-rameters. In Proceedings of the Genetic and Evolutionary Computation Con-ference, E. Cantu-Paz, J. A. Foster, K. Deb, L. Davis, R. Roy, U.-M. O’Reilly,H.-G. Beyer, R. K. Standish, G. Kendall, S. W. Wilson, M. Harman, J. We-gener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. A. Dowsland, N. Jonoska,and J. F. Miller, Eds., vol. 2723. Springer, 2003, pp. 49–60.

[116] SOLNON, C. Ants can solve constraint satisfaction problems. IEEE Transac-tions on Evolutionary Computation 6, 4 (2002), 347–357.

[117] STUTZLE, T. Local Search Algorithms for Combinatorial Problems - Analysis,Algorithms and New Applications. Phd, TU Darmstadt, 1998.

[118] STUTZLE, T., AND HOOS, H. H. Max-Min Ant System. Future GenerationComputer Systems 16, 8 (2000), 889–914.

[119] VAN HEMERT, J. I. Property Analysis of Symmetric Travelling SalesmanProblem Instances Acquired Through Evolution. In Proceedings of theFifth Conference on Evolutionary Computation in Combinatorial Optimization,G. R. Raidl and J. Gottlieb, Eds., vol. 3448. Springer-Verlag, Berlin, 2005,pp. 122–131.

[120] VOSS, S., MARTELLO, S., OSMAN, I. H., AND ROUCAIROL, C., Eds. Meta-Heuristics - Advances and Trends in Local Search Paradigms for Optimization.Kluwer Academic Publishers, Dordrecht, The Netherlands, 1999.

[121] WIKIPEDIA, T. F. E. I. Travelling salesman problem, 2007.

[122] WINEBERG, M., AND CHRISTENSEN, S. An Introduction to Statistics for ECExperimental Analysis. Tutorial at the ieee congress on evolutionary compu-tation, 2004.

[123] XU, J., CHIU, S., AND GLOVER, F. Fine-tuning a tabu search algorithmwith statistical tests. International Transactions on Operations Research 5, 3(1998), 233–244.

[124] ZEMEL, E. Measuring the quality of approximate solutions to zero-one pro-gramming problems. Mathematics of Operations Research 6 (1981), 319–332.

[125] ZLOCHIN, M., AND DORIGO, M. Model based search for combinatorial op-timization: a comparative study. In Proceedings of the Seventh Interna-tional Conference on Parallel Problem Solving from Nature, J. J. M. Guervs,P. Adamidis, and H.-G. Beyer, Eds., vol. 2439. Springer-Verlag, Berlin, Ger-many, 2002, pp. 651–661.

250

design of experiments for the tuning of …...design of experiments for the tuning of optimisation...

Documents